The Lab is the Research & Development department of Lingua Custodia. Each member is a Doctor in Machine Learning and a specialist in Natural Language Processing (NLP). It has a triple mission: to actively contribute to academic and industrial research in these specific fields, to maintain the products and services offered by Lingua Custodia at the state of the art, and to develop new value-added applications for our clients.
Lingua Custodia Lab works on very different topics, with Natural Language Processing as a common denominator.
Here are the reserach projects created by the Lab's team.
Neural Machine Translation (NMT) is the state-of-the-art in Machine Translation as it can generate high quality translation. However, we restricted translation to this stage, it would be less accurate for a specialised financial text. Specialised corpora used to feed our engines are often of modest size in comparison with corpora issued from the general domain. Our added value lies precisely in our ability to enrich our training data with financial terminology. Selective data training is one of most investigated approaches aiming to increase the training dataset without introducing too much noise. In this project we study how to select the best portion of general or noisy data.
Despite their effectiveness, Neural Machine Translation (NMT) systems exhibit an important drawback, namely they do not provide explicit source-target link between original segments and translated segments. Without this source-target link, it is particularly laborious for NMT systems to enforce terminological constraints in domain-specific scenarios (e.g. financial translations). The objective of this project is two folds. First, it consists of building bilingual terminologies (lexicons) in the considered. The second part consists of extending existing NMT systems on new systems able to efficiently integrate the specific terminological constraints.
Neural machine models are very powerful models, achieving state-of-the-art results in most languages. However, such models do not contain hard alignments between source and target words, making it difficult to handle tasks such as translating markup tags (e.g. html, xml). A markup tag gives format indications: it is attached to a word to indicate that it should be in italics for example. In this project, we will create markup-enriched parallel training data and design a network that is able to correctly translate markups and put them back in the correct positions in the translated text. As a result, it will now be possible to automatically translate html, xml and many other document types, which was not possible with traditional neural machine translation models.
Discover the prototypes developed by the Lab team and based on Natural Language Processing applied to finance.
This prototype, developed by Lingua Custodia’s Lab, gives an analysis of two documents and is based on the GRI Standards.
From 27 June to 1 July 2022, the French Computer Science and Systems Laboratory (“LIS”) and the Computer Laboratory from Avignon University (“LIA”), alongside the Association for Natural Language Processing (ATALA) jointly organise the 29th conference about NLP (TALN in French) and the 24th meeting of Students Researchers in Computer Science for NLP (RECITAL in French).
The Lingua Custodia Lab made it 1st in 4 out of 7 languages pairs in the 2nd round of Covid-19 MLIA Challenge – a European collaboration project supported by the European Commission, the European Language Resources Coordination (ELRC), and several other entities and universities.
The project organises a community evaluation effort aimed at accelerating the creation of resources and tools for improved MultiLingual Information Access using Machine Learning technology.
This paper describes Lingua Custodia’s submission to the WMT21 shared task on machine translation using terminologies. We consider three directions, namely English to French, Russian, and Chinese. We rely on a Transformer-based architecture as a building block, and we explore a method which introduces two main changes to the standard procedure to handle terminologies. The first one consists in augmenting the training data in such a way as to encourage the model to learn a copy behavior when it encounters terminology constraint terms. The second change is constraint token masking, whose purpose is to ease copy behavior learning and to improve model generalization.Empirical results show that our method satisfies most terminology constraints while maintaining high translation quality.
This research paper “Encouraging Neural Machine Translation to Satisfy Terminology Constraints” has been accepted for publication at the ACL 2021 conference, in the Findings of ACL 2021. It presents a new approach to encourage neural machine translation to satisfy lexical constraints. ACL is the premier conference of the field of computational linguistics, rewarding the most promising research papers woldwide. This new recognition confirms the Lab’s position as a leader in the NLP field, alongside most prestigious companies: Google Research, Facebook AI or Amazon Sciences.
This workshop aims at discovering the recent advanced on data representation for clustering under different approaches. Thereby, the LDRC workshop is an opportunity to (i) present the recent advances in data representation based clustering algorithms; (ii) outline potential applications that could inspire new data representation approaches for clustering; and (iii) explore benchmark data to better evaluate and study data representation based clustering models.
Head of the Lab Raheel Qader gave a lecture at the Université Grenoble Alpes, to master students majoring in Financial Engineering. The lecture talks about Natural Language Processing applied to Finance, and more specifically Machine Translation.