Bilingual Translation Memories (TMs) are databases of text in two languages where each sentence in the source language is paired with the corresponding translation in the target language. The size of a TM can range from several hundreds to thousands or even millions of sentence pairs.
The importance of TMs cannot be underestimated in everything that revolves around language services and technologies, and all the more so when we talk about machine translation. Our Lingua Custodia machine translation solution VERTO™ is indeed trained to convert the raw information contained in the TMs into actionable statistical knowledge for translating new sentences, in much the same way as a fuel engine converts fuel into motion.
But just as there are fuels of varying quality, there are TMs of varying quality as well. This happens for instance because the TMs originates from a pair of parallel documents which were not actually 100% parallel (for instance a legal disclosure appearing just in one of the two languages), or because an USD value has been converted in Euro on the other side, or simply because, you know, the TM is extremely large and internet can be such a wild place at times.
TMs cleaning becomes an essential step. But how to do it? Manual cleaning by bilingual experts who check every and each sentence pair is terrific in term of quality but potentially prohibitive in term of costs and absolutely prohibitive in terms of time when one has to deal with several millions of sentence pairs.
On the other end, too simplistic approaches such as removing all sentence pairs where, say, the English side is twice as long as the French side, are faster but may lead to deceiving results.
We at Lingua Custodia have therefore developed an automatic system based on machine learning and artificial intelligence to deal precisely with this problem and make a step further towards the best of both worlds: a fast, automatic TM cleaning system which removes the wrong sentence pairs and preciously keep just the correct ones. The system scans the TM for features such as mismatched number and punctuation’s, suspicious length difference, co-occurrence in dictionaries, words sharing a common root (for Indo-European languages), … then cooks this up and, based on previous statistical knowledge learnt by the system, automatically classifies each sentence pair as being wrong or correct.
How does our newly developed system perform? We have bench-marked our tool by participating to an international competition (Natural Language Processing for Translation Memories 2016) along with 5 other academic and industrial players. Our system got ranked 1st in one of the 9 sub-competitions (whose aim was to perform a more fine-grained distinction between wrong, correct and almost correct sentence pairs), and got several 2nd places in the remaining sub-competitions. Well done Lingua Custodia! Only the best fuel for our machine translation engines !