Before considering evaluating the quality of a translation, one must wonder what a good translation is. The title of Marcel Proust’s novel “À la recherche du temps perdu” (In Search of Lost Time) was first translated into English as “Remembrance of Things Past”. The wording was explicit: lost time is the past, and the search is through memories. The meaning is conveyed, it’s a good translation. Nowadays, the novel is called “In Search of Lost Time”. It is now more literal, and it is also a good translation. Which one is the best? The issue is complex and there is no evaluation method to make a final decision.In practice, there are two ways to evaluate the performance of a machine translation system:
– Human evaluation: An expert evaluates the translation provided by a system taking into account the instructions given to him, such as not penalizing literal translations. This evaluation is long and costly, but it enables a fine-grained analysis of the qualities and flaws of a system.
– Automatic evaluation: Human intervention is limited to providing the translation of a source text, called the reference. Quality is measured in terms of the distance of the machine translation output from the reference. This evaluation is easy and quick to use, but it is very coarse, since it only gives a score reflecting the overall quality of the system, without any specifics. The most common automatic metric in machine translation is called BLEU. It provides a score based on the common words between the machine translation output and the reference, favouring consecutive word groups. Such a metric cannot solve the problem of the title of Proust’s novel, since it considers the reference translation as the truth, although it is merely the choice made by a translator.
However, it is useful for measuring the degree of adaptation of the system to a specific technical vocabulary, such as that of Fund Prospectuses: the terms there are much less ambiguous than in Proust’s literature. Our English-French system specialising in Funds’ prospectuses and KIIDs thus obtains a BLEU score of 73.71. If, on the other hand, we make it translate equity research papers, the score drops to 43.48. These two types of documents have a different vocabulary and the BLEU score reflects the specialization of the system. For a better translation of equity research papers, we have also developed a specialized system, which obtains a score of 68.03.
The evaluation of machine translation performance, for Lingua Custodia as for all industry players, remains an open field of research.
Franck Burlot, Chief Technology Officer