LLMs – Generative AI is not Sci-fi!

LLMs

Lingua Custodia was delighted to co-host this event with Cosmian, a company specialised in cybersecurity, at Le Village by CA Paris.


What are LLMs?

Gaëtan Caillaut’s presentation for Lingua Custodia focused on Large Language models (LLMs) and aimed to ‘demystify’ the engineering and science behind large language models. He highlighted LLMs are a type of AI program able to recognise and generate text. These models are trained on large sets of data, which allow the models to learn the probability of generating the next word, based on the context of the word or phrase.

What are the limitations of LLMs?

The limitations of LLMs were also discussed. The quality of the text which is generated is very dependent on the underlying data and there is also a risk that these models can misinterpret the context of the words or phrase. A LLM hallucination happens where the model generates text that is irrelevant or inconsistent with the input data.
LLMs are also very expensive to run and complicated to train.

Retrieval Augmented Generation and RLHF for finetuning

He highlighted the benefit of RAG (Retrieval Augmented Generation) which references an external knowledge base to improve the accuracy and reliability of LLMs. RAG helps to enhance LLM capabilities and has the advantage of not requiring particular training.

RLHF (Reinforcement Learning from Human Feedback) is one of most used finetuning approaches. It helps the model by using human feedback to ensure the model is more efficient, logical and helpful.

Lingua Custodia’s Generative AI Multi-Document Analyser


Olivier Debeugny, Lingua Custodia’s CEO then presented the multi-document data extraction technology which uses RAG to optimise the data extraction quality.

Please note that Lingua Custodia now has a new address in Paris, Le Village by CA Paris, at 55 Rue La Boétie, 75008. We are delighted with our new offices and thrilled to be part of this dynamic eco system which prioritises supporting startups and PMEs.

How LLMs (Large Language Models) use long contexts

Large language models (LLMs) are capable of using large contexts, sometimes hundreds of thousands of tokens. OpenAIs GPT-4 is capable of handling inputs of up to 32K tokens, while Anthropic’s Claude AI can handle 100K context tokens. This enables LLMs to treat very large documents which can be very useful for question answering or information retrieval.

A newly released paper by Stanford University examines the usage of context in large language models, particularly long contexts for two key tasks: multi-document question answering and key-value retrieval. Their findings show that the best performance is typically achieved when relevant information occurs at the beginning or end of the input context. However, the performance of models significantly declines when they need to access relevant information in the middle of long contexts.This could be attributed to the way humans write, where the beginning and concluding segments of text mainly contain the most crucial information.

These findings show that one needs to be careful when using LLMs for search and information retrieval in long documents. Information found in the middle might be ignored by the LLM and hence wrong or less accurate responses will be provided.

Lingua Custodia has over 10 years of experience in language technologies for financial document processing and we are very aware of the importance of context for search and information retrieval sentiment analysis, content summary and extraction. We continuously study the impact of context size of these language models

Our expert team consists of scientists, engineers and developers, so we are well placed to create, customise and design secure LLMs which are perfectly tailored to meet your business needs.