LLM creation and evaluation
Context
As part of the France 2030 initiative, Bpifrance has launched a call for projects in 2023 entitled "Digital Commons for Generative Artificial Intelligence". The aim of this call is to develop specialised language models (LLMs) tailored to the needs of businesses. The ArGiMi consortium, comprising the companies Artefact, Giskard and Mistral AI as well as the public institutions Institut national de l'audiovisuel (INA) and Bibliothèque nationale de France (BnF), was selected at the end of May 2024.
The major innovations of the ArGiMi project include the development and evaluation of specialised language models for French, as well as methods and tools to ensure regulatory compliance and ethics in the use of these LLMs. The two-year project aims to overcome major technical hurdles, particularly in adapting AI technologies to French linguistic and cultural specificities.
Work carried out
In this context, the expertise of the INA teams is mobilised to carry out several tasks.
One of the characteristics of audiovisual data is its use of spoken French - with all the specificities of spoken language compared with written language. An essential component of the INA's contribution to this project is therefore the adaptation and evaluation of the models to audiovisual use cases in general and to the INA in particular. These experiments, carried out exclusively by our Research Department and solely on the Institute's computing cluster, will give rise to scientific publications and the provision of open source annotation and evaluation tools. In this way, the Institute will contribute to the construction of ‘digital commons’ in the French language - a major sovereignty issue. This work will, of course, be carried out in strict compliance with current legislation. The audiovisual streams, their transcriptions and the specialised models based on these data will not be shared, not even with the project partners, and will only be used to carry out evaluations with a view to the aforementioned scientific publications.
A legal study will also be carried out, in partnership with the Bibliothèque nationale de France (BnF), to determine the conditions under which such heritage data could be exploited - or not - for model training purposes, so as to make them more relevant to French, French-speaking and European use cases. This work will naturally be carried out in conjunction with the two missions entrusted to the Conseil supérieur de la propriété littéraire et artistique (CSPLA) in 2024, and will help to clarify the options for remunerating cultural content used by AI systems, as well as the effective implementation of the new European regulation on AI.
Summary
Scientific challenges :
- specificity of the spoken language
- dependence on the quality of transcription
- wide variety of programme types
Methodology :
- fine-tuning of generic models on transcriptions
- evaluation on multiple information extraction tasks
Deliverables :
- scientific publications
- legal study
- annotation GUI and evaluation tools
Project members
Nicolas Hervé (head of project), Abdelkrim Beloued (researcher), Émile Chapuis (researcher), Steffen Lalande (researcher), Agnès Saulnier (researcher)