Today our agency Coteries launches Cedille, a new artificial intelligence for text generation that will bring a game changing solution in the world of Machine Learning to French-speaking users.

Any company active in content generation in French that previously mostly had access to English-trained models can now leverage the largest and publicly available French language model released to date, available in beta version on cedille.ai.

 

Cedille.ai, the largest and most powerful French model, publicly available

Florian Laurent et Martin Müller, les deux Senior Machine Learning Engineers qui ont développé Cedille.ai

Martin Müller and Florian Laurent, the two Senior Machine Learning Engineers behind the development of Cedille.ai

The model now reaches a perplexity score (a key performance measure, lower is better) of 4.5 compared to the best previously available system (GPT-fr) which had a score of 12.9, marking Cedille almost 3 times as performant.

The project was launched with support from the Google TRC Program and trained for several months on Tensor Processing Units (TPUs), custom chips made by Google to speed up AI computation. Relying on this infrastructure ensured that the training process was carbon neutral. This is a major achievement as most training processes for such models require a tremendous amount of energy, drastically increasing carbon emissions.

Cedille stands on the shoulders of the open source community EleutherAI, a grassroot movement of open source AI researchers. As Cedille is released openly, researchers can experiment with the model and reproduce the results themselves.

“With Cedille we are leveling the playing field for French compared to English language models – with other non-English languages soon to follow! We are able to achieve this feat also thanks to the efforts of the open source community EleutherAI. By releasing our model publicly we’re excited to contribute back to the community!”

Martin Müller, Senior Machine Learning Engineer at Coteries

Filtering out toxic and unsafe datasets

Existing AI models for text generation such as GPT-3 are trained with publicly available data from the internet to understand the world. As these data often contain discriminatory content such as sexism, racism, and general mis- and disinformation, it has been shown that existing models can  exhibit the same discriminatory tendencies.

Coteries strives to offer a model that is free of unsafe content and its Machine Learning team has taken special care to filter the data Cedille is trained on. All toxic and discriminatory content has been removed, as well as low-quality material. This process was made possible by a combination of Natural Language Processing and careful manual examination of sample data.

As a result, Cedille’s output is 14.7% safer compared to the best openly available system (GPT-fr).

Endless possibilities and applications

From enhanced journalism to autocompletion and chatbots, French model Cedille offers endless usage possibilities. Coteries offers its model and skills for customized applications, which may be especially relevant for any company desiring to leverage Artificial Intelligence to generate content in French. 

“With Cedille, I’m thrilled that we can bring the power of very large language models to French. Now, there’s no more need to train a new model for each specific task: just give Cedille a few examples and the model will follow your lead!”

Florian Laurent, Senior Machine Learning Engineer at Coteries

The public Beta version can now be tested on cedille.ai.