Any company active in the generation of content in French that until now had access to models trained in English, can now take advantage of the largest French-language model to date, publicly available in Beta version on cedille.ai.
Cedille.ai, the largest and most powerful French-language model, now publicly accessible
The model now reaches a perplexity score — a key performance measure for predicting the next word where the lowest score is the best — of 4.5 compared to the best publicly available system (GPT-fr) which has a score of 12.9, positioning Cedille as nearly 3 times more efficient.
The project was launched with the support of the Google TRC program and was trained for several months on Tensor Processing Units (TPUs), special chips created from scratch by Google to speed up artificial intelligence calculations. By relying on this infrastructure, the team was able to ensure a neutral ecological footprint for the model's training process. This is a major achievement when you know that such processes require huge amounts of energy and therefore high carbon emissions.
Cedille relies on the EleutherAI community, a grassroots movement of open source AI researchers. Because Cedille is available to the public, researchers can verify and replicate the results and experiment with them as they please.
“With Cedille we are redistributing the cards for French compared to English language models — and with even more language models to come! We were able to achieve this feat thanks to the efforts of the EleutherAI open source community. By publishing our model publicly, we are excited to contribute back to the community!”
Martin Müller, Senior Machine Learning Engineer at Coteries
Excluding toxic and inappropriate data
To understand the world, the main current text generation models based on artificial intelligence such as GPT-3 are trained using large databases of content available publicly on the Internet. As this content also contains a good deal of misinformation, sexism or racism, it has been shown that existing models can take up these same discriminatory tendencies in text generation.
Coteries made every effort to publish a free model of inappropriate content as much as possible and to filter the data for Cedille's training. All toxic content as well as low quality content has been removed. This process was made possible by a combination of Natural Language Processing and careful manual review of the data samples.
As a result, Cedille is now generating quality texts with a significant reduction of 14.7% in toxic content compared to the best model existing so far (GPT-fr).
Endless application possibilities with Cedille
From improved journalism to autocompletion through chatbots, Cedille offers a very wide potential for use. Coteries offers its model and the skills of its team to create personalized applications, representing an excellent opportunity for any company wishing to make the most of artificial intelligence to generate content in French.
“With Cedille, I am delighted to be able to bring the power of very great models to the French language. Now there is no need to train a new model for each specific task: just give Cedille a few examples!”
Florian Laurent, Senior Machine Learning Engineer at Coteries
You can test Cedille on cedille.ai.