Ethically trained AI startup Pleias releases new small reasoning models optimized for RAG with built-in citations
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More French AI startup Pleias made waves late last year with the launch of its ethically trained Pleias 1.0 family of small language models — among the first and only to date to be built entirely on scraping “open” data, that is, data explicitly labeled as public domain, open source, or unlicensed and not copyrighted. Now the company has announced the release of two open source small-scale reasoning models designed specifically for retrieval-augmented generation (RAG), citation synthesis, and structured multilingual output. The launch includes two core models — Pleias-RAG-350M and Pleias-RAG-1B — each also available in CPU-optimized GGUF format, making a total of four deployment-ready variants. They are all based on Pleias 1.0, and can be used independently or in conjunction with other LLMs that the organization may already or plan to deploy. All appear to be available under a permissive Apache 2.0 open source license, meaning they are eligible for organizations to take, modify and deploy for commercial use cases. RAG, as you’ll recall, is the widely-used technique that enterprises and organizations can deploy to hook an AI large language model (LLM) such as OpenAI’s GPT-4o, Google’s Gemini 2.5 Flash, Anthropic’s Claude Sonnet 3.7 or Cohere’s Command-A, or open source alternatives like Llama 4 and DeepSeek V3 to external knowledge bases, such as enterprise documents and cloud storages. This is often necessary for enterprises that want to build chatbots and other AI applications that reference their internal policies or product catalogs (an alternative, prompting a long context LLM with all the information necessary, may not be suitable for enterprise use cases where security and per-token transmission costs are concerns). The Pleias-RAG model family is the latest effort to bridge the gap between accuracy and efficiency in small language models. These models are aimed at enterprises, developers, and researchers looking for cost-effective alternatives to large-scale language models without compromising traceability, multilingual capabilities, or structured reasoning workflows. The target userbase is actually Pleias’s home continent of Europe, as co-founder Alexander Doria told VentureBeat via direct message on the social network X: “A primary motivation has been the difficulty of scaling RAG applications in Europe. Most private organization have little GPUs (it may have changed but not long ago less than 2% of all [Nvidia] H100 [GPUs] were in Europe). And yet simultaneously there are strong incentive to self-host for regulated reasons, including GDPR. “SLMs have progressed significantly over the past year, yet they are too often conceived as ‘mini-chatbots’ and we have observed a significant drop of performance in non-English languages, both in terms of source understanding and quality of text generation. So we have been satisfied to hit most of our objectives: An actual alternative to 7-8b models for RAG even on CPU and other constrained infras. Fully verifiable models coming with citation support. Preservation of European language performance.” However, of course the models being open source under the Apache 2.0 license means anyone could take and use them freely anywhere in the world. Focused on grounding, citations, and facts A key feature of the new Pleias-RAG models is their native support for source citation with literal quotes, fully integrated into the model’s inference process. Unlike post-hoc citation methods or external chunking pipelines, the Pleias-RAG models generate citations directly, using a syntax inspired by Wikipedia’s reference format. This approach allows for shorter, more readable citation snippets while maintaining verifiability. Citation grounding plays a functional role in regulated settings. For sectors like healthcare, legal, and finance — where decision-making must be documented and traceable — these built-in references offer a direct path to auditability. Pleias positions this design choice as an ethical imperative, aligning with increasing regulatory demands for explainable AI. Proto agentic? Pleias-RAG models are described as “proto-agentic” — they can autonomously assess whether a query is understandable, determine if it is trivial or complex, and decide whether to answer, reformulate, or refuse based on source adequacy. Their structured output includes language detection, query and source analysis reports, and a reasoned answer. Despite their relatively small size (Pleias-RAG-350M has just 350 million parameters) the models exhibit behavior traditionally associated with larger, agentic systems. According to Pleias, these capabilities stem from a specialized mid-training pipeline that blends synthetic data generation with iterative reasoning prompts. Pleias-RAG-350M is explicitly designed for constrained environments. It performs well on standard CPUs, including mobile-class infrastructure. According to internal benchmarks, the unquantized GGUF version produces complete reasoning outputs in roughly 20 seconds on 8GB RAM setups. Its small footprint places it in a niche with very few competitors, such as Qwen-0.5 and SmolLM, but with a much stronger emphasis on structured source synthesis. Competitive performance across tasks and languages In benchmark evaluations, Pleias-RAG-350M and Pleias-RAG-1B outperform most open-weight models under 4 billion parameters, including Llama-3.1-8B and Qwen-2.5-7B, on tasks such as HotPotQA, 2WikiMultiHopQA, and MuSiQue. These multi-hop RAG benchmarks test the model’s ability to reason across multiple documents and identify distractors — common requirements in enterprise-grade knowledge systems. The models’ strength extends to multilingual scenarios. On translated benchmark sets across French, German, Spanish, and Italian, the Pleias models show negligible degradation in performance. This sets them apart from other SLMs, which typically experience a 10–35% performance loss when handling non-English queries. The multilingual support stems from careful tokenizer design and synthetic adversarial training that includes language-switching exercises. The models not only detect the language of a user query but aim to respond in the same language—an important feature for global deployments. In addition, Doria highlighted how the models could be used to augment the performance of other existing models an enterprise may already be using: “We envision the models to be used in orchestration setting, especially since their compute cost is low. A very interesting results on the evaluation side: even the 350m model turned out to be good on entirely different answers than the answers [Meta] Llama and [Alibaba] Qwen were performing at. So there’s a real











