Bengaluru-based startup Sarvam AI has launched a new large language model, Sarvam 1. The AI model is open-source and has been trained on 11 languages including Bengali, Gujarati, Hindi, Marathi, Malayalam, Kannada, Oriya, Tamil, Telugu, Punjabi and English.
The 2-billion-parameter model was trained on 4 trillion tokens on a custom tokeniser curated by Sarvam on Nvidia H100 Tensor Core GPUs. The company claims that the tokeniser is up to four times more efficient than other AI models which were trained on Indian languages.
The custom training corpus, Sarvam-2T, comprised of 20% datasets in Hindi, English, and programming languages so the AI model can perform multilingual tasks.
To deal with the lack of high-quality training data for Indian languages, Sarvam AI built datasets using synthetic data generation methods.
Besides Nvidia, the AI model also used Yotta’s data centres and AI4Bharat’s technology and language resources.
“The Sarvam 1 model is the first example of an LLM trained from scratch with data, research, and compute being fully in India”, said Dr. Pratyush Kumar, Co-Founder, Sarvam. He added; “We expect it to power a range of use cases including voice and messaging agents. This is the beginning of our mission to build full stack sovereign AI. We are deeply excited to be working together with NVIDIA towards this mission.”
Developers can use the base model, which is available on Hugging Face, to build their own AI applications for Indic language speakers.
Earlier in August, the startup also launched its first foundational AI model called Sarvam 2B. Prior to that, in December last year, Sarvam AI launched the country’s first Hindi LLM, Open Hathi which was built on Meta AI’s Llama 2-7billion model.
Published - October 28, 2024 02:47 pm IST