Bootstrapped AI startup Smallstep serves up Misal, a Marathi LLM

Original source (on modern site) | Article images: [1] [2] [3] [4]

The company claims that the results indicated that Misal-7B outperformed ChatGPT 3.5 in reading comprehension but lagged in sentiment analysis, paraphrasing, and translation.

Spicing up the Indian artificial intelligence (AI) scene, Bengaluru-based startup Smallstep.ai has launched Misal, a large language model (LLM) for Marathi. This comes as competition heats up to develop AI capabilities for local languages in India.

Misal's name draws inspiration from a popular spicy Maharashtrian dish made with moth beans.

"It's a staple breakfast for many," explained Smallstep founder Sagar Sarkale. "We chose the name because it's something familiar and relatable for Marathi speakers."

Built on top of Meta's Llama2 model, Smallstep rolled out four versions of Misal LLM: Marathi Pre-trained LLM - Misal-7B-base-v0.1 & Misal-1B-base-v0.1, and Marathi Instruction tuned LLM - Misal-7B-instruct-v0.1 & Misal-1B-instruct-v0.1.

Sarkale, who hails from Maharashtra himself, saw the lack of AI models for his native language. "There are models for languages like Tamil," he told Moneycontrol. "I thought, why not create one for Marathi and empower Marathi speakers to use generative AI products?"

Before founding Smallstep, Sarkale worked as a data scientist at Tencent and Krafton-backed content platform Pratilipi. He also held a data scientist position at cloud technology company Tekion.

Sarkale said the cost of training the AI model was around Rs 50,000-60,000. "There were also other small experiments we ran during the developmental phase. I'm not counting those here," he said.

Sarkale in a blog post said that Misal was developed to address the limitations of the Llama2 model, which is primarily trained on English data with only a small fraction dedicated to other languages.

"With mere 2 percent of its data representing non-English languages, it's evident that Llama2 is not optimally fine-tuned for building GenAI applications in languages beyond English," he said.

Also read: India's LLM race is heating up! Here's a look at who's building what

The bootstrapped startup adopted a three-step procedure to develop Instruction Tuned Misal models, with similar processes for both the 7-billion and 1-billion parameter versions.

Parameters are essentially the 'knowledge' the model acquires during its training, with more parameters typically leading to better performance due to increased contextual understanding.

The company said that it identified a significant challenge with Meta's Llama tokenizer, particularly in handling non-English languages due to increased token requirements.

In order to improve performance for Marathi text, Smallstep created a custom SentencePiece tokenizer designed for the language. This adds approximately 15,000 new tokens to the existing inventory of 32,000 tokens of Llama2.

Also read: India needs it, India is ready for it, India will do it: Nandan Nilekani, Tanuj Bhojwani on country's AI potential

The company conducted a manual evaluation using a limited dataset of 100 internet-sourced questions, focusing on tasks like reading comprehension, translation, sentiment analysis, and paraphrasing.

The company claims that the results indicated that Misal-7B outperformed ChatGPT 3.5 in reading comprehension but lagged in sentiment analysis, paraphrasing, and translation, surpassing Ola Krutrim in all tasks except translation.

The company acknowledged this as a preliminary approach and said, "We understand that a better evaluation method is needed to benchmark our model."

During Misal's pretraining phase, the company said that it exposed the model to a vast corpus of Marathi text data, comprising around 2 billion tokens primarily sourced from newspapers spanning 2016 to 2022, from the CulturaX dataset. Supplementing this were additional datasets from sources like l3cube, ai4bharat, and other internet-based datasets.

Explained: What powers ChatGPT and Bard? A look at LLMs or large language models

The company fine-tuned its model by using around 200,000 Marathi instructions. This involved translating the Alpaca dataset into Marathi using Google Translate. However, due to translation inaccuracies, the team had to clean a significant number of data points to ensure consistency in both instructions and responses within the dataset.

"The translation didn't fully convey the exact meaning of the data. To address this, we had to write a lot of custom rules to ensure the translation was as close to the real meaning as possible. That was one of the major challenges," Sarkale said.

Sarkale addressed the issue of hallucinations and biases in LLMs, stating that these issues are "inherently learned from the data itself. Therefore, we need to ensure the data is free of bias, which is a very tricky problem to solve."

"We are actively working on improvements, and this is just the first iteration. Future iterations will incorporate additional techniques to ensure the model is safe for usage. But I would say the model is currently safe for use."

< Back to 68k.news IN front page