Loading...

AI4Bharat Introduces IndicLLMSuite to Foster Indian Language Models

TL;DR

  • AI4Bharat launches IndicLLMSuite, boosting Indian language models with 251B tokens and 74.8M instruction-response pairs.
  • The suite features Sangraha for data aggregation, Setu for content extraction, and IndicAlign for prompt-response pairs.
  • Collaboration with Sarvam AI and IIT Madras introduces IndicVoices, enriching the Indian linguistic landscape with 7348 hours of natural speech.

In a significant stride towards enhancing the representation of Indian languages in the realm of Language Model training, AI4Bharat has launched the IndicLLMSuite. This comprehensive suite of resources is tailored to address the challenges faced by low and mid-resource languages in the development of Language Models (LLMs). The initiative aims to democratize access to advanced NLP technologies across a diverse linguistic landscape.

Empowering linguistic diversity

IndicLLMSuite encompasses a rich repository of data spanning 22 Indian languages, totaling an impressive 251 billion tokens and 74.8 million instruction-response pairs. This extensive corpus is meticulously curated from various sources, including curated URLs, multilingual corpora, and large-scale translations. Such diversity in data collection ensures robust representation and fosters inclusivity in language model training.

The suite comprises several essential components designed to facilitate the creation and refinement of Language Models tailored to Indian languages:

This foundational component serves as the bedrock of IndicLLMSuite, offering a vast pre-training dataset aggregated from diverse linguistic sources. With 251 billion tokens spanning 22 languages, Sangraha provides the raw material necessary for training language models effectively. Setu presents a sophisticated Spark-based distributed pipeline, custom-built for Indian languages. This versatile tool streamlines the extraction of content from a multitude of sources, including websites, PDFs, and videos. Its built-in functionalities for cleaning, filtering, toxicity removal, and deduplication ensure the integrity and quality of the extracted data.

IndicAlign-Instruct introduces a comprehensive collection of 74.7 million prompt-response pairs across 20 languages. These pairs are meticulously curated using diverse methodologies, including the compilation of existing Instruction Fine-Tuning (IFT) datasets, translation of English datasets, generation of discussions from India-centric Wikipedia articles, and crowd-sourcing through the Anudesh platform. Additionally, a novel IFT dataset drawn from IndoWordNet enriches the suite’s resources, facilitating enhanced language and grammar learning for models.

This component addresses the crucial aspect of safety alignment in Language Models by providing a curated dataset comprising 123K pairs of toxic prompts and non-toxic responses. Leveraging open-source English LLMs and translation to 14 Indian languages, IndicAlign–Toxic enhances the safety and reliability of Indic Language Models.

Collaborative endeavors in language technology

The unveiling of IndicLLMSuite underscores a collaborative effort within the Indian AI landscape to advance the development of language technologies. Partnering with Sarvam AI and IIT Madras, AI4Bharat recently introduced IndicVoices, a comprehensive speech dataset aimed at fostering inclusivity and diversity in speech recognition applications. With 7348 hours of natural speech from 16237 speakers across 145 Indian districts and 22 languages, IndicVoices complements the efforts of IndicLLMSuite in enriching the linguistic ecosystem of India.

The introduction of IndicLLMSuite marks a pivotal moment in the journey towards inclusive language technology development in India. By democratizing access to resources and fostering collaboration among stakeholders, AI4Bharat reinforces its commitment to promoting linguistic diversity and empowering Indian languages in the digital age. As the landscape of NLP continues to evolve, initiatives like IndicLLMSuite serve as catalysts for innovation and progress, paving the way for a more inclusive and accessible linguistic future.

Disclaimer. The information provided is not trading advice. Cryptopolitan.com holds no liability for any investments made based on the information provided on this page. We strongly recommend independent research and/or consultation with a qualified professional before making any investment decisions.

Share link:

John Palmer

John Palmer is an enthusiastic crypto writer with an interest in Bitcoin, Blockchain, and technical analysis. With a focus on daily market analysis, his research helps traders and investors alike. His particular interest in digital wallets and blockchain aids his audience.

Most read

Loading Most Read articles...

Stay on top of crypto news, get daily updates in your inbox

Related News

Thomson Reuters
Cryptopolitan
Subscribe to CryptoPolitan