OpenAI tackles AI language gaps with new India-focused IndQA benchmark

In the global race to build smarter artificial intelligence, a critical question has emerged: Can an AI truly be “intelligent” if it only understands the world from one perspective?

OpenAI, the San Francisco-based research firm that catapulted generative AI into the mainstream with ChatGPT, has confronted this problem head-on. On Tuesday, the company announced the launch of IndQA, a new and highly-detailed benchmark designed to evaluate how well AI systems grasp the vast, nuanced, and complex tapestry of Indian languages and culture.

This isn’t just another test of an AI’s ability to translate. It’s an attempt to measure something far more elusive: its understanding of context, history, and the everyday realities that matter to people where they live.

The initiative stems from a glaring gap in the world of AI development. As OpenAI points out, while 80% of the world’s population does not speak English as their primary language, the tools used to measure AI progress have been overwhelmingly Anglo-centric.

This has led to a significant problem. Popular multilingual benchmarks, like the widely-used MMMLU (Massive Multitask Language Understanding), are now “saturated.” In simple terms, the most powerful AI models are acing these tests, making them less and less useful for measuring real, meaningful progress.

More importantly, these existing tests often focus on multiple-choice questions or direct translations. They might be able to tell you the Hindi word for “computer,” but they can’t capture the cultural nuance of why a certain dish is central to a festival, the historical context of a local monument, or the subtle, code-switching humor of “Hinglish” spoken in a city.

That’s precisely the gap IndQA is built to fill.

“Today we are rolling out IndQA,” announced Srinivas Narayanan, OpenAI’s CTO for B2B Applications, at a media conference. “Built in collaboration with 261 experts across 12 languages, IndQA fills a key gap by enabling fair and rigorous evaluation that reflects India’s cultural and linguistic diversity.“

This is a benchmark built by humans, for AIs. The 261 domain experts, all native-level speakers from across India, were tasked with drafting difficult, reasoning-focused prompts tied directly to their regions and specialties.

The result is a massive evaluation system spanning 2,278 questions. These aren’t just in Hindi or English, but are natively written in 12 languages: Bengali, English, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi, and Tamil.

The prompts cover 10 broad cultural domains, digging deep into topics like:

Architecture & Design
Arts & Culture
Everyday Life
Food & Cuisine
History
Law & Ethics
Literature & Linguistics
Media & Entertainment
Religion & Spirituality
Sports & Recreation

So, how does it work? Instead of a simple “right” or “wrong” answer, IndQA uses a sophisticated “rubric-based approach.” For each culturally-grounded prompt, the human expert also provides a detailed set of criteria for what a good answer looks like, along with an “ideal answer” that reflects expert expectations. This allows for a far more nuanced score than a simple pass/fail.

To ensure its robustness, OpenAI tested the benchmark against its most powerful models at the time of creation, including GPT-4o, GPT-4.5, and even the newly launched GPT-5.

Narayanan emphasized that this tool is designed to help all AI models—not just OpenAI’s—to “perform better in languages and contexts that are currently underrepresented in global datasets.“

With nearly a billion people who don’t speak English as their primary language and 22 official languages, India was described by the company as the “obvious starting point” for this global-first initiative. Company officials framed the work as part of an ongoing commitment to make AI technology more accessible and useful for a wide range of Indian users, from students and farmers to educators and developers.

Narayanan, speaking passionately about the potential, positioned India as a leader in this new era. “India can be a beacon of how AI can be used for social good,” he said, “including education, health and farming etc.“

However, the company was careful to add a few important caveats. Because the questions are unique and deeply tied to each specific language and culture, IndQA is not a “language leaderboard.” You cannot, for example, use its scores to definitively claim a model is “better” at Tamil than it is at Bengali.

Instead, its true value lies in measuring improvement over time within a single model family. It gives developers a clear, culturally-rich target to aim for, pushing them beyond simple translation and toward genuine understanding.

Ultimately, the launch of IndQA signals a major shift in how AI capabilities are measured. As OpenAI continues to expand its global developer ecosystem—which Narayanan noted already includes 4-5 million people—the focus is clearly moving. The true test of Artificial General Intelligence (AGI) won’t be its ability to pass an American high school exam, but its capacity to understand and respectfully engage with the countless cultures that make up humanity. And that road, it seems, runs directly through the rich, diverse, and complex linguistic landscapes of India.