"The limits of my language mean the limits of my world." — Ludwig Wittgenstein
When we set out to create mHumanEval, we had a simple question: How well do Large Language Models really understand code generation across different human languages? What started as a research curiosity evolved into one of the most comprehensive multilingual code generation benchmarks, spanning 204 natural languages and 25 programming languages, with 836,400 total prompts.
This blog post shares the technical journey, unexpected challenges, and key lessons from building this benchmark—a journey that taught us as much about the limitations of current LLMs as it did about the fascinating intersection of natural and programming languages.
The Genesis: Why Another Benchmark?
Existing code generation benchmarks like HumanEval and MBPP have been instrumental in advancing the field. However, they share a common limitation: they're almost exclusively in English. This creates several problems:
- English-centric bias: Models are primarily evaluated on their ability to understand English prompts, potentially missing how they perform with non-English speakers
- Limited generalization understanding: We don't know if strong performance in English translates to other languages
- Accessibility barriers: Non-English speaking developers may face disadvantages when using these tools
- Research gaps: The multilingual capabilities of code LLMs remain largely unexplored
With over 7,000 languages spoken worldwide and a growing global developer community, we needed a benchmark that reflects this linguistic diversity.
The Scale Challenge: 204 Languages
Deciding to support 204 languages was not a random choice. We wanted to cover:
- All major programming communities worldwide
- High-resource languages (English, Spanish, Chinese, etc.)
- Mid-resource languages (Bengali, Vietnamese, Swahili, etc.)
- Low-resource languages (Yoruba, Sundanese, Maori, etc.)
We couldn't rely on simple machine translation. A coding problem is not just descriptive text—it contains:
- Technical specifications and constraints
- Input/output examples that must remain consistent
- Edge cases that need precise description
- Domain-specific terminology
Our Translation Pipeline
We developed a multi-stage translation and verification pipeline:
- Initial Translation: Using state-of-the-art neural machine translation models (primarily GPT-4 and specialized translation APIs)
- Technical Verification: Ensuring that code examples, function names, and technical terms remain accurate
- Consistency Checks: Verifying that input/output examples are preserved correctly
- Native Speaker Review: For high-resource languages, we conducted spot-checks with native speakers who are also programmers
Surprising Findings
As we tested various LLMs on mHumanEval, several patterns emerged that challenged our assumptions:
1. The "English Anchor" Effect
Models showed a strong tendency to perform significantly better when prompts contained even small amounts of English text. For example, keeping variable names in English while translating descriptions improved performance by 15-20% on average across all non-English languages.
2. Script Matters More Than Language Family
We initially hypothesized that models would perform similarly on languages from the same family (e.g., Romance languages). However, we found that script type (Latin, Cyrillic, Arabic, Devanagari, etc.) was a better predictor of performance than linguistic family.
3. The Resource Paradox
Surprisingly, some mid-resource languages outperformed high-resource ones in specific programming domains. For instance, models performed better on Bengali prompts for algorithmic problems than on some European languages, possibly due to the prevalence of Bengali programmers in competitive programming communities.
4. Programming Language Transfer
Models showed interesting cross-lingual transfer: a model that struggled with Python prompts in Bengali performed better when asked to generate JavaScript, suggesting that the training data distribution affects multilingual code generation in complex ways.
Key Technical Decisions
Choosing the Programming Languages
We selected 25 programming languages to cover:
- Popular general-purpose languages: Python, JavaScript, Java, C++, C#, Go, Rust, Swift
- Web technologies: TypeScript, PHP, Ruby
- Functional languages: Haskell, Scala, Erlang, F#
- Systems languages: C, Rust, Assembly variants
- Emerging languages: Julia, Kotlin, Dart, Mojo
Evaluation Metrics
We use pass@k metrics (pass@1, pass@5, pass@10) as the primary evaluation criteria, where:
- pass@1: Percentage of problems solved in the first attempt
- pass@5: Percentage of problems solved within 5 attempts
- pass@10: Percentage of problems solved within 10 attempts
Additionally, we track:
- Syntax correctness rates across languages
- Runtime efficiency of generated solutions
- Code quality metrics (readability, maintainability)
Lessons for the Research Community
1. Multilingual != Multilingual Understanding
Having training data in multiple languages doesn't guarantee true multilingual code generation capability. Models often rely on English as an "intermediate representation," which can introduce errors and biases.
2. Context Window Considerations
Different languages have different verbosity levels. A problem description that takes 200 tokens in English might take 300 in German or 180 in Chinese. This affects how models with limited context windows handle longer problems.
3. The Importance of Code-Switched Data
Real-world developers often mix languages in their comments and variable names. Benchmarks need to reflect this reality. We found that models trained on code-switched data performed better across the board.
4. Cultural Context in Problem Design
Some problems contain implicit cultural context that doesn't translate well. For example, problems involving dates, addresses, or measurement systems need careful handling to ensure fairness across cultures.
Challenges and Limitations
Despite our best efforts, mHumanEval has limitations:
- Translation quality variance: While we ensured high quality for major languages, some low-resource language translations may contain subtle errors
- Cultural equivalence: Some problems may be more familiar to certain cultural contexts
- Dynamic nature of languages: Both natural and programming languages evolve; the benchmark requires periodic updates
- Computational cost: Evaluating on 836,400 prompts is expensive, limiting accessibility for some researchers
Future Directions
The development of mHumanEval opens several exciting research directions:
- Fine-tuning strategies: How can we better adapt models for multilingual code generation?
- Cross-lingual transfer learning: Can we leverage high-resource languages to improve low-resource performance?
- Cultural adaptation: How do we make code generation tools more culturally aware?
- Real-world deployment: How does benchmark performance correlate with actual developer productivity across languages?
Open Source and Community
mHumanEval is completely open source. We believe that advancing multilingual AI for code generation requires community collaboration. The benchmark is available on GitHub, and we actively encourage:
- Contributions for improving translations
- Addition of new programming languages
- Extensions to cover more natural languages
- Development of better evaluation metrics
Get Started: Access the full benchmark, code, and documentation on GitHub.
View on GitHub →Conclusion
Building mHumanEval taught us that creating truly multilingual AI systems is about more than just translation. It's about understanding how language shapes thought, how culture influences problem-solving, and how we can build tools that work for everyone, regardless of the language they speak.
The journey from 1 language to 204 languages wasn't just about scale—it was about inclusion. As AI systems become more integral to software development, ensuring they work equally well for developers worldwide isn't just a technical challenge; it's a moral imperative.
We hope mHumanEval serves as a stepping stone toward more equitable and accessible AI-powered development tools. The code generation revolution should be available to all 26 million developers worldwide, not just the English-speaking ones.
Read the Full Paper
For technical details, experimental results, and comprehensive analysis, check out our full paper accepted at NAACL 2025.
Read the Paper (PDF) →About the Author: Nishat Raihan is a PhD student at George Mason University researching large language models, multilingual NLP, and code generation. His work focuses on making AI tools more accessible and effective for developers worldwide.