Building mHumanEval: Lessons from Creating a 204-Language Benchmark

"The limits of my language mean the limits of my world." — Ludwig Wittgenstein

When we set out to create mHumanEval, we had a simple question: How well do Large Language Models really understand code generation across different human languages? What started as a research curiosity evolved into one of the most comprehensive multilingual code generation benchmarks, spanning 204 natural languages and 25 programming languages, with 836,400 total prompts.

This blog post shares the technical journey, unexpected challenges, and key lessons from building this benchmark—a journey that taught us as much about the limitations of current LLMs as it did about the fascinating intersection of natural and programming languages.

The Genesis: Why Another Benchmark?

Existing code generation benchmarks like HumanEval and MBPP have been instrumental in advancing the field. However, they share a common limitation: they're almost exclusively in English. This creates several problems:

English-centric bias: Models are primarily evaluated on their ability to understand English prompts, potentially missing how they perform with non-English speakers
Limited generalization understanding: We don't know if strong performance in English translates to other languages
Accessibility barriers: Non-English speaking developers may face disadvantages when using these tools
Research gaps: The multilingual capabilities of code LLMs remain largely unexplored

With over 7,000 languages spoken worldwide and a growing global developer community, we needed a benchmark that reflects this linguistic diversity.

The Scale Challenge: 204 Languages

Deciding to support 204 languages was not a random choice. We wanted to cover:

All major programming communities worldwide
High-resource languages (English, Spanish, Chinese, etc.)
Mid-resource languages (Bengali, Vietnamese, Swahili, etc.)
Low-resource languages (Yoruba, Sundanese, Maori, etc.)

                Technical Challenge: How do you accurately translate coding prompts while preserving their semantic meaning, technical accuracy, and problem-solving requirements?
            

We couldn't rely on simple machine translation. A coding problem is not just descriptive text—it contains:

Technical specifications and constraints
Input/output examples that must remain consistent
Edge cases that need precise description
Domain-specific terminology

Our Translation Pipeline

We developed a multi-stage translation and verification pipeline:

Initial Translation: Using state-of-the-art neural machine translation models (primarily GPT-4 and specialized translation APIs)
Technical Verification: Ensuring that code examples, function names, and technical terms remain accurate
Consistency Checks: Verifying that input/output examples are preserved correctly
Native Speaker Review: For high-resource languages, we conducted spot-checks with native speakers who are also programmers

Surprising Findings

As we tested various LLMs on mHumanEval, several patterns emerged that challenged our assumptions:

1. The "English Anchor" Effect

Models showed a strong tendency to perform significantly better when prompts contained even small amounts of English text. For example, keeping variable names in English while translating descriptions improved performance by 15-20% on average across all non-English languages.

2. Script Matters More Than Language Family

We initially hypothesized that models would perform similarly on languages from the same family (e.g., Romance languages). However, we found that script type (Latin, Cyrillic, Arabic, Devanagari, etc.) was a better predictor of performance than linguistic family.

3. The Resource Paradox

Surprisingly, some mid-resource languages outperformed high-resource ones in specific programming domains. For instance, models performed better on Bengali prompts for algorithmic problems than on some European languages, possibly due to the prevalence of Bengali programmers in competitive programming communities.

4. Programming Language Transfer

Models showed interesting cross-lingual transfer: a model that struggled with Python prompts in Bengali performed better when asked to generate JavaScript, suggesting that the training data distribution affects multilingual code generation in complex ways.

Key Technical Decisions

Choosing the Programming Languages

We selected 25 programming languages to cover:

Popular general-purpose languages: Python, JavaScript, Java, C++, C#, Go, Rust, Swift
Web technologies: TypeScript, PHP, Ruby
Functional languages: Haskell, Scala, Erlang, F#
Systems languages: C, Rust, Assembly variants
Emerging languages: Julia, Kotlin, Dart, Mojo

Evaluation Metrics

We use pass@k metrics (pass@1, pass@5, pass@10) as the primary evaluation criteria, where:

pass@1: Percentage of problems solved in the first attempt
pass@5: Percentage of problems solved within 5 attempts
pass@10: Percentage of problems solved within 10 attempts

Additionally, we track:

Syntax correctness rates across languages
Runtime efficiency of generated solutions
Code quality metrics (readability, maintainability)

Lessons for the Research Community

1. Multilingual != Multilingual Understanding

Having training data in multiple languages doesn't guarantee true multilingual code generation capability. Models often rely on English as an "intermediate representation," which can introduce errors and biases.

2. Context Window Considerations

Different languages have different verbosity levels. A problem description that takes 200 tokens in English might take 300 in German or 180 in Chinese. This affects how models with limited context windows handle longer problems.

3. The Importance of Code-Switched Data

Real-world developers often mix languages in their comments and variable names. Benchmarks need to reflect this reality. We found that models trained on code-switched data performed better across the board.

4. Cultural Context in Problem Design

Some problems contain implicit cultural context that doesn't translate well. For example, problems involving dates, addresses, or measurement systems need careful handling to ensure fairness across cultures.

Challenges and Limitations

Despite our best efforts, mHumanEval has limitations:

Translation quality variance: While we ensured high quality for major languages, some low-resource language translations may contain subtle errors
Cultural equivalence: Some problems may be more familiar to certain cultural contexts
Dynamic nature of languages: Both natural and programming languages evolve; the benchmark requires periodic updates
Computational cost: Evaluating on 836,400 prompts is expensive, limiting accessibility for some researchers

Future Directions

The development of mHumanEval opens several exciting research directions:

Fine-tuning strategies: How can we better adapt models for multilingual code generation?
Cross-lingual transfer learning: Can we leverage high-resource languages to improve low-resource performance?
Cultural adaptation: How do we make code generation tools more culturally aware?
Real-world deployment: How does benchmark performance correlate with actual developer productivity across languages?

Open Source and Community

mHumanEval is completely open source. We believe that advancing multilingual AI for code generation requires community collaboration. The benchmark is available on GitHub, and we actively encourage:

Contributions for improving translations
Addition of new programming languages
Extensions to cover more natural languages
Development of better evaluation metrics

Get Started: Access the full benchmark, code, and documentation on GitHub.

View on GitHub →

Conclusion

Building mHumanEval taught us that creating truly multilingual AI systems is about more than just translation. It's about understanding how language shapes thought, how culture influences problem-solving, and how we can build tools that work for everyone, regardless of the language they speak.

The journey from 1 language to 204 languages wasn't just about scale—it was about inclusion. As AI systems become more integral to software development, ensuring they work equally well for developers worldwide isn't just a technical challenge; it's a moral imperative.

We hope mHumanEval serves as a stepping stone toward more equitable and accessible AI-powered development tools. The code generation revolution should be available to all 26 million developers worldwide, not just the English-speaking ones.

Read the Full Paper

For technical details, experimental results, and comprehensive analysis, check out our full paper accepted at NAACL 2025.

Read the Paper (PDF) →

About the Author: Nishat Raihan is a PhD student at George Mason University researching large language models, multilingual NLP, and code generation. His work focuses on making AI tools more accessible and effective for developers worldwide.

Did you find this article helpful? Let me know!