"The limits of my language mean the limits of my world." (Ludwig Wittgenstein)

When we set out to create mHumanEval, we had a simple question: How well do Large Language Models really understand code generation across different human languages? What started as a research curiosity evolved into one of the most comprehensive multilingual code generation benchmarks, spanning 204 natural languages and 25 programming languages, with 836,400 total prompts.

This post shares the technical journey, unexpected challenges, and key lessons from building this benchmark, a journey that taught us as much about the limitations of current LLMs as it did about the fascinating intersection of natural and programming languages.

The Genesis: Why Another Benchmark?

Existing code generation benchmarks like HumanEval and MBPP have been instrumental in advancing the field. However, they share a common limitation: they are almost exclusively in English. This creates several problems:

  • English-centric bias: models are primarily evaluated on their ability to understand English prompts, potentially missing how they perform with non-English speakers.
  • Limited generalization understanding: we do not know if strong performance in English translates to other languages.
  • Accessibility barriers: non-English-speaking developers may face disadvantages when using these tools.
  • Research gaps: the multilingual capabilities of code LLMs remain largely unexplored.

With over 7,000 languages spoken worldwide and a growing global developer community, we needed a benchmark that reflects this linguistic diversity.

The Scale Challenge: 204 Languages

Deciding to support 204 languages was not a random choice. We wanted to cover:

  • All major programming communities worldwide
  • High-resource languages (English, Spanish, Chinese, etc.)
  • Mid-resource languages (Bengali, Vietnamese, Swahili, etc.)
  • Low-resource languages (Yoruba, Sundanese, Maori, etc.)
Technical challenge: How do you accurately translate coding prompts while preserving their semantic meaning, technical accuracy, and problem-solving requirements?

We could not rely on simple machine translation. A coding problem is not just descriptive text. It contains:

  1. Technical specifications and constraints
  2. Input/output examples that must remain consistent
  3. Edge cases that need precise description
  4. Domain-specific terminology

Our translation pipeline

We developed a multi-stage translation and verification pipeline:

  1. Initial translation: using state-of-the-art neural machine translation models (primarily GPT-4 and specialized translation APIs).
  2. Technical verification: ensuring that code examples, function names, and technical terms remain accurate.
  3. Consistency checks: verifying that input/output examples are preserved correctly.
  4. Native-speaker review: for high-resource languages, we conducted spot-checks with native speakers who are also programmers.

Surprising Findings

As we tested various LLMs on mHumanEval, several patterns emerged that challenged our assumptions:

1. The "English anchor" effect

Models showed a strong tendency to perform significantly better when prompts contained even small amounts of English text. For example, keeping variable names in English while translating descriptions improved performance by 15–20% on average across all non-English languages.

2. Script matters more than language family

We initially hypothesized that models would perform similarly on languages from the same family (e.g., Romance languages). However, we found that script type (Latin, Cyrillic, Arabic, Devanagari, etc.) was a better predictor of performance than linguistic family.

3. The resource paradox

Surprisingly, some mid-resource languages outperformed high-resource ones in specific programming domains. For instance, models performed better on Bengali prompts for algorithmic problems than on some European languages, possibly due to the prevalence of Bengali programmers in competitive programming communities.

4. Programming-language transfer

Models showed interesting cross-lingual transfer: a model that struggled with Python prompts in Bengali performed better when asked to generate JavaScript, suggesting that the training data distribution affects multilingual code generation in complex ways.

Key Technical Decisions

Choosing the programming languages

We selected 25 programming languages to cover:

  • Popular general-purpose languages: Python, JavaScript, Java, C++, C#, Go, Rust, Swift
  • Web technologies: TypeScript, PHP, Ruby
  • Functional languages: Haskell, Scala, Erlang, F#
  • Systems languages: C, Rust, Assembly variants
  • Emerging languages: Julia, Kotlin, Dart, Mojo

Evaluation metrics

We use pass@k metrics (pass@1, pass@5, pass@10) as the primary evaluation criteria, where:

  • pass@1: percentage of problems solved in the first attempt
  • pass@5: percentage of problems solved within 5 attempts
  • pass@10: percentage of problems solved within 10 attempts

Additionally, we track syntax correctness rates across languages, runtime efficiency of generated solutions, and code-quality metrics such as readability and maintainability.

Lessons for the Research Community

1. Multilingual ≠ multilingual understanding

Having training data in multiple languages does not guarantee true multilingual code generation capability. Models often rely on English as an "intermediate representation," which can introduce errors and biases.

2. Context-window considerations

Different languages have different verbosity levels. A problem description that takes 200 tokens in English might take 300 in German or 180 in Chinese. This affects how models with limited context windows handle longer problems.

3. The importance of code-switched data

Real-world developers often mix languages in their comments and variable names. Benchmarks need to reflect this reality. We found that models trained on code-switched data performed better across the board.

4. Cultural context in problem design

Some problems contain implicit cultural context that does not translate well. For example, problems involving dates, addresses, or measurement systems need careful handling to ensure fairness across cultures.

Challenges and Limitations

Despite our best efforts, mHumanEval has limitations:

  • Translation-quality variance: while we ensured high quality for major languages, some low-resource language translations may contain subtle errors.
  • Cultural equivalence: some problems may be more familiar to certain cultural contexts.
  • Dynamic nature of languages: both natural and programming languages evolve; the benchmark requires periodic updates.
  • Computational cost: evaluating on 836,400 prompts is expensive, limiting accessibility for some researchers.

Future Directions

  1. Fine-tuning strategies: how can we better adapt models for multilingual code generation?
  2. Cross-lingual transfer learning: can we leverage high-resource languages to improve low-resource performance?
  3. Cultural adaptation: how do we make code generation tools more culturally aware?
  4. Real-world deployment: how does benchmark performance correlate with actual developer productivity across languages?

Open Source and Community

mHumanEval is completely open source. We believe that advancing multilingual AI for code generation requires community collaboration, and we actively encourage contributions for improving translations, adding new programming languages, extending coverage of natural languages, and developing better evaluation metrics.

Get started: access the full benchmark, code, and documentation on GitHub.

View on GitHub →

Conclusion

Building mHumanEval taught us that creating truly multilingual AI systems is about more than just translation. It is about understanding how language shapes thought, how culture influences problem-solving, and how we can build tools that work for everyone, regardless of the language they speak.

The journey from 1 language to 204 languages was not just about scale. It was about inclusion. As AI systems become more integral to software development, ensuring they work equally well for developers worldwide is not just a technical challenge; it is a matter of fairness.

We hope mHumanEval serves as a stepping stone toward more equitable and accessible AI-powered development tools. The code generation revolution should be available to all developers worldwide, not just the English-speaking ones.

Read the full paper

For technical details, experimental results, and comprehensive analysis, see the full paper published at NAACL 2025.

Read the paper (PDF) →
About the author. Nishat Raihan is finishing his Ph.D. at George Mason University and joining the University of Notre Dame as a Provost's Postdoctoral Fellow in Fall 2026. His work focuses on adapting Code LLMs for under-explored domains: low-resource languages, less common programming languages, and the settings standard benchmarks miss.