Mar 15, 2025
Building mHumanEval: Lessons from a 204-Language Benchmark
Building a massively multilingual benchmark was less about collecting code and more about how programming languages travel across natural-language boundaries. The technical challenges, the surprising findings, and what 836,400 prompts taught us about the limits of "multilingual" code models.