Writing

Blog

Longer-form notes on the parts of research that do not fit in a paper: the dead ends, the design decisions, and the things I wish someone had told me earlier.

Mar 15, 2025

Building mHumanEval: Lessons from a 204-Language Benchmark

Building a massively multilingual benchmark was less about collecting code and more about how programming languages travel across natural-language boundaries. The technical challenges, the surprising findings, and what 836,400 prompts taught us about the limits of "multilingual" code models.

LLMsBenchmarkingMultilingual NLPCode Generation
Read the post →

More posts coming as the Notre Dame work gets going.