2025
TigerLLM - A Family of Bangla Large Language Models
Nishat Raihan, Marcos Zampieri
The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)
Accepted
A comprehensive family of large language models specifically designed for the Bangla language, advancing the state-of-the-art in Bangla NLP.
mHumanEval - A Multilingual Benchmark to Evaluate Large Language Models for Code Generation
Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri
Proceedings of The Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025)
A massively multilingual (204 languages) benchmark for evaluating Large Language Models' code generation capabilities across diverse programming languages, featuring 836,400 coding prompts across 25 programming languages.
Mojobench: Language modeling and benchmarks for mojo
Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri
Findings of The Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025 - Findings)
A complete LLM framework including corpus (6.5M tokens), instruction datasets, and a model that achieves state-of-the-art performance in Mojo code generation.
Large Language Models in Computer Science Education: A Systematic Literature Review
Nishat Raihan, et al.
Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 (SIGCSE-TS 2025)
A systematic literature review examining 125 papers on how LLMs are used and perceived in computer science education, identifying key trends, challenges, and opportunities.
2024
Code LLMs: A Taxonomy-based Survey
Nishat Raihan, Christian Newman, and Marcos Zampieri
2024 IEEE International Conference on Big Data (IEEE Big Data 2024)
A comprehensive taxonomy-based survey on Code LLMs, covering evaluation benchmarks, corpora, limitations, and open problems in the domain.
CSEPrompts: A Benchmark of Introductory Computer Science Prompts
Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Christian Newman, Tharindu Ranasinghe, Marcos Zampieri
27th International Symposium on Methodologies for Intelligent Systems (ISMIS 2024)
A framework with hundreds of programming prompts and MCQs from introductory CS courses, evaluating LLM performance in Python code generation and CS fundamentals.
MentalHelp: A Multi-Task Dataset for Mental Health in Social Media
Md Nishat Raihan, Sadiya Sayara Chowdhury Puspo, Shafkat Farabi, Ana-Maria Bucur, Tharindu Ranasinghe, Marcos Zampieri
The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
A comprehensive 14 million instance semi-supervised dataset for mental health support research, containing diverse conversations for developing empathetic AI systems.
Py-holmes: Causal Testing for Deep Neural Networks in Python
Wren McQueary, et al.
Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE 2024)
The BEA 2024 shared task on the multilingual lexical simplification pipeline
Matthew Shardlow, et al.
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
An extensible massively multilingual lexical simplification pipeline dataset using the MultiLS framework
Matthew Shardlow, et al.
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI @ LREC-COLING 2024)
EmoMix-3L: A Code-Mixed Dataset in Bangla-English-Hindi for Emotion Detection
Nishat Raihan, et al.
LREC-COLING 2024
MasonTigers at SemEval-2024 Task 9: Solving Puzzles with an Ensemble of Chain-of-Thoughts
Md Nishat Raihan, Dhiman Goswami, Al Nahian Bin Emran, Sadiya Sayara Chowdhury Puspo, Amrita Ganguly, Marcos Zampieri
The 18th International Workshop on Semantic Evaluation (SemEval @ NAACL 2024)
MasonTigers at SemEval-2024 Task 1: An Ensemble Approach for Semantic Textual Relatedness
D Goswami, SSC Puspo, MN Raihan, ANB Emran, A Ganguly, M Zampieri
The 18th International Workshop on Semantic Evaluation (SemEval @ NAACL 2024)
MasonTigers at SemEval-2024 Task 8: Performance Analysis of Transformer-based Models on Machine-Generated Text Detection
SSC Puspo, MN Raihan, D Goswami, ANB Emran, A Ganguly, O Uzuner
The 18th International Workshop on Semantic Evaluation (SemEval @ NAACL 2024)
MasonPerplexity at ClimateActivism 2024: Integrating Advanced Ensemble Techniques and Data Augmentation for Climate Activism Stance and Hate Event Identification
A Ganguly, SSC Puspo, D Goswami, MN Raihan
Fourth Workshop on Language Technology for Equality, Diversity, Inclusion (LTEDI @ EACL 2024)
MasonPerplexity at Multimodal Hate Speech Event Detection 2024: Hate Speech and Target Detection Using Transformer Ensembles
A Ganguly, ANB Emran, SSC Puspo, MN Raihan, D Goswami, M Zampieri
7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE @ EACL 2024)
MasonTigers @ LT-EDI-2024: An Ensemble Approach towards Detecting Homophobia and Transphobia in Social Media Comments
D Goswami, SSC Puspo, MN Raihan, ANB Emran
Fourth Workshop on Language Technology for Equality, Diversity, Inclusion (LTEDI @ EACL 2024)
2023
Offensive Language Identification in Transliterated and Code-Mixed Bangla
Md Nishat Raihan, Umma Hani Tanmoy, Anika Binte Islam, Kai North, Tharindu Ranasinghe, Antonios Anastasopoulos, Marcos Zampieri
Proceedings of the 1st International Workshop on Bangla Language Processing (BLP-2023 @ EMNLP)
BEST PAPER AWARD
nlpBDpatriots at BLP-2023 Task 1: A Two-Step Classification for Violence Inciting Text Detection in Bangla
Md Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Marcos Zampieri
Proceedings of the 1st International Workshop on Bangla Language Processing (BLP-2023 @ EMNLP)
nlpBDpatriots at BLP-2023 Task 2: A Transfer Learning Approach to Bangla Sentiment Analysis
Dhiman Goswami, Md Nishat Raihan, Sadiya Sayara Chowdhury Puspo, Marcos Zampieri
Proceedings of the 1st International Workshop on Bangla Language Processing (BLP-2023 @ EMNLP)
OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification
Dhiman Goswami, Md Nishat Raihan, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri
The First Workshop in South East Asian Language Processing (AACL 2023)
SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis
Md Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri
The 11th International Workshop on Natural Language Processing for Social Media (AACL 2023)
Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi
Raihan, Md Nishat, Dhiman Goswami, and Antara Mahmud
arXiv preprint arXiv:2309.10272 (2023)
Determining the Optimal Number of Clusters for Time Series Datasets with Symbolic Pattern Forest
Raihan, Md Nishat
arXiv preprint arXiv:2310.00820 (2023)