Code TREAT
Abstract
Large foundation models are transforming software engineering, yet significant gaps remain in comprehensive evaluation methodologies. Our framework addresses this with four key improvements: Multi-Task Holistic Evaluation, Multi-Language and Multi-Modality Assessment, Robustness Assessment, and Rigorous Evaluation Methodology.
Key Insights: Based on evaluation of over 25 state-of-the-art models, we uncover substantial performance variation across programming tasks, specific limitations in multi-modal code generation, severe robustness issues, and demonstrate that multi-prompt evaluation methods can mitigate bias and obtain more reliable results.
Introduction

TREAT introduces the first holistic evaluation framework for Large Language Models in code intelligence tasks.
Our framework features Multi-Task Holistic Evaluation spanning the entire software development lifecycle, Multi-Language & Multi-Modality assessment incorporating visual design and software implementation, Systematic Robustness Testing through code transformations to ensure model stability, and Rigorous Evaluation Methodology with multi-prompt strategies to reduce bias and align with real-world developer usage.
Key Finding: Through evaluation of 25+ state-of-the-art models, we reveal significant performance variations across tasks and severe robustness issues with 15.5% average performance decline under code perturbations.
Benchmark Construction
General Coding Tasks
Code Generation
Algorithmic problems from GeeksforGeeks and HackerRank
Code Summarization
Function-docstring pairs extracted from GitHub repositories
Code Translation
Python-Java bidirectional translation using PolyHumanEval
Code Reasoning
Input/output prediction with masked function components
Code Review
Real-world code review from GitHub pull requests
Test Generation
Unit test creation following CodaMOSA methodology
Vulnerability Detection
Expert-verified vulnerable functions using PRIMEVUL
Multi-Modality Tasks
UI-based Code Generation
Visual design to code implementation
Code Edit & Repair
Visual element-based code modification tasks
Robustness Evaluation
Code Transformation
Program structure modifications while preserving functionality
Misleading Comments
Assessment under intentionally deceptive documentation