Code TREAT

Abstract

Large foundation models are transforming software engineering, yet significant gaps remain in comprehensive evaluation methodologies. Our framework addresses this with four key improvements: Multi-Task Holistic Evaluation, Multi-Language and Multi-Modality Assessment, Robustness Assessment, and Rigorous Evaluation Methodology.

Key Insights: Based on evaluation of over 25 state-of-the-art models, we uncover substantial performance variation across programming tasks, specific limitations in multi-modal code generation, severe robustness issues, and demonstrate that multi-prompt evaluation methods can mitigate bias and obtain more reliable results.

Introduction

TREAT introduces the first holistic evaluation framework for Large Language Models in code intelligence tasks.

Our framework features Multi-Task Holistic Evaluation spanning the entire software development lifecycle, Multi-Language & Multi-Modality assessment incorporating visual design and software implementation, Systematic Robustness Testing through code transformations to ensure model stability, and Rigorous Evaluation Methodology with multi-prompt strategies to reduce bias and align with real-world developer usage.

Key Finding: Through evaluation of 25+ state-of-the-art models, we reveal significant performance variations across tasks and severe robustness issues with 15.5% average performance decline under code perturbations.

Benchmark Construction

General Coding Tasks

Code Generation

Algorithmic problems from GeeksforGeeks and HackerRank

Code Summarization

Function-docstring pairs extracted from GitHub repositories

Code Translation

Python-Java bidirectional translation using PolyHumanEval

Code Reasoning

Input/output prediction with masked function components

Code Review

Real-world code review from GitHub pull requests

Test Generation

Unit test creation following CodaMOSA methodology

Vulnerability Detection

Expert-verified vulnerable functions using PRIMEVUL

Multi-Modality Tasks

UI-based Code Generation

Visual design to code implementation

Code Edit & Repair

Visual element-based code modification tasks

Robustness Evaluation

Code Transformation

Program structure modifications while preserving functionality

Misleading Comments

Assessment under intentionally deceptive documentation