GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding

Zeming Dong, Qiang Hu, Xiaofei Xie, Maxime Cordy, Mike Papadakis, Yves Le Traon, Jianjun Zhao

Abstract

Pre-trained code models lead the era of code intelligence, with multiple models designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code augmentation techniques to generate new code candidates first and then identifies important ones as the training data by influence scores. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5) and two recent released code-specific Large Language Models (LLMs) (e.g., Qwen2.5-Coder). Compared to the state-of-the-art (SOTA) code augmentation method MixCode, GenCode produces pre-trained code models with 2.92% higher accuracy and 4.90% adversarial robustness on average. For code-specific LLMs, GenCode achieves an average improvement of 0.93% in accuracy and 0.98% in natural robustness.

Keywords

Computer Science

📄 Full Paper Available as PDF

This paper is available as a downloadable PDF.

📄 Download PDF

Comments (0)

No comments yet. Be the first to comment.

Paper Details

Authors Zeming Dong ,
Qiang Hu ,
Xiaofei Xie ,
Maxime Cordy ,
Mike Papadakis ,
Yves Le Traon ,
Jianjun Zhao
Published 2024-02-24
Category Computer Science
Status Non-peer-reviewed Preprint
Language English
Word Count 170

GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding

Abstract

Keywords

✨ AI Plain-English Summary

Comments (0)

Related Papers

A Model for Web Page Usage Mining Based on Segmentation

Core-Periphery Structure in Networks

Risk Assessment Techniques and Survey Method for COTS Components

Beyond the Bethe Free Energy of LDPC Codes via Polymer Expansions