Expertini Research Research
Computer Science PDF Available Non-peer-reviewed Preprint

GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding

Abstract

Pre-trained code models lead the era of code intelligence, with multiple models designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code augmentation techniques to generate new code candidates first and then identifies important ones as the training data by influence scores. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5) and two recent released code-specific Large Language Models (LLMs) (e.g., Qwen2.5-Coder). Compared to the state-of-the-art (SOTA) code augmentation method MixCode, GenCode produces pre-trained code models with 2.92% higher accuracy and 4.90% adversarial robustness on average. For code-specific LLMs, GenCode achieves an average improvement of 0.93% in accuracy and 0.98% in natural robustness.
๐Ÿ“„ Full Paper Available as PDF
This paper is available as a downloadable PDF.
๐Ÿ“„ Download PDF

โœจ AI Plain-English Summary

Get a plain-English summary of this paper generated by AI (5 free per day).

Comments (0)

No comments yet. Be the first to comment.