Evaluating Small-Scale Code Models for Code Clone Detection

Jorge Martinez-Gil

Abstract

Detecting code clones is relevant to software maintenance and code refactoring. This challenge still presents unresolved cases, mainly when structural similarity does not reflect functional equivalence, though recent code models show promise. Therefore, this research aims to systematically measure the performance of several newly introduced small code models in classifying code pairs as clones or non-clones. The evaluation is based on five datasets: BigCloneBench, CodeJam, Karnalim, POJ104, and PoolC, as well as six code models: CodeBERT, GraphCodeBERT, Salesforce T5, UniXCoder, PLBART, and Polycoder. Most models performed well across standard metrics, including accuracy, precision, recall, and F1-score. However, a marginal fraction of clones remains challenging to detect, especially when the code looks similar but performs different operations. The source code that illustrates our approach is available at: https://github.com/jorge-martinez-gil/small-code-models

Keywords

Computer Science

📄 Full Paper Available as PDF

This paper is available as a downloadable PDF.

📄 Download PDF

Comments (0)

No comments yet. Be the first to comment.

Paper Details

Authors Jorge Martinez-Gil
Published 2025-04-10
Category Computer Science
Status Non-peer-reviewed Preprint
Language English
Word Count 127

Evaluating Small-Scale Code Models for Code Clone Detection

Abstract

Keywords

✨ AI Plain-English Summary

Comments (0)

Related Papers

A Model for Web Page Usage Mining Based on Segmentation

Core-Periphery Structure in Networks

Risk Assessment Techniques and Survey Method for COTS Components

Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation