Are Large Language Models Memorizing Bug Benchmarks?

Ramos, Daniel; Mamede, Claudia; Jain, Kush; Canelas, Paulo; Gamboa, Catarina; Goues, Claire Le

Computer Science > Software Engineering

arXiv:2411.13323 (cs)

[Submitted on 20 Nov 2024]

Title:Are Large Language Models Memorizing Bug Benchmarks?

Authors:Daniel Ramos, Claudia Mamede, Kush Jain, Paulo Canelas, Catarina Gamboa, Claire Le Goues

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage.
In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular codegen-multi, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LLaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

Comments:	pre-print
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2411.13323 [cs.SE]
	(or arXiv:2411.13323v1 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2411.13323

Submission history

From: Daniel Ramos [view email]
[v1] Wed, 20 Nov 2024 13:46:04 UTC (1,543 KB)

Computer Science > Software Engineering

Title:Are Large Language Models Memorizing Bug Benchmarks?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Are Large Language Models Memorizing Bug Benchmarks?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators