Improving the detection of technical debt in Java source code with an enriched dataset

Nam Le Hai, Anh M. T. Bui, Phuong T. Nguyen, Davide Di Ruscio, Rick Kazman

2024-11-11

Improving the detection of technical debt in Java source code with an enriched dataset

Summary

This paper discusses a new approach to improve the detection of technical debt in Java code by creating a dataset that combines comments from developers with the actual source code.

What's the problem?

Technical debt refers to the extra work that arises when developers choose quick fixes instead of more robust solutions. This can lead to problems later on. While developers often document these quick fixes in comments (called Self-Admitted Technical Debts or SATDs), existing methods mainly focus on these comments without considering the rich information in the actual code. This means that many technical debts might go unnoticed, making it harder to manage and fix them effectively.

What's the solution?

The authors created a new dataset by analyzing comments and their related source code from 974 Java projects. This dataset helps identify technical debts more accurately by combining the information from both comments and code. They found that using this enriched dataset improved the performance of existing models that detect SATDs, leading to better predictions about various types of technical debt. The new dataset, called Tesoro, is designed to inspire further research and improve detection methods.

Why it matters?

This research is important because it provides a better way to identify and manage technical debt in software development. By improving detection methods, developers can address issues more effectively, leading to cleaner, more maintainable code and ultimately better software products.

Abstract

Technical debt (TD) is a term used to describe the additional work and costs that emerge when developers have opted for a quick and easy solution to a problem, rather than a more effective and well-designed, but time-consuming approach. Self-Admitted Technical Debts (SATDs) are a specific type of technical debts that developers intentionally document and acknowledge, typically via textual comments. While these self-admitted comments are a useful tool for identifying technical debts, most of the existing approaches focus on capturing crucial tokens associated with various categories of TD, neglecting the rich information embedded within the source code itself. Recent research has focused on detecting SATDs by analyzing comments embedded in source code, and there has been little work dealing with technical debts contained in the source code. To fill such a gap, in this study, through the analysis of comments and their associated source code from 974 Java projects hosted in the Stack corpus, we curated the first ever dataset of TD identified by code comments, coupled with its associated source code. Through an empirical evaluation, we found out that the comments of the resulting dataset help enhance the prediction performance of state-of-the-art SATD detection models. More importantly, including the classified source code significantly improves the accuracy in predicting various types of technical debt. In this respect, our work is two-fold: (i) We believe that our dataset will catalyze future work in the domain, inspiring various research issues related to the recognition of technical debt; (ii) The proposed classifiers may serve as baselines for other studies on the detection of TD by means of the curated dataset.

View Paper