CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis

Rochana R. Obadage, Sarah M. Rajtmajer, Jian Wu

2025-11-14

CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis

Summary

This paper introduces a new dataset called CC30k, which is designed to help computers understand how researchers feel about the reproducibility of other research papers in machine learning.

What's the problem?

It's hard to automatically tell if a research paper's results can be reliably recreated by others, which is a key part of good science. While people often express opinions about reproducibility when citing papers, there wasn't a good dataset available to train computers to recognize these opinions. Existing sentiment analysis datasets don't focus on this specific issue, and negative opinions about reproducibility are rarely explicitly stated, making it hard to find enough examples to learn from.

What's the solution?

The researchers created CC30k, a dataset of over 30,000 snippets of text from machine learning papers where one paper cites another. They had people label each snippet as expressing a positive, negative, or neutral sentiment about the cited paper's reproducibility. To get enough negative examples, they also used a computer program to generate some. They made sure the data was clean and the labeling was accurate, achieving 94% accuracy. Then, they showed that existing AI models got much better at identifying reproducibility sentiments after being trained on this dataset.

Why it matters?

This dataset is important because it provides a resource for building AI tools that can automatically assess the reproducibility of research. This could help identify potentially unreliable research and promote more trustworthy science in the field of machine learning. It allows for large-scale studies of reproducibility, something that was previously difficult to do.

Abstract

Sentiments about the reproducibility of cited papers in downstream literature offer community perspectives and have shown as a promising signal of the actual reproducibility of published findings. To train effective models to effectively predict reproducibility-oriented sentiments and further systematically study their correlation with reproducibility, we introduce the CC30k dataset, comprising a total of 30,734 citation contexts in machine learning papers. Each citation context is labeled with one of three reproducibility-oriented sentiment labels: Positive, Negative, or Neutral, reflecting the cited paper's perceived reproducibility or replicability. Of these, 25,829 are labeled through crowdsourcing, supplemented with negatives generated through a controlled pipeline to counter the scarcity of negative labels. Unlike traditional sentiment analysis datasets, CC30k focuses on reproducibility-oriented sentiments, addressing a research gap in resources for computational reproducibility studies. The dataset was created through a pipeline that includes robust data cleansing, careful crowd selection, and thorough validation. The resulting dataset achieves a labeling accuracy of 94%. We then demonstrated that the performance of three large language models significantly improves on the reproducibility-oriented sentiment classification after fine-tuning using our dataset. The dataset lays the foundation for large-scale assessments of the reproducibility of machine learning papers. The CC30k dataset and the Jupyter notebooks used to produce and analyze the dataset are publicly available at https://github.com/lamps-lab/CC30k .

View Paper