Towards Filtering Out Deficient Pull Requests Collected Through the GitHub API

As pull-based software development has become popular, collecting pull requests is frequent in many empiri-cal studies. Although researchers can utilize publicly available datasets, the on-demand collection of PR data is indispensable to compensate for missing information or obtain the latest in-formation. Unfortunately, PR data collected through the GitHub API sometimes has a deficiency in which parts of the data are lost. This data loss would be trouble for researchers in their data analysis. To reveal what data related to PRs tends to be lost during their collections using GitHub API, we conducted a study with 12,118 pull requests in six repositories of OSS projects on GitHub. In the study, we clarified the PR data that needs to be obtained through the GitHub API by defining their entities as features and attributes. We also collected data losses and classified them by checking the lost attributes based on exception reports triggered during PR collection. The collected data losses were categorized into seven. The paper shows our study results that more than half of the PRs (about 53%) involve data loss in total, which may be surprising for many researchers. The paper also discusses the possible causes of data losses, which helps researchers filter out deficient PRs during the collection.

Discussion(0)

No comments yet. Be the first to comment.

Publication Info

DOI: 10.1109/apsec65559.2024.00065
Year: 2024
Published: —
Language: en

Article Details

Link Of The Paper: https://doi.org/10.1109/apsec65559.2024.00065

Timeline

Created:June 19, 2026

Related publications

Article2013

Setting the RECORD straight: developing a guideline for the REporting of studies Conducted using Observational Routinely collected Data

Henrik Toft Sørensen, Sinéad Langan, Eric I. Benchimol, Guttmann, David Moher, Irene Petersen, Liam Smeeth, Stanley, Erik von Elm

Article2009

Evolution of Mutation Rates: Phylogenomic Analysis of the Photolyase/Cryptochrome Family

José Ignacio Lucas‐Lledó, Michael E Lynch

Article2024

Methodological Challenges when Using Routinely Collected Health Data for Research: A scoping review.

Lili Wei, M Ellen Kuenzig, James Im, Yan Zheng, Taylor McLinden, Scott D. Emerson, Azza Eissa, Henry Halder, Richard Shaw, An‐Wen Chan, William G Dixon, Véra Ehrenstein, Astrid Guttmann, Katie Harron, Lars G. Hemkens, Asbjørn Hróbjartsson, Ronan A Lyons, Shannon E. MacDonald, Jerry M Maniate, David Moher, Irene Petersen, Hude Quan, Sigrún Alba Jóhannesdóttir Schmidt, Henrik Toft Sørensen,

Article2003

Gene Loss, Protein Sequence Divergence, Gene Dispensability, Expression Level, and Interactivity Are Correlated in Eukaryotic Evolution

Dmitri M. Krylov, Yuri I. Wolf, Igor B. Rogozin, Eugene V Koonin

Preprint2021

Dataset of gold nanoparticle sizes and morphologies extracted from literature-mined microscopy images

Akshay Subramanian, Kevin Cruse, Amalie Trewartha, Xingzhi Wang, Paul Alivisatos, Gerbrand Ceder

Shirley Wang,

David McAllister,

Sinéad Langan,

Eric I. Benchimol