As pull-based software development has become popular, collecting pull requests is frequent in many empiri-cal studies. Although researchers can utilize publicly available datasets, the on-demand collection of PR data is indispensable to compensate for missing information or obtain the latest in-formation. Unfortunately, PR data collected through the GitHub API sometimes has a deficiency in which parts of the data are lost. This data loss would be trouble for researchers in their data analysis. To reveal what data related to PRs tends to be lost during their collections using GitHub API, we conducted a study with 12,118 pull requests in six repositories of OSS projects on GitHub. In the study, we clarified the PR data that needs to be obtained through the GitHub API by defining their entities as features and attributes. We also collected data losses and classified them by checking the lost attributes based on exception reports triggered during PR collection. The collected data losses were categorized into seven. The paper shows our study results that more than half of the PRs (about 53%) involve data loss in total, which may be surprising for many researchers. The paper also discusses the possible causes of data losses, which helps researchers filter out deficient PRs during the collection.
Lili Wei, M Ellen Kuenzig, James Im, Yan Zheng, Taylor McLinden, Scott D. Emerson, Azza Eissa, Henry Halder, Richard Shaw, An‐Wen Chan, William G Dixon, Véra Ehrenstein, Astrid Guttmann, Katie Harron, Lars G. Hemkens, Asbjørn Hróbjartsson, Ronan A Lyons, Shannon E. MacDonald, Jerry M Maniate, David Moher, Irene Petersen, Hude Quan, Sigrún Alba Jóhannesdóttir Schmidt, Henrik Toft Sørensen,
Discussion(0)
No comments yet. Be the first to comment.