Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README.md

Motivating the preservation of research data

Michael Hucka
May 2017

This is an annotated bibliography of research relevant to motivating the need for preserving research data. This began as background research work as part of developing funding proposals to pursue projects in data preservation. It is admittedly biased towards academic research laboratories, but hopefully with other people's contributions over time, it can be broadened.

Usage notes: Under the main topic headings, there are lists of references. Each reference has a small triangle (▶︎) to the left of it . Clicking your pointer on the triangle will expand the reference to show more information and annotations about that reference. The reference itself (colored in blue) is a hyperlink to the actual paper.

General overview

Heidorn, P. B. (2008). Shedding light on the dark data in the long tail of science. Library Trends, 57(2), 280-299.
Outstanding article. Lays out the problem faced by libraries to preserve research output when much of scholarly data is never disseminated or published in a conventional form. Coined the term "dark data": "any data that is not easily found by potential users. Dark data may be positive or negative research findings or from either “large” or “small” science. Like dark matter, this dark data on the basis of volume may be more important than that which can be easily seen." Applies the concept of long-tail economics to science data. Discusses in a single place essentially all of the problems faced today by data preservation efforts in academia.

Data sharing and archiving by researchers

Research providing evidence for the poor state of data archiving and availability of data. This is not about data disappearance or link rot, but rather about whether researchers bother to share their data in the first place.

Read, K. B., Sheehan, J. R., Huerta, M. F., Knecht, L. S., Mork, J. G., Humphreys, B. L., & NIH Big Data Annotator Group (2015). Sizing the problem of improving discovery and access to NIH-funded data: A preliminary study. PloS One, 10(7), e0132735.
They studied how many of the data sets generated annually by NIH-funded researchers was not deposited in a known data repository. They did this by analyzing journal articles in PubMed Central that could be identified as NIH-funded. They found "about 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets." (Note that this does not necessarily mean the datasets are completely inaccessible, only that the are not reported as being deposited in a known location. They may be available from somewhere.) They also discuss how discoverability could be improved.

Ross, J. S., Tse, T., Zarin, D. A., Xu, H., Zhou, L., & Krumholz, H. M. (2012). Publication of NIH funded trials registered in ClinicalTrials.gov: Cross sectional analysis. BMJ (Clinical Research Ed.), 344, d7292-d7292.
Examined publication of NIH-funded clinical trials by searching Medline and correlating with ClinicalTrials.gov. Found "fewer than half of trials funded by NIH are published in a peer reviewed biomedical journal indexed by Medline within 30 months of trial completion. Moreover, after a median of 51 months after trial completion, a third of trials remained unpublished."

Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A. U., Wu, L., Read, E., Manoff, M., & Frame, M. (2011). Data sharing by scientists: Practices and perceptions. PloS One, 6(6), e21101.
Surveyed researchers and asked them why they did or did not share their data. Also investigates some of the difficulties reported b people who do try to share their data. This is the best survey of its kind that I have found so far. This provides evidence that scientists have lots of excuses not to deposit their data in data archives.

Fleischer, D. & Jannaschk, K. (2011). A path to filled archives. Nature Geoscience, 4(9), 575-576.
Short commentary on the state of data archiving and what the authors think it will take to improve things. Echoes much of what we already know. The following are some choice quotes: "We argue that most scientists view data deposition in remote archives as a burden, because it is too far removed from their daily routine." […] "The human interaction in the data pathway creates unacceptable bottlenecks: only an automated process can turn around the full quantity of data that are generated and published. The curation system simply will be overwhelmed if all data are to be submitted." […] "… we propose that each scientific institution should support its scientists in the form of local data navigators, in combination with structured data storage."

Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PloS One, 6(11), e26828.
Examined the hypothesis that researchers are reluctant to share data because the fear that reanalysis by other people may expose errors in their work or produce contradictory conclusions. Of 49 authors contacted, 21 shared some data; 13 failed to respond; 3 refused to share; 12 promised to share at a later date but never even after 6 years. Wicherts et al. then examined the correlation between the statistical power of the authors' published papers and their willingness to share data. They found authors were less likely to want to share data when the results they published had lower statistical power.

Nelson, B. (2009). Data sharing: Empty archives. Nature, 461(7261), 160-163.
Examines the state of data archives at the time, and the reasons it has been difficult to get researchers to archive their data.

Data disappearance and link rot

Studies of link rot (or "URL corrosion") and general disappearance of data and electronic resources that were ostensibly available at one time.

Berg, J. (2016). Editorial expression of concern. Science, 354(6317), 1242-1242.
"The authors have notified Science of the theft of the computer on which the raw data for the paper were stored. These data were not backed up on any other device nor deposited in an appropriate repository. Science is publishing this Editorial Expression of Concern to alert our readers to the fact that no further data can be made available, beyond those already presented in the paper and its supplement, to enable readers to understand, assess, reproduce, or extend the conclusions of the paper.

Vines, T. H., Albert, A. Y., Andrew, R. L., Débarre, F., Bock, D. G., Franklin, M. T., Gilbert, K. J., Moore, J. -S., Renaut, S., & Rennison, D. J. (2014). The availability of research data declines rapidly with article age. Current Biology, 24(1), 94-97.
Examined how quickly data becomes unavailable after publication of a research paper. In order to avoid confounding factors due to different practices in different fields, they focused on a specific domain (articles containing morphological data from plants or animals using a particular kind of analysis). They found that "The major cause of the reduced data availability for older papers was the rapid increase in the proportion of data sets reported as either lost or on inaccessible storage media. For papers where authors reported the status of their data, the odds of the data being extant decreased by 17% per year."

Prithviraj, K. R. & Kumar, B. T. S. (2014). Corrosion of URLs: Implications for electronic publishing. IFLA Journal, 40(1), 35-47.
Another study of link rot (which they call "URL corrosion"). They found very high rates. This paper is also interesting because it cites and summarizes a large number of other studies on the same topic.

Thorp, A. W. & Schriger, D. L. (2011). Citations to web pages in scientific articles: The permanence of archived references. Annals of Emergency Medicine, 57(2), 165-168.
Another study of how published URLs become inaccessible over time. Studied how the web pages at URLs changed after authors referenced it in papers. Method: "We scanned the “Articles in Press” section in Annals of Emergency Medicine from March 2009 through June 2010 for Internet references in research articles. If an Internet reference produced the authors’ expected content, the Web page was archived with WebCite (http://www.webcitation.org). Because the archived Web page does not change, we compared it with the original URL to determine whether the original Web page had changed. We attempted to access each original URL and archived Web site URL at 3-month intervals from the time of online publication during an 18-month study period." They found that 35% of the original URLs were lost, but none of the ones in WebCite were lost.

Wren, J. D. (2008). URL decay in MEDLINE—a 4-year follow-up study. Bioinformatics, 24(11), 1381-1385.
Follow up to Wren's 2004 study (below). Found no significant change in the rate of decay of URLs. However, found that URLs that were cited more than twice were much less likely to disappear than those cited only once or twice. (I.e., more popular resources were more likely to resist decay.) An important caveat here is that this is about published URLs; the underlying data or resource may still be available elsewhere.

Aronsky, D., Madani, S., Carnevale, R. J., Duda, S., & Feyder, M. T. (2007). The prevalence and inaccessibility of internet references in the biomedical literature at the time of publication. Journal of the American Medical Informatics Association, 14(2), 232-4.
A 2007 study of the citations in 4,700 papers from 844 different journals. They examined which ones of the citations were to Internet resources. Found that 12% of those internet resources were already inaccessible within 2 days of an article's release. This paper is also useful for the citations includes to other similar studies.

Evangelou, E., Trikalinos, T. A., & Ioannidis, J. P. (2005). Unavailability of online supplementary scientific information from articles published in major journals. The FASEB Journal, 19(14), 1943-1944.
A 2005 study of link rot for top journals: Science, Nature, Cell, New England Journal of Medicine, Lancet, and PNAS. Found that nearly 5% of online supplementary information links were bad after 2 years, and nearly 10% were bad after 5 years.

Wren, J. D. (2004). 404 not found: The stability and persistence of URLs published in MEDLINE. Bioinformatics, 20(5), 668-672.
Another study of how URLs published in research journals become inaccessible over time. For the time span evaluated (1994–2002), he found a 19% cummulative loss per year, meaning 19% of URLs become inaccessible per year, every year. Obviously not relevant today bc of the age, but useful as a marker when comparing to recent studies of data loss.

Hester, E. J., Heilig, L. F., Drake, A. L., Johnson, K. R., Vu, C. T., Schilling, L. M., & Dellavalle, R. P. (2004). Internet citations in oncology journals: A vanishing resource? Journal of the National Cancer Institute, 96(12), 969-971.
Short 2004 article about how many URLs referenced in cancer research journals became inaccessible over time. "9.5%, 10%, and 33% of Internet addresses were inactive 5, 17, and 29 months after publication, respectively."

Dellavalle, R. P., Hester, E. J., Heilig, L. F., Drake, A. L., Kuntzman, J. W., Graber, M., & Schilling, L. M. (2003). Going, going, gone: Lost internet references. Science, 302(5646), 787-788.
A short 2003 paper reviewing the (then) evidence of URL rot and data loss from web sources found in JAMA, NEJM and Science. Obviously not relevant today bc of the age, but useful as a marker when comparing to recent studies of data loss. Also has a great quote that can be used to motivate regular website archiving: "Readers, however, cannot be assured that the information captured by Internet Archive, or Google, or even an active URL is unchanged compared with the information originally captured by the authors."

Policies and consequences

Mills, J. A., Teplitsky, C., Arroyo, B., Charmantier, A., Becker, P. H., Birkhead, T. R., Bize, P., Blumstein, D. T., Bonenfant, C., Boutin, S., Bushuev, A., Cam, E., Cockburn, A., Côté, S. D., Coulson, J. C., Daunt, F., Dingemanse, N. J., Doligez, B., Drummond, H., Espie, R. H., Festa-Bianchet, M., Frentiu, F., Fitzpatrick, J. W., Furness, R. W., Garant, D., Gauthier, G., Grant, P. R., Griesser, M., Gustafsson, L., Hansson, B., Harris, M. P., Jiguet, F., Kjellander, P., Korpimäki, E., Krebs, C. J., Lens, L., Linnell, J. D., Low, M., McAdam, A., Margalida, A., Merilä, J., Møller, A. P., Nakagawa, S., Nilsson, J. Å., Nisbet, I. C., van Noordwijk, A. J., Oro, D., Pärt, T., Pelletier, F., Potti, J., Pujol, B., Réale, D., Rockwell, R. F., Ropert-Coudert, Y., Roulin, A., Sedinger, J. S., Swenson, J. E., Thébaud, C., Visser, M. E., Wanless, S., Westneat, D. F., Wilson, A. J., & Zedrosser, A. (2015). Archiving primary data: Solutions for long-term studies. Trends in Ecology & Evolution, 30(10), 581-589.
Surveyed researchers in ecology and evolution research, about public data archiving in repositories such as Dryad. Found considerable resistance. The reasons given by researchers concern what data would be archived and to whom access would be given. The paper gives a list of reasons for objections given by people, and proposes some possible solutions. One of the suggestions is this: "Data could be archived on institutional servers, and the institution and its staff could control access and determine if collaboration is appropriate. [...]. Such institutional databases also allow the preservation of data and their accessibility after the Pl retires." Another idea: "implement data-tracking, allowing data collectors to obtain information on who is using the data and why. For example, any request for data to the Climate Change, Agriculture, and Food Security Data Portal triggers an email to be sent to the PI who deposited the data. Journals should have a rule that no paper is considered where the data users have not corresponded with the data owners and included appropriate acknowledgement of the source of the data within the paper."

Vines, T. H., Andrew, R. L., Bock, D. G., Franklin, M. T., Gilbert, K. J., Kane, N. C., Moore, J. -S., Moyers, B. T., Renaut, S., & Rennison, D. J. (2013). Mandated data archiving greatly improves access to research data. The FASEB Journal, 27(4), 1304-1308.
Examined the impact of journal policies on the availability of data. Found that when journals required archiving of data, it improves the odds of finding the data online by 1000 times. However, when it is not a requirement, less than 23% of the data was available online. Journals studied focused on evolutionary biology and included BMC Evolutionary Biology and PLoS One. Conclusion: mandatory data deposition rules by journals had the greatest impact on data availability.