Published Versions 1 Vol 2 (4) : 554–568 2020
Download
The open data challenge: An analysis of 124,000 data availability statements and an ironic lesson about data management plans
: 2019 - 11 - 26
: 2020 - 07 - 13
: 2020 - 07 - 22
452 14 0
Abstract & Keywords
Abstract: Data availability statements can provide useful information about how researchers actually share research data. We used unsupervised machine learning to analyze 124,000 data availability statements submitted by research authors to 176 Wiley journals between 2013 and 2019. We categorized the data availability statements, and looked at trends over time. We found expected increases in the number of data availability statements submitted over time, and marked increases that correlate with policy changes made by journals. Our open data challenge becomes to use what we have learned to present researchers with relevant and easy options that help them to share and make an impact with new research data.Data availability statements can provide useful information about how researchers actually share research data. We used unsupervised machine learning to analyze 124,000 data availability statements submitted by research authors to 176 Wiley journals between 2013 and 2019. We categorized the data availability statements, and looked at trends over time. We found expected increases in the number of data availability statements submitted over time, and marked increases that correlate with policy changes made by journals. Our open data challenge becomes to use what we have learned to present researchers with relevant and easy options that help them to share and make an impact with new research data.
Keywords: Data availability statement (DAS); FAIR data; Machine learning; Trends; Journal; Policy
Acknowledgements
Thanks to Elisha Morris at Wiley for the literature search and analysis we used to write our introduction. Thanks to Yan Wu at Wiley for insights into data sharing requirements in China. Thanks to Gary Spencer at Wiley for useful discussions about author behavior and manuscript submission processes. Thanks to Alex Moscrop at Wiley for providing our data. Written collaboratively and preprinted using Authorea; thanks to Alberto Pepe and the Authorea team.
[1]
M. Hahnel. Global funders who require data archiving as a condition of grants.Available at: https://figshare.com/articles/Global_funders_who_require_data_archiving_as_a_condition_of_grants/1281141/1.
[2]
G. Popkin. Data sharing and how it can benefit your scientific career. Nature569(2019), 445–447. doi: 10.1038/d41586-019-01506-x.
[3]
NIH Guide: Final NIH statement on sharing research data.Available at: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html.
[4]
China open science and open data mandate released. Available at: https://www.enago.com/academy/china-open-science-open-data-manadate-released/.
[5]
European Union. EU budget for the future: Horizon Europe. EU funding for research and innovation 2021-2027. doi: 10.2777/101500.
[6]
Realising the potential – Final report of the Open Research Data Task Force.Available at: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/775006/Realising-the-potential-ORDTF-July-2018.pdf.
[7]
L. Bezuidenhout & E. Chakauya. Hidden concerns of sharing research data by low/middle-income country scientists. Global Bioethics29(1)(2018), 39–54. doi: 10.1080/11287462.2018.1441780.
[8]
Meadows. To share or not to share? That is the (research data) question….Available at: https://scholarlykitchen.sspnet.org/2014/11/11/to-share-or-not-to-share-that-is-the-research-data-question/.
[9]
Two competing visions for research data sharing.Available at: https://scholarlykitchen.sspnet.org/2019/10/14/competing-visions-research-data/.
[10]
L. Jones, R. Grant, & I. Hrynaszkiewicz. Implementing publisher policies that inform, support and encourage authors to share data: two case studies. Insights32(1)(2019), 11.
[11]
R. Grant, & I. Hrynaszkiewicz. The impact on authors and editors of introducing Data Availability Statements at Nature journals. International Journal of Digital Curation 13(1)(2018, 195-203.doi: 10.2218/ijdc.v13i1.614.
[12]
D. Sholler, K. Ram, C. Boettiger, & D.S. Katz. Enforcing public data archiving policies in academic publishing: A study of ecology journals. Big Data & Society 6(1)(2019): 1–18. doi: 10.1177/2053951719836258.
[13]
H.A. Campbell, M.A. Micheli-Campbell, & V. Udyawer. Early career researchers embrace data sharing.Trends in Ecology & Evolution34(2)(2019), 95–98. doi: 10.1016/j.tree.2018.11.010.
[14]
D.B. Taichman, P. Sahni, A. Pinborg, L. Peiperl, C. Laine, A. James, S.-T. Hong … & J. Backus. Data sharing statements for clinical trials. JAMA317(24)(2017), 2491-2492. doi:10.1001/jama.2017.6514.
[15]
G. Colavizza. The citation advantage of linking publications to research data.arXiv preprint. arXiv:1907.02565v2, 2019.
[16]
N. A. Vasilevsky. Reproducible and reusable research: are journal data sharing policies meeting the mark? PeerJ 5(2017), e3208. doi: 10.7717/peerj.3208.
[17]
M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, … & B. Mons. The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3(2016), Article No. 160018. doi: 10.1038/sdata.2016.18.
[18]
H2020 Programme, Guidelines on FAIR Data Management in Horizon 2020.Available at: https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf.
[19]
Y. Wu, E. Moylan, H. Inman, & C. Graf. Paving the way to open data.Data Intelligence1(4)(2019), 60–72.doi: 10.1162/dint_a_00021.
[20]
Hrynaszkiewicz​​, N. Simons​, A. Hussain, & S. Goudie. Developing a research data policy framework for all journals and publishers. Available at: https://figshare.com/articles/Developing_a_research_data_policy_framework_for_all_journals_and_publishers/8223365/1.
[22]
T.H. Vines, R.L. Andrew, D.G. Bock, M.T. Franklin, K.J. Gilbert, N.C. Kane, J.-S. Moore, … & S.Yeaman. Mandated data archiving greatly improves access to research data.The FASEB Journal27(2013), 1304–1308. doi: 10.1096/fj.12-218164.
[23]
F. Murphy. Belmont Forum data accessibility statement policy and template - Endorsed 18 October 2018. Available at: https://zenodo.org/record/1476871#.XxqA94P_ysA.
[24]
Wiley. Wiley Open Science Researcher Survey 2016.Available at: https://figshare.com/articles/dataset/Wiley_Open_Science_Researcher_Survey_2016/4748332.
[25]
B. Fecher, S. Friesike, & M. Hebing. What drives academic data sharing?PLOS ONE10(2015), e0118053.doi: 10.1371/journal.pone.0118053.
[26]
L.M. Federer, C.W. Belter, D.J. Joubert, A. Livinski, Y.-L. Lu, L.N. Snyders, & H. Thompson. Data sharing in PLOS ONE: An analysis of Data Availability Statements. PLOS ONE13(2018), e0194768. doi: 10.1371/journal.pone.0194768.
[27]
S. Stall, L. Yarmey, J. Cutcher-Gershenfeld, B. Hanson, K. Lehnert, B. Nosek, M. Parsons, & L. Wyborn. Make scientific data FAIR. Nature570(2019), 27–29.doi: 10.1038/d41586-019-01720-7.
[28]
T.E. Hardwicke, M.B. Mathur, K. MacDonald, G. Nilsonne, G.C. Banks, M.C. Kidwell, A.H. Mohr … & M.C. Frank. Data availability reusability, and analytic reproducibility: Evaluating the impact of a mandatory open data policy at the journal Cognition.Royal Society Open Science5(2018), 180448. doi: 10.1098/rsos.180448.
[29]
J.D. Wallach, K.W. Boyack, & J.P. A. Ioannidis. Reproducible research practices transparency, and open access data in the biomedical literature, 20152017.PLOS Biology 16(2018), e2006930.doi: 10.1371/journal.pbio.2006930.
[30]
A. Rowhani-Farid, & A.G. Barnett. Has open data arrived at the British Medical Journal (BMJ)? An observational study. BMJ Open6(10)(2016), e011784. doi: 10.1136/bmjopen-2016-011784.
[31]
B. Graf. How and Why We’re Making Research Data More Open.Available at: https://www.wiley.com/network/researchers/licensing-and-open-access/how-and-why-we-re-making-research-data-more-open.
[32]
J. Koster & S. Rahmann. Snakemake-A scalable bioinformatics workflow engine. Bioinformatics28(2012), 2520–2522. doi:10.1093/bioinformatics/bts480.
[33]
spaCy· Industrial-strength natural language processing in Python.Available at: https://spacy.io/.
[34]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,… & E. Duchesnay. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research 12(2011), 2825–2830. Available at: http://jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.
[35]
K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval.Journal of Documentation 28(1972), 11–21.doi: 10.1108/eb026526.
[36]
D.M. Blei, A.Y. Ng, & M.I. Jordan. Latent Dirichlet Allocation.Journal of Machine Learning Research3(2003), 993–1022.
[37]
C. Sievert & K. Shirley. LDAvis: A method for visualizing and interpreting topics. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, 2014, pp. 63-70. doi: 10.3115/v1/W14-3110.
[38]
Wiley. Data sharing policy. Available at: https://authorservices.wiley.com/author-resources/Journal-Authors/open-access/data-sharing-citation/data-sharing-policy.html.
[39]
J. Chuang, C.D. Manning, & J. Heer. Termite: Visualization techniques for assessing textual topic models.In: Advanced Visual Interfaces International Working Conference (AVI ‘12), 2012, pp. 1–4.
[40]
C. Sievert & K.E. Shirley. LDAvis: A method for visualizing and interpreting topics. In: Proceedings of theWorkshop on Interactive Language Learning, Visualization, and Interfaces, 2014, pp. 63–70.
[41]
DCC. Data management plans.Available at: https://www.dcc.ac.uk/resources/data-management-plans.
Article and author information
Cite As
C. Graf, D. Flanagan, L. Wylie & D. Silver. The open data challenge: An analysis of 124,000 data availability statements and an ironic lesson about data management plans. Data Intelligence 2(2020), 554–568. doi: 10.1162/dint_a_00061
Chris Graf
All authors made substantial contributions to the design of this paper and participated in the investigation and data collection. All authors approved the version to be published and are accountable for the paper. C. Graf and D. Flanagan wrote the first draft.
cgraf@wiley.com
Chris Graf is Director, Research Integrity, in Wiley’s Open Research team. His responsibilities include the implementation of policies and tools at Wiley that enable researchers and journals to adopt more open, transparent practices. Chris is past co-chair of the Committee on Publication Ethics (COPE), and a program committee member for the 7th World Conference on Research Integrity.
0000-0002-4699-4333
David Flanagan
All authors made substantial contributions to the design of this paper and participated in the investigation and data collection. All authors approved the version to be published and are accountable for the paper. D. Flanagan and L. Wylie contributed to the methodology part of this paper.
David Flanagan is Director of Data Science in Wiley’s Research division, where his group develops applications of data science and machine learning to address scholarly publishing questions. Previously he was Editor-in-Chief of Advanced Functional Materials and general manager of ChemPlanner, Wiley’s award-winning organic synthesis prediction tool. He received hisPhD in Polymer Science and Engineering from the University of Massachusetts Amherst.
0000-0002-7364-4961
Lisa Wylie
All authors made substantial contributions to the design of this paper and participated in the investigation and data collection. All authors approved the version to be published and are accountable for the paper. D. Flanagan and L. Wylie contributed to the methodology part of this paper.
Lisa Wylie is Senior Data Product Manager at Wiley. With a background in chemistry and geology, her role incorporates 15 years of editorial experience with project management and data science skills. She is particularly interested in the application of topic modelling and natural language processing for visualizing and predicting trends in published research.
0000-0002-0148-6087
Deirdre Silver
All authors made substantial contributions to the design of this paper and participated in the investigation and data collection. All authors approved the version to be published and are accountable for the paper.
Deirdre Silver is Executive Vice President (EVP) and General Counsel at Wiley. Deirdre joined Wiley in 2002 as Legal Director of its Higher Education business and subsequently counseled Wiley’s Professional/Trade, Talent Solutions and Research businesses prior to being appointed EVP and General Counsel. She is responsible for all aspects of legal support. She is a boardmember of the Copyright Clearance Center and a member of scientific technical and medical (STM)’s Copyright and Legal Affairs Committee. Deirdre is a graduate of Cornell University (B.A., Government, with Distinction in All Subjects) and New York University Law School (J.D.) and a member of the New York State Bar Association.
0000-0002-8648-8857
Publication records
Published: Dec. 17, 2020 (Versions1
References
Data Intelligence