No raw data, no science: another possible source of the reproducibility crisis
Abstract
Introduction
Raw data rarely comes out
- 1.Attach raw data (all the images for entire membranes of western blotting with size markers and for staining, quantified numerical data for each sample used for statistical analyses, etc.) as supplementary materials.
- 2.Provide absolute p-values, instead of expressions like p < 0.05, in the results.
- 3.Conduct corrections for multiple tests, where necessary.
Absence of raw data means the absence of science
The necessity of sharing raw data
Begley and Ioannidis recommend that institutions should make it a requirement that raw data be made available on request [17]. These recommendations are also based on the assumption that researchers are honest, at least to the extent that the authors will present raw data upon request. However, I imagine that, upon such a request, some of the authors might say, “Oops, hard disk got broken!” or similar. I do not think it is practical to suppose that every co-author sees and reviews all the raw data in a huge/interdisciplinary paper published in a high impact journal. I believe that it is now time to design a system, based on such realistic reasoning of the majority of researchers, that not everyone is “honest,” replacing the “trust-me” system that is based on the traditional idealistic assumption that everyone is good. The idea of open science/open data is needed in such a design and I propose that a custom should be commonly accepted, that sharing raw data publicly is a necessary condition for a study to be considered as scientifically sound, unless the authors have acceptable reasons not to do so (e.g., data contains confidential personal information). In the past age of print publishing, it was technically impossible to publish all raw data due to the limitation of space. This limitation, however, has been virtually eliminated, thanks to the evolution of data storage devices and the internet. Indeed, in 2014, the National Institutes of Health mandated researchers to share large-scale human or non-human genomic data, such as large-scale data including genome-wide association studies (GWAS), single nucleotide polymorphisms (SNP) arrays, and genome sequence, transcriptomic, epigenomic, and gene expression data (https://osp.od.nih.gov/scientific-sharing/genomic-data-sharing/). This year, the National Institute of Mental Health (NIMH) issued a data sharing policy, which requires NIMH-funded researchers to deposit all raw and analyzed data (including, but not limited to, clinical, genomic, imaging, and phenotypic data) from experiments involving human subjects into their informatics infrastructure to enable the responsible sharing and use of data collected from and about human subjects by the entire research community (https://grants.nih.gov/grants/guide/notice-files/NOT-MH-19-033.html). In 2018, it is reported that China mandated its researchers to share all scientific data in open national repositories (https://www.editage.com/insights/china-mandates-its-researchers-to-share-all-scientific-data-in-open-national-repositories/1523453996). I believe that other countries may want to follow such a move. I propose that all journals should, in principle, try their best to have authors and institutions make their raw data open in a public database or on a journal web site upon the publication of the paper, in order to increase the reproducibility of published results and to strengthen public trust in science. Currently, the data sharing policy of Molecular Brain only “encourages” all datasets on which the conclusions of the manuscript rely to be either deposited in publicly available repositories (where available and appropriate) or presented in the main paper or additional supporting files, in machine-readable format (such as spread sheets rather than PDFs) whenever possible. Building on our existing policy, we will require, in principle, deposition of the datasets on which the conclusions of the manuscript rely from 1 March 2020. Such datasets include quantified numerical values used for statistical analyses and graphs, images of tissue staining, and uncropped images of all blot and gel results. The deposition does not have to be completed at the time of manuscript submission but the manuscripts will be accepted on the condition that such data are deposited before its publication. We could allow some exceptions, when the authors cannot make data public due to some ethical or legal reasons (eg. The data consist of confidential personal information, or proprietary data from third party). In such cases, the rational for not doing so should be clearly described in the data availability section of the manuscript and be approved by the handling and chief editors. There are practical issues that need to be solved to share raw data. It is true that big data, such as various kinds of omics data and footage of animal behaviors, are hard to handle and to be deposited in a public database or repository and could be costly. Different researchers in different institutions may not have equal access to the use of the same level of repositories, or the skills to properly share their data. In addition, the definition of “raw data” could be an issue. For example, in mouse behavior, we are running a database to share “raw data” of mouse behaviors, but the database contains just quantified numerical text data. Ideally all the footage taken for behavior analysis should be shared, and we would like to do so when we obtain sufficient funding and infrastructure to realize such a database. The meaning of “raw data” should be discussed by the experts in each field of science and some consensus should be reached so that they can be shared in a systematic manner whereby re-analysis of the data and data mining can be conducted easily. Storage and sharing of confidential personal information on data derived from human subjects would be another challenge that needs to be overcome. For these technical issues, institutions, funding agencies, and publishers should cooperate and try to support such a move by establishing data storage infrastructure to enable the securing and sharing of raw data, based on the understanding that “no raw data, no science.”As part of the submission process, journals could require authors to confirm that the raw data are available for inspection (or to stipulate why data are not available). Likewise, co-authors could be asked to confirm that they have seen the raw data and reviewed the submitted version of the paper.
References
- 1.
Prinz F, Schlange T, Asadullah K. Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discov. 2011;10(9):712.
- 2.
Begley CG, Ellis LM. Drug development: raise standards for preclinical cancer research. Nature. 2012;483:531–3.
- 3.
Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716.
- 4.
Kerr NL. HARKing: hypothesizing after the results are known. Personal Soc Psychol Rev. 1998;2(3):196–217.
- 5.
Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and consequences of P-hacking in science. PLoS Biol. 2015;13(3):e1002106.
- 6.
Dwan K, Altman DG, Arnaiz JA, Bloom J, Chan A-W, Cronin E, et al. Systematic review of the empirical evidence of study publication Bias and outcome reporting Bias. PLoS One. 2008;3(8):e3081.
- 7.
Tsilidis KK, Panagiotou OA, Sena ES, Aretouli E, Evangelou E, Howells DW, et al. Evaluation of excess significance bias in animal studies of neurological diseases. PLoS Biol. 2013;11(7):e1001609.
- 8.
Ioannidis JPA. Why most clinical research is not useful. PLoS Med. 2016;13(6):e1002049.
- 9.
Baker M. 1,500 scientists lift the lid on reproducibility. Nature News. 2016;533(7604):452.
- 10.
Promoting research data sharing at Springer Nature: Of Schemes and Memes Blog. Available from: http://blogs.nature.com/ofschemesandmemes/2016/07/05/promoting-research-data-sharing-at-springer-nature. Cited 5 Nov 2019.
- 11.
Cell Press STAR★Methods. Available from: https://www.cell.com/star-authors-guide. Cited 5 Nov 2019.
- 12.
PLOS’ new data policy: public access to data | EveryONE Blog. 2014. Available from: https://blogs.plos.org/everyone/2014/02/24/plos-new-data-policy-public-access-data-2/. Cited 2019 Nov 5.
- 13.
Fanelli D. How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. PLoS One. 2009;4(5):e5738.
- 14.
Bik EM, Casadevall A, Fang FC. The Prevalence of inappropriate image duplication in biomedical research publications. mBio. 2016;7(3):e00809–16.
- 15.
Fang FC, Casadevall A. Retracted science and the retraction index. Infect Immun. 2011;79(10):3855–9.
- 16.
Asendorpf JB, Conner M, Fruyt FD, Houwer JD, Denissen JJA, Fiedler K, et al. Recommendations for increasing replicability in psychology. Eur J Personal. 2013;27(2):108–19.
- 17.
Begley CG, Ioannidis JPA. Reproducibility in science. Improving the standard for basic and preclinical research. Circ Res. 2015;116(1):116–26.