T. rex Data Stomp Back Into Spotlight

The controversy over the Tyrannosaurus rex data that Raj Mukhopadhyay posted about last month just won’t go away. Now, questions about data sharing are being raised in response to how the lead investigator, John Asara, made the mass spectral data available to the proteomics community. Adding fuel to the already blazing fire, Asara deposited the data into a free public proteomics database, but he included a rider that stated that any other peptides identified by anyone else were owned by the T. rex researchers and were not publishable without their permission. Many proteomics scientists were appalled that such restrictions were placed on free data. The whole brouhaha got us thinking about the larger issue of data sharing in proteomics. After interviewing proteomics experts, we found that database administrators do not routinely check entries for such restrictions. In addition, some researchers say that all proteomics data sets should be freely available so that bioinformaticians can develop better tools with them. Other researchers counter they’d like to keep mining their data sets that they worked so hard to generate, so they’d prefer to keep some of the information private. And on the whole, proteomics scientists are more reluctant than their genomics counterparts to make their data publicly accessible. What do you think: when should data be shared, and why might proteomics researchers have a different view than other –omics investigators? Image: Shutterstock

Author: kcotting

Share This Post On

7 Comments

  1. A key point in this discussion is that these are data sets made public in support of an academic publication. The NIH and other funding agencies have permitted restrictions on prepublication data in a number of contexts (e.g. whole genome shotgun sequencing) primarily to facilitate release of the data to the community while allowing the groups actually producing the data to retain priority in publication. This does not apply to the case of Asara et al., these authors have already published their findings, and they are providing the data sets to resolve scientific issues regarding interpretation of the data. If another investigator finds a different interpretation such as these spectra were derived from a molecule other than T. rex collagen, they should be free to publish this finding. Access to data in support of publications is the foundation upon which science has been based for the past several centuries, “After publication, scientists expect that data and other research materials will be shared with qualified colleagues upon request.” [1]. If Asara et al. are still analyzing the data sets and now have some uncertainty in their initially reported findings, they should retract the original paper. If they wish to stand by their initial publication, they are obliged to make the supporting data available to scientists without restriction on subsequent scientific publication.

    There is a question of what data actually forms the basis for the Science paper [2] under discussion. In particular, is it just the specific spectra matching a hypothesized collagen sequence, or is it the entire data set? The Science paper is based on the assertion that the matches between the spectra and the peptide sequence are statistically significant. Statistical significance can not be assessed on the basis of only the matching spectra. It is necessary to compare the reported matches with other matches to other spectra in the data set to determine a false discovery rate and to interrogate other databases to assess the possibility that the spectra were derived from peptide sequences other than the hypothetical T. rex collagen. Thus, the scientific finding is based on the full data set, and it is the full data set that forms the basis for the publication.
    Patents and commercialization are a different issue. Asara et al. are certainly within their rights to patent findings based on their data, and they are not obliged to disclose that such filing may have been made in their scientific publications. They are also, of course, free to continue to analyze their data and to seek patent protection for any new discoveries that they might make. However, they have no claim for co-inventor status on subsequent discoveries based on independent analysis of published data sets.

    David J. States, M.D., Ph.D.
    Professor of Health Information Science
    School for Health Information Science
    University of Texas Health Science Center at Houston

    [1] National Academy of Sciences, National Academy of Engineering, Institute of Medicine
    “On Being a Scientist: Responsible Conduct in Research” National Academies Press, Washington D.C., 1995. Page 10.

    [2] Asara, J. M. et al. (2007) Protein Sequences from Mastodon and Tyrannosaurus Rex Revealed by Mass Spectrometry. Science 316:280-5.

  2. I don’t know what the policy for proteomics data release should be – it is up to the proteomics community to decide. Until the community decides what the standards should be, every call to share the data may be classified as “bullying”. And the biggest “bully” of all would be the Science magazine since the Science guidelines state:

    “After publication, all data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science.”

    This reader of Science was unable to ASSESS the conclusions of Asara et al., 2007 and respectfully asked the authors to provide the “data necessary to assess the conclusions of the manuscript”. Needless to say, the seven T.Rex spectra alone do not allow one to assess the data (moreover, these 7 spectra only became available after this reader wrote to the Editor that they should be released to assess the data). After trying for a year to “understand and asses” the conclusions (based only on 7 spectra) this naïve reader wrote a technical column explaining that he has been having difficulties assessing the results. Is it about “bullying” or about the Science guidelines?

    When it comes to “bullying”, NSF comes close second to Science:

    “NSF expects investigators to share with other researchers, at no more than incremental
    cost and within a reasonable time, the data, samples, physical collections and other supporting
    materials created or gathered in the course of the work.”

    Needless to say, the notion of “reasonable time” may be subject to interpretation and abuse, so NSF is not as bad as Science when it comes to bullying.

    The opponents of data sharing in proteomics have a really good argument: “We don’t want to share data (even a year after it was published) since it will allow others to mine them with different tools (and find gold that is ours).” The only danger is that genomics people may find this argument appealing and stop releasing the genomic data to PROTEOMICS community. Indeed, if this principle benefits the proteomics researchers, it should benefit the genomics researchers in the same way. To avoid double standards, maybe the proteomics researchers who are against data sharing, should agree not to use any DNA sequencing data in their MS/MS database searches? This reader participated in the mouse, rat, and chicken consortia (on the genomics side of the barricades) – should I feel sorry that the chicken genome is now available so that the proteomics researchers can mine it without the need to share the results with a curious Science reader who contributed to the Nature Chicken paper in the first place?
    Particularly if this Science reader wants to figure out whether T.Rex indeed tasted like chicken?

    Pavel Pevzner

  3. I know what I think the policy on data release should be: anyone who wants to do publicly-funded science should release the data they collect, and they should release it promptly. They certainly should release all the relevant data supporting a publication no later than the publication date. To those who “counter they’d like to keep mining their data sets that they worked so hard to generate, so they’d prefer to keep some of the information private,” I have a simple response: why do you think you’re getting funded to do this work? It’s not in order to advance your career, whatever you may think – it’s to advance science! And as we all know, science is fundamentally based on describing discoveries and communicating those descriptions to the community, so that they can be tested, validated, and explored further.

    I have encountered this attitude in many fields – most recently among the influenza virus community, who don’t want to share samples or sequences – and I think that scientists who feel this way have simply forgotten why they went into science in the first place. We don’t study science in order to keep the results to ourselves. Yes, I know that no one wants to be “scooped” by someone else using their own data – and my response is that if you’re not clever enough to analyze your data quickly and accurately, too bad.

    The only reason so many fields don’t require rapid data sharing is, bluntly speaking, that results in those fields aren’t particularly urgent. But what scientist wants to admit that it doesn’t matter if the truth is delayed for a few years? That’s like admitting that your field is irrelevant.

    And to those who would respond that NIH/NSF/other allows you to keep your data private, or that the Bayh-Dole act allows you to file patents on your work, I’ve heard that all before. Just because it’s allowed doesn’t make it right. It’s not.

    As for the T. rex data, Asara et al want to have it both ways – they did release their data, and I applauded them for that in my blog:
    http://genefinding.blogspot.com/2008/09/t-rex-peptides-now-available-to-public.html
    but to place restrictions on the data – and to withdraw part of it, as they’ve now done – is inappropriate.

    Steven Salzberg

  4. I agree with all the above points. I have learned from this process that transparency is always the best policy to data generation. In fact, it will typically help to support your conclusions rather than refute them. While there will always be skeptics trying to find fault with your data, there will be just as many supporters who find comfort in mining the data themselves and coming to the same conclusions independently. The original T. rex dataset containing all 48,216 spectra will be re-released very shortly on the PRIDE database without restrictions or trimming.

    Computational scientists such as Pavel Pevzner and Steven Salzberg cannot work without data that we generate so we should all release our interesting data so that they can further develop software for improving our analyses and interpretations.

    Everyone who acquires data is hesitent to release it since you may be wary of that hidden gem that you missed or a sketchy spectrum that may have an alternative interpretation. That is precisely the reason why the data should be released. Let others verify the conclusions on their own. In time, the correct conclusions will stand.

    I urge all of you to go the PRIDE database http://www.ebi.ac.uk/pride/ and download the raw data from T. rex, Mastodon, and ostrich and come to your own conclusions. I only ask that you please let us know about your findings.

    John Asara

  5. Jon Asara made a good point: “Computational scientists such as Pavel Pevzner and Steven Salzberg cannot work without data that we generate so we should all release our interesting data”. It would be only fair to clarify this point by adding: “Since mass spectrometry research today WOULD NOT BE POSSIBLE without computational scientists we should all release our interesting data”. Indeed, some mass spectrometrists may not know who Steven Salzberg is but they are benefitting form his work every day when they search databases of proteins predicted by GLIMMER, the leading gene prediction tool. Not to mention that in many cases, these gene predictions were made in the genomes assembled using computational tools he helped to developed. To the best of my knowledge, the areas of gene prediction and fragment assembly (enabling modern mass spectrometry) are now being developed exclusively by “computational scientists” since there are no experimental scientists left in these fields. Therefore, releasing data is not a charity but a way for experimental science to survive and stay relevant. It is also a small token of appreciation for two decades of work that went into development of gene prediction and fragment assembly algorithms without which MS-based protein identifications would not be possible today.

  6. First I want to applaud John for releasing his data and removing the restrictions. I hope that others in the mass spec community will follow his example. Thanks to Pavel too for the generous comments about our GLIMMER gene finder and our genome assembly software. I would add that all our software is free and open source, and has been for many years. I’ve also been involved in many genome sequencing projects, and for those where I was the project leader, we have released all of our sequence data to GenBank, with no restrictions, prior to publication. Most recently this includes a bacterial genome (Pseudomonas aeruginosa) sequenced entirely with short-read technology. I was also one of the project leaders for the Influenza Genome Sequencing Project, which has now sequenced and released >3000 complete influenza virus genomes, all in GenBank. Others have already written papers based on some of this data.

    Again, bravo to John for removing all restrictions on his data!
    -Steven Salzberg

  7. Most of that science is fraud; they publish to promote their careers at tax-payer expense and academic promotions. Absolute transparency is required in science; not after the fact due to self-serving, self grandisment purposes. Bravo to Dr. Pezsner for his rigor and hold Jon to task; hopefully this chemist has learned a valuable life and professional lessons. regarding a “hidden gem”–we are in a severe economic recession, millions unemployed, losing homes, homes underwater from mortgage standpoint—studying dinosaurs (from USA taxpayers) is a JOKE and perhaps Jon and other should be focusing their efforts and intellectual pursuits on road repairs, schools, etc…things that actually matter to the general population.