Sunday, May 3, 2015

Thinking about the Data Lifecycle

I'm sure many have thought about this before me. 

My thoughts on data life cycle:
  1. Data as an idea (A project I have in mind to collect or generate data)
  2. Model / Methodology refining (Data is of assumed low quality)
  3. Bulk Data Generation (Data is of assumed high quality)
  4. Post Processing (Traditional journal article generation)
  5. Archiving (Code / Method and Data)
  6. Secondary Insights (Other investigators)
1 through 4 are all under the control of the original investigator, 5 and 6 are not.  There are many problems I see form this workflow. 

The most significant issues I see is that the incentives to the original investigator ends at 4.  We keep talking about more data and code publication, encourage reuse etc, but until there is incentive to do so there will not be much of 5 and 6 going on.  The current methods of requiring data management plans will only be moderately useful because it is a stick not a carrot.

5 and 6 are more interesting to me from a technical standpoint. There is a lot of bit rot inherent in this process.  Specifically the only way data is useful to a second investigator is if there is metadata and documentation (probably the first papers) describing what the data are and how they were generated. 

Metadata suffers from two problems, right now the archiving has minimal metadata attached, or is attached last minute and isn't attached at data generation time.  Expecting all needed metadata to be recalled at a later date is unreasonable.

The second issue is more subtle and I don't see how it can be addressed well, it is that is the needed metadata for any second investigator going to be listed and preserved?  All data has almost unlimited metadata characteristics and many are not of interest to those who generated the data and cannot foresee how the data might be useful in the future to another.

No comments:

Post a Comment