Sunday, May 3, 2015

Metadata ease of use in traditional HPC platforms

I was reading a co-workers notes from BioIT World 2015 and got thinking about useability of data, how object filesystems could be made more useful to the average simple researcher etc.  This got me thinking about data life cycle and the need for better metadata management.

Users love regular POSIX filesystem with folders etc, and it is their own metadata structure:


People love working with this, our brains wrap around it.  The problem is data.h5 doesn't have any of the information from the directory structure the user has given it.  Existing object store systems make it hard to navigate data like this.

I propose two ideas, a pseudo filesystem that looks like folders but can point to data in multiple ways depending on what metadata attribute you are interested in.  The second is a 'search only' filesystem.  Think of a search only filesystem to be like Apple Spotlight or Launch Bar etc.  Most the time it's close enough and it finds what you want based on metadata.  These searchable systems should be extendable from user space (think like bash completion add ons) around different communities of use.

This will allow for a few results:
  • Users will find it useful in their own day to day work to attach metadata at data generation time rather than leaving it un-categorized data.
  • It should allow for more robust metadata though the data entire life cycle to archive and thus be more useful to future users of the data
  • Object filesystems holding the actual data can phase out traditional POSIX filesystem and hopefully help with many of the data scale problems we have had on the the trail to Exascale.
I think this would be the best of both worlds, human friendly (still has 'folders'), computer friendly (Trillions of objects, no directory size issues) and data reuse (more and better metadata) and productivity (find my data with X attribute easily using search).

No comments:

Post a Comment