Thursday, September 17, 2020

Archivetar - A better tar for Big Data

 Challenge: Trade-offs of Cost/Bit vs Bits/File and Performance

In the options of Tape, HDDs, SSD, and NVMe, there are significant trade offs to expected performance for small files, at a higher cost per unit capacity.   In HPC we would love to deploy Peta Bytes of NVMe but most budgets cannot support it.

Tape and AWS Glacier have low costs, great bandwidth, but long seek times before the first file appears. Thus these technologies are often targeted at Archive use cases.  It is left to the user though to organize their data in a way that does not make recalling data painfully slow.

80/20 Rule of Project Folders

In a perfect world archived project folders would include data, source code, scripts to re-create the data etc.  This leads to a common 80/20 split, where 80% of the files have 20% of the data.  The total data volume drives the budget for storing the data, but the file count, which is only 20% of the data, drives management complexity.


Current Practices, One, Huge Tar

Currently most researchers, not having better options, will tar and entire project and upload to an archive. As projects get larger this introduces issues:

  • Tars are larger than max object size
  • Compression is limited to a single core
  • To access subsets of data the entire archive must be retrieved and expanded. This requires 2x the storage space (Tar + Expanded Tar)
  • Opportunities for parallelism, are lost when transferring data at the file level
  • Large files, often binary, don't compress, dominate compressor time, for little benefit
  • Low utilization of CPU, Storage IO, and Networking

Desired Outcome, Sort and Split

Preferably it would be better if files over a given size could be excluded. These will often be data files that are big enough to realize full archive performance.  Files under this threshold could be sorted into lists, and assigned to tars of a target size.  The end result being a folder of only large files and multiple tars of small files.  Subsets of data can be recalled without needing to expand all archives.

Archivetar - A better tar for Big Data

Archivetar aims to address exactly that workflow.


Archivetar benefits include:

  • Utilized mpiFileUtils to quickly walk filesystems
  • Creates multiple tars simultaneously for higher performance on network filesystems
  • Auto detects many parallel compressors for multi-core systems
  • Saves an index of files in each tar to find subsets of data without needing to recall and expand all archives
  • Archives are still stand alone tars and can be expanded without archivetar installs

Example Archivetar

#example data file count
[brockp@gl-login1 box-copy]$ find . | wc -l
6925

# create tars of all files smaller than 10M
# tars should be 200M before compression
# save purge list
# compress with pigz if installed
archivetar --prefix my-archive --size 10M --tar-size 200M --save-purge-list --gzip

# delete small files and empty directories
archivepurge --purge-list my-archive-2020-09-17-22-35-20.under.cache

# File count after
[brockp@gl-login1 box-copy]$ find . | wc -l
379

# recreate
unarchivetar --prefix my-archive

[brockp@gl-login1 box-copy]$ find . | wc -l
6925

No comments:

Post a Comment