Thursday, September 17, 2020

Archivetar - A better tar for Big Data

 Challenge: Trade-offs of Cost/Bit vs Bits/File and Performance

In the options of Tape, HDDs, SSD, and NVMe, there are significant trade offs to expected performance for small files, at a higher cost per unit capacity.   In HPC we would love to deploy Peta Bytes of NVMe but most budgets cannot support it.

Tape and AWS Glacier have low costs, great bandwidth, but long seek times before the first file appears. Thus these technologies are often targeted at Archive use cases.  It is left to the user though to organize their data in a way that does not make recalling data painfully slow.

80/20 Rule of Project Folders

In a perfect world archived project folders would include data, source code, scripts to re-create the data etc.  This leads to a common 80/20 split, where 80% of the files have 20% of the data.  The total data volume drives the budget for storing the data, but the file count, which is only 20% of the data, drives management complexity.

Current Practices, One, Huge Tar

Currently most researchers, not having better options, will tar and entire project and upload to an archive. As projects get larger this introduces issues:

  • Tars are larger than max object size
  • Compression is limited to a single core
  • To access subsets of data the entire archive must be retrieved and expanded. This requires 2x the storage space (Tar + Expanded Tar)
  • Opportunities for parallelism, are lost when transferring data at the file level
  • Large files, often binary, don't compress, dominate compressor time, for little benefit
  • Low utilization of CPU, Storage IO, and Networking

Desired Outcome, Sort and Split

Preferably it would be better if files over a given size could be excluded. These will often be data files that are big enough to realize full archive performance.  Files under this threshold could be sorted into lists, and assigned to tars of a target size.  The end result being a folder of only large files and multiple tars of small files.  Subsets of data can be recalled without needing to expand all archives.

Archivetar - A better tar for Big Data

Archivetar aims to address exactly that workflow.

Archivetar benefits include:

  • Utilized mpiFileUtils to quickly walk filesystems
  • Creates multiple tars simultaneously for higher performance on network filesystems
  • Auto detects many parallel compressors for multi-core systems
  • Saves an index of files in each tar to find subsets of data without needing to recall and expand all archives
  • Archives are still stand alone tars and can be expanded without archivetar installs

Example Archivetar

#example data file count
[brockp@gl-login1 box-copy]$ find . | wc -l

# create tars of all files smaller than 10M
# tars should be 200M before compression
# save purge list
# compress with pigz if installed
archivetar --prefix my-archive --size 10M --tar-size 200M --save-purge-list --gzip

# delete small files and empty directories
archivepurge --purge-list my-archive-2020-09-17-22-35-20.under.cache

# File count after
[brockp@gl-login1 box-copy]$ find . | wc -l

# recreate
unarchivetar --prefix my-archive

[brockp@gl-login1 box-copy]$ find . | wc -l

Monday, January 22, 2018

Automating Jetstream with Terraform

Jetstream is an OpenStack cluster for science that researchers can request access to via XSEDE which traditionally been known only as an HPC provider but has long provided other services.  Jetstream provides many of the infrastructure as a service (IAAS) offerings many have turned to public cloud providers (Amazon, Google, and Azure) but many don't know that Jetstream exists.

Another challenge is automation of Jetstream.  AWS provides a service called cloud formation that allows automating deployments scaling etc without having to spend a lot of time in UI's and helps with predictability between deployements.

Jetstream is just an implementation of Openstack at its most fundamental level and thus any tool that understands the Openstack API can work with Jetstream.  Thus I went out and made a small example of how to bring up an CentOS7 system on Jetstream, and create all the supporting networks and security groups with Terraform an open source tool for automated infrastructure.

You can find this example and documentation on my Github site.

Users should find it simple to extend the example to make very complex multi network customized scalable environments, the same they can on public cloud providers but without extreme cost.

Wednesday, May 31, 2017

Job Posting

Join our group!

Looking for work using containers (docker) with some engine (Mesos, Kubernetes, or Rancher, etc) to increase the flexibility for deploying BigData tools in a dynamic research environment.

Opportunities for public cloud in research

It comes down to $/performance and government regulations / sponsor requirements. I covered these in others.  So what is my read of the tea leaves for cloud use in research computing.

  • Campus level / modest projects with with stock CUI / NIST 800-171 or similar regulations. 
    These require purpose built systems and heavy documentation.  FedRamp made this simpler, avoid doing all the work required for this and save yourself the time.
  • Logging/Administrative/Web/CI Systems/Disaster Recovery. 
    These systems are generally small and part of the administrative stack of just having a center.  These systems benefit the same way enterprise systems do with the flexibility of the cloud.  I personally love PAAS and docker here, yes I would like another elastic search cluster please,  no I do not want to worry about building it please.
  • High Availability Systems
    IOT / Sensor Nets ingest points.  Any system where you need higher availability than normal research.  Similar to the sensitive systems, if you have a 1MW HPC data center you don't put the entire thing on generators and have a second data center for 20KW of distributed message buses for sensor networks.  If you are not investing a lot of capital into the computer systems,  don't do it anywhere else, piggyback on clouds offering of multi site built in. 
  • High Throughput Computing / Grid ComputingNew lower cost price models via AWS Spot, Google Interruptible, and Azure Low Priority make the cost of actual cycles very close to what you can buy bulk systems for.  Every HPC center I know of is always running out of power and cooling, take these workloads that are insensitive to interruption / short run times and don't require high IO or tightly coupled parallelism and keep your scarce power for unique capability for HPC.
  •  Archive / HSM Replicas or the entire thingDepending on your use of your tape today, the cloud sites make great replicas at similar costs.  Some nitch providers like Oracle have costs that are hard to beat, with one catch.  As long as you never access your data. Cost predictability for faculty is a problem, and with cold storage costing as much as $10,000/retrieved PB in the cloud, if your HSM system is busy use the cloud only for a second copy for DR.  That is upload data (free generally), delete it (sometimes free) and never bring it back except on media error.  This should help you limit your capital spend on a second site as well as the second site to put the system in.
    If you are doing real archive, that is set it and forget it, ejected tape will forever be cheaper, but do you have a place to put it, and people to do the shuffle, there is a lot of value to maybe use the cloud for all of it.
This is my first (quick) set of thoughts, other systems like analytic systems should also be done in the cloud, they are already more mature than most research sites, and makes hosting things like notebooks, and splitting data across storage buckets for policy much more useful.

I'm sure many of you will disagree with me, feel free to tweet me at @brockpalen.

Data Providers Need to Catch up to Cloud

In my recent project looking to see if we could migrate to cloud in this generation for HPC another topic kept arising.

We cannot yet take enough of our data or software off our owned systems and facilities.

Beyond HIPAA, and BAA's  there are a raft of other data regulations that data are provided to our researchers under.  Last I checked there was thousands of faculty with hundreds of data sources in a campus environment.

Right now because most campus projects are small, it is not worth it in both time, nor upsetting the data provider, to get any agreement in place with a cloud provider to host said data. Many of these plans require revealing information about your physical security and practices that you cannot have in general from a cloud provider.  Or refer to standards that existed before clouds existed (anyone who looked at FISMA training pre-FedRamp, and any agreement with physical isolation will recognize this limitation).

Some data types (FISMA / NIST 800-171) come to mind that are actually easier to do in the major public clouds because you don’t need sign off from each of the data providers but just the agency who has already done the work with that public cloud provider.  (NOTE: I am still early in looking into this, this is my current understanding, but I could be wrong).  Thus after doing the last mile work (securing your host images, your staff policies, patch policies etc) you can actually respond to these needs faster in the cloud and get an ATO.

So where does this leave the data providers that each have their own rules and require each project to have sign off form the provider making the fixed cost of each project high?  As a community we should be educating them to move them towards aligning with one of the federal standards.  Very few of these projects I have seen are actually stricter than NIST 800-171, thus if these data providers would accept these standards, and an ATO (Authority to operate) from the federal agencies, they would probably get better security/less under the desk 'air gaped' servers, but increase the impact / ease of access to data for the work they are trying to support.

This would make funding go further, get technical staff and researchers back at what they do best and less time looking at data use agreements.

Comparing the Cost of Public Cloud to On Prem for HPC

I was recently working on long term planning of a modestly large HPC resource (20,000+ cores).  The question proposed was why are we not doing this in the cloud?

Personally I love the cloud for a lot of use cases,  I would love to not worry about hardware, have the ability to burst to any scale, but after I did the work with one major cloud provider the economics were just not there.  Will I think they get there?  Probably, but not for at least 5-10 years for our shop without some discounting off list.  Below I'm laying out my formula that for another shop might change the calculation:

  1. Data Center Reliability
    Cloud data centers aim to provide enterprise availability, probably Tier 3 or better.  In academic HPC that draws the most MW from the data center infrastructure we don't value this much, but it is expensive to provide that level of availability.
  2. Offerings designed for web content, enterprise, and analytics
    Cloud offerings are almost all based on enterprise needs or web app delivery.  HPC does not map to these work flows.  Adding a few extra ms of network overhead is a small cost for human interaction on a website, but is awful in HPC MPI offerings.  Yes there are HPC specific providers out there, but most are not big enough to handle the scale we are looking at, and you sacrifice most of the flexibility of cloud.
    NOTE: As analytics becomes more important for academics we should pay very close attention here, and this might be the first option to large scale utilization of public cloud, as enterprise is ahead in this area currently. 
  3. Scaling
    Cloud has way more scale in total cores than any HPC system out there, but as the adage goes, "there is no cloud only other people's computers" and as many in the community have pointed out, if you have a decent sized consistent need, even reserved and pre-paid instances rack up costs quickly compared to building your own if your north of 500KW of constant need of HPC.  Someone is paying for all that unused capacity to scale.  In this case the massive scaling works against you in your marginal cost for additional long term need. 
  4. Staffing
    Refer to #2 because cloud never wants to let anyone down, they are staffing at very high levels to keep all services running all the time to meet enterprise needs.  This is great if you have a database that is key to your organization, it is very cheap compared to staffing in house for that one database, but for HPC again we don't value that as academics compared to the cost of doing that.
In general because there is no HPC specific cloud provider with scale providing a service that aims to provide "good enough" availability, if you have significant need public cloud economics won't work right now.   HPC is capital heavy and staffing light compared to enterprise.  Public cloud uses expensive capital for high availability (even if the lowest cost way to get it) that isn't valued by this community.

Now if I was an enterprise IT person in the hardware / data center line of work.  I would be worried and retool my skills for deploying and monitoring HA architecture across public clouds. The small and medium enterprise will be all cloud, it's just lower cost with greater flexibility.  Once your sunk cost of a data center goes away small scale operators cannot compete the the investment being made in cloud.  Your services should start shifting to running on cloud.