Wednesday, May 31, 2017

Comparing the Cost of Public Cloud to On Prem for HPC

I was recently working on long term planning of a modestly large HPC resource (20,000+ cores).  The question proposed was why are we not doing this in the cloud?

Personally I love the cloud for a lot of use cases,  I would love to not worry about hardware, have the ability to burst to any scale, but after I did the work with one major cloud provider the economics were just not there.  Will I think they get there?  Probably, but not for at least 5-10 years for our shop without some discounting off list.  Below I'm laying out my formula that for another shop might change the calculation:


  1. Data Center Reliability
    Cloud data centers aim to provide enterprise availability, probably Tier 3 or better.  In academic HPC that draws the most MW from the data center infrastructure we don't value this much, but it is expensive to provide that level of availability.
  2. Offerings designed for web content, enterprise, and analytics
    Cloud offerings are almost all based on enterprise needs or web app delivery.  HPC does not map to these work flows.  Adding a few extra ms of network overhead is a small cost for human interaction on a website, but is awful in HPC MPI offerings.  Yes there are HPC specific providers out there, but most are not big enough to handle the scale we are looking at, and you sacrifice most of the flexibility of cloud.
    NOTE: As analytics becomes more important for academics we should pay very close attention here, and this might be the first option to large scale utilization of public cloud, as enterprise is ahead in this area currently. 
  3. Scaling
    Cloud has way more scale in total cores than any HPC system out there, but as the adage goes, "there is no cloud only other people's computers" and as many in the community have pointed out, if you have a decent sized consistent need, even reserved and pre-paid instances rack up costs quickly compared to building your own if your north of 500KW of constant need of HPC.  Someone is paying for all that unused capacity to scale.  In this case the massive scaling works against you in your marginal cost for additional long term need. 
  4. Staffing
    Refer to #2 because cloud never wants to let anyone down, they are staffing at very high levels to keep all services running all the time to meet enterprise needs.  This is great if you have a database that is key to your organization, it is very cheap compared to staffing in house for that one database, but for HPC again we don't value that as academics compared to the cost of doing that.
In general because there is no HPC specific cloud provider with scale providing a service that aims to provide "good enough" availability, if you have significant need public cloud economics won't work right now.   HPC is capital heavy and staffing light compared to enterprise.  Public cloud uses expensive capital for high availability (even if the lowest cost way to get it) that isn't valued by this community.

Now if I was an enterprise IT person in the hardware / data center line of work.  I would be worried and retool my skills for deploying and monitoring HA architecture across public clouds. The small and medium enterprise will be all cloud, it's just lower cost with greater flexibility.  Once your sunk cost of a data center goes away small scale operators cannot compete the the investment being made in cloud.  Your services should start shifting to running on cloud.

No comments:

Post a Comment