Wednesday, May 31, 2017

Job Posting

Join our group!

http://careers.umich.edu/job_detail/142372/research_cloud_administrator_intermediate

Looking for work using containers (docker) with some engine (Mesos, Kubernetes, or Rancher, etc) to increase the flexibility for deploying BigData tools in a dynamic research environment.

Opportunities for public cloud in research

It comes down to $/performance and government regulations / sponsor requirements. I covered these in others.  So what is my read of the tea leaves for cloud use in research computing.

  • Campus level / modest projects with with stock CUI / NIST 800-171 or similar regulations. 
    These require purpose built systems and heavy documentation.  FedRamp made this simpler, avoid doing all the work required for this and save yourself the time.
  • Logging/Administrative/Web/CI Systems/Disaster Recovery. 
    These systems are generally small and part of the administrative stack of just having a center.  These systems benefit the same way enterprise systems do with the flexibility of the cloud.  I personally love PAAS and docker here, yes I would like another elastic search cluster please,  no I do not want to worry about building it please.
  • High Availability Systems
    IOT / Sensor Nets ingest points.  Any system where you need higher availability than normal research.  Similar to the sensitive systems, if you have a 1MW HPC data center you don't put the entire thing on generators and have a second data center for 20KW of distributed message buses for sensor networks.  If you are not investing a lot of capital into the computer systems,  don't do it anywhere else, piggyback on clouds offering of multi site built in. 
  • High Throughput Computing / Grid ComputingNew lower cost price models via AWS Spot, Google Interruptible, and Azure Low Priority make the cost of actual cycles very close to what you can buy bulk systems for.  Every HPC center I know of is always running out of power and cooling, take these workloads that are insensitive to interruption / short run times and don't require high IO or tightly coupled parallelism and keep your scarce power for unique capability for HPC.
  •  Archive / HSM Replicas or the entire thingDepending on your use of your tape today, the cloud sites make great replicas at similar costs.  Some nitch providers like Oracle have costs that are hard to beat, with one catch.  As long as you never access your data. Cost predictability for faculty is a problem, and with cold storage costing as much as $10,000/retrieved PB in the cloud, if your HSM system is busy use the cloud only for a second copy for DR.  That is upload data (free generally), delete it (sometimes free) and never bring it back except on media error.  This should help you limit your capital spend on a second site as well as the second site to put the system in.
    If you are doing real archive, that is set it and forget it, ejected tape will forever be cheaper, but do you have a place to put it, and people to do the shuffle, there is a lot of value to maybe use the cloud for all of it.
This is my first (quick) set of thoughts, other systems like analytic systems should also be done in the cloud, they are already more mature than most research sites, and makes hosting things like notebooks, and splitting data across storage buckets for policy much more useful.

I'm sure many of you will disagree with me, feel free to tweet me at @brockpalen.

Data Providers Need to Catch up to Cloud

In my recent project looking to see if we could migrate to cloud in this generation for HPC another topic kept arising.

We cannot yet take enough of our data or software off our owned systems and facilities.

Beyond HIPAA, and BAA's  there are a raft of other data regulations that data are provided to our researchers under.  Last I checked there was thousands of faculty with hundreds of data sources in a campus environment.

Right now because most campus projects are small, it is not worth it in both time, nor upsetting the data provider, to get any agreement in place with a cloud provider to host said data. Many of these plans require revealing information about your physical security and practices that you cannot have in general from a cloud provider.  Or refer to standards that existed before clouds existed (anyone who looked at FISMA training pre-FedRamp, and any agreement with physical isolation will recognize this limitation).

Some data types (FISMA / NIST 800-171) come to mind that are actually easier to do in the major public clouds because you don’t need sign off from each of the data providers but just the agency who has already done the work with that public cloud provider.  (NOTE: I am still early in looking into this, this is my current understanding, but I could be wrong).  Thus after doing the last mile work (securing your host images, your staff policies, patch policies etc) you can actually respond to these needs faster in the cloud and get an ATO.

So where does this leave the data providers that each have their own rules and require each project to have sign off form the provider making the fixed cost of each project high?  As a community we should be educating them to move them towards aligning with one of the federal standards.  Very few of these projects I have seen are actually stricter than NIST 800-171, thus if these data providers would accept these standards, and an ATO (Authority to operate) from the federal agencies, they would probably get better security/less under the desk 'air gaped' servers, but increase the impact / ease of access to data for the work they are trying to support.

This would make funding go further, get technical staff and researchers back at what they do best and less time looking at data use agreements.

Comparing the Cost of Public Cloud to On Prem for HPC

I was recently working on long term planning of a modestly large HPC resource (20,000+ cores).  The question proposed was why are we not doing this in the cloud?

Personally I love the cloud for a lot of use cases,  I would love to not worry about hardware, have the ability to burst to any scale, but after I did the work with one major cloud provider the economics were just not there.  Will I think they get there?  Probably, but not for at least 5-10 years for our shop without some discounting off list.  Below I'm laying out my formula that for another shop might change the calculation:


  1. Data Center Reliability
    Cloud data centers aim to provide enterprise availability, probably Tier 3 or better.  In academic HPC that draws the most MW from the data center infrastructure we don't value this much, but it is expensive to provide that level of availability.
  2. Offerings designed for web content, enterprise, and analytics
    Cloud offerings are almost all based on enterprise needs or web app delivery.  HPC does not map to these work flows.  Adding a few extra ms of network overhead is a small cost for human interaction on a website, but is awful in HPC MPI offerings.  Yes there are HPC specific providers out there, but most are not big enough to handle the scale we are looking at, and you sacrifice most of the flexibility of cloud.
    NOTE: As analytics becomes more important for academics we should pay very close attention here, and this might be the first option to large scale utilization of public cloud, as enterprise is ahead in this area currently. 
  3. Scaling
    Cloud has way more scale in total cores than any HPC system out there, but as the adage goes, "there is no cloud only other people's computers" and as many in the community have pointed out, if you have a decent sized consistent need, even reserved and pre-paid instances rack up costs quickly compared to building your own if your north of 500KW of constant need of HPC.  Someone is paying for all that unused capacity to scale.  In this case the massive scaling works against you in your marginal cost for additional long term need. 
  4. Staffing
    Refer to #2 because cloud never wants to let anyone down, they are staffing at very high levels to keep all services running all the time to meet enterprise needs.  This is great if you have a database that is key to your organization, it is very cheap compared to staffing in house for that one database, but for HPC again we don't value that as academics compared to the cost of doing that.
In general because there is no HPC specific cloud provider with scale providing a service that aims to provide "good enough" availability, if you have significant need public cloud economics won't work right now.   HPC is capital heavy and staffing light compared to enterprise.  Public cloud uses expensive capital for high availability (even if the lowest cost way to get it) that isn't valued by this community.

Now if I was an enterprise IT person in the hardware / data center line of work.  I would be worried and retool my skills for deploying and monitoring HA architecture across public clouds. The small and medium enterprise will be all cloud, it's just lower cost with greater flexibility.  Once your sunk cost of a data center goes away small scale operators cannot compete the the investment being made in cloud.  Your services should start shifting to running on cloud.

Saturday, November 7, 2015

SSH Directly to XSEDE Resources with GSISSH

Many know about the XSEDE Single Sign On Login Hub.  Many don't know that you can make your own version of this on your local systems.  To create the sign on hub, XSEDE uses the Globus Toolkit.

The steps include:

  • Build Globus Toolkit with GSI enabled
  • Download XSEDE Certificates

Big Data and Data Job Openings

Advanced Research Computing - Technology Services ( http://arc-ts.umich.edu/ ) at the University of Michigan has four new job openings as part of our Data Science Initiative ( http://record.umich.edu/articles/u-m-launching-100-million-data-science-initiative ) and supporting our ongoing efforts in High Performance Computing.  Available from entry level to senior.

ARC-TS builds and operates research computing platforms. These platforms will contain High Performance Computing (HPC) Linux clusters, High Throughput Computing (HT-Condor), data intensive (Hadoop, SQL, and NoSQL) systems, and containerized/virtualized systems (OpenStack, Docker). 
Big Data System Administrator Senior
http://umjobs.org/job_detail/117063/big_data_system_administrator_seniorintermediate

This position will act as a senior technical resource and be the primary position responsible for creating and operating and expanding our Hadoop and Spark infrastructure.
----------

Research Database Administrator Senior
http://umjobs.org/job_detail/117056/research_database_administrator_seniorintermediate

This position will act as a senior technical resource and be responsible for creating and operating our research database infrastructure and will be responsible for designing, building, operating, and supporting database platforms. These platforms will contain SQL, NoSQL, and columnar data stores.
----------

Research Cloud Administrator Intermediate
http://umjobs.org/job_detail/117062/research_cloud_administrator_intermediate

This position will act as a technical resource as part of a team that will create and operate our private cloud infrastructure, and will be responsible for designing, building, operating, and supporting a research private cloud. The private cloud will host administrative systems, databases, and other services.