Thursday, February 9, 2012

Doing More With the Same Budget

Andrew Jones has an article up at HPCWire on my series of posts on HPC funding models.  I wish to move away from condo funding, that is researchers own the underlying hardware, to a service model where users effectively rent cores. This is exactly what we did at Michigan.

Note my comments on this blog are mine and do not reflect that of my major employer.

I don't actually see much disagreement between Andrew and myself. The free model, overheads in Andrews terms, is a wonderful example of The Tragedy of the Commons. Queue times get long and what is to stop anyone from running anything?  Under this wild west approach, which I don't think Andrew is advocating, users end up paying in time.  As with anything as you hold supply steady and increase demand, wait times go up.

I live and work in a world where nothing is free.  Nodes, storage, admins, consulting, power, facilities etc. All take real resources and their use needs to be moderated.  Many educational institutions live in the world where the support from central is normally admins, maybe power, and maybe some software tokens.

By default in this world the researcher must bring the hardware, in the form of the funds to purchase nodes.  Under most cases this hardware can only be used by that group.  I currently run user support for a cluster of 5600+ cores spread across 51 groups.  The utilization is 50-70% but the matter is because these groups cannot, and are not allowed to share, inside a given group is very different. Some are 100% and queued, some are idle.

If the assumption is that we will never be just given the gear to run, and that funding agencies are only going to spend $X on computational research a year what value do you want out of it?  I argue that just reorganizing how the capital funding is spent more value is realized, as Andrew says "Science and Business output".

I agree 100% with Andrew that users do not always like paying for high speed networking, consults etc and might not even realize until after the funding arrives that that is what they need.  I also agree that some of the most interesting stuff is the "just try this" jobs.  I think these are problems we can solve other ways out of the savings of driving utilization higher.  I don't expect that all resources should be billed by the unit.  Some resources will most efficiently be provided as a public good.

Under the system I propose I would expect to extract close to maximum value from the capital resources, while maintaining great service.  Thus far these resources had been the hardware and the facilities they consume. In my most recent post I point out the benefit for of a HAAS model to users:
For the user they gain flexibility in utilization.  Groups with small budgets can now utilize large chunks of HPC resources for short periods, opening HPC to an entire new class of user.  To illustrate the Michigan Flux Project, a group with a budget of $1000 could not even buy one node in a condo but can purchase 89 cores compute for 1 month, or 1 core for 89 months.  The options here are the biggest to be realized by moving to a HAAS model.
This flexibility of utilization of the hardware equates to flexibility in the most important resource which is the researcher behind the job.  I am not saying the user behind the desk does nothing while they are jailed to smaller core counts in their condo, that they must maintain for 5 years, but I do think they can gain huge outcomes from being able to use a large number of resources in a short period for the same cost or lower than that of a condo.

Lastly the biggest winner is HAAS being approachable to those with the smallest budgets. The bottom billion researchers.  If you don't get a bucket of money to provide HPC to everyone who comes knocking, what do you tell that experimentalist that wants to run one model to align their input?  That they have to pony up for all the cores they want to use for 5 years?  Under HAAS this large base of small total users can bring a large amount of resources to bear in a short period.  This is real science and business value created from the same set of resources.

Hardware will always cost $X for Y cores, and HPC admins will ran those Y cores for 5 years.  I don't think it is good to drag the users with it. $X is a constant, there is no reason users should be forced into Y cores for 5 years. They should be able to vary Y and the number of years (months or days) up and down until the total area under the curve is $X.

Reorganize how your resources are provided and for the same capital you will see more capability.

No comments:

Post a Comment