Sunday, January 15, 2012

Condo Clusters are Inefficent

As I noted in my post about HPC funding models at higher education the condo cluster is probably the most popular form of funding model. The condo, or a federation of many private clusters owned by faculty but managed as a single unit sharing common administration and infrastructure.  Purdue is probably the best known condo provider.

I see two large inefficiencies to condos and one lost strategic advantage:
  • Capital utilization is low due to the private nature of each condo.
  • Due to sunk cost of hardware purchased frivolous utilization is encouraged.
  • The lost advantage is in flexibility of allocation for emergencies and competitive advantage.
We have a phenomenon we call cluster hugger. When you can attach a name and account to hardware faculty really think of the hardware as theirs and, rightfully, expect access to it in the same way they expect access to their desktop. It is there when they want it. I have experienced this when questions come in from users of our condo and they refer to their nodes as "Dr. X's cluster" our "our cluster".  Managing these expectations is difficult.

The result is almost always access to the nodes of the condo only being allowed to those who are members of it and access to those who are not is non-existent or severally limited. This leads to low utilization due to a lack of diversity in workload inside condos.  This consumes data center space network ports etc for machines that are on average 50% idle.

Because researchers pay the full cost upfront, and these nodes are not always utilized the marginal cost of running a job on them that is very low priority, or even frivolous, unwilling to be funded by any agency etc, are allowed to run.  If the hardware is there why not?

I actually think this is a good thing for condos. If marginal cost is low you might as well utilize the hardware. My proposed replacement puts the squeeze on these sort of marginal work.

Condos are slow changing and static. They also have a high start up costs, as researchers own the hardware the upfront cost of the hardware is large and one time. This pushes out small work that could benefit to limit access to HPC resources. If a group could rent just 100 cores for a week it would cost less than 1 machine and provide greater benefit.  Condos also do not allow for bursting, the idea that resources are needed quickly in an emergency situation.

In a later post I will put down my thoughts as to a solution to these problems.

