Saturday, November 1, 2014

Optimizing Star Cluster on AWS

This is a cross post of my own notes from the day job at Advanced Research Computing.

In a previous post we pointed to instructions and some common tasks for using Star Cluster to create an HPC cluster on the fly in Amazon Web Services.  Now we will focus on some options for optimizing your use of Star Cluster.

Cutting the Fat - Your Head Node is to Big

By default Star creates a head node and N compute nodes. You select the instance type with NODE_INSTANCE_TYPE but this same type is used for the head/master and the compute nodes. In most cases your head node need not be so big. Un-comment MASTER_INSTANCE_TYPE and set to a more modest size instance to control costs. A c3.large for your head at $0.105/hr is much better than paying $1.68 for c3.8xlarge at $1.68/hr but you probably want the C3.8xlarge for compute nodes because it is lower cost per CPU performance.

Use Spot For Lower Costs

Spot instances are where you bid a price you are willing to pay for AWS's extra capacity. Star can use spot, but will only use it for compute nodes, but they are your main costs, plus you don't want your head node being killed if prices rise higher than your bid.

Almost every Star command that starts an instance can be created as a bid, using the -b <bid price> options. 

Switch Regions and/or Availability Zone

AWS is made up of regions which in turn are made of availability zones. Different regions have their own pricing, my favorite are Virginia (us-east-1) and Oregon (us-west-2) for the lowest prices. By default Star will use us-east-1, mostly I switch to us-east-2, why do I do this? Lower Spot prices! The graphs from the AWS console spot price history for c3.8xlarge the fastest single node on AWS from both regions shows the difference.
c3.8xlarge us-east-1 24h price history
c3.8xlarge us-west-2 24h price history
The average price on us-west-2 for compute power on spot is on average much lower than us-east-1.  Be sure to really think about how spot works, you can bid high, and it is possible to pay for a few hours more than the on demand rate. But this keeps your nodes from being killed, and the total spend, the area under the curve should still be much lower than would have been paid under On-Demand.

Changing regions in Star Cluster:

Update the region name and the region host (the machine that accepts AWS API commands from star) and the availability zone.
#get values from:
$ starcluster listregions
$ starcluster listzones
AWS_REGION_NAME = us-west-2
Create a new key pair for each region, so repeat that step with a new name for they key in the Star cluster Quick Start, and update the [key key-name-here] for your cluster config.
$ starcluster createkey mykey-us-west-2 -o ~/.ssh/mykey-us-west-2.rsa

[key mykey]
The AMI names are also per region so when you switch regions you need to update the name of the image to boot, in general select an HVM image
#get list of AMI images for region
$ starcluster listpublic 

Get a Reserved Instance for your Head/Master Node

If you know you are going to have your Star cluster up and running for a decent amount of time, and look into Reserved Instances. Discounts of close to 50% can be had compared to on demand pricing.  There are also light, medium, and heavy reserved types which match how often you expect your Star head/master node to be running. Discounts vary on instance type, and term, so refer to AWS pricing to figure out if this makes sense for you.  You can even buy the remaining reserved time from another AWS user, or sell your unused remaining reserved contract on the Reserved Instance Marketplace. Be careful reserved contracts are tied to regions and availability zones, if you plan to move between these to chase lower spot costs your contract won't follow you.

Switch to Instance Store

By default Star uses EBS volumes for the disk for each node. While very simple and allows data stored on the worker nodes to persist even when shutdown EBS has an extra cost. The cost of a few GB of EBS will be small compared to the compute costs, but if you plan to have a large cluster, it can add up to real money.  Consider instance storage supported by Star.   With instance store the compute node will boot by copying the star AMI image.

Most users clusters will not resize long enough to have this mater, contact if you want to switch. Just remember to terminate your cluster, not just stop it.  If you stop it the EBS volumes remain.

No comments:

Post a Comment