Friday, October 31, 2014

Lustre Stats with Graphite and Logstash

A while back Matthew Britt who handles most of our HPC Scheduler and Resource Manager services got fed up with the problem of debugging the distributed nature of jobs. We had logs on 1000+ nodes, server, scheduler, and accounting manager. To solve this he build a solution around the ELK stack, Elasticsearch, Logstash, and Kibana. His solution was one of the most useful user support and job debugging tools ever.  You can see Matt's talk form MoabCon on YouTube.

Once Matt showed me how easy this is, I instantly got an idea, Lustre, it's another distributed problem. In this case I didn't care about logs I cared about time series data, and I had two goals to solve.
  1. What is our filesystem performance over time, in both bandwidth and open/close opps.
  2. Find the users who open 999999 files/s in a single code.
To handle time series data rather than Elasticsearch I used Graphite a tool that has been around awhile. Think of it as RRD, but with a data collector for creating databases on the fly over the network, and with great performance.

All the config files used at the time of writing this are available on Github.

First some pictures:


So how did I easily get these data? Enter Logstash and the exec {} input. Because lustre stores all its summary stats in directories like /proc/fs/lustre/[mdt|obdfilter]/stats and /proc/fs/lustre/[mdt|obdfilter]//exports//stats I had to make my own handler.  Why couldn't I just cat the stats files? Logstash being designed for log files, sees each line as an event and parses them each on their own.  In our case I want the entire file.  The solution to this was to make a simple python script that turns the stats into json objects.

json-stats.wrapper.py

./json-stats-wrapper.py /proc/fs/lustre/mdt/scratch-MDT0000/md_stats  | python -mjson.tool
{
    "close": "241694077",
    "crossdir_rename": "300771",
    "getattr": "439797690",
    "getxattr": "3393359",
    "link": "117530",
    "mkdir": "1332774",
    "mknod": "1209",
    "open": "789470206",
    "rename": "522526",
    "rmdir": "1289414",
    "samedir_rename": "221755",
    "setattr": "12991707",
    "setxattr": "118798",
    "snapshot_time": "1414810134.237384",
    "source": "/proc/fs/lustre/mdt/scratch-MDT0000/md_stats",
    "statfs": "799026",
    "sync": "43951",
    "unlink": "25767242"
}
Logstash if told that the input data is json, will now treat this as an event. In general the logstash-lustre.conf and logstash-lustre-mds.conf configs parse each event, including the path to the stats and grabs each counter and builds a metric form it.

One wants to be smart about your groupings. Lucky for us the Lustre devs did things in a very logical way, and it almost falls in our laps.  You will use these groups/wildcards with graphite to quickly make lots of plots for the same metric over all OST's, MDT's, or clients, etc.
lustre....
Eg: lustre.scratch.MDT.0000.open
Eg:lustre.scratch.OST.*.10-255-1-100.read_bytes

I don't calculate any rates in logstash.  I chose to store the raw counter value from the stats files.  Graphite has built in lots of nice functions that let you calculate your rates like nonNegativeDerivative() which deals with counters that roll over such as reboot.  So keep your data quite raw.

Be careful with the number of metrics and your Graphite storage schema.  We keep our per OST/MDT data for a year (10s:7d,10m:30d,60m:180d,6h:1y) and per client stats for 30 days (2m:30d).  In our case we have over 1000 clients, 30 OSTs 1 MDT and we store 16 metrics/MDT, 16 metrics/MDT/client, 18 metrics/OST, 4 metrics/OST/client.  In call we store 53,300 metrics just for lustre. This is about 14GB of data right now, and because of the way Graphite works will not grow unless we add more metrics, or add more clients.

The more likely problem is you send metrics to fast for graphite, and the number of IO's your graphite server disk can provide won't be up to task. In our case we update OST/MDT summary stats every 10 seconds, and client OST/MDT stats every 2 minutes.  This is being handled with single 7,200 RPM SATA drive with some tweaking.

Future work:
  • Use logstash to alert us for slow_attr and LBUG events.
  • Use the DDN SFA SNMP MIB to get raw SFA counters into Graphite
  • Alert if an OST is set deactivate or in recovery/out of recovery.

No comments:

Post a Comment