Thursday, December 4, 2014

Finding the Data Network Bottleneck with perfSONAR and BWCTL

This is a cross post from the day job.

UPDATE: ITS page on perfSONAR
Java based (no software install required) bandwidth test.

Networks are famous for getting the blame for why things are slow. It would be wonderful if one could use tools like IPerf3 on points along the network to hammer down, if the network is the problem, and if so where does the network go bad.  As we all know we all love data, so how can we collect this?

The problem with IPerf is that a server has to be started on the remote end, if you don't have access to a server on the other end you can't run a test.  Enter perfSONAR a way of registering network tests allowing both authenticated and anonymous bandwidth, ping, and other tests.

PerfSONAR publishes a list of tests, and limits what an external anonymous tester can run against it.  By using a PerfSONAR node on the network along your data path, you can find if the network can hit speeds you expect.  In this example we will focus on Bandwidth Test Controller or BWCTL.  BWCTL handles the communication with the perfSONAR box and then relies on popular tools such at IPerf, IPerf3, Nuttcp, etc. to run actual tests.

To run your tests you will need two things, an install of BWCTL with the plugins supported by the endpoints you use, use IPerf3 as most support that.  Most major distributions have packages for BWCTL, if not you can build it from the sites linked above.

You will also need a perfSONAR server to test against. At Michigan as part of the process upgrading the backbone to 100Gig and other links, ITS has installed a series of perfSONAR boxes in each datacenter near the network core.  This is where you should start, make sure you get good performance between your machine and the core servers.

There is a directory for perfSONAR deployments world wide. As of this writing there are 850 BWCTL servers in the directory. For a list of boxes at umich.edu or other domain, you can filter directly to that result. An example server would be ntap-dc-mdc-10g.umnet.umich.edu this is the perfSONAR server in the Modular Data Center, which is the datacenter Flux and the data transfer node flux-xfer.engin.umich.edu are located in.

With BWCTL with IPerf3 installed and the hostname of the perfSONAR server we can run tests:
bwctl -c "ntap-dc-mdc-10g.umnet.umich.edu:4823" -T iperf3 -t 20
[ ID] Interval           Transfer     Bandwidth       Retr
[ 17]   0.00-20.04  sec  14.9 GBytes  6.38 Gbits/sec    0             sender
[ 17]   0.00-20.04  sec  14.9 GBytes  6.37 Gbits/sec                  receiver
If we have 0 retrys and a decent bandwidth, things are looking good. Next test the network in the other direction using -o and -s options in place of -c :
bwctl -o -s "ntap-dc-mdc-10g.umnet.umich.edu:4823" -T iperf3 -t 20
[ ID] Interval           Transfer     Bandwidth       Retr
[ 17]   0.00-20.04  sec  9.85 GBytes  4.22 Gbits/sec    0             sender
[ 17]   0.00-20.04  sec  9.85 GBytes  4.22 Gbits/sec                  receiver

Where to go from here?

Choose servers from the directory along the path you are sending data, you can find the paths using tools like traceroute or tracepath.  Work with network administrators if networks do appear to be slowing your data transfer.  If the network is the problem because of errors normally speeds fall very low.  If you are getting 50% of the network as in our tests above things are probably ok on the network side.

If the network isn't the problem likely the protocol for file transfers is poor and should be replaced with tools like bbcp or our recommended Globus.  Lastly make sure the storage system can send or write data at the speeds of the network.

More Reading

ESnet has a great collection of tools and information at Faster Data.

No comments:

Post a Comment