UPDATE: At requet of a friend I looked into zstd and wow it's a great option. As it becomes more ubiquitous it should likely replace most compressors. Compresson similar to xz and speed approaching llz4 for modest cpu increase.
As data volumes grow and single core performance grows slower than core count, compressing large volumes of data quickly requires the use of compressors that are capable of utilizing multiple cores for keeping up with the data volumes and hardware investments.
Luckily there are several available that are compatible compressors out there, but how do they perform and compare to classic gzip? Also how well do they work on scientific data? Often scientific data has a few very large files that are often binary and thousands of small files that are compressible.
All tests were done on the Great Lakes login node. The properties of this node are:
- 36 core 36 thread Intel Xeon 6154
- 192 GB Memory
- 1.9PB GPFS File System
- 100Gbps HDR Network
The data set has the following properties
- 6649 files
- 276 directories
- 221 GB total size
[ 0.000 B - 0.000 B ) 1
[ 0.000 B - 1.000 KB ) 560
[ 1.000 KB - 1.000 MB ) 4935
[ 1.000 MB - 10.000 MB ) 1175
[ 10.000 MB - 100.000 MB ) 116
[ 100.000 MB - 1.000 GB ) 94
[ 1.000 GB - 10.000 GB ) 43
[ 10.000 GB - 100.000 GB ) 1
[ 100.000 GB - 1.000 TB ) 0
[ 1.000 TB - MAX ) 0
This compares runtime and final archive size as compared to serial gzip. This was accomplished with
tar -I pigz -cf myarchive.tar.gz
pigz can only compress in parallel with very minimal speedup on decompression
xz requires -T0 option to use all cores in the system or will default to 1
xz cannot decompress files in parallel but pixz can
lbzip2 and mpibzip2 can only decompress in parallel if the archive was compressed with a parallel aware compressor
lz4 is not parallel aware but is by far the fastest compressor of all, but with the least space savings
zstd requires -T0 option to use all cores or will default to 1
Overall using the drop in replacements for gzip and bzip2 are obvious improvements on modern multi-core systems. While xz and lz4 are available on almost all modern systems they are still less portable than gzip and bz2 based compressors.
lz4 is very interesting as it's so fast it uses almost no CPU. If one was collecting data on a lower powered device using lz4 appears to be 'compression for free'. While not as effective as the other compressors there is almost no performance impact during tar/untar when using lz4.
One would hope over time the stock installs of gzip and bzip2 are replaced by the parallel versions. Xz is very stable but struggles to utilize very high core counts of modern systems, but still returns the best compression ratio.