Sunday, January 25, 2015

modern compression tools are fast !

This morning I played with some compression tools on a new 28 core machine (Xeon E5-2695 v3 @ 2.30GHz).

I used python to create a 1GB string that consists of fake random DNA and dumped that to a text file:

   import sys, random  
   dnalist= list('ACGTACGTACGTACGT')  
   dnastr = ''   
   for i in range(bytesize):  
     dnastr += random.choice(dnalist)  

then I tried standard gzip as well as the new lz4, lzo  and the highly parallel pigz compressor which produces gzip compatible archives:

Tool compression level file size (MB) run time (s)
gzip 6 293 111
lz4 1 693 15
lz4 6 466 69
lzo 6 500 6
lzo 7 399 412
pigz 6 292 5

lz4 performance is certainly an improvement over gzip at the price of lower compression ratio. However in this test it is not quite as impressive as in these benchmarks.

lzo is doing really well and is actually much faster than lz4 while delivering similar compression. Level 7-9 are really not that useful though.

lz4 claims to have much faster decompression times than lzo but I cannot confirm this here. Both tools take about 6 seconds to decompress and restore the 1GB file.

pigz shows what can be done with raw compute power. top showed 2800% cpu utilization on this 28 core linux system. It seems to scale almost linearly to the numbers of cores. Decompression takes about 3 seconds. Here the local raid array may be a limiting factor. It can write 300-400 MB/s

1 comment:

Tantrem said...


- The kind of "access pattern" your synthetic test is creating might be misleading. In particular, it may lead to wrong conclusions. It's better than a fill full of zero, or a purely random noise, but in most circumstances, using a "real sample" is more representative than synthetic data.

- I guess you already noticed that you are comparing single-threaded programs with multi-threaded ones

- The kind of speed that LZ4/LZO can reach tend to be untestable from command line tool. They are designed as memory-to-memory compression algorithms. File compression is limited by HDD I/O interface, which tends to dominate the test time.