Sunday, January 25, 2015

modern compression tools are fast !

This morning I played with some compression tools on a new 28 core machine (Xeon E5-2695 v3 @ 2.30GHz).

I used python to create a 1GB string that consists of fake random DNA and dumped that to a text file:

   import sys, random  
   dnalist= list('ACGTACGTACGTACGT')  
   bytesize=1024*1024*1024  
   hostname=socket.gethostname()  
   dnastr = ''   
   for i in range(bytesize):  
     dnastr += random.choice(dnalist)  
   sys.stdout.write(dnastr)  


then I tried standard gzip as well as the new lz4, lzo  and the highly parallel pigz compressor which produces gzip compatible archives:

Tool compression level file size (MB) run time (s)
gzip 6 293 111
lz4 1 693 15
lz4 6 466 69
lzo 6 500 6
lzo 7 399 412
pigz 6 292 5

lz4 performance is certainly an improvement over gzip at the price of lower compression ratio. However in this test it is not quite as impressive as in these benchmarks. https://code.google.com/p/lz4/

lzo is doing really well and is actually much faster than lz4 while delivering similar compression. Level 7-9 are really not that useful though.

lz4 claims to have much faster decompression times than lzo but I cannot confirm this here. Both tools take about 6 seconds to decompress and restore the 1GB file.

pigz shows what can be done with raw compute power. top showed 2800% cpu utilization on this 28 core linux system. It seems to scale almost linearly to the numbers of cores. Decompression takes about 3 seconds. Here the local raid array may be a limiting factor. It can write 300-400 MB/s