I used python to create a 1GB string that consists of fake random DNA and dumped that to a text file:
import sys, random
dnalist= list('ACGTACGTACGTACGT')
bytesize=1024*1024*1024
hostname=socket.gethostname()
dnastr = ''
for i in range(bytesize):
dnastr += random.choice(dnalist)
sys.stdout.write(dnastr)
then I tried standard gzip as well as the new lz4, lzo and the highly parallel pigz compressor which produces gzip compatible archives:
Tool | compression level | file size (MB) | run time (s) |
---|---|---|---|
gzip | 6 | 293 | 111 |
lz4 | 1 | 693 | 15 |
lz4 | 6 | 466 | 69 |
lzo | 6 | 500 | 6 |
lzo | 7 | 399 | 412 |
pigz | 6 | 292 | 5 |
lz4 performance is certainly an improvement over gzip at the price of lower compression ratio. However in this test it is not quite as impressive as in these benchmarks. https://code.google.com/p/lz4/
lzo is doing really well and is actually much faster than lz4 while delivering similar compression. Level 7-9 are really not that useful though.
lz4 claims to have much faster decompression times than lzo but I cannot confirm this here. Both tools take about 6 seconds to decompress and restore the 1GB file.
pigz shows what can be done with raw compute power. top showed 2800% cpu utilization on this 28 core linux system. It seems to scale almost linearly to the numbers of cores. Decompression takes about 3 seconds. Here the local raid array may be a limiting factor. It can write 300-400 MB/s
1 comment:
Hello
- The kind of "access pattern" your synthetic test is creating might be misleading. In particular, it may lead to wrong conclusions. It's better than a fill full of zero, or a purely random noise, but in most circumstances, using a "real sample" is more representative than synthetic data.
- I guess you already noticed that you are comparing single-threaded programs with multi-threaded ones
- The kind of speed that LZ4/LZO can reach tend to be untestable from command line tool. They are designed as memory-to-memory compression algorithms. File compression is limited by HDD I/O interface, which tends to dominate the test time.
Post a Comment