Why is gzip slow despite CPU and hard drive performance not being maxed out?

Sunday, 28 January 2018

Why is gzip slow despite CPU and hard drive performance not being maxed out?

I have some JSON files, 20 GB each, that I want to compress with gzip:

gzip file1.json

This takes up one full CPU core, all fine.

It processes around 25 MB/s (checked in atop), my hard drive can read 125 MB/s and I have 3 free processor cores, so I expect to get a speed-up when compressing multiple files in parallel. So I run in other terminals:

gzip file2.json
gzip file3.json
gzip file4.json

Surprisingly, my throughput does not increase; CPU is around 25% on each core, and my HD still only reads at 25 MB/s.

Why and how to address it?

Answer

I found it out:

The reason is that gzip operates on (in terms of CPU speed vs HD seek speed these days) extremely low buffer sizes.

It reads a few KB from from the input file, compresses it, and flushes it to the output file. Given the fact that this requires a hard drive seek, only a few operations can be done per seconds.

The reason my performance did not scale is because already one gzip was seeking like crazy.

I worked around this by using the unix buffer utility:

buffer -s 100000 -m 10000000 -p 100 < file1.json | gzip > file1.json.gz

By buffering a lot of input before sending it to gzip, the number of small seeks can be dramatically reduced. The options:

-s and -m are to specify the size of the buffer (I believe it is in KB, but not sure)

-p 100 makes sure that the data is only passed to gzip once the buffer is 100% filled

Running four of these in parallel, I could get 4 * 25 MB/s throughput, as expected.

I still wonder why gzip doesn't allow to increase the buffer size - this way, it is pretty useless if run on a spinning disk.

EDIT: I tried out a few more compression programs behaviour:

bzip2 only processes 2 MB/s due to its stronger / more CPU intensive compression

lzop seems to allow larger buffers: 70 MB/s per core, and 2 cores can max out my HD without over-seeking

Notes

Sunday, 28 January 2018