Sunday, 28 January 2018

Why is gzip slow despite CPU and hard drive performance not being maxed out?


I have some JSON files, 20 GB each, that I want to compress with gzip:


gzip file1.json

This takes up one full CPU core, all fine.


It processes around 25 MB/s (checked in atop), my hard drive can read 125 MB/s and I have 3 free processor cores, so I expect to get a speed-up when compressing multiple files in parallel. So I run in other terminals:


gzip file2.json
gzip file3.json
gzip file4.json

Surprisingly, my throughput does not increase; CPU is around 25% on each core, and my HD still only reads at 25 MB/s.


Why and how to address it?



Answer



I found it out:


The reason is that gzip operates on (in terms of CPU speed vs HD seek speed these days) extremely low buffer sizes.


It reads a few KB from from the input file, compresses it, and flushes it to the output file. Given the fact that this requires a hard drive seek, only a few operations can be done per seconds.


The reason my performance did not scale is because already one gzip was seeking like crazy.




I worked around this by using the unix buffer utility:


buffer -s 100000 -m 10000000 -p 100 < file1.json | gzip > file1.json.gz

By buffering a lot of input before sending it to gzip, the number of small seeks can be dramatically reduced. The options:



  • -s and -m are to specify the size of the buffer (I believe it is in KB, but not sure)

  • -p 100 makes sure that the data is only passed to gzip once the buffer is 100% filled


Running four of these in parallel, I could get 4 * 25 MB/s throughput, as expected.




I still wonder why gzip doesn't allow to increase the buffer size - this way, it is pretty useless if run on a spinning disk.


EDIT: I tried out a few more compression programs behaviour:



  • bzip2 only processes 2 MB/s due to its stronger / more CPU intensive compression

  • lzop seems to allow larger buffers: 70 MB/s per core, and 2 cores can max out my HD without over-seeking


No comments:

Post a Comment

Where does Skype save my contact&#39;s avatars in Linux?

I'm using Skype on Linux. Where can I find images cached by skype of my contact's avatars? Answer I wanted to get those Skype avat...