I have some JSON files, 20 GB each, that I want to compress with gzip
:
gzip file1.json
This takes up one full CPU core, all fine.
It processes around 25 MB/s (checked in atop
), my hard drive can read 125 MB/s and I have 3 free processor cores, so I expect to get a speed-up when compressing multiple files in parallel. So I run in other terminals:
gzip file2.json
gzip file3.json
gzip file4.json
Surprisingly, my throughput does not increase; CPU is around 25% on each core, and my HD still only reads at 25 MB/s.
Why and how to address it?
Answer
I found it out:
The reason is that gzip
operates on (in terms of CPU speed vs HD seek speed these days) extremely low buffer sizes.
It reads a few KB from from the input file, compresses it, and flushes it to the output file. Given the fact that this requires a hard drive seek, only a few operations can be done per seconds.
The reason my performance did not scale is because already one gzip
was seeking like crazy.
I worked around this by using the unix buffer
utility:
buffer -s 100000 -m 10000000 -p 100 < file1.json | gzip > file1.json.gz
By buffering a lot of input before sending it to gzip, the number of small seeks can be dramatically reduced. The options:
-s
and-m
are to specify the size of the buffer (I believe it is in KB, but not sure)-p 100
makes sure that the data is only passed to gzip once the buffer is 100% filled
Running four of these in parallel, I could get 4 * 25 MB/s throughput, as expected.
I still wonder why gzip doesn't allow to increase the buffer size - this way, it is pretty useless if run on a spinning disk.
EDIT: I tried out a few more compression programs behaviour:
bzip2
only processes 2 MB/s due to its stronger / more CPU intensive compressionlzop
seems to allow larger buffers: 70 MB/s per core, and 2 cores can max out my HD without over-seeking
No comments:
Post a Comment