Wednesday, 29 November 2017

linux - extract single file from huge tgz file


I have a huge tar file (about 500G) and I wan't to extract just a single file from it.
However, when I run tar -xvf file.tgz path/to/file it seems like it is still loading the whole contents into memory, and takes over an hour to extract. I've also tried to use --exclude=ignore.txt where ignore.txt is list of patterns in an attempt to stop it from traversing futile paths, but that doesn't seem to work.


Perhaps I don't understand tar... Is there a way to quickly extract the file?



Answer



Unfortunately, in order to unpack single member of .tar.gz archive you have to process whole archive, and not much you can do to fix it.


This is where .zip (and some other formats like .rar) archives work much better, because zip format has central directory of all files contained in it with direct offsets pointing to the middle of the zip file, so archive members can be quickly extracted without processing whole thing.


You might ask why processing .tar.gz is so slow?


.tar.gz (often shortened as .tgz) is simply .tar archive compressed with gzip compressor. gzip is streaming compressor that can only work with one file. If you want to get any part of gzip stream, you have to uncompress it as a whole, and this is what really kills it for .tar.gz (and for .tar.bz2, .tar.xz and other similar formats based on .tar).


.tar format is actually very, very simple. It is simply stream of 512-byte file or directory headers (name, size, etc), each followed by file or directory contents (padded to 512 block size with 0 bytes if necessary). When you observe totally null 512 block for a header, this means end of .tar archive.


Some people think that even .tar archive members cannot be accessed quickly, but this is not quite true. If .tar archive contains few big files, you actually can quickly seek into next header, and thus you can find necessary archive member in few seeks (but still could require as many seeks as there are archive members). If your .tar archive contains of lots of tiny files, this means quick member retrieval becomes effectively impossible even for uncompressed .tar.


No comments:

Post a Comment

Where does Skype save my contact's avatars in Linux?

I'm using Skype on Linux. Where can I find images cached by skype of my contact's avatars? Answer I wanted to get those Skype avat...