grep through gzip files vs tarball files


I recently came across a couple of instances, on random forums, where the user was trying to grep through a huge log file that had been tarred and then gzipped. Only then did I realized that some people out there do not know the differences and advantages between a gzip file and a tarball (actually, that particular dude didn’t know purpose of a tarball). So, for my personal benefit, I did a couple of experiments with some big log files (413 Megabytes of text file) to see what the speed vs storage benefits were when using gzip and tar.gz, all with bash (sorry tcsh users!).

A little on gzip files

A gzip file is simply a file formatted in a certain way, using a compression algorithm called DEFLATE, that makes the file smaller (read more on wikipedia) that It is best to use gzip when trying to save some space, but still want to have easy access to your files for peeking in them (see speed results of grepping through log files at bottom). The scenario where I use gzip the most is for compressing 100-200 log files, each 20-30 megs (again, see stats at bottom for space gains). An important feature of gzip is that it applies to single file (this is not a ZIP file people!) and so you will end up with 100-200 little gzip files, instead of big ones. This can be a little annoying for transferring to clients, and handling in general (notice the for loop in my bash script).

# gzip multiple files
for i in `find . -name "*.log"`; do gzip $i; done

# grep through all those gzip files
for i in `find . -name "*.log.gz"`;do zgrep "TIME" $i; done

A little on tarball files

The ideal situation in which to use a tarball is when you want to compress directories and data files, which you want to bundle up in one nice and tidy package for users/clients to download but still preserve some of the file system information such as permissions and directory structure (read more on wikipedia). Note here that the end product is one file, which can be extremely useful in some cases (even a necessity at times). I say a necessity because sometimes, when handing in log files through a ftp, you may want to have one package that you’ve encrypted for your client (using pgp), in which case you wouldn’t want to have 20-30 small packages all encrypted separately… In any case, to produce a tarball with a lot of logs, you can do the following under bash:

# Create a tarball
tar -czvf logs.tar.gz *.log

# grep through a tarball
zcat logs.tar.gz | tar -xvf - | xargs grep "TIME"

Storage vs Speed for grepping through different types of files

Some statistics on storage vs speed, while grepping in different types of files.
Type of file Size of files (Megabytes) Time for grep (Seconds)
RAW 413 248 2
GZIP 27 656 3
TARBALL 27 520 19

Notice enormous gain in space going from raw log files to gzip log files. You have a 93% reduction in size (RAW to GZIP), compared to a mere 0.05% reduction in size by going from many GZIP files to one tarball. Now, I didn’t even talk about the loss in speed when compressing. That is, of course, the most important thing to consider when dealing with files that you will need to peek through from time to time (logs are a perfect example).

In order to compare the speed for each scenario, i used a very simple bash script, which I copied here for documentation (notice that I redirect the output to a ‘toto’ file so that i don’t get anything printed on my screen). The performance of my ‘grep’ command on the RAW logs was very good, 2 seconds to find 107 336 occurences of “TIME” in the 10 logs. Now comparing this with the results of the GZIP logs and the TARBALL, 3 and 19 respectively, you can quickly see that it is extremely advantageous to use gzip for log files (Look at the little graph of the different time if you are a more visual person…). Not only the gain in storage is negligible when going from GZIP to TARBALL, but the speed at which you have access to your data is a lot slower.

echo "Testing speed of RAW"
echo "===================="
echo $(date)
for i in `find . -name "*.log"`;do grep "TIME" $i >> toto; done
echo $(date)

echo "Testing speed of GZIP"
echo "===================="
echo $(date)
for i in `find . -name "*.log.gz"`;do zgrep "TIME" $i >> toto; done
echo $(date)

echo "Testing speed of TARBALL"
echo "======================="
echo $(date)
zcat logs.tar.gz | tar -xvf - | xargs grep "TIME" >> toto
echo $(date)

Concluding remark

The are situations where a tarball is necessary (or advantageous), but, in general, to keep the size of many log files down and still be able to search through them, I recommend using gzip. Not to mention that all your favorite bash commands come in a gzip flavour (zcat, zgrep, zdiff, zmore, etc) and vi can easily read a gzip file on the fly! What more can you ask for!

Advertisements

One thought on “grep through gzip files vs tarball files

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s