The personal weblog of Martin Donath, professional web developer, technology ninja and design enthusiast based in Cologne, Germany.

The Power of Sorted Grep

In the process of mining the data collected for my diploma thesis, I ran across a very useful tool called sorted grep which offers significant speed improvements over the common unix tool grep when it comes to very large pre-sorted files.

Traditional logfiles seem to be the first use case one can think of, as they are naturally growing large and have an inherent chronological order. However, there are other cases where sorted grep might be the weapon of choice, as it offers binary search packed in a very reasonable way.

This article will walk you through the setup and usage of the powerful tool sorted grep and give you some ideas where and how to use it.

Theory

The Problem with grep

The unix tool grep enables us to search a file for certain patterns and output the matching lines. As this is very useful in (let’s say) a sysadmin’s everyday life, it comes to an end when files have gigabytes or even terabytes of size.

The sheer mass doesn’t even need to be concentrated in one file but can be split across multiple files in a folder, how it is common for logfiles for example.

grep searches every line of a file for the supplied pattern and outputs it. Generally, it doesn’t halt until it traversed the whole file (or set of files). If only one match is to be expected, one can provide the parameter -m 1 to tell grep to exit after it found a match. However, what if it is not known how many matches to expect?

The linear walkthrough of a file is a major drawback when the lines of interest are located all together at a certain position within the file (for example all entries within a logfile for a given time). grep doesn’t know when to stop, as it has no clue if it found all matches. This is where sorted grep comes into play.

Improving Search Performance with Sorted Grep

If the file is already sorted and the sort key is located at the beginning of each line, sorted grep can bring huge performance improvements when searching a file, as it uses a binary search algorithm.

The theory of operation is quite simple: sorted grep directly jumps into the middle of the file. It then searches the beginning of the current line and compares the sort key at the beginning of the line with the supplied search key. If the search key is greater than the sort key, it jumps into the middle of the second half of the file, otherwise into the first half. This step is repeated until a line matching the search key is (not) found. If it found a line with the key, it extracts all lines before and after the current line beginning with the search key.

So, if the file size doubles, only one more iteration is needed. This is a huge improvement, since the amount of iterations needed doubles for grep. If it would take grep several minutes to iterate through a file of many gigabytes in size, sorted grep will only need a few milliseconds to find all matching lines.

Setup

The first thing one should know: sorted grep is not the same as the more common structured grep. As both tools are referenced by the command sgrep, there is an unfortunate collision. If you install sgrep via your favourite package manager (e.g. apt-get), you will end up with structured grep.

Therefore, the best way to install it is to download the source code from SourceForge and build it yourself. This is fairly simple. First we fire up the terminal and download the source (for the current version 1.0):

1
wget 'http://downloads.sourceforge.net/project/sgrep/sgrep/sgrep-1.0/sgrep-1.0.tgz'

Next we extract and build it:

1
2
3
tar xvfz sgrep-1.0.tgz
cd sgrep-1.0
make

After the build process, the binary is located in the folder. Since there is no make install routine, it needs to be installed manually. I suggest /usr/local/bin:

1
sudo mv sgrep /usr/local/bin

That’s it! Give it a try in order to verify everything was installed properly:

1
2
sgrep
# Usage: sgrep [ -i | -n ] [ -c ] [ -b ] [ -r ] key [ sorted_file ... ]

Perfect. sgrep tells us what it eats and is ready for a spin.

Usage

sgrep (for the rest of this article refering to sorted grep) demands input that sticks to some conditions: all input files must be sorted regular files, the sort key must start at the beginning of the file as the search key only matches the beginning of the file. Also, sgrep doesn’t support regular expressions. But that’s not what it was made for.

Checking the Input File

Let’s assume we want to search a file called raw.csv with the following contents:

1
2
3
4
5
AAAAA, 13.4
AAAAC, 12.5
AAAAB, 10.4
AAABA, 87.3
AABAA, 50.8

In order to verify that the file is sorted properly we use the unix tool sort with the parameter -c to tell sort only to check if the file is already sorted:

1
2
sort -c raw.csv
# sort: raw.csv:3: disorder: AAAAB, 10.4

As we can see, the file is not sorted correctly, so we sort it and write the output to a new file called sorted.csv, since we don’t want to destroy the original order:

1
sort raw.csv > sorted.csv

By checking the new file with sort -c sorted.csv, we ensure that the file is properly sorted, because sort won’t report an error anymore.

It is absolutely recommended to always check the input file initially if one’s not sure that everything is sorted properly. Otherwise sgrep may not find all matching lines.

Performing the Search

Now that we have ensured the input file is sorted correctly, we can perform our first search as simple as:

1
sgrep PATTERN FILE

Using our file sorted.csv from the example above and searching for AAAA, this outputs:

1
2
3
AAAAA, 13.4
AAAAC, 12.5
AAAAB, 10.4

Technically, as sgrep searches from the beginning of the file, this is the same as invoking the common unix grep in the following way:

1
grep ^AAAA sorted.csv

However, sgrep tends to be much much faster when it comes to large files consisting of gigabytes of size, as already described before.

Further Options

Depending on the nature of the sort key, we have to do the actual check and search operations with some additional parameters. Remember: sgrep matches only the beginning of a line.

Case insensitive search (-f): If the case of the sort and search key should be ignored, the additional parameter -f should both be supplied for sort and sgrep.

1
2
sort -f raw.csv > sorted.csv        # sort case insensitive
sgrep -f aa sorted.csv              # matches "aa", "Aa", "aA" and "AA"

Numeric comparison (-n): If the sort key is numeric and contains negative values or floats, numeric comparison should be used by specifing the parameter -n for sort and sgrep.

1
2
sort -n raw.csv > sorted.csv        # sort numerical
sgrep -n 10 sorted.csv              # matches "10", "10.0" and "010"

Furthermore, ranges can be specified in the format [start:end]:

1
2
sort -n raw.csv > sorted.csv        # sort numerical
sgrep -n [10.010:10.5] sorted.csv   # matches "10.015", "10.3" and "010.5"

Reverse order (-r): If the input file is sorted in reverse order, the parameter -r has to be specified for sort as well as for sgrep.

Use cases

So why do we need this? How do the additional efforts of ensuring that the file is sorted (or respectively sorting it) pay off? Actually, when you’re operating on small files or querying an unsorted file only a few times, they don’t. But when you have a huge amount of static data and query it many many times, they do.

Up to know we only discussed the more or less trivial case of mining logfiles. However, sorted grep can generally be applied for all files containing key-value information, for example describing relationships between specific entities. Let’s say we have a dump containing all relationships of who is following whom on Twitter which is structured the following way:

1
2
3
4
5
...
1928481, 13920103, '2009-01-04 12:10:34'
1928481, 14032932, '2010-05-04 04:05:22'
1928481, 14139392, '2010-06-04 19:14:46'
...

In the case of Twitter, the file will most probably contain billions and billions of lines. Presuming we sorted the file numerically with sort -n, we can now query it with sgrep:

1
sgrep -n 1928481 twitter.csv

The result are all lines starting with the number 1928481 which are all relationships from the user 1928481 to other users.

This is similar to the setting I had when mining the data for my diploma thesis. A 27 gigabyte file with around 1 billion lines had to be queried thousands of times for different values. The initial sorting took approximately 2 hours, but being able to use sgrep, all queries only needed some milliseconds.

Even for a file with only 10 million lines sgrep already pays off, as the following benchmark (made on my 1.6 GHz MacBook Air, SSD) shows:

1
2
3
4
5
6
time grep -c ^1928481 twitter.csv   |    time sgrep -nc 1928481 twitter.csv
# 84                                |    # 84
#                                   |    #
# real    0m0.492s                  |    # real    0m0.006s
# user    0m0.342s                  |    # user    0m0.001s
# sys     0m0.148s                  |    # sys     0m0.003s

The results on computers with non-ssd harddrives don’t differ very much. Therefore, if you have a static data set and need to query it a whole bunch of times, get it sorted. This is especially helpful for data mining and data science in general.

Conclusion

sgrep is the tool of choice when it comes to enormous data sets which have to be queried a lot of times. It eliminates the need to setup a database just to ensure efficiency when operating on huge amounts of data. Furthermore only a small amount of RAM is needed, as the file is never loaded completely. Pretty cool.

Comments