In the process of mining the data collected for my diploma thesis, I ran across a very useful tool called sorted grep which offers significant speed improvements over the common unix tool grep when it comes to very large pre-sorted files.
Traditional logfiles seem to be the first use case one can think of, as they are naturally growing large and have an inherent chronological order. However, there are other cases where sorted grep might be the weapon of choice, as it offers binary search packed in a very reasonable way.
This article will walk you through the setup and usage of the powerful tool sorted grep and give you some ideas where and how to use it.
The Problem with grep
The unix tool
grep enables us to search a file for certain patterns and output the matching lines. As this is very useful in (let’s say) a sysadmin’s everyday life, it comes to an end when files have gigabytes or even terabytes of size.
The sheer mass doesn’t even need to be concentrated in one file but can be split across multiple files in a folder, how it is common for logfiles for example.
grep searches every line of a file for the supplied pattern and outputs it. Generally, it doesn’t halt until it traversed the whole file (or set of files). If only one match is to be expected, one can provide the parameter
-m 1 to tell
grep to exit after it found a match. However, what if it is not known how many matches to expect?
The linear walkthrough of a file is a major drawback when the lines of interest are located all together at a certain position within the file (for example all entries within a logfile for a given time).
grep doesn’t know when to stop, as it has no clue if it found all matches. This is where sorted grep comes into play.
Improving Search Performance with Sorted Grep
If the file is already sorted and the sort key is located at the beginning of each line, sorted grep can bring huge performance improvements when searching a file, as it uses a binary search algorithm.
The theory of operation is quite simple: sorted grep directly jumps into the middle of the file. It then searches the beginning of the current line and compares the sort key at the beginning of the line with the supplied search key. If the search key is greater than the sort key, it jumps into the middle of the second half of the file, otherwise into the first half. This step is repeated until a line matching the search key is (not) found. If it found a line with the key, it extracts all lines before and after the current line beginning with the search key.
So, if the file size doubles, only one more iteration is needed. This is a huge improvement, since the amount of iterations needed doubles for
grep. If it would take
grep several minutes to iterate through a file of many gigabytes in size, sorted grep will only need a few milliseconds to find all matching lines.
The first thing one should know: sorted grep is not the same as the more common structured grep. As both tools are referenced by the command
sgrep, there is an unfortunate collision. If you install
sgrep via your favourite package manager (e.g.
apt-get), you will end up with structured grep.
Therefore, the best way to install it is to download the source code from SourceForge and build it yourself. This is fairly simple. First we fire up the terminal and download the source (for the current version 1.0):
Next we extract and build it:
1 2 3
After the build process, the binary is located in the folder. Since there is no
make install routine, it needs to be installed manually. I suggest
That’s it! Give it a try in order to verify everything was installed properly:
sgrep tells us what it eats and is ready for a spin.
sgrep (for the rest of this article refering to sorted grep) demands input that sticks to some conditions: all input files must be sorted regular files, the sort key must start at the beginning of the file as the search key only matches the beginning of the file. Also,
sgrep doesn’t support regular expressions. But that’s not what it was made for.
Checking the Input File
Let’s assume we want to search a file called
raw.csv with the following contents:
1 2 3 4 5
In order to verify that the file is sorted properly we use the unix tool sort with the parameter
-c to tell
sort only to check if the file is already sorted:
As we can see, the file is not sorted correctly, so we sort it and write the output to a new file called
sorted.csv, since we don’t want to destroy the original order:
By checking the new file with
sort -c sorted.csv, we ensure that the file is properly sorted, because
sort won’t report an error anymore.
It is absolutely recommended to always check the input file initially if one’s not sure that everything is sorted properly. Otherwise
sgrep may not find all matching lines.
Performing the Search
Now that we have ensured the input file is sorted correctly, we can perform our first search as simple as:
Using our file
sorted.csv from the example above and searching for
AAAA, this outputs:
1 2 3
sgrep searches from the beginning of the file, this is the same as invoking the common unix
grep in the following way:
sgrep tends to be much much faster when it comes to large files consisting of gigabytes of size, as already described before.
Depending on the nature of the sort key, we have to do the actual check and search operations with some additional parameters. Remember:
sgrep matches only the beginning of a line.
Case insensitive search
(-f): If the case of the sort and search key should be ignored, the additional parameter
-f should both be supplied for
(-n): If the sort key is numeric and contains negative values or floats, numeric comparison should be used by specifing the parameter
Furthermore, ranges can be specified in the format
(-r): If the input file is sorted in reverse order, the parameter
-r has to be specified for
sort as well as for
So why do we need this? How do the additional efforts of ensuring that the file is sorted (or respectively sorting it) pay off? Actually, when you’re operating on small files or querying an unsorted file only a few times, they don’t. But when you have a huge amount of static data and query it many many times, they do.
Up to know we only discussed the more or less trivial case of mining logfiles. However, sorted grep can generally be applied for all files containing key-value information, for example describing relationships between specific entities. Let’s say we have a dump containing all relationships of who is following whom on Twitter which is structured the following way:
1 2 3 4 5
In the case of Twitter, the file will most probably contain billions and billions of lines. Presuming we sorted the file numerically with
sort -n, we can now query it with
The result are all lines starting with the number 1928481 which are all relationships from the user 1928481 to other users.
This is similar to the setting I had when mining the data for my diploma thesis. A 27 gigabyte file with around 1 billion lines had to be queried thousands of times for different values. The initial sorting took approximately 2 hours, but being able to use
sgrep, all queries only needed some milliseconds.
Even for a file with only 10 million lines
sgrep already pays off, as the following benchmark (made on my 1.6 GHz MacBook Air, SSD) shows:
1 2 3 4 5 6
The results on computers with non-ssd harddrives don’t differ very much. Therefore, if you have a static data set and need to query it a whole bunch of times, get it sorted. This is especially helpful for data mining and data science in general.
sgrep is the tool of choice when it comes to enormous data sets which have to be queried a lot of times. It eliminates the need to setup a database just to ensure efficiency when operating on huge amounts of data. Furthermore only a small amount of RAM is needed, as the file is never loaded completely. Pretty cool.