Linux: Find Files Containing Text
This topic is essential knowledge for every user of UNIX, Linux, Solaris, OS X, and BSD. Furthermore, the LPI certification contains tricky questions about this.
If you want to find files with a certain filename using the command line then use either the
find or the
locate commands. But if you want to find files that contain a certain text you'll want to use grep and its friends. Here, the term friends means a group of similar tools that are tailored to a specific data format, or file structure like plain text, compressed files, and PDF documents.
Here is what we'll cover in this article:
- Basic Text Searching
- More grep Options
- Searching Compressed Files
- Searching Other Document Types
Basic Text Searching
grep is a combination of the initial letters of the four words "global / regular expression / print". This is similar to formulating search patterns in the stream editor
grep is designed to find according patterns in entire data streams (files). Given patterns are interpreted as text or Regular Expressions (see below for an example).
Example 1 displays how to discover all the occurrences of the brand name "Mikrotik" written either as "Mikrotik", or "MikroTik". We use grep to search through all files whose name starts with "invoice-2017". The result is a list of file names with the according matches - one per line preceded by the file name.
Example 1: Calling grep with an Regular Expression
$ grep Mikro[tT]ik invoice-2017* invoice-20170015.text:Mikrotik Routerboard 750GL Gigabit Switch invoice-20170045.text:MikroTik RouterBoard RB250GS Gigabit Switch
This output is helpful but does not contain the line number. To show the line number with grep, use the option
--line-number as the long version. Then, the result is as follows:
Example 2: Calling grep with an Regular Expression, and line numbers
$ grep -n Mikro[tT]ik invoice-2017* invoice-20170015.text:64:Mikrotik Routerboard 750GL Gigabit Switch invoice-20170045.text:65:MikroTik RouterBoard RB250GS Gigabit Switch
On every line, the individual output fields are separated by colons. The first field contains the filename ("invoice-20170015.text"), the second field is the line number within the matched file ("64"), and the third field is the entire line with the matched text ("Mikrotik Routerboard 750GL Gigabit Switch").
More grep Options
grep has a long list of helpful options. See the manual page for a detailed description. The most relevant ones for this article are:
|Short option||Long option||Description|
|-i||--ignore-case||lower and upper case writing|
|-l||--files-with-matches||stop after the first match, and output the file name|
|-n||--line-number||show the line number of the match|
|--color or --colour||highlight the actual match|
With the exception of highlighting the actual match, Example 3 combines the options
-l as described above. This simplifies the call, and returns a list of files with matches, no matter how many matches exist for each file. With the help of this you can see if there are matches at all, and if so, in which files.
Example 3: how to search for all files that contain the term "mikrotik" in any kind of spelling recursively
$ grep -irl mikrotik invoice-2017* invoice-20170015.text invoice-20170045.text
grep comes with two special variants -
fgrep interprets the search pattern as a string of single characters and is exactly the same as
grep -F (and
egrep takes the pattern as a Regular Expression and is similar to
grep -E (and
grep --extended-regexp). In older Linux releases prior Debian 4 Etch, both commands are implemented as shell scripts that call
grep with special options. Nowadays, current Linux releases keep the commands as binary files. In either case the search is a bit quicker than using
grep without this special option.
Searching Compressed Files
grep is unable to inspect compressed files properly. Now, the specialists named
zipgrep enter the stage. These tools help you to simplify commands like this:
$ zcat archive.gz | fgrep [pattern]
zcat uncompresses the given file, and outputs its content to
stdout. Piped to
fgrep, the data stream is searched for the given
With the help of the commands above, you don't have to unpack files compressed with
zip before searching - this step happens behind the scenes. As with
grep, the special variants
gzip exist as well as
xz archives. Example 4 shows how to search an
Example 4: Searching an
$ xzfgrep Mikrotik invoice-20130015.text.xz Mikrotik Routerboard RB450G Level 5 680MHz
Searching compressed archives is a bit more complex, and requires a bit of shell scripting. Listing 1 demonstrates such a shell script that works only with
gzip-compressed archives. For simplicity, we saved the below script with the name "search.sh". The script requires two parameters - the search pattern, and the filename of the compressed archive (see Example 5 below).
Listing 1: Searching compressed
#!/bin/bash pattern=$1 archive=$2 for filename in $(tar -tzf "$archive"); do match=$(tar -xOzf "$archive" "$filename" | fgrep "$pattern") && echo "$filename:"; echo "$match" | fgrep --color "$pattern"; echo ""; done
Example 5: Calling the script
$ ./search.sh Mikro archive.tar.gz invoice-20110045.text: Mikrotik invoice-20110110.text: MikroTik
Understanding the script may need a moment of time. First, the script extracts the list of files from the archive, and evaluates each file one after the next. The outer
for loop does all the complex work. Second, the single matches are saved in the variable
$match. Therefore, the current file is extracted from the compressed archive, and is then piped to
fgrep searches the data stream, and indicates a match with a positive return value. In case of that the following
echo command is executed, and the file name is sent to
stdout. Third, the actual match is printed as well, and followed by an empty line. This separates the different matches file-wise.
An alternative is the tool
deepgrep which is part of the desktop search engine Strigi (Debian package strigi-utils). It searches
tar.gz files as well as
zip archives, Debian packages, and even Microsoft Word files. Example 6 shows how it works. Line by line you see the file name, and the according matches.
Example 6: Searching an archive using
$ deepgrep Mikro archive.tar.gz archive.tar.gz/invoice-20110045.text:Mikrotik archive.tar.gz/invoice-20110110.text:MikroTik $
Searching Other Document Types
deepgrep covers a lot of file formats but has quite a few package dependencies. Instead you may have a look at
pdfgrep is specialized for PDF documents, and
ssgrep is for spreadsheets.
What I like about
pdfgrep is both its simplicity in usage, and variety in terms of options. Matches are highlighted without the need to specify further arguments.
-n, which is shown in use below, helps to identify the page the pattern was discovered. In Example 7, each line consists of three data fields that are separated by a column - the file name, the page number of the match, and the extracted text from the match. If the output terminal supports colors the data fields and the matches are highlighted in different ways.
Example 7: Searching PDF documents
$ pdfgrep -n Mikro[tT]ik invoice*.pdf invoice-20120033.pdf:2:MikroTik Sextant 5HnD 18dbi MIMO invoice-20120075.pdf:1:MikroTik RouterBOARD 250GS Giga
As mentioned above,
ssgrep helps you to search spreadsheets.
ssgrep abbreviates "spreadsheet grep", and is part of the Gnumeric tool. As the file format, both Open/Libre Office Calc and Gnumeric use
gzip-compressed XML as their file format. Newer releases of Microsoft Excel could work as well, but I didn't test. Figure 1 shows an example spreadsheet with sales data and four orders.
Figure 1: Gnumeric example spreadsheet
To identify the single cells that contain the term "NanoStation",
ssgrep is called with the options
-H outputs the file name as the first data field, and
-n adds the location - the name of the spreadsheet, and the table position. See Example 8 below for the output.
$ ssgrep -Hn "Nano[Ss]tation" orders.gnumeric orders.gnumeric:orders!C3:5x NanoStation M5 orders.gnumeric:orders!C5:4x NanoStation M2 orders.gnumeric:orders!C6:3x NanoStation M3
The tools presented up to now are command line tools. To search Open/Libre Office documents you may use the graphical tool named "loook" (Debian package loook). Figure 2 shows the simple graphical user interface. Figure 2: The graphical interface of loook
Searching data formats is complex, and can be an endless story. Several tools helps you to identify the relevant files easily. For a full list of commands for other data formats have a look at the given references below.
- Axel Beckert: Grep Everything
- Axel Beckert, Frank Hofmann: Nadel im Datenhaufen. Suche in komprimierten Dateien und Archiven, LinuxUser 04/2012
- Axel Beckert, Frank Hofmann: Mit Struktur. Suche in Datenformaten (Teil 1), LinuxUser 06/2012
- Axel Beckert, Frank Hofmann: Durchgekämmt. Suche in Datenformaten (Teil 2), LinuxUser 07/2012