Scripting a 4-Color Multiple Grepper
By Bob Mesibov, published 20/06/2014 in Tutorials
I wrote the shell script described here ('grep4') to find up to four items at the same time in my data tables, which are big text files with one record per line. The script gives each searched-for item its own color on the terminal screen. It lets me know in advance how many 'hits' there are, so that if there are lots of hits, I get the choice of printing the results to a file instead of displaying them. The script also lets me choose between seeing all records with any of the items (the OR case: A or B or...) or just the records containing all the searched-for items on one line (the AND case: A and B and...).
Demo
The script is a wee bit complicated so I'll show it working before I explain the code. For demonstration purposes I'll use a small text file called 'menus' (modified from here):
After navigating to the directory containing 'menus' I launch 'grep4 and I'm asked for a file to search:
If I enter the wrong filename, 'grep4' tells me and exits:
With a correct filename, 'grep4' asks for a sequence of up to four search strings, comma-separated. I can enter 1, 2, 3 or strings, and the strings can contain spaces. I can even mess up the spacing around the commas, because 'grep4' will fix that later:
Next,'grep4' reports what it's found, and offers five choices of what to do next:
I'll try option 1 first:
Each search string is highlighted in its own color, and I'm again offered option 2, namely a print to file. If I choose 'n', 'grep4' exits. 'grep4' also exits if I do some serious misspelling:
If I choose option 3, 'grep4' finds the one line (in this particular case) which contains all of my search strings:
Printing to file gives me (in this particular case) a 1-line file on my desktop containing menu 8. The file is named 'joint_matches_in_menus':
The code
Now for the gory details. I'll show the script in sections, but with line numbers added so you can see what goes where. You can also download the script here.
The script begins by asking the user to enter a filename, which is stored as the variable '$where' (line 3). The script tests to see if that filename exists in the current directory (line 4). If it does, the script progresses. If the filename doesn't exist, the script exits with a message containing '$where' in red (lines 84-88).
The next section of the script processes the comma-separated sequence of strings, which is stored as the variable '$query' (line 7). The sequence is turned by the tr command and a redirect into a file, '/tmp/searchlist', in which each search string is on a separate line (line 8). A sed command is used to remove any leading or trailing whitespaces, so that grep won't look for them as part of the search string (line 9). Each search string is then printed to a variable by sed (lines 10-13).
Now for a neat grep option, namely '-f' (line 14). This allows grep to search the file for all the items in 'tmp/searchlist' at one time. All grepped lines in '$where' containing any of the search strings are stored in the variable '$all'. (Which will be tested in a moment...)
'$all' is likely to be a lot smaller than '$where', so it's a good place to search for the AND case (line 15). Here '$all' is passed through a grep filter, first coloring the '$A' items in red (which is the default highlighting color for grep). Lines with '$A' will next be grepped for '$B', but to preserve the red color of '$A' items we need to use the grep option '--color=always', or the coloring characters will be stripped off during the piping.
Items matching '$B' get their own color next (line 15), using the 'GREP_COLORS="mt [matching text]' option. The '$A'-and-'$B' matching lines then get passed to grep for '$C' checking, and so on. All the AND case lines are stored in the variable '$joint'.
Time to check if, in fact, any of the search items were found in the file. This is done with a test (line 16) to see if '$all' is of zero length, i.e. an empty variable. If it is, the script tells me so, deletes '/tmp/searchlist/' and exits.
Now to report a summary of the search:
The script builds the summary by going through '/tmp/searchlist' with a for loop, and herein lies a gotcha. A for loop in BASH uses the default field separators whitespace, tab and newline to recognise the items to be processed. But the search strings may contain whitespace, so the for loop would (for example) do something first with 'vanilla', then with 'pudding', rather than with 'vanilla pudding'. The workaround is to temporarily save the default field separators in a variable (line 23), replace them with a newline (line 24), get the for loop working (lines 25-30), then restore the old defaults (line 31). This well-known workaround is nicely discussed here.
The first step in the for loop (line 27) is to use the grep option '-o' to pull out each instance of the searched-for item on a new line, count the number of resulting lines with wc, and store that number in the variable '$var1'. That's the total number of instances of that search string in the file.
The second step in the for loop (line 28) is to use the grep option '-c' to count the total number of lines in the file containing the search string, and to store that number in the variable '$var2'.
Last step in the for loop (line 29) is to report the summary, search string by search string. [If a string hasn't been found, the report just says (for example) Found 0 of watermelon on 0 lines.]
With the summary finished, the script offers five choices. The user's pick is read into the variable '$choice'. I could have used an 'if/elif/else/fi' construction here, but I find case/esac easier on the brain.
The first choice will print all the lines with found items to the screen (line 37). To do this, the non-colored '$all' is piped to grep much as in line 15, but this time grep searches for either '$A' or the end of a line ('$'). Since it will find an end-of-line on every line, the whole of '$all' passes through the first grepping, with '$A' items in the default red coloring. The same happens in the next three grep searches in the pipeline, so what comes out the end and onto the screen is the whole of '$all' with each item in its own color.
At line 39 the user is given the option to print the OR case result to a file, with the choice of yes (y) or no (n) stored in the variable '$print1'. The choice is tested (line 40), and if the user said 'no', the script exits after deleting '/tmp/searchlist'. If the user wants 'yes', the non-colored '$all' is sent to a suitably named desktop file (line 45)
Choice 2 for case is just the 'print all results to desktop file' code already seen at line 45.
Things get interesting for case choice 3, because the script doesn't know yet if there is, indeed, any line containing a joint occurrence of all search strings. That line or lines was captured back at line 5 in '$joint', and now (line 50) is the time to test to see whether '$joint' is of zero length. If it is, the script does the usual housekeeping and exits. If '$joint' has something in it, it's printed to screen with the colors established in line 15 (line 57).
Printing the AND case result to a file contains another gotcha: the variable '$joint' includes color-determining characters (invisible in the terminal) which would show up as unwanted nuisances in the text file. To remove these, '$joint' is passed through a very neat sed filter (line 65) recommended in 2013 by contributor 'Zhoul' on commandlinefu.com.
Choice 4 for case repeats the print option from choice 3, and choice 5 is simply a 'tidy and exit' step.
More...
If you're a scripting wizard, you've probably seen half a dozen ways that 'grep4' could be improved, and you've been impatiently drumming your fingers while I meandered through that long-winded explanation. Improvements welcome in comments!
If you don't like my color taste, check out [this excellent resource on terminal colors]( http://misc.flogisoft.com/bash/tip_colors_and_formatting) by developer Fabien Loison.