BASH drivers, start your engines
By Bob Mesibov, published 11/01/2018 in Tutorials
There's always more than one way to do a job in the shell, and there may not be One Best Way to do that job, either.
Nevertheless, different commands with the same output can differ in how long they take, how much memory they use and how hard they make the CPU work.
Out of curiosity I trialled 6 different ways to get the last 5 characters from each line of a text file, which is a simple text-processing task. The 6 commands are explained below and are abbreviated here as awk5, echo5, grep5, rev5, sed5 and tail5. These were also the names of the files generated by the commands.
Tracking performance
I ran the trial on a 1.6GB UTF-8 text file with 1559391514 characters on 3570866 lines, or an average of 437 characters per line, and no blank lines. The last 5 characters on every line were alphanumeric.
To time the 6 commands I used time (the BASH shell built-in, not GNU time) and while the commands were running I checked top to follow memory and CPU usage. My system is the Dell OptiPlex 9020 Micro described here and runs Debian 9.
All 6 commands used between 1 and 1.4GB of memory (VIRT in top), and awk5, echo5, grep5 and sed5 ran at close to 100% CPU usage. Interestingly,
rev5 ran at ca 30% CPU and tail5 at ca 15%.
To ensure that all 6 commands had done the same job, I did a diff on the 6 output files, each about 21 MB:
And the winner is...
Here are the elapsed times:
Well, AWK (GNU AWK 4.1.4) is really fast. Sure, all 6 commands could process a 100-line file zippety-quick, but for big text-processing jobs, fire up your AWK.
Commands used
awk '{print substr($0,length($0)-4,5)}' file > awk5
awk5 used AWK's substring function. The function works on the whole line ($0), starts at the 4th character back from the last character (length($0)-4) and returns 5 characters (5).
#!/bin/bash
while read line; do echo "${line: -5}"; done < file > echo5
exit
echo5 was run as a script and uses a while loop for processing one line at a time. The BASH string function "${line: -5}" returns the last 5 characters in "$line".
grep -o '.....$' file > grep5
In grep5, grep searches each line for the last 5 characters (.....$) and returns (with the -o option) just that searched-for string.
#!/bin/bash
while read line; do rev <<<"$line" | cut -c1-5 | rev; done < file > rev5
exit
The rev5 trick in this script has appeared often in online forums. Each line is first reversed with rev, then cut is used to return the first 5 characters, then the 5-character string is reversed with rev.
sed 's/.*\(.....\)/\1/' file > sed5
sed5 is a simple use of sed (GNU sed 4.4) but was surprisingly slow in the trial. In each line, sed replaces zero or more characters leading up to the last 5 with just those last 5 (as a backreference).
#!/bin/bash
while read line; do tail -c 6 <<<"$line"; done < file > tail5
exit
The "-c 6" in the tail5 script means that tail captures the last 5 characters in each line plus the newline character at the end.
Actually, the "-c" option captures bytes, not characters, meaning if the line ends in multi-byte characters the output will be corrupt. But would you really want to use the ultra-slow tail for this job in the first place?