How to deal with NBSPs in a terminal
By Bob Mesibov, published 02/02/2018 in Tutorials
A non-breaking space (NBSP) is a special kind of whitespace. It's an invisible signal that tells a text-processing program to avoid replacing that space with a linefeed or carriage return.
For example, if you wrote "2 kg" in LibreOffice Writer and didn't want the "2" to be at the end of one line and the "kg" at the beginning of the next, you would put a NBSP between "2" and "kg". (Add a NBSP with Ctrl + Shift + space.) Writer helpfully shows a NBSP as a vertical gray bar:
NBSPs are also useful in HTML documents, where they're coded as and rendered as whitespace by a browser.
NBSPs, however, are fairly useless in text in a terminal. When NBSPs occur where they're not supposed to, they can give unexpected results in data and code processing. In the first command below there's an ordinary whitespace between "NBSP" and "hazard", in the second command the space is a NBSP:
Finding NBSPs
The commonly seen NBSP is character number 160 in the ISO-8859 character sets and Windows-1252. That "160" is "240" in octal and "a0" in hexadecimal, and it's a one-byte character.
In Unicode, and in UTF-8 encoding as in our Linux distros, the same character is two bytes, as shown here by the od command:
The NBSP is "302 240" in octal, or "c2 a0" in hex.
In text files that only use ASCII and extended ASCII characters, you can find NBSPs by searching for the second hex value alone, because in such files "a0" will only be in the NBSP character. If your grep command supports Perl syntax you can find lines containing NBSPs with
grep -P "\xa0"
but you'll need to highlight the output in the terminal to make your find visible:
If your grep doesn't allow Perl syntax or the file contains lots of non-ASCII-type Unicode characters, use this syntax and the full two-byte hex code:
grep $'\xc2\xa0'
If that doesn't work with your grep, try
grep "["$'\xc2\xa0'"]"
grep will also find lines with NBSPs if you specify the octal value or the Unicode designation (U+00A0) of NBSP and use printf to tell grep what to look for:
grep "$(printf '\302\240')"
grep "$(printf '\u00a0')"
A trick to make a NBSP more easily visible is to convert it to its octal value with sed and its 'l' option (that's an ell, not a one), then to grep with color for that octal value:
sed -n 'l' | grep '\\302\\240'
Removing NBSPs
In most cases "removing" NBSPs means replacing them with ordinary whitespace. Either sed or AWK will do this job nicely:
sed 's/\xc2\xa0/ /g'
awk '{gsub(/\xc2\xa0/," "); print}'
Other non-ordinary spaces
Note that there are spaces in Unicode that are neither ordinary whitespace nor the usual NBSP, like
- U+02002 en space
- U+02003 em space
- U+02004 three-per-em space
- U+02005 four-per-em space
- U+02006 six-per-em space
- U+02007 figure space
- U+02008 punctuation space
- U+02009 thin space
- U+0200A hair space
- U+0202F narrow no-break space
- U+0205F medium mathematical space
- U+03000 ideographic space
For more on these oddities, search the excellent Graphemica website for "space". I don't know of a simple, general way to search for all the Unicode space characters at once, but individually they could be found and replaced as above for NBSPs.
--
Note: The strange-looking symbol in the banner image has been proposed as the label for a "no break space" key on a keyboard. See this Wikimedia page.