Software is not data
By Bob Mesibov, published 07/11/2014 in Opinion
You know that trick where the magician snatches the tablecloth from the table, leaving the plates, cutlery and glassware in place? Watch as I do something similar on my computer.
The screenshot above shows the top left corner of a large data table of mine in a spreadsheet. Drum roll, please...
ALAKAZAM! The spreadsheet's gone, but the data are still there and ordered just like before: in rows, with tab-separated column entries.
As a means to store data, plain text is just as good as a spreadsheet, and plain text is a lot more accessible. The spreadsheet file can only be opened by the spreadsheet program, or by (another) spreadsheet or database program which can import the file without mangling it. A plain text file, in contrast, can be opened with any text editor or word processor, or even in a terminal with the cat command.
But storage and access aren't the whole story. I also want to add, delete, find and modify data items. I might want to use the data for various calculations. I definitely want to generate reports based on querying the data. Can I do all that with plain text?
Yes. Simple jobs can be done in a text editor, and for both simple and complex data manipulations there are command line programs (e.g. grep, sed and AWK) designed to work quickly and efficiently on plain text files.
So what have I lost by going from a spreadsheet to plain text? One major loss is data visibility. Spreadsheets and database browsers are great tools for viewing large numbers of data items at a glance. Spreadsheets also allow you to align particular data items, emphasise them with italics or bold font, and set off particular data items with colours or thickened cell borders.
There's more. Data items in a spreadsheet can be annotated with comments, or hyperlinked to external files. Copy/pasting is visually simpler in a spreadsheet, and copy-down in a GUI is quicker than coding. Seriously awkward calculations, of the kind add contents of cell M23 to half the product of cells AA29 and BB61, but only if the contents of cell Z14 are greater than zero, are much easier in a spreadsheet.
But those advantages come at a cost. When I keep my data in a spreadsheet or database application, it's captive there. I can only access the data by running the application. The software effectively is the data. How often has someone sent you a Microsoft Excel file, instead of raw data as a CSV or TSV? Have you ever wanted a particular database report, only to be told it wasn't a standard output, and that the database manager would need to talk nicely to the database as a special job for you?
When data gets eaten by software, data management is limited by the capabilities of that software. Some of the data management operations I do can't be done at all, or can only be done with difficulty, by spreadsheets and databases. For 'with difficulty' think reading help files and exploring submenus, writing complicated macros, attending day-long training sessions, emailing experts for assistance, etc. Yet those non-standard jobs of mine can be done quickly and simply on the command line.
Conversely, I don't use most of the capabilities available in today's spreadsheets and databases. They're wonderfully feature-rich programs, sure, but for my purposes that richness is... well, bloat.
Another problem with data captive to software is migration. I have 27 tables in SuperDB 3 (I made that up) and I need to migrate them to MagnaDB 7 (I made that up, too). There's no direct import option, so I have to export the 27 tables individually as plain text from the first application and import the 27 tables as plain text into the second application. Hmm. Wouldn't it have been easier if I'd managed the tables as plain text in the first place?
Spreadsheets and databases are complex programs written to enable users to do marvellous things with data, but only if the data are kept within those applications. That trade-off isn't widely appreciated. I can't help thinking that many people habitually open a new spreadsheet or database file to store data items and to do simple operations on those items, when the job could just as well be done by a plain text file and either a text editor or the command line.
Well, that's just an old fella's opinion, but it's based on nearly 50 years of mucking around with data tables. I started with tab-separated columns on paper (remember typewriters?) before moving to spreadsheets, flat-file databases, relational databases, the elegantly simple SQLite, and finally to plain text and the CLI. In my case, the magicians who pulled off the tablecloth were the wizards behind the GNU AWK programming language. AWK is very, very good for plain text data management. I don't miss the tablecloth.