WSL - whitespace separated literals

% DOMAIN CityID Int
% DOMAIN CityName String
% DOMAIN CountryCode ID
% TABLE City CityID CityName CountryCode District
% TABLE Country CountryCode CountryCode2 CountryName
% TABLE Capital CountryCode CityID
% TABLE Language CountryCode Language IsOfficial
% ...

Country ABW AW [Aruba]
Country AFG AF [Afghanistan]
Country AGO AO [Angola]
Country AIA AI [Anguilla]
Country ALB AL [Albania]
Country AND AD [Andorra]
Country ANT AN [Netherlands Antilles]
Country ARE AE [United Arab Emirates]
Country ARG AR [Argentina]
Country ARM AM [Armenia]
...

WSL is a clean and practical plain text format for relational data. It comprises a schema language and a notation for typed database tuples with schema-supported lexical syntax.

It is accessible to format-agnostic text utilities like grep/sed/awk as well as to specialized tools that understand the format and can take advantage of the schema information. Due to its simplicity it is also amenable for tooling — for example logical or hierarchial query languages.

Example database

Here is an an example WSL database.

Specification

The current specification

Implementations

A python library.

Practical advantages

Textual storage: scriptability, version control, simple and robust data exchange
Domain-specific and user-defined lexers: low syntactic overhead, canonic representation
Schema support: data integrity, build tooling and query languages.
Convenience: Use the python3 implementation (more to come) to parse or format runtime objects from/to text in a single line of code.

Relational model

The relational model was popularized by Edward Codd in the 1970's. The message was: For persistent data hierarchical representations are often a poor choice; flat tables are often superior.

Why? Hierarchical structures can easily be emulated with flat database tuples that reference each other. On the other hand, hierarchical representations are just transformations of relational data that are opinionated about the access path through which information should be extracted from the database. One has to start at the top of a fixed hierarchy and navigate all the way to the bottom. For example, a list of Employers each containing a list of his/her Employees. But what if one wants to start from a given Employee and find all the Employee's Employers? Too bad the hierarchies don't start with Employees, so one has to go through all Employers and see if the Employee is there, writing a ton of ad-hoc code (which breaks easily when the representation is changed).

Similarly it's difficult to define the structure of what goes where, and which references resolve where, in a nested hierarchy. (Maybe the only somewhat popular hierarchical schema language is XML DTD, but it is not easy to use). If however we constrain data modelling to flat tables, it's easy to express and implement integrity constraints in a simple schema language.

The relational model is closely connected to logic programming. It is a restricted version of first order logic, where each table in the schema is a predicate, and each database table is a universe where each predicate is true if and only if the corresponding database tuple exists in the table.

WSL compared to JSON / XML / S-Expressions ...

WSL is a notation for relational data (flat tables). It does not directly serialize hierarchical data like JSON. However, much hierarchical data found in practice would better be represented as relational data. (See "Relational model").

JSON lacks a schema language, and as a consequence it can't provide a lexical syntax that is both convenient and canonical ("pick one") and can't support data integrity. It also has only few available datatypes. In conclusion, it doesn't offer much support for data modelling. Its success came mostly from mapping easily to the basic built-in types of most dynamic languages.

S-Expressions are somewhat similar to JSON, but less widely used.

The situation is different with XML / DTD / XSLT, which do provide support for well-formedness beyond syntax. But they are syntactically and conceptually heavy.

WSL compared to CSV

CSV is the most widespread format for storing relations as text. It is kind-of-portable, and, thanks to its simplicity, immensely popular and supported virtually everywhere. It has a number of shortcomings, though. Compared to CSV, WSL offers

Contains database (schema + many tables) instead of only one table
Primitive types; built-in and extendable types (CSV has only strings)
Per-type lexical syntax: more readable, editable, scriptable
Non-broken quoting rules.
Better Integrity: primitive types, unique and foreign keys

WSL compared to SQL dump files

The example database was converted manually from this sample database, which is a (probably not very portable) MySQL dump. (I haven't bothered to convert the JSON data in the CountryInfo table). It is freely available from the MySQL website. This is the kind of data WSL was designed for.

A comparison to dump files might appear silly, but it illustrates WSL's design goals. The conversion was done with tedious regular expressions and manual selection and editing. Conversely, it should be easy to re-create an SQL file from the WSL database with vim or grep + sed.

Design decisions

One design decision was to encourage many tables with few columns, instead of few tables with many columns as is common with SQL and big datasets / heavy database servers.

There are good reasons for this

Practicality: At more than four or five columns, table rows become less practical to read and edit in a text editor.
Normalization: tables with fewer columns tend to carry more relational meaning. In SQL culture, tables with more columns typically have many NULL-able attribute/tag/comment type columns and are quite denormalized (and so are in particular OUTER JOINS of such tables).

One way in which WSL encourages fewer columns is making column names optional. The schema designer is encouraged to communicate meaning of data only through table names and their columns' types. This is supported by the separation of the concepts of datatypes (which can't be used directly as columns) and domains (which have to be declared as "instances" of datatypes with optional parameterization).

Take as an example the definition of the City table.

% DOMAIN CityID Integer
% DOMAIN CityName String
% DOMAIN District String
% DOMAIN Population Integer
% TABLE City CityID CityName CountryCode District Population

Here we first create new meaningful domains from available datatypes. Now the meanings of a table's columns are clear from their domains and from the table name. Having many distinct domains is also really useful to avoid comparing apples with oranges in logic queries.

Only rarely is this approach problematic, when a table has two columns of the same datatype and it's not super clear which column has what meaning in the relation. On the other hand, anybody who has written their share of SQL joins will know how painful it is to have to rename columns for each intermediate table because either the names clash or the context changed so the name is not appropriate anymore.

Another means to keep the number of columns low is omitting (first class) NULL-able columns. The practical effects of this are illustrated by the conversion of the example SQL database to WSL. The Country table had to be split in two. The SQL version

CREATE TABLE `Country` (
  `Code` char(3) NOT NULL DEFAULT '',
  `Name` char(52) NOT NULL DEFAULT '',
  `Capital` int(11) DEFAULT NULL,
  `Code2` char(2) NOT NULL DEFAULT ''
);

was translated to

% TABLE Country CountryCode CountryCode2 CountryName
% TABLE Capital CountryCode CityID

(integrity constraints omitted). In this way the number of columns per table was reduced, normalization was improved, and the need for a NULL-able column went away.