sane-tsv/SaneTsv
2024-02-14 16:16:53 -08:00
..
SaneTsvTest Add different types of parsing 2024-02-14 16:16:23 -08:00
.editorconfig Move .NET implementation to SaneTsv 2024-02-13 19:15:07 -08:00
readme.md Add readme 2024-02-14 16:16:53 -08:00
SaneTsv.cs Add different types of parsing 2024-02-14 16:16:23 -08:00
SaneTsv.csproj Add default namespaces 2024-02-14 15:34:06 -08:00
SaneTsv.sln Move .NET implementation to SaneTsv 2024-02-13 19:15:07 -08:00

Sane TSV

Sane TSV is a strict format for tabular data.

'\n' (0x0A) character delimit lines, and '\t' (0x09) characters delimit fields within a line.

'\n' and '\t' characters are allowed within fields by escaping them with a backslash character (0x5C) followed by 'n' (0x6E) and 't' (0x74) respectively. Additionaly, '\' and '#' (0x23) must also be escaped. The '#' character is escaped for compatility with Commented TSVs.

All fields must be UTF-8 encoded text. All escaping can be done before decoding (and after encoding).

Empty fields (i.e. two subsequent '\t' characters) are allowed.

The first line is always the header and the fields of the header are the column names for the file. Column names must be unique within the file and must not contain ':' characters (for compatibility with Typed TSVs).

All lines in the file must have the same number of fields.

The file must not end with '\n'. That will be treated as if there is an empty row at the end of a file and cause an error.

Implementations of the format do not need to handle file reading and writing directly, but if they do, they should enforce usage of the file extension '.stsv'. They should also provide a manual override option so that other extensions may be forced.

Typed TSV

Typed TSV allows for typing of columns. All column names in a typed TSV must end with ':' (0x3A) and then one of the following types:

  • 'string'
  • 'boolean'
  • 'float32'
  • 'float64'
  • 'uint32'
  • 'uint64'
  • 'int32'
  • 'int64'
  • 'binary'

Any other values are an error, however, the portion of the name prior to the last ':' may be anything and may include ':' characters.

All fields in the rest of the file must be of the type corresponding the their column.

Aside from the 'binary' column type, all fields must be UTF-8 encoded text. Each type has the following restrictions:

  • 'boolean' fields must contain only and exactly the text "TRUE" or "FALSE".

  • 'float32' and 'float64' correspond to single and double precision IEEE 754 floating-point numbers respectively. They should be formatted like this regex: -?[0-9]\.([0-9]|[0-9]+[1-9])E-?[1-9][0-9]*

    Both float types may additionally have these values:

    • 'sNaN'
    • 'qNaN'
    • '+inf'
    • '-inf'
  • 'uint32' and 'uint64' are unsigned 32 and 64 bit integers respectively. They should be formatted like this regex: [1-9][0-9]*

  • 'int32' and 'int64' are signed 32 and 64 bit integers respectively. They should be formatted like this regex: -?[1-9][0-9]* (except that '-0' is not allowed)

Commented TSV

Comments after the last record are an error.