sane-tsv/SaneTsv
2024-03-08 11:43:06 -08:00
..
ExtraTsv Add ExtraTSV 2024-02-15 20:26:40 -08:00
ExtraTsvTest Add ExtraTSV 2024-02-15 20:26:40 -08:00
SaneTsvTest Complete parallel parsing implementation 2024-03-08 11:43:06 -08:00
.editorconfig Move .NET implementation to SaneTsv 2024-02-13 19:15:07 -08:00
readme.md Add more ideas for ExtraTSV 2024-02-16 17:20:21 -08:00
SaneTsv.cs Complete parallel parsing implementation 2024-03-08 11:43:06 -08:00
SaneTsv.csproj Add ExtraTSV 2024-02-15 20:26:40 -08:00
SaneTsv.sln Add ExtraTSV 2024-02-15 20:26:40 -08:00

Sane TSV

Sane Tab-Separate Values is a series of tabular formats as an alternative to the under-specified TSV / CSV quagmire.

Simple TSV

Simple TSV is a strict format for tabular data.

'\n' (0x0A) character delimit lines, and '\t' (0x09) characters delimit fields within a line.

'\n' and '\t' characters are allowed within fields by escaping them with a backslash character (0x5C) followed by 'n' (0x6E) and 't' (0x74) respectively. Additionally, '\' and '#' (0x23) must also be escaped. The '#' character is escaped for compatility with Commented TSVs.

All fields must be UTF-8 encoded text. All escaping can be done before decoding (and after encoding).

Empty fields (i.e. two subsequent '\t' characters) are allowed.

The first line is always the header and the fields of the header are the column names for the file. Column names must be unique within the file and must not contain ':' characters (for compatibility with Typed TSVs).

All lines in the file must have the same number of fields as are in the header.

The file must not end with '\n'. That will be treated as if there is an empty row at the end of a file and cause an error.

Implementations of the format do not need to handle file reading and writing directly, but if they do, they should enforce usage of the file extension '.stsv'. They should also provide a manual override option so that other extensions may be forced.

Typed TSV

Typed TSV builds on Simple TSV to allow for typing of columns. All column names in a typed TSV must end with ':' (0x3A) and then one of the following types:

  • 'string'
  • 'boolean'
  • 'float32'
  • 'float32-le'
  • 'float64'
  • 'float64-le'
  • 'uint32'
  • 'uint64'
  • 'int32'
  • 'int64'
  • 'binary'

Any other values are an error, however, the portion of the name prior to the last ':' may be anything and may include ':' characters.

All fields in the rest of the file must be of the type corresponding to their column.

Aside from the 'binary', 'float32-le', and 'float64-le' column types, all fields must be UTF-8 encoded text. Each type has the following restrictions:

  • 'boolean' fields must contain only and exactly the text "TRUE" or "FALSE".

  • 'float32' and 'float64' correspond to single and double precision IEEE 754 floating-point numbers respectively. They should be formatted like this regex: -?[0-9]\.([0-9]|[0-9]+[1-9])E-?[1-9][0-9]*

    Both float types may additionally have these values:

    • 'sNaN'
    • 'qNaN'
    • '+inf'
    • '-inf'
  • 'float32-le' and 'float64-le' are also IEEE 754 floating-point, but are stored as binary. They must always be stored in little-endian order.

    The reason for having a separate binary format for them is that round-tripping floating-point text values between different parsers is not likely to work for all cases. The text-based format should be fine for general use, but when exact value transfer is needed, the binary formats are available.

  • 'uint32' and 'uint64' are unsigned 32 and 64 bit integers respectively. They should be formatted like this regex: [1-9][0-9]*

  • 'int32' and 'int64' are signed 32 and 64 bit integers respectively. They should be formatted like this regex: -?[1-9][0-9]* (except that '-0' is not allowed)

Binary fields are left as-is (after unescaping is performed).

Typed TSV files should have the .ytsv extension (.ttsv is already used).

Commented TSV

Commented TSV builds on Typed TSV and allows for more flexibility in the format by including line comments. The formats are kept distinct so that some applications can take advantage of the extra flexibility comments allow, while others can stick with the more restricted Typed TSV format.

Commented lines start with a '#' character at the beginning of the line. Unescaped '#' characters are not allowed on a line that does not start with a '#'. Any '#' characters in fields must be escaped. Note that the '#' character is excluded from the comment data.

Multiple consecutive comment lines are considered a single comment, with each line separated by a '\n'.

Comments must be UTF-8 encoded text.

Comments are associated with the record beneath them. If a comment appears at the top of the file, it is associated with the file as a whole.

Comments after the last record are an error.

Commented TSV files should have the .ctsv extension.

Extending the Commented TSV Format

Because it can include comments, this format lends itself well to extension. For example, if we wanted to extend the type system to include physical units, we could do so like this:

# UnitsTSV V1.0.0
id:uint32\tdatetime:string\tmeasurement1:m:float64\tmeasurement2:v:float64\tmeasurement3:1/s:float64

Note that extended formats must remain parseable by baseline parsers, hence we must include the base types after the new types.

Extending formats may also have restrictions. For example, they could disallow record comments and only allow the file comment above the header.

Extended formats may still use the .ctsv extension, though they could use a dedicated one instead.

Ideas for Extension

  • Physical units
  • Multiformats
    • Instead of multihashes, maybe have a column type for each hash type. That way we can avoid wasting data on the type within each field.
  • ISO 8601
  • https://github.com/multiformats/unsigned-varint
  • Color codes (e.g. #E359FF)
    • Both binary and string-based
  • JSON
  • XML
  • URL