Compare commits

...

2 Commits

Author SHA1 Message Date
Nathan McRae
cc8a122b57 Flesh out readme 2024-02-14 19:02:46 -08:00
Nathan McRae
e52dc01e7a Add comment parsing test 2024-02-14 18:32:31 -08:00
2 changed files with 48 additions and 4 deletions

View File

@ -34,5 +34,19 @@ using System.Text;
Console.WriteLine($"Passed {testName}");
}
Console.WriteLine("Done with tests");
}
{
string testName = "Comment test";
string testString1 = "#This is a file comment\n" +
" #One more file comment line\n" +
"column1:type:boolean\tcolumn2:binary\tcolumnthree\\nyep:string" +
"\n#This is a comment" +
"\n#Another comment line" +
"\nTRUE\tvalue\\\\t\0woo\tvaluetrhee" +
"\nFALSE\tnother\tno\\ther";
SaneTsv parsed = SaneTsv.ParseCommentedTsv(Encoding.UTF8.GetBytes(testString1));
}
Console.WriteLine("Done with tests");

View File

@ -20,7 +20,7 @@ Implementations of the format do not need to handle file reading and writing dir
# Typed TSV
Typed TSV allows for typing of columns. All column names in a typed TSV must end with ':' (0x3A) and then one of the following types:
Typed TSV builds on Sane TSV to allow for typing of columns. All column names in a typed TSV must end with ':' (0x3A) and then one of the following types:
- 'string'
- 'boolean'
@ -49,10 +49,40 @@ Aside from the 'binary' column type, all fields must be UTF-8 encoded text. Each
- 'uint32' and 'uint64' are unsigned 32 and 64 bit integers respectively. They should be formatted like this regex: `[1-9][0-9]*`
- 'int32' and 'int64' are signed 32 and 64 bit integers respectively. They should be formatted like this regex: `-?[1-9][0-9]*` (except that '-0' is not allowed)
Typed TSV files should have the .ytsv extension (.ttsv is already used).
# Commented TSV
Commented lines start with a '#' character at the beginning of the line. Unescaped '#' characters are not allowed on a line that does not start with a '#'. Any '#' characters in fields must be escaped.
Commented TSV builds on Typed TSV and allows for more flexibility in the format by including line comments. They are kept distinct so that some applications of it can take advantage of the extra flexibility, while others can stick with the more restricted Typed TSV format.
Commented lines start with a '#' character at the beginning of the line. Unescaped '#' characters are not allowed on a line that does not start with a '#'. Any '#' characters in fields must be escaped. Any unescaped '#' after the start of a line are errors.
Comments must be UTF-8 encoded text.
Comments after the last record are an error.
Comments are associated with the record beneath them. If a comment appears at the top of the file, it is associated with the file as a whole.
Comments after the last record are an error.
Commented TSV files should have the .ctsv extension.
## Extending the Commented TSV Format
Because it can include comments, this format lends itself well to extension. For example, if we wanted to extend the type system to include physical units, we could do so like this:
```
# UnitsTSV V1.0.0
id:uint32\tdatetime:string\tmeasurement1:m:float64\tmeasurement2:v:float64\tmeasurement3:1/s:float64
```
Note that extended formats must remain parseable by baseline parsers, hence we must include the base types after the new types.
Extending formats may also have restrictions. For example, they could disallow record comments and only allow the file comment above the header.
Extended formats may still use the .ctsv extension, though they could use a dedicated one as well.
## Ideas for Extension
- Physical units
- Multiformats
- ISO 8601
- https://github.com/multiformats/unsigned-varint