TYSON 0.9.3

TYSON

The complete reference

Edition 1-4

Daniela Florescu

Edited by

Ghislain Fourny

ETH Zurich

Abstract

This document is a description of the TYSON, the typed JSON language. It describes how to annotate JSON with types.
1. Introduction
2. Grammar
2.1. Whitespace
2.2. Top-level element
2.3. Tokens
2.4. Superset of JSON
2.5. Quotes around type names (non-binding side note)
3. Types
3.1. Type name
3.2. Value space
3.3. Lexical space
3.4. Builtin types
3.5. User-defined types
3.6. Implicit annotations
3.7. Type checks
3.8. (Absence of a) data model
4. Examples
A. Revision History

Chapter 1. Introduction

TYSON is an extended JSON syntax that:
  • is very simple
  • can be used to store semi-structured typed data in a language-agnostic way
  • can be used to exchange semi-structured typed data in a language-agnostic way
  • can be output by a schema validation engine, as an post-validation instance bearing type annotations
  • serve as an input instance to such a schema validation engine as well
  • is designed to be orthogonal to, and compatible with any JSON schema technology (JSON Schema, JSound, etc.)
  • is not tied to any host or querying language
  • makes very minimal, backward-compatible changes to JSON
  • does not specify a data model beyond the one naturally implied by its syntax
  • is forward compatible with a future standardized, more comprehensive set of atomic types
A conforming TYSON processor is a software library that parses a text document (read from the disk, received from the network...), checks for well-formedness as specified in this specification, checks for validity against builtin types, and, upon success, exposes the parsed document to a consuming application for further processing. This consuming application is written in any host language. This allows consuming applications to rely on a common language-neutral interchange format for typed data. The duties of a conforming TYSON processor are specified in this document, as well as which aspects (i.e., agreeing on the user-defined types used and their semantics) are to be left to the consuming applications.

Chapter 2. Grammar

The grammar is almost identical to that of JSON, and only adds optional type annotations in front of any JSON value (in blue). TYSON furthermore enforces a few additional constraints not captured by the grammar, and specified in Section 3.
annotated-value
  value
  (  string ) value
object
  {}
  { members }
members
  pair
  pair , members
pair
  string : annotated-value
array
  []
  [ elements ]
elements
  annotated-value 
  annotated-value , elements
value
  string
  number 
  object 	
  array 	
  true 
  false
  null 

2.1. Whitespace

Whitespace rules are identical to those of JSON. For the newly introduced syntax, whitespace on either side of the parentheses are handled in exactly the same way as whitespace on either side of curly braces or angle brackets.

2.2. Top-level element

A TYSON text is any annotated-value. This implies that the top-level element can be anything, including a JSON value, which is consistent with both ECMA 404 and RFC 7169.

2.3. Tokens

TYSON introduces two new tokens "(" (Unicode codepoint 40) and "(" (Unicode codepoint 41)

2.4. Superset of JSON

Any well-formed JSON document is also well-formed TYSON.

2.5. Quotes around type names (non-binding side note)

TYSON formally requires quoted type names in the annotated-value non-terminal. The editors consider it possible that, in practice, some implementations may choose to be more lenient and accept unquoted type names such as (date) "2018-06-01" -- this would be similar to widespread acceptance of unquoted JSON keys in practice, although JSON formally requires quoted keys. For interoperability, TYSON processors claiming full conformance MUST however only produce TYSON documents with all type names quoted.

Chapter 3. Types

3.1. Type name

Types MUST have names. Type names can be any sequence of Unicode codepoints that corresponds to a well-formed JSON string.

3.2. Value space

A type MUST have a value space that is the set of all values that match this type.
Such values are called typed values and must be understood as abstract mathematical objects. For example, the integer 1 belongs to the value space of the natural integers.
For example, the boolean type has a value space with two elements: true and false. True is to be understood here as the abstract concept of "trueness", and not as the string "true", which is its lexical value (see below).

3.3. Lexical space

Atomic types MUST additionally have a lexical space, which is set of of strings. These strings are syntactic representations of typed values, and can thus be serialized and appear as a sequence of bits in a stored or transmitted text file.
An atomic type thus also has a lexical mapping, which maps each string in the lexical space to a typed value in the value space. A lexical mapping MUST be surjective, meaning that all typed values MUST have at least a lexical representation. A lexical mapping MAY NOT be injective, meaning that a typed value MAY have several lexical representations (example: 1.0 and 1.00)

3.4. Builtin types

There are builtin types that cover original JSON values. TYSON does not define any further builtin types, even though it is the intent of the designer to specify a suite of standardized, useful types in a separate specification. The builtin types are:
  • object: the value space is the set of all objects.
  • array: the value space is the set of all arrays.
as well as the atomic builtin types:
  • string: the value space is the set of all JSON strings, seen as lists of Unicode codepoints, and with the appropriate restrictions made in the JSON specification (e.g., on null characters, control characters, etc). The lexical space and the lexical mapping take escaping (backslashes) into account, also as defined in the JSON specification.
  • integer: the value space is the set of all integers (ℤ). The lexical space is the set of all JSON number literals with no dots and no scientific notation. The lexical mapping is implicit from the JSON specification.
  • decimal: the value space is the set of all decimal numbers (𝔻), i.e., all real numbers with a finite decimal representation. The lexical space is the set of all JSON number literals with a dot and with no scientific notation. The lexical mapping is implicit from the JSON specification.
  • double: the value space is the set of all decimal numbers that can be expressed as a double according to IEEE 754, which can casually be described as decimals with a precision of 15 digits and with the exponent in a certain range. The lexical space is the set of all strings corresponding to JSON number literals with a scientific notation (exponent). The lexical mapping is implicit from the JSON specification.
  • boolean: the value space is the canonical boolean domain (𝔹) containing true and false (a mathematical abstraction for trueness and falseness). The lexical space is made of the strings "true" and "false", which are also JSON literals appearing without quotes. The lexical mapping is the natural bijection between the two sets
  • null: the value space is the singleton set containing null (a mathematical abstraction of the essence of "nullness"). The lexical space is made of the string "null", which is also a JSON literal appearing without quotes. The lexical mapping is the one and only bijection between these two singleton sets.
TYSON processors MUST support all above builtin types and check that they correctly annotate values, i.e. that the corresponding values are valid.

3.5. User-defined types

TYSON support, and is forward compatible, with any user-defined types. These user-defined types can come from any external environment, specification or language.
TYSON does not restrict the definition of types in any way, except that any user-defined types MUST fulfill the general guidelines set out in sections 2.1, 2.2 and 2.3. These restrictions were designed to ensure soundness of the syntax and of its semantics, while being as little restrictive as reasonably possible, and correspond to established practice:
  • The type name must correspond to a JSON string, but different from all builtin types, which are reserved.
  • The user-defined type MUST have a clearly documented value space.
  • If it is not an atomic type, its value space MUST be either a set of objects, or a set of arrays.
  • If it is an atomic type, it MUST also have a clearly documented lexical space and lexical mapping.
Important remark
While users of TYSON may support types obtained by the union of other types, type annotations used in a TYSON document MUST NOT use union types that do not fulfill the above criteria, because this leads to ambiguities in the lexical mappings. For example, the union type containing all strings and booleans MUST NOT be used as a TYSON type annotation. Any value in the corresponding value space must be either annotated as string, or as boolean, so that the typed values are non ambiguous. For example, ("string-or-boolean") "true" would be ambiguous and is not allowed, while ("string") "true" and ("boolean") "true" are not.

3.6. Implicit annotations

Annotations are optional. In the absence of annotations on a JSON value, the consuming application MUST assume the annotation to be implicitly:
  • "object" for a JSON object
  • "array" for a JSON array
  • "string" for a JSON string
  • "integer" for a JSON number with no dots and no exponent
  • "decimal" for a JSON number with a dot and no exponent
  • "double" for a JSON number with an exponent
  • "boolean" for the JSON literals true and false
  • "null" for the JSON literal null
This ensures backward-compatibility with JSON documents, with no modification of their semantics.
A conforming TYSON processor MUST add and expose implicit annotations to the consuming application. It MUST hide from the consuming application whether an annotation was explicit or implicit.

3.7. Type checks

The TYSON specification enforces correctness of its builtin types.
An object MUST be annotated with either "object" or a user-defined type.
An array MUST be annotated with either "array" or a user-defined type.
If an atomic value is annotated with a builtin type, then this builtin type MUST be atomic, and the syntactic representation of this value MUST belong to the lexical space of the builtin type.
Quotes MUST be ignored for the purpose of checking lexical values.
Examples of non-well-formed TYSON (they MUST be refused by a conforming processor)
("boolean") "yes"
("yes" is not in the lexical space of the boolean builtin type)
("integer") { "foo" : "bar" }
(An object is not in the value space of the integer builtin type)
("array") { "foo" : "bar" }
(An object is not in the value space of the array builtin type)
("integer") "foo"
("foo" is not in the lexical space of the integer builtin type)
("integer") "2.0"
(TYSON is lightweight and does NOT do any casts, it only looks at whether the lexical value is in the lexical space of the annotating type, which is not the case here)
("object") true
(true is not an object)
A conformant TYSON processor MUST accept any user-defined type annotation on any value. A TYSON document annotating all its values with non-builtin type names is thus always considered well-formed from the perspective of this specification, assuming the other syntactic constraints are fulfilled.
Whether the annotated atomic values in a TYSON document belong to the lexical space of a user-defined type is outside of the scope of this specification, and is the responsibility of consuming applications to enforce their additional constraints.
Whether the annotated objects or arrays in a TYSON document belong to the value space of a user-defined type is outside of the scope of this specification, and is the responsibility of consuming applications.
The format in which types are documented is outside of the scope of this specification. Looking up type documentation given a TYSON document is outside of the scope of this specification.
Example of well-formed TYSON (they MUST be accepted by a conforming processor, leaving further decisions involving validity against user-defined types to the consuming application)
("my-array") { "foo" : "bar" }
(TYSON does not enforce validation of user-defined types)
("boolean") "true"
(TYSON ignores quotes if an annotation is present: the string "true" is mapped to the boolean value true via the lexical mapping of the boolean builtin type)
("string") false
(This is the string "false", the unquoted literal false being interpreted as a lexical value)
("string") null
(This is the string "null", the unquoted literal null being interpreted as a lexical value)
("integer") "2"
(The string "2" is mapped to the integer value 2 in the lexical mapping of the integer builtin type)
A conforming TYSON processor MUST hide from the consuming application whether an annotated lexical value was quoted or not.
For example, the following three TYSON annotated-values MUST be exposed in the same way to the consuming application: as the boolean value true.
true
("boolean") true
("boolean") "true"

3.8. (Absence of a) data model

This specification does not specify any data model, as it is directly explicit from the syntax.

Chapter 4. Examples

A "random" TYSON document
("customType") {
  "a" : ("date") "2018-05-28",
  "b" : ("my-array-type") [ ("int") 1, ("short") 2, 3, ("zipcode") 8000 ],
  "c" : ("xyType") {
    "x" : "xxx",
    "y" : ("myString") "yyy",
    "z" : ("string") true
  },
  "d" : [ "foo", "bar", true, ("boolean") "false"]
  }
A JSON document that is also a TYSON document
{
  "a" : [ 1, 2.2, 3e6],
  "b" : null,
  "c" : true,
  "d" : {
    "e" : false
  }
}
The same equivalent document, where implied annotations are made explicit:
("object") {
  "a" : ("array") [ ("integer") 1, ("decimal") 2.2, ("double") 3e6],
  "b" : ("null") null,
  "c" : ("boolean") true,
  "d" : ("object") {
    "e" : ("boolean") false
  }
}
The same equivalent TYSON document, with all lexical values quoted (even though this is not necessary)
("object") {
  "a" : ("array") [ ("integer") "1", ("decimal") "2.2", ("double") "3e6"],
  "b" : ("null") "null",
  "c" : ("boolean") "true",
  "d" : ("object") {
    "e" : ("boolean") "false"
  }
}
A structured document, for example, part of a bigger collection. Other specifications may define new types like date and base64Binary, as well as schema languages to enforce further constraints.
("person") {
  "name" : ("first-and-last") {
    "first name" : ("disney-character") "Mickey",
    "last name" : "Mouse"
  },
  "birth date" : ("date") "1928-11-18",
  "male" : true,
  "picture" : ("base64Binary") "VGhpcyBpcyBhIHBpY3R1cmU="
}
What a tech-savvy power-user of TYSON could do
("my-crazy-structure") {
  "pointer" : ("int[]*") "0x0123456789ABCDEF"
}

Appendix A. Revision History

Revision History
Revision 1-4Tue Sep 11, 2018Ghislain Fourny
Update TYSON picture to show JSON as a subset
Revision 1-3Thu Sep 5 2018Ghislain Fourny
Switch to final name (JOHN -> TYSON)
Revision 1-2Fri Aug 17 2018Ghislain Fourny
Add to features compatibility with all schemas
Add architecture image
Grammar includes the additions to JSON in bold
Clarify that user-defined types can come from any where in the external environment
Revision 1-1Tue Jul 24 2018Ghislain Fourny
Changed terminology from parser to processor.