Core PDML Specification
Version |
1.0.1 |
Published |
2021-12-03 |
License |
|
Website |
|
Author |
Christian Neumanns |
Introduction
The Practical Data and Markup Language (PDML) is a text format to store data.
A distinction is made between Core PDML and PDML Extensions. Core PDML is the minimum needed to store data. Extensions are optional features to make PDML more practical.
This document is the official specification for Core PDML.
Document Structure
A PDML document is a tree of nodes.
The syntax for a node is defined as follows (in EBNF):
"[" name ( separator ? child_node + ) ? "]"
Node
A node is enclosed by a pair of square brackets: [...]
. A node starts with [
and ends with ]
.
Each document has exactly one root node.
Name
Each node has a name.
A node name must match the regex [a-zA-Z_][a-zA-Z0-9_\.-]*
. This means that a name starts with a letter or an underscore (_
), optionally followed by any number of letters, digits, underscores (_
), hyphens (-
), or dots (.
).
Here are some examples of valid node names:
color
Index_12
_ID_12.5-a
A node name does not need to be unique. Different nodes in a tree can have the same name.
Separator
The separator separates the node's name from its content.
The separator is a single whitespace character. The following whitespace characters are allowed:
Name | C-style syntax | Unicode |
---|---|---|
Space | " " | U+0020 |
Tab | "\t" | U+0009 |
Unix new line | "\n" | U+000A |
Windows new line | "\r\n" | U+000D U+000A |
The separator is required if the first child node is text. Example:
[color green]
The separator is optional if the first child node is a node. Hence this code:
[b [i huge]]
... can also be written as:
[b[i huge]]
Child Node
A node can optionally have any number of child nodes.
A child node can be text (a sequence of Unicode characters) or another node (with optional child nodes too).
Examples:
-
Node with one text child:
[color light green]
The node's name is
color
. The node's single child node is the textlight green
. -
Node with child node:
[config [color light green]]
The node
config
has one child node. The child node's name iscolor
, its text islight green
. -
Tree of nodes:
[config [color light green] [size [width 200] [height 100] ] ]
-
Node containing a mixture of text and nodes (markup code):
[p We can write words in [i italic], [b bold], or [b[i bold and italic]].]
Empty Node
If a node has no child nodes, it is called an empty node.
Example:
[new_line]
Escape Characters
As seen already, [
and ]
are used as node delimiters. Therefore these two characters must be escaped when they are used in text nodes.
A backslash (\
) is used as escape character (as in C-like programming languages). Therefore the backslash must itself be escaped too.
The final rule is simple: Characters [
, ]
, and \
must be preceded by \
when they are used in text nodes, as shown in the following table:
Character | Escape sequence |
---|---|
[ | \[ |
] | \] |
\ | \\ |
Example:
Suppose node foo
contains the text: Characters [, ], and \ must be escaped.
This would be written as:
[foo Characters \[, \], and \\ must be escaped.]
Whitespace
The following whitespace characters before of after the root node are ignored:
Name | C-style syntax | Unicode |
---|---|---|
Space | ' ' | U+0020 |
Tab | '\t' | U+0009 |
Carriage return | '\r' | U+000D |
Line feed | '\n' | U+000A |
Other characters before or after the root node are illegal.
Within a PDML document, there are no whitespace handling rules defined in Core PDML. Whitespace is preserved when a PDML document is parsed.
Consider the following PDML snippet:
[a foo [b]
2 [c] [d]
]
In this example, node a
contains 7 child nodes:
-
text
{space}foo{space}{space}{space}
-
empty node
b
-
text
{new line}{space}{space}{space}{space}2{space}
-
empty node
c
-
text
{space}
-
empty node
d
-
text
{new line}
Applications reading PDML documents (or customized PDML parsers) are free to implement any appropriate whitespace handling rules, such as:
-
skip whitespace nodes
-
trim leading and/or trailing whitespace in text nodes
-
replace whitespace sequences with a single space (similar to HTML)
New Lines
New lines are defined differently in Unix/Linux and Windows. Unix uses a single line feed ("\n"
). Windows uses a carriage return, followed by a line feed ("\r\n"
).
The following rules are applied in PDML:
-
Reading Rule
When a PDML document is read, Unix and Windows new lines are both supported, whether the application runs on Unix or Windows, even if a single document uses a mixture of Unix/Windows new lines.
For example, a parser reads
"\n"
and"\r\n"
as a single new line. -
Writing Rule
When a PDML document is written, the operating system's canonical new line is used.
For example, a writer running on Unix writes
"\n"
. On Windows it writes"\r\n"
.
Encoding
PDML documents are encoded in UTF-8.
Grammar
The grammar is defined in separate documents, in two variations:
Note
This document is the only official specification for Core PDML.
The EBNF grammar and the railroad diagrams are just auxiliary assets to help readers better contextualize the specification.
Examples
More examples of PDML code can be found in PDML Examples.
License
This specification is licensed under CC BY-ND 4.0.
Permission is granted to create verbatim translations of this specification into other human languages.
Versioning
This specification uses Semantic Versioning.
Website
PDML's website is https://pdml-lang.dev/.
Markup Code
This document is written in PML which uses the PDML syntax.
The markup code is available on Github.
Pull requests are welcome.