PDML Extensions Specification
Extensions Version |
0.78.0 |
First Published |
2025-03-25 |
Author |
Christian Neumanns |
Website |
Introduction
The Practical Data and Markup Language (PDML) is a text format designed for encoding data and markup (i.e. formatted text).
This document provides the authoritative specification for PDML Extensions. Core PDML is covered in Core PDML Specification. The difference between Core PDML and PDML Extensions is explained in What Is "Core PDML" and What Are "PDML Extensions"?.
The specification is written for anyone who wants to know the exact rules governing PDML Extensions (e.g. software developers who want to implement a PDML parser, an editor/IDE plugin, or other PDML assets). If you are new to PDML Extensions, it is recommended to first read PDML Extensions User Manual. For general information about PDML you may read PDML Overview.
Notes
Extensions are a work in progress. All extensions are currently in an experimental state. There might be breaking changes in future versions, and some extensions might even be removed.
Additional extensions are implemented already in the reference implementation, but not yet documented here.
Syntax rules will be shown in Extended Backus–Naur form (EBNF) notation.
Please share your thoughts using the following links:
-
Discussion: ask a question, discuss an idea, enhancement, or anything else.
-
Issues: report mistakes, suggest enhancements and missing features, etc.
The Extension Start Character ^
Every extension (except Unicode Escape Sequences) starts with a Circumflex Accent (^
, U+005E), also known as the extension start character in the context of PDML.
This character is always followed by one or more characters that identify the extension to be applied. For example, ^/
indicates the start of a comment, while ^"
indicates the start of a string literal.
Syntax Extensions
The following syntax extensions are provided:
Unicode Escape Sequences
Unicode escape sequences enable users and machines to insert Unicode code points in PDML documents by specifying the hexadecimal values of the code points.
For an introduction and examples you may first read section Unicode Escape Sequences in the PDML Extensions User Manual.
Grammar
PDML adopts the \u{hhhhhh}
syntax for Unicode escape sequences, but allows a list of several code points to be defined in a single escape sequence.
The grammar is defined as follows:
Name |
Rule |
Examples |
---|---|---|
Unicode_escape_sequence |
|
|
hex_values |
|
|
hex_value |
|
|
hex_digit |
|
|
whitespace |
|
|
whitespace_item |
|
Additional Rules
In addition to the rules defined by the grammar, the following rules apply:
-
Leading zeros are optional, e.g.
\u{A}
,\u{0A}
and\u{00000A}
all represent a new line character. -
The following Unicode code points are invalid:
-
U+0000
(for reasons explained in Invalid Characters) -
Surrogates in the ranges U+D800 to U+DBFF and U+DC00 to U+DFFF (reserved to encode code points beyond U+FFFF in UTF-16)
All other Unicode code points in the range U+0001 to U+10FFFF (maximum valid code point in Unicode) are valid.
A parser must generate an error if an invalid Unicode code point is present in a PDML document.
-
-
Unicode escape sequences must be supported in:
-
node tags, e.g.
[foo\u{41}bar]
→[fooAbar]
-
text leaves, e.g.
[foo foo\u{41}bar]
→[foo fooAbar]
-
unquoted string literals, e.g.
foo\u{41}bar
→fooAbar
-
quoted string literals, e.g.
"foo\u{41}bar"
→"fooAbar"
-
multiline string literals with escape mode enabled, e.g.
"""e foo\u{41}bar """
As attribute names and values are defined using string literals, Unicode escape sequences are also supported in attribute names and values.
-
Note
When a PDML document is parsed, Unicode escape sequences are usually converted into their Unicode code points. For example, \u{221E}
is converted to ∞
, which is then stored in the AST as a single Unicode code point (character).
However, this is not always the case. For example, a parser that creates a concrete syntax tree (CST) stores the Unicode escape sequences (unchanged) into the CST.
String Literals
Note
This feature is implemented already in the PDML reference implementation, but not yet documented here.
Attributes
Note
This feature is implemented already in the PDML reference implementation, but not yet documented here.
Utility Nodes
Note
This feature is implemented already in the PDML reference implementation, but not yet documented here.
Scripting Nodes
Note
This feature is implemented already in the PDML reference implementation, but not yet documented here.
Types
Note
This feature is partially implemented in the PDML reference implementation, but not yet documented here.
Comments
A comment consists of a text segment that is not part of the data/markup code stored in a PDML document. PDML supports single-line comments and nested multi-line comments.
Single-line Comment
A single-line comment starts with
^//
or^/
at any position in a line. It ends at the end of line:If the comment starts with
^//
then the line break at the end of line (LF or CRLF) is included in the comment, otherwise it is not included:Multi-line Comment
A multi-line comment starts with
^/
, immediately followed by one or more*
characters (e.g.^/*
,^/**
,^/***
). It must be ended by the same number of*
characters that were used to start the comment, followed by a/
character (e.g.*/
,**/
,***/
).It can start at any position in a line. It can end in the same line as it was started, or at a subsequent line, at any position:
Multi-line comments can be nested to any level:
Common Rules
Single- and multi-line comments must adhere to the following rules:
Comments are allowed:
At the start, in the middle, and at the end of text leaves:
Before, between, and after attribute assignments (see Attributes).
Comments are not allowed:
In tags:
In string literals:
Before or after the root node:
Note
Comments are usually skipped (ignored) by parsers. However, sometimes they must be handled in one way or another. For example, a parser that creates a concrete syntax tree (CST) stores all comments in the CST.