PDML Extensions User Manual
Extensions Version |
0.78.0 |
Latest Update |
2025-03-25 |
First Published |
2022-02-07 |
Author |
Christian Neumanns |
Website |
Introduction
In this document you'll learn how to use PDML extensions, a set of optional features that are not part of Core PDML.
Reminder
Section What Is "Core PDML" and What Are "PDML Extensions"? in PDML Overview explains the difference between Core PDML and PDML Extensions:
"The Core PDML Specification encompasses the fundamental set of simple rules needed for encoding data and markup of any complexity as plain text. This is the minimum set of rules that every PDML implementation must adhere to.
PDML Extensions encompass additional rules that specify a set of predefined, but optional features (called extensions) designed to enrich the core functionality and increase practicality in specific contexts. A PDML implementation can support some or all extensions, but every extension must be implemented according to the official PDML Extensions Specification, and the documentation of a PDML implementation should clearly specify which extensions (if any) are supported.
Extensions are designed to simplify, and sometimes even automate the creation and maintenance of your PDML documents."
It's important to understand that extensions do not enable you to encode additional types of data or markup, nor do they enable encoding more complex data or markup. You can encode data and markup of any type and complexity using only Core PDML, as demonstrated in Core PDML Examples. However, extensions can significantly simplify and enhance the creation and maintenance of PDML documents — extensions honor the P in PDML, which stands for Practical.
The following extensions will be covered:
Notes
Extensions are a work in progress. All extensions are currently in an experimental state. There might be breaking changes in future versions, and some extensions might even be removed.
Additional extensions are implemented already in the reference implementation, but not yet documented here.
Please share your thoughts using the following links:
-
Discussion: ask a question, discuss an idea, enhancement, or anything else.
-
Issues: report mistakes, suggest enhancements and missing features, etc.
The Extension Start Character ^
Every extension (except Unicode Escape Sequences) starts with a Circumflex Accent (^
, U+005E), also called caret or hat. Therefore, in the context of PDML, ^
is also known as the extension start character.
This character is always followed by one or more characters that identify the extension to be applied. For example, ^/
indicates the start of a comment, while ^"
indicates the start of a string literal.
Note
Since ^
always starts an extension, you must escape this character when using it in node tags and texts. For example, a node tagged ^_info
, containing the text Character ^ can have different meanings in different domains.
is written as follows:
[\^_info Character \^ can have different meanings in different domains.]
For more information on escape characters you may read section Character Escape Sequences in the Core PDML Specification.
Because a non-escaped ^
is a reserved character (i.e. invalid in tags and texts, unless used to start an extension), there is no risk of accidentally parsing extensions as normal text when a parser supports only Core PDML, because an error will be generated at parse-time.
Syntax Extensions
Unicode Escape Sequences
Unicode escape sequences enable you to insert Unicode code points in PDML documents by specifying their hexadecimal value.
PDML adopts the \u{hhhhhh}
syntax which originated in Perl and was later adopted in Rust, Swift, JavaScript (ES6+), and other programming languages.
For example, you can insert the Unicode code point ∞ (U+221E) by typing \u{221E}
.
Rationale
Unicode Escape Sequences provide a practical solution for the following potential challenges:
-
Some characters (i.e. Unicode code points) are cumbersome or even impossible to type, depending on the user's input device (keyboard, tablet, phone, etc.), OS, and software setup.
For example, Unicode emoticons (such as 😀, U+1F600) are not present on a standard keyboard. Depending on your device and its setup, you might be able to copy/paste the character, or use a dedicated method or tool (e.g. Alt code, emoji picker, character map) to insert the character. If a PDML implementation supports Unicode escape sequences, you can simply type
\u{1F600}
. -
Some characters might be invisible or not displayed correctly on the user's output device (e.g. monitor or printer).
Unicode escape sequences allow you to represent these characters explicitly using only ASCII characters.
For example, you can insert a Zero Width Space (U+200B) by writing
\u{200B}
. -
The PDML specification requires documents to be encoded in UTF-8. However, it is sometimes necessary to use other text encodings for storing whole PDML documents or just parts of it.
Unicode escape sequences ensure that characters are represented consistently, regardless of encoding settings (even if the encoding supports only ASCII characters).
Example: If text is encoded in UTF-16 then code points beyond U+FFFF must be encoded using UTF-16 surrogate pairs. For instance, U+1F600 must be encoded as
U+D83D U+DE00
(two 16-bit code points). To avoid having to deal with surrogate pairs you can simply write\u{1F600}
. -
As stated in the Core PDML specification, some control characters cannot by typed directly into a PDML document, for reasons explained in Invalid Characters.
Unicode escape sequences allow you to insert these control characters without causing readability, interpretation, and security issues. For example, you can insert a backspace (U+0008) by typing the Unicode escape sequence
\u{8}
. -
Unicode escape sequences increase clarity and avoid misinterpretations when dealing with invisible, similar, or special characters.
For example, Unicode defines multiple code points representing Regular and Unusual Space Characters. Imagine the confusion that could arise if these characters were mixed in the same document. In such cases, Unicode escape sequences help to clarify which specific variation is used at each location in the document.
In a nutshell, Unicode escape sequences provide a reliable, portable, and clear way to represent characters that might be difficult or error-prone to insert directly in PDML documents.
Examples
Note
You may consult the exact rules governing Unicode escape sequences in the PDML Extensions Specification.
Links to explore Unicode code points:
-
Official chart: unicode.org/charts
-
More user-friendly version: unicode-explorer.com
The hexadecimal value for the Unicode code point ∞ (infinity) is 221E (as can be seen here). Thus, you can insert ∞ by typing the Unicode escape sequence \u{221E}
:
\u{221E} → ∞
You can insert Unicode code points beyond FFFF using the same syntax:
\u{1F600} → 😀
Leading zeros are optional:
\u{41}, \u{0041}, \u{000041} → A, A, A
You can write the hex digits A
to F
in uppercase or lowercase:
\u{1F4AA}, \u{1f4aa} → 💪, 💪
Instead of specifying just one character between the curly braces, you can specify a sequence of several characters separated by whitespace (spaces, tabs, and/or line breaks):
\u{41 42 43} → ABC
\u{2669 2C 20 266A} → ♩, ♪
The following escape sequence:
\u{2764 2764 A 2764 2764}
... would be rendered like this:
❤️❤️
❤️❤️
The line break in the rendered text is caused by code point A
in the escape sequence (U+000A, End of Line).
Instead of using single spaces to separate the hex values, you can use several whitespace characters, including line breaks. Thus, you can also write the above escape sequence in a more readable way, without affecting the result:
\u{2764 2764 A
2764 2764}
You can use Unicode escape sequences in tags and in text leaves:
[\u{1F4AC} Hi]
→ [💬 Hi]
[b Some math symbols: \u{221A} \u{221E} \u{222B}]
→ [b Some math symbols: √ ∞ ∫]
You can also use them in attribute names and values:
Attribute name:
[food ^(\u{1F44C}=yes) ...]
→ [food ^(👌=yes) ...]
Attribute name and value:
[product ^("\u{1F44D} or \u{1F44E}" = \u{1F44D}) ...]
→ [product ^("👍 or 👎" = 👍) ...]
Contrived:
[\u{1F34E} ^(\u{1F44C}=\u{1F44D}) \u{1F4AA 1F4AA 1F4AA}]
→ [🍎 ^(👌=👍) 💪💪💪]
String Literals
Note
This feature is implemented already in the PDML reference implementation, but not yet documented here.
Attributes
Note
This feature is implemented already in the PDML reference implementation, but not yet documented here.
Utility Nodes
Note
This feature is implemented already in the PDML reference implementation, but not yet documented here.
Scripting Nodes
Note
This feature is implemented already in the PDML reference implementation, but not yet documented here.
Types
Note
This feature is partially implemented in the PDML reference implementation, but not yet documented here.
Comments
PDML comments are very similar to comments in programming languages. You can use them to:
Add information that is useful for human readers of a PDML document, but ignored by machines that read the PDML document. For example, a comment can provide a general description of the data stored in a document.
Temporarily disable parts of a document without deleting it.
PDML supports single line comments (starting with
^//
) and nestable multi-line comments (^/* ... */
).Here's an example of a PDML document containing various comments:
Single-line Comment
A single-line comment starts with
^//
and ends at the end of line. It can start at any position in a line:The line break at the end of the comment (i.e. LF on Unix/Linux; CRLF on Windows) is part of the comment. If you want the line break to be excluded from the comment (needed rarely), you can use
^/
(instead of^//
) to start the comment:Multi-line Comment
A multi-line comment starts with
^/*
and ends with*/
:It can start at any position in a line, and it ends at any subsequent position in the same line or in a subsequent line:
You can nest multi-line comments. In other words, a comment can contain another comment:
They can be nested to any level:
If a multi-line comment contains
*/
, then it must be embedded in^/** ... **/
(i.e. two stars to start and end the comment):You can use as many stars as needed, but the number of stars to start and end the comment must be the same:
You can also use plenty of stars to emphasize comments: