PDML Extensions Specification

Introduction

The Practical Data and Markup Language (PDML) is a text format designed for encoding data and markup (i.e. formatted text).

This document provides the authoritative specification for PDML Extensions. Core PDML is covered in Core PDML Specification. The difference between Core PDML and PDML Extensions is explained in What Is "Core PDML" and What Are "PDML Extensions"?.

The specification is written for anyone who wants to know the exact rules governing PDML Extensions (e.g. software developers who want to implement a PDML parser, an editor/IDE plugin, or other PDML assets). If you are new to PDML Extensions, it is recommended to first read PDML Extensions User Manual. For general information about PDML you may read PDML Overview.

Notes

Extensions are a work in progress. All extensions are currently in an experimental state. There might be breaking changes in future versions, and some extensions might even be removed.

Additional extensions are implemented already in the reference implementation, but not yet documented here.

Syntax rules will be shown in Extended Backus–Naur form (EBNF) notation.

Please share your thoughts using the following links:

Discussion: ask a question, discuss an idea, enhancement, or anything else.
Issues: report mistakes, suggest enhancements and missing features, etc.

The Extension Start Character `^`

Every extension (except Unicode Escape Sequences) starts with a Circumflex Accent (^, U+005E), also known as the extension start character in the context of PDML.

This character is always followed by one or more characters that identify the extension to be applied. For example, ^/ indicates the start of a comment, while ^" indicates the start of a string literal.

Syntax Extensions

The following syntax extensions are provided:

Comments
Unicode Escape Sequences
String Literals
Attributes

Comments

A comment consists of a text segment that is not part of the data/markup code stored in a PDML document. PDML supports single-line comments and nested multi-line comments.

Single-line Comment

A single-line comment starts with ^// or ^/ at any position in a line. It ends at the end of line:

^// Full line comment
text ^// Trailing comment

If the comment starts with ^// then the line break at the end of line (LF or CRLF) is included in the comment, otherwise it is not included:

^// Line break is part of the comment
^/  Line break is part of the following text
text text

Multi-line Comment

A multi-line comment starts with ^/, immediately followed by one or more * characters (e.g. ^/*, ^/**, ^/***). It must be ended by the same number of * characters that were used to start the comment, followed by a / character (e.g. */, **/, ***/).

It can start at any position in a line. It can end in the same line as it was started, or at a subsequent line, at any position:

^/*
    multi
    line
    comment
*/

text ^/* inline comment */ text

text ^/* this text is
commented out */ text

Multi-line comments can be nested to any level:

^/* level 1 (not nested)
    ^/* level 2
        ^/* level 3 */
    */
*/

Common Rules

Single- and multi-line comments must adhere to the following rules:

Comments are allowed:

At the start, in the middle, and at the end of text leaves:

[foo ^/* comment at start */ text]

[foo text ^/* comment */ text ^/* comment */ text]

[foo text ^/* comment at end */]

[foo text
    ^// comment
    text ^// comment
]

Before, between, and after attribute assignments (see Attributes).

Comments are not allowed:

In tags:
```
[foo^/* INVALID comment */bar]
```
In string literals:
```
^"start ^/* INVALID comment*/ end"
```

Before or after the root node:

^// INVALID comment
^/* INVALID comment */

[root
    data and/or markup
]

^// INVALID comment
^/* INVALID comment */

Note

Comments are usually skipped (ignored) by parsers. However, sometimes they must be handled in one way or another. For example, a parser that creates a concrete syntax tree (CST) stores all comments in the CST.

Unicode Escape Sequences

Unicode escape sequences enable users and machines to insert Unicode code points in PDML documents by specifying the hexadecimal values of the code points.

For an introduction and examples you may first read section Unicode Escape Sequences in the PDML Extensions User Manual.

Grammar

PDML adopts the \u{hhhhhh} syntax for Unicode escape sequences, but allows a list of several code points to be defined in a single escape sequence.

The grammar is defined as follows:

Name	Rule	Examples
Unicode_escape_sequence	`"\u{" hex_values "}"`	`\u{41}` `\u{41 42 43}`
hex_values	`hex_value ( whitespace hex_value ) *`	`41` `41 42 43`
hex_value	`hex_digit hex_digit ? hex_digit ?` `hex_digit ? hex_digit ? hex_digit ?`	`A` `1F600`
hex_digit	`0 \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \| 9` `\| A \| B \| C \| D \| E \| F` `\| a \| b \| c \| d \| e \| f`	`0` `9` `a` `F`
whitespace	`whitespace_item +`
whitespace_item	`{space} (U+0020)` `\| {character tabulation} (U+0009)` `\| {Unix line break} (U+000A)` `\| {Windows line break} (U+000D U+000A)`

Additional Rules

In addition to the rules defined by the grammar, the following rules apply:

Leading zeros are optional, e.g. \u{A}, \u{0A} and \u{00000A} all represent a new line character.
The following Unicode code points are invalid:
- U+0000 (for reasons explained in Invalid Characters)
- Surrogates in the ranges U+D800 to U+DBFF and U+DC00 to U+DFFF (reserved to encode code points beyond U+FFFF in UTF-16)
All other Unicode code points in the range U+0001 to U+10FFFF (maximum valid code point in Unicode) are valid.

A parser must generate an error if an invalid Unicode code point is present in a PDML document.
Unicode escape sequences must be supported in:
- node tags, e.g. [foo\u{41}bar] → [fooAbar]
- text leaves, e.g. [foo foo\u{41}bar] → [foo fooAbar]
- unquoted string literals, e.g. foo\u{41}bar → fooAbar
- quoted string literals, e.g. "foo\u{41}bar" → "fooAbar"
- multiline string literals with escape mode enabled, e.g.
```
"""e
foo\u{41}bar
"""
```
As attribute names and values are defined using string literals, Unicode escape sequences are also supported in attribute names and values.

Note

When a PDML document is parsed, Unicode escape sequences are usually converted into their Unicode code points. For example, \u{221E} is converted to ∞, which is then stored in the AST as a single Unicode code point (character).

However, this is not always the case. For example, a parser that creates a concrete syntax tree (CST) stores the Unicode escape sequences (unchanged) into the CST.

String Literals

Note

This feature is implemented already in the PDML reference implementation, but not yet documented here.

Attributes

Note

This feature is implemented already in the PDML reference implementation, but not yet documented here.

Utility Nodes

Note

This feature is implemented already in the PDML reference implementation, but not yet documented here.

Scripting Nodes

Note

This feature is implemented already in the PDML reference implementation, but not yet documented here.

Types

Note

This feature is partially implemented in the PDML reference implementation, but not yet documented here.

Extensions Version	0.79.0
First Published	2025-03-25
Author	Christian Neumanns
Website	https://pdml-lang.dev/

PDML Extensions Specification

Introduction

The Extension Start Character ^

Syntax Extensions

Comments

Single-line Comment

Multi-line Comment

Common Rules

Unicode Escape Sequences

Grammar

Additional Rules

String Literals

Attributes

Utility Nodes

Scripting Nodes

Types

The Extension Start Character `^`