Core PDML Specification

Version	2.0.0
Latest update	2025-03-25
First published	2021-06-13
Website	https://pdml-lang.dev/
Author	Christian Neumanns
Copyright © 2019-2025	Christian Neumanns

Introduction

The Practical Data and Markup Language (PDML) is a text format designed for encoding data and markup (i.e. formatted text).

This document is divided into two parts:

Part I: Specification: provides the Core PDML rules for encoding data and markup as plain text.

The specification is written for anyone who wants to know the exact rules governing the PDML text format (e.g. software developers who want to implement a PDML parser, an editor/IDE plugin, or other PDML assets).
Part II: Usage and Implementation Guidelines: provides practical tips for using and implementing Core PDML.

Note

Readers are supposed to be familiar with common technical IT terms such as Unicode and abstract syntax tree (AST).

For general information about PDML you may read PDML Overview.

Before reading the specification, you might also want to first see some Core PDML Examples, to get an idea of how data and markup are encoded in PDML.

Part I: Specification

This part provides the authoritative specification for Core PDML, encompassing the fundamental set of simple rules needed for encoding data and markup of any complexity as plain text. This is the minimum set of rules that every PDML implementation must conform to.

PDML Extensions (optional features designed to enrich the Core PDML functionality) are not covered here. The difference between Core PDML and PDML Extensions is explained in What Is "Core PDML" and What Are "PDML Extensions"?.

Note

PDML syntax rules are shown in Extended Backus–Naur form (EBNF) notation.

Overview

In this chapter you'll be presented with a birds-eye view of Core PDML, in order to become acquainted with its terminology and rules, and grasp a general understanding. Throughout the following chapters, each element will be explained in full detail. You may skip this chapter if you're already familiar with the basic Core PDML concepts.

A PDML document is structured as a tree that can contain three types of nodes:

Tagged Branch Node

This node has a tag (aka label or name) and one or more child nodes.

A tag is a sequence of one or more Unicode code points (e.g. config, chapter).

Each child node can be anyone of the three node types.

Syntax: "[" tag separator node + "]"

The separator can be a space, tab, or line break.

Example

You can use tagged branch nodes, for instance, to encode a record (aka structure, struct, or compound data). Here's a simple example of a node tagged dimensions containing two child-nodes, the first tagged width and the second one height:
```
[dimensions [width 200][height 100]]
```
In other words, the above code represents a record (tagged dimensions) with two fields: width and height. Each field is a key–value pair. In our example, the key width contains the value 200).

We can enhance readability by indenting nested nodes:
```
[dimensions
    [width 200]
    [height 100]
]
```
In XML, the semantically equivalent tree looks like this:
```
<dimensions>
    <width>200</width>
    <height>100</height>
</dimensions>
```
And here's the same data structure expressed in JSON:
```
{
    "dimensions": {
        "width": 200,
        "height": 100
    }
}
```
Tagged Leaf Node

This node has a tag, but no child nodes.

Syntax: "[" tag "]"

Example

Tagged leaf nodes are used to encode record fields that have no value (aka "absence of value"; null, nil, nothing, void in programming languages).

Suppose that a user didn't enter a value for field remark in a data entry form. Then the code representing this field (i.e. its key–value pair) would look like this:

PDML: [remark]

XML: <remark></remark> or <remark />

JSON: { "remark": null }

Besides being used for record fields with no values, tagged leaf nodes are also useful to represent native elements, especially in markup documents.

For example, a horizontal rule (hr) in a markup document would be inserted using the following syntax:

PDML: [hr]

HTML: <hr /> or <hr>
Untagged Text Leaf Node

This node is composed of text (Unicode code points), and it doesn't have a tag.

Example

Untagged text leaf node: Lorem ipsum ...

An untagged text leaf is always contained in a tagged branch node. Here's a branch node tagged p (paragraph) containing the untagged text leaf Lorem ipsum ...:

PDML: [p Lorem ipsum ...]

HTML: <p>Lorem ipsum ...</p>

The root node of a PDML document must be a tagged branch node or a tagged leaf node.

Example

Here's a PDML document containing a mixture of data and markup code:

File example.pdml

[document
    [data
        [message
            [id 123]
            [content All is well! 👍]
            [public? yes]
            [remark]
        ]
    ]
    [markup_code
        [p We can write text in [b bold], [i italic], or [b [i bold and italic]].]
    ]
]

The following image shows the tree structure of node data, stripped of indentation:

File System Analogies

A PDML document is structured in a way that resembles the tree of directories and files on your PC.

To help you grasp the essence of PDML terminology, the following analogies may prove useful:

A directory containing other directories and/or files is like a tagged branch node in PDML: It has a name (called tag in PDML) and child-nodes.
An empty directory is like a tagged leaf node in PDML: It only has a name (a tag).
A non-empty text file is like an untagged text leaf node in PDML: It contains a sequence of one or more characters. However, while a file has a name, an untagged text leaf has no tag.

Basic PDML Terminology

Some commonly used IT-terms are defined as follows in the context of PDML:

Character

A single Unicode code point.

Examples: a, Z, é, 日, 0, 9, !, 😀
Text

A sequence of one or more characters.

Examples: Hello, Guten Tag!, やあ, a = b + 1

Whitespace character

Any one of the following characters:

Unicode Code Point	Unicode Name	C-style Syntax
U+0009	Character Tabulation (HT, TAB)	'\t'
U+000A	End of Line (EOL, LF, NL)	'\n'
U+000C	Form Feed (FF)	'\f'
U+000D	Carriage Return (CR)	'\r'
U+0020	Space (SP)	' '

Whitespace

A sequence of one or more whitespace characters.
Line break

One of the following:
- LF: a single End of Line character (U+000A), also called Unix/Linux line break
- CRLF: a sequence of a single Carriage Return character (U+000D) followed by a single End of Line character (+U000A), also called Windows/DOS line break

Tree Structure

As mentioned in Overview, a PDML document is structured as a tree that can contain three types of nodes:

Tagged Branch Node: has a tag and one or more child nodes.

A child node can be anyone of the three node types.
Tagged Leaf Node: has a tag, but no child nodes.
Untagged Text Leaf Node: is composed of text.

The root node must be a tagged branch node or a tagged leaf node.

Note

PDML is a text format, and this specification describes the rules for encoding PDML documents as plain text — typically represented as a sequence of Unicode code points stored in a UTF-8 encoded text file bearing the .pdml extension.

Beyond text encoding, there are other methods for storing PDML documents, but they are not covered here. For example, a PDML parser typically stores a PDML document in an abstract or concrete syntax tree (AST or CST). These data structures are implementation- and programming-language-dependant, and beyond the scope of this specification.

Tag

Each tagged node (i.e. tagged branch node or tagged leaf node) must have a tag (aka label, name).

A tag is composed of arbitrary text.

Some characters are not allowed (see Invalid Characters). Moreover, some characters must be escaped (see Character Escape Sequences).

A node tag does not need to be unique. Distinct nodes in a tree may have the same tag, even if they are siblings.

Here are a few tag examples:

quality
Qualität               // German for "quality"
คุณภาพ                 // Thai for "quality"
text_color
list.index-a
123
2d_point
2025-01-07
_
😍                     // Unicode emoticon

Tags may contain whitespace characters, but they must be escaped (see Character Escape Sequences).

For example, the tag Net Weight contains a space character which must be replaced with its escape sequence \s:

Net\sWeight

And this tag, which contains also a line break and square brackets (in addition to a space):

Net Weight
[kg]

... must be written:

Net\sWeight\n\[kg\]

Tagged Branch Node

A tagged branch node (also simply called branch node) is composed of a tag, a separator, and a list of one or more child nodes.

Syntax

The syntax of a tagged branch node is as follows:

"[" tag separator node + "]"

A tagged branch node is enclosed by a pair of square brackets ([...]) which serve as node delimiters: the node starts with an opening delimiter [ and ends with a closing delimiter ]. Note that a Tagged Leaf Node is also enclosed by a pair of square brackets, but an Untagged Text Leaf Node isn't.

The opening delimiter ([) is immediately followed by the node tag, a separator (often a single space), and one or more child nodes.

A node with tag greeting and text content (child-node) Hello world looks as follows:

[greeting Hello world]

A tagged branch node containing only text (as in the example above) is called a tagged text node.

Separator

In a tagged branch node, the node tag must be separated from the child nodes (i.e. the node contents) by a valid separator.

The following separators (consisting of one or two whitespace characters) are allowed:

Unicode Code Point(s)	Name	C-style Syntax
U+0020	Space	' '
U+000A	End of Line (Unix line break)	'\n'
U+000D U+000A	Windows line break	"\r\n"
U+0009	Character Tabulation	'\t'

Let's look at some examples.

Here's a node tagged color with text content green, and a space used as separator:

[color green]
      ^

If two spaces were present:

[color  green]
      ^^

... then the first space would serve as separator, while the second space would be part of the node content. In other words, the above node contains the text “ green” (i.e. the text starts with one space).

Because a line break is also a valid separator, the following two nodes are semantically equivalent:

[color green]

[color
green]

Note that the line break in the second example can be a Unix/Linux line break (LF) or a Windows line break (CRLF).

Now consider the following snippet:

[color
    green
]

Here, the line break after color serves as separator, and the node contains a single text leaf starting with four spaces (i.e. the indentation) and ending with a line break (included).

The following example shows node b containing node i with text content huge:

[b [i huge]]
  ^  ^

It would be invalid to omit the separator of node b:

[b[i huge]]
  ^          INVALID!

The separator is required in tagged branch nodes.

Note that each tagged branch node has only one separator right after the tag, no matter how many child nodes it contains. Consider this snippet:

[foo [key1 value1] [key2 value2]]
                  ^

The space between the two child nodes is not a separator of the child nodes — it is an Untagged Text Leaf Node composed of a single space. Hence, node foo contains three child nodes, the second one being a single space character.

Also note that escaped whitespace characters are not eligible separators, and will be treated as part of the tag. This snippet:

[a\sb c]

... represents a node with tag a b and text content c — it does not represent a node with tag a and text content b c.

Child Nodes

A tagged branch node contains a list (i.e. ordered sequence) of one or more child nodes.

A child node can be:

a tagged branch node (which might itself have child nodes too) (e.g. [color green])
a tagged leaf node (e.g. [remark])
an untagged text leaf node (e.g. lorem ipsum ...)

Examples:

A branch node tagged color, containing the text light green:
```
[color light green]
```
A branch node containing another branch node:
```
[config [color light green]]
```
Here, node config contains one child node, tagged color, containing the text light green.
A deeply nested tree:
```
[config [color light green][size [width 200][height 100]]]
```
Node config contains two child nodes: color and size. Node size also contains two child nodes: width and height.

The above code is written using the so-called compact form. To increase readability and maintainability we can prettify it using indentation:
```
[config
    [color light green]
    [size
        [width 200]
        [height 100]
    ]
]
```
However, note that the additional whitespace used for indenting is part of the data stored in the AST. As explained later, whitespace is preserved in PDML (as in XML, but unlike JSON and YAML). After removing the whitespace used for indentation, the tree will be equivalent to its compact form (see Insignificant Whitespace).
A branch node containing a mixture of text and other branch nodes:
```
[p We can write words in [b bold], [i italic], or [b [i bold and italic]].]
```
In this example, PDML is used to encode markup that would be rendered as follows:

We can write words in bold, italic, or bold and italic.

Tagged Leaf Node

A tagged leaf node has only a tag, but no separator nor child nodes.

Its syntax is as follows:

"[" tag "]"

Example of a tagged leaf node with tag remark:

[remark]

Important

A separator is not permitted in a tagged leaf node. Hence, the following code is invalid, and the parser must generate an error:

[remark ]
       ^    INVALID!

If two spaces were present:

[remark  ]
       ^^

... then the code would represent a Tagged Branch Node containing an Untagged Text Leaf Node consisting of a single space character.

Untagged Text Leaf Node

An untagged text leaf node (also simply called a text leaf) is composed of arbitrary text. This text is all there is to this node — there's no tag and no delimiters.

Some characters are not allowed (see Invalid Characters). Moreover, some characters must be escaped and others may be escaped (see Character Escape Sequences).

An untagged text leaf node must be contained in a tagged branch node. In other words, a text leaf is always a child of a branch node.

For example, a branch node tagged message containing the text leaf All is well! 👍 looks as follows:

[message All is well! 👍]

As previously mentioned, a tagged branch node containing only text is also called a tagged text node. Hence, the above example shows a tagged text node (tagged message) containing the text leaf All is well! 👍.

Using the Thai language, the node would look like this:

[ข้อความ ทุกอย่างดี 👍]

Here's a contrived example of a branch node tagged 1 containing the text leaf 2 3:

[1 2 3]

The following warning node contains the text leaf: Characters \, [, ], and ^ must be escaped.:

[warning Characters \\, \[, \], and \^ must be escaped.]

Invalid Characters

The following characters (Unicode code points) are invalid anywhere in a PDML document (including tags and text leaves), because they can cause readability, interpretation, and security issues.

C0 Controls in the range U+0000 to U+001F (included), except:
- U+0009 (Character Tabulation)
- U+000A (End of Line)
- U+000C (Form Feed)
- U+000D (Carriage Return)
C1 Controls in the range U+0080 to U+009F (included)

Tip

Unicode Escape Sequences (a PDML extension not covered here) can be used to insert the above Unicode code points (C0 Controls and C1 Controls). For example, a backspace (U+0008) can be inserted using the Unicode escape sequence \u{8}.

The only exception is U+0000, which is always invalid in PDML documents, even when written as an escape sequence. U+0000 is prohibited because many programming languages treat it as a string terminator. As a result, its inclusion within a string could cause unexpected and unpredictable malfunctions, potentially creating vulnerabilities that could be exploited maliciously.

To represent binary or non-printable data in a PDML document, encoding methods like Base64 should be used, ensuring the document remains valid, readable, and interoperable (see Encoding Binary Data).

Unicode surrogate code points in the range U+D800 to U+DFFF are also invalid anywhere in a PDML document, because they are reserved for encoding Unicode code points beyond U+FFFF in UTF-16 encodings.

Character Escape Sequences

Some characters in tags and text leaves must be escaped.

A character escape sequence starts with a backslash (\), followed by a second character that denotes the escaped character (as in C-like programming languages).

For example, the character escape sequence \t is used to escape a Character Tabulation (TAB).

The following table shows the character escape sequences:

PDML Category	Unicode Code Point	Character	Escape Sequence	Mandatory in Text Leaves
Escape character	U+005C	\	\\	yes
Node delimiters	U+005B	[	\[	yes
Node delimiters	U+005D	]	\]	yes
Whitespace characters	U+0009	Character Tabulation (HT, TAB)	\t	no
	U+000A	End of Line (EOL, LF, NL)	\n	no
	U+000C	Form Feed (FF)	\f	no
	U+000D	Carriage Return (CR)	\r	no
	U+0020	Space (SP)	\s	no
Reserved characters (current and future extensions)	U+005E	^	\^	yes
	U+0028	(	\(	no
	U+0029	)	\)	no
	U+003D	=	\=	no
	U+0022	"	\"	no
	U+007E	~	\~	no
	U+007C	\|	\\|	no
	U+003A	:	\:	no
	U+002C	,	\,	no
	U+0060	`	\`	no
	U+0021	!	\!	no
	U+0024	$	\$	no

All characters in the above table must be escaped in tags.

However, only characters in rows marked as "yes" in the last column must be escaped in text leaves. Characters in rows marked as "no" may be escaped.

For instance:

The text snippet a[1] must be written as a\[1\] in tags and in text leaves.
The text snippet a=b must be written as a\=b in tags; in text leaves it can be written as a=b or a\=b.

Note

The reason for having characters that may be escaped in text leaves is to support the same set of escape sequences in tags and text leaves.

This can be useful, for instance, when shared code (possibly auto-generated) is used in tags and text leaves. For example, the text snippet a\sb can be inserted into a tag as well as into a text leaf.

Escape sequences not listed in the above table are invalid and parsers must generate an error if they occur in a PDML document. For example, if \m occurs in a PDML document, the parser must generate an error — it must not convert \m to m or ignore the invalid escape sequence.

Example

A node tagged Note (important) with text content Characters \, [, ], and ^ must be escaped. is written as follows:

[Note\s\(important\) Characters \\, \[, \], and \^ must be escaped.]

Whitespace Rules

A PDML document may contain whitespace before or after the root node. Any other character is invalid (including escaped whitespace characters like \t and \n). Whitespace before or after the root node must be ignored.

Here's a valid PDML document that has one empty line before the root node, and two empty lines after it:


[root
    [child text]
]

Whitespace within PDML documents (i.e. the text contained between the opening and closing delimiters of the root node) must be preserved by parsers.

Consider the following PDML snippet:

[a  foo   [b]
    2 [c] [d]
]

Here, node a contains seven child nodes:

text leaf {space}foo{space}{space}{space}
tagged leaf node b
text leaf {line break}{space}{space}{space}{space}2{space}
tagged leaf node c
text leaf {space}
tagged leaf node d
text leaf {line break}

Note that all line breaks are stored in the AST as they are encountered, without any transformations such as normalization to LF. Hence, an AST might contain a mixture of Unix line breaks (LF) and Windows line breaks (CRLF).

Note

Preserving whitespace must be the default behavior of any PDML-compliant parser. See How to Handle Whitespace for guidelines about handling whitespace in specific contexts.

Encoding

PDML documents must be encoded in UTF-8.

Versioning

This specification adopts Semantic Versioning.

Useful Links

Core PDML Grammar, expressed in Extended Backus–Naur form (EBNF) notation
Official website: pdml-lang.dev
Core PDML Examples

Glossary

Note

The following definitions are PDML-specific — they might have different definitions in other domains.

character	A single Unicode code point. Examples: `a, Z, é, 日, 0, 9, !, 😀`
child node	A node contained in a tagged branch node. The child can be a tagged branch node, tagged leaf node, or untagged text leaf node.
line break	A Unix line break (`LF`) or Windows line break (`CRLF`).
node delimiters	A pair of square brackets used to start and end a node (`[...]`). Example: `[size big] ^ ^`
node separator	A space, line break, or tab that separates a tag from child nodes. Example: `[color dark green] ^`
PDML AST	An abstract syntax tree (AST) representing a PDML text document.
PDML text document	UTF-8 text that encodes data and/or markup according to the rules of the PDML specification. Example: `[doc [data ...] [markup ...] ]`
tag	A label or name assigned to a tagged branch or tagged leaf node. Example: `[pi 3.1415] ^^`
tagged branch node	A node that has a tag, a separator, and one or more child nodes. Example: `[dimensions [width 200] [height 100] ]`
tagged leaf node	A node that has a tag, but no separator and no child nodes. Examples: `[remark]`, `[hr]`
tagged text node	A tagged branch node that contains a single text leaf. Example: `[color light green]`
text	A sequence of one or more characters (Unicode code points). Examples: `Hello`, `Guten Tag!`, `やあ`, `a = b + 1`
untagged text leaf node (aka text leaf)	A tagless leaf node composed of text. Example: `Lorem ipsum ...`
whitespace	A sequence of one or more whitespace characters.
whitespace character	A space, tab, carriage return, line feed, or form feed.

Part II: Usage and Implementation Guidelines

This part provides practical tips for using and implementing PDML, based on real-case scenarios that need to be considered.

The following topics are covered: length limits, tips on using tags, handling whitespace, and encoding binary data.

Length Limits

The specification does not define a length limit (i.e. maximum number of characters/Unicode points) for tags, text leaves, or the whole document. Practical limitations depend on software implementations (e.g. parsers) and should be documented.

In case of very large PDML documents, you may encounter performance issues or out-of-memory errors, depending on the capabilities of the system.

Tip on Implementing a Parser

If you create a parser designed to read very large documents, consider creating a streaming parser that processes the document in smaller chunks. For example, instead of loading the whole document into memory, the parser only holds the currently parsed token in memory (e.g. a single tag or text leaf). The PDML reference implementation includes a streaming parser that uses this technique and generates events to be consumed by a handler.

How to Handle Whitespace

IT history has repeatedly proven that whitespace (though seemingly harmless) can cause unexpected results and malfunctions when handled incorrectly. Problems and unanticipated edge cases can arise due to the following factors:

There is no universal, precise, one-size-fits-all definition for the term whitespace in the IT world — this term can mean different things in different domains and contexts.
To enhance readability and maintainability, we often insert insignificant whitespace (e.g. indents) into text documents containing data or markup.
Line breaks are defined differently in Unix/Linux (LF) and Windows (CRLF).

Moreover, editors and other tools might silently transform line breaks (e.g. CRLF to LF), and these transformations depend on the tools used, the OS, as well as configuration files installed on your PC.
Editors and other tools often render tabs as spaces, making it difficult for humans to distinguish between the two.

Software applications might also silently transform tabs to spaces, or vice versa.

To help you avoid unexpected issues, this section offers some tips for managing whitespace in PDML documents.

Basic Principles

To keep things straightforward and practical, PDML adheres to the following principles:

As stated in the specification, a whitespace character can be a space (SP), character tabulation (HT, TAB), carriage return (CR), end of line (EOL, LF, NL), or form feed (FF). These are the characters commonly used in data and markup documents, and it's easy to type them on a keyboard. The only exception is FF, which is supported because it's still used in some legacy applications.

Note

Unicode defines additional code points representing whitespace (over 20), but they have no special meaning in PDML — they are parsed and stored in ASTs like any other character.
Section Whitespace Rules states: "Parsers must preserve whitespace in PDML documents."

This rule is justified because PDML is a general purpose and portable text format. Any attempt to define standard whitespace-handling rules would fail: the rules would be "just what we need" in some contexts, but "totally unsuited" in others.

Insignificant Whitespace

To illustrate the challenges of handling whitespace correctly in various situations, let's first consider a 2D point with coordinates x=1.234 and y=567.8. In PDML we can encode this point as follows:

[2d_point [x 1.234][y 567.8]]

Here we use the so-called compact form: every character is significant and required.

Note

Whenever PDML documents are read and written only by machines, the compact form is usually preferable because it leads to better time- and space-efficiency.

Although the compact form is short and efficient, it can be difficult to read for humans (especially in the case of large, deeply nested trees). We can easily enhance readability and maintainability by indenting child nodes and treating the additional whitespace as insignificant:

[2d_point
    [x 1.234]
    [y 567.8]
]

The additional whitespace is called insignificant because it doesn't change the data (semantics) — it's sole purpose is to enhance readability and maintainability (aka pretty printing, prettyfying, beautifying).

We can further enhance readability by also inserting insignificant whitespace within the x and y nodes:

[2d_point
    [x   1.234]
    [y 567.8  ]
]

Because whitespace is preserved when the document is parsed into an AST, it must be ignored or removed in one way or another when the AST is transformed into a programming-language-dependant data structure (e.g. record) stored in memory.

Now let's look at an example of markup:

[article [title Harmonious States][p This article [i demonstrates] that ...][p You'll also discover ...]]

Again, we can greatly enhance readability and show the tree structure by inserting insignificant whitespace:

[article [title Harmonious States]
    [p This article [i demonstrates] that ...]
    [p You'll also discover ...]
]

Since paragraphs are frequently used in markup, it would be convenient to omit the p nodes and simply insert an empty line that will be interpreted as a paragraph break:

[article [title Harmonious States]
        
    This article [i demonstrates] that ...
        
    You'll also discover ...
]

Obviously, this level of user-friendlyness requires customized whitespace-handling rules to be applied by the application reading the PDML document.

As you can see from these simple examples:

Insignificant whitespace is often useful because it enhances readability and maintainability, especially in cases of deeply nested PDML documents maintained by humans.
Sometimes whitespace is significant, sometimes it's not.
The rules for proper whitespace-handling can vary largely — they depend on the nature and context of the PDML documents.

For example, whitespace-handling rules for data documents are usually very different from those required for markup documents.

This implies that different rules might be required for different parts of documents that contain a mixture of data and markup.

Now that we're aware of the challenges, let's have a look at possible approaches for whitespace-handling:

Utility Functions and Methods
Types
Customized Parsers
Application-Specific Handling

The best approach (or combination of approaches) depends on the domain and context in which PDML is used. The following sections provide general guidelines.

Utility Functions and Methods

A PDML implementation may provide any number of utility functions/methods designed to apply common whitespace-handling operations. They are usually invoked by the application code after the PDML document has been parsed into an AST. They may check if whitespace is contained in a node, change the data stored in the AST, or transform AST nodes into ready-to-use values of a specific type.

Let's look at some examples (using Java syntax):

Method isWhitespace() in class TextLeafNode: returns true if the text leaf contains only whitespace, otherwise returns false.

Application code can invoke code like remarkField.isWhitespace(), and take appropriate action, such as ignoring the node if the method returns true.
Method trimTextLeafNodes() in class TaggedBranchNode: removes leading and trailing whitespace in all text leaves contained in the branch.

For example, consider this structure:
```
[3d_point
    [x 123.45]
    [y   1.1 ]
    [z   3   ]
]
```
Calling trimTextLeafNodes() on node 3d_point transform the structure into this one:
```
[3d_point
    [x 123.45]
    [y 1.1]
    [z 3]
]
```
Method replaceCRLFWithLF() in class TaggedBranchNode: replaces Windows line breaks (CRLF) with Unix line breaks (LF) in all text leaves contained in the branch.
Methods to convert text leaves into typed values.

Consider the following node:
```
[bonus   1,000,000 ]
```
Suppose that TextLeafNode provides method asInt to convert the text into an int. Invoking asInt on the above text leaf removes leading and trailing whitespace, removes the commas, converts the remaining text into an int value, and returns 1000000 as result.
Methods to convert a tree or sub-tree into typed values.

To streamline the conversion of PDML documents (or parts of it) into commonly used in-memory data structures, a PDML implementation can provide dedicated convenience methods/functions.

For example, we often use PDML to store collections — mainly lists and maps (aka dictionaries, associative arrays). Therefore, the PDML reference implementation provides utility class StringMapUtil, which contains convenience methods to convert PDML trees or sub-trees into Map<String,String> objects.

Here's a file storing three key–value pairs:

File config.pdml
```
[config
    [ip 192.168.1.1]
    [port 8080]
    [timeout]
]
```
You can convert this file into a Map<String,String> using a single line of code:
```
Map<String,String> config = StringMapUtil.parseFile ( "config.pdml" );
```
parseFile() parses the file into an AST, removes insignificant whitespace (e.g. indents), ensures that the root node only contains tagged text nodes or tagged leaf nodes, and traverses the tree to create and return a Map<String,String> object which you can then use in your application to do whatever you want.

Utility functions/methods executing common whitespace-handling operations provide the following benefits:

You can use them in many contexts because they provide a comprehensive set of individual operations that can be combined in many different ways to cover a wide variety of needs.
You can apply them to the whole tree, sub-trees, or individual nodes.
You can apply different whitespace-handling rules to different sections of a document (e.g. data vs markup sections).
The parser is much simpler to implement because there's no need for checking and handling whitespace (less ifs in the parser source code).
Documents are parsed faster, because text leaves never need to be tested for whitespace and handled appropriately.

If your application uses only a small part of a document (quite common), you might benefit from a considerable overall performance gain, because the potentially time-consuming whitespace-handling operations are only executed on the data used by your application — they don't need to be executed on the whole tree.

Types

A PDML type (a PDML extension, not part of Core PDML, and not specified here) can be designed to apply customized whitespace-handling rules, applied implicitly when the PDML document is parsed.

Consider the following code:

[birthdate
    2000-01-01
]

If node birthdate is configured to be of type date, then:

Leading and trailing whitespace is removed at parse-time (by functions that are part of the type and implicitly called by the parser).
The remaining text is verified to be a valid date in the ISO 8601 format and, if valid, stored in the AST, otherwise an error is generated.

Customized Parsers

If you create a PDML parser for a context where whitespace-handling rules are well-known, then the best option might be to implement the rules in the parser.

However, as stated at the end of section Whitespace Rules, any PDML-compliant parser must preserve whitespace by default. Therefore, a compliant parser must support configuration options to enable whitespace-handling rules explicitly.

Example: Consider a parser that supports configuration option convertCRLFToLF, to convert all occurrences of CRLF into LF. The default value of this option must be false, as required by the specification (i.e. whitespace must be preserved). Hence, convertCRLFToLF must explicitly be set to true to enable the conversion.

Application-Specific Handling

If whitespace-handling rules are very specific (uncommon), then the best approach is usually to implement them in the application that reads the parsed document.

This approach is used in the Practical Markup Language (PML) to implement specific rules, such as: A new paragraph is inserted automatically whenever text is separated by whitespace containing at least two Unix/Windows line breaks.

Line Breaks

As previously mentioned, line breaks in text files (e.g. .pdml files) can be problematic because:

Unix/Linux uses LF, while Windows uses CRLF, making them different.
Editors and other tools might silently change line breaks (CRLF to LF or vice versa).

Reading Operations

Consider the following text in a PDML document:

line 1
line 2

The actual line break inserted between the two lines depends on the tool and environment/configuration used to create the PDML document. Most editors running on Windows would (by default) insert a CRLF sequence. If you later edit the document in an editor running on Linux, the line break might silently be changed to LF. Line break changes might also occur due to configuration files such as .editorconfig or .gitattributes, or they might be caused by various tools processing the PDML documents in different environments.

Most modern applications promote EOL neutrality. They tolerate and handle both line break versions equally well, even if they are mixed in a single document — no problem if CRLF is changed to LF, or vice versa.

However, unsupported line breaks can cause strange results or malfunctions. For example, a Unix tool might not work properly if Windows line breaks are present.

To avoid issues, you might want to force the required line break version using character escape sequences. You can force a Windows line break using \r\n:

line 1\r\nline 2

... or a Unix/Linux line break using \n:

line 1\nline 2

Although usually less user-friendly, escape sequences provide the following advantages:

The line break version (Unix or Windows) is clearly shown in editors, and the escape sequences might even be displayed in a different color.
Automatic line break changes by editors and other tools won't affect these escape sequences.

Writing Operations

By default, general-purpose PDML writers should:

Write line breaks stored in text leaves unchanged.

For example, if a text leaf contains a mixture of Unix and Windows line breaks, then they should be written without transformations, regardless of whether the application runs on Unix or Windows.
Write OS-native line breaks for additional line breaks inserted by the writer (e.g. for pretty printing)

Thus, a writer running on Unix writes LF; on Windows it writes CRLF.

More sophisticated writers can support parameters to customize their behavior, such as:

Enforce OS-native line breaks in the output (e.g. a writer running on Unix always writes LF; on Windows it writes CRLF).
Always write LF, even if the application runs on Windows.

Character Tabulations

Character Tabulations (tabs) are often rendered as spaces (typically two to eight spaces per tab). If this is an issue, you can insert the escape sequence \t (instead of just a single tab character), so that tabs and spaces can easily be distinguished.

For example, if an editor is configured to display tabs as two spaces, the text foo bar doesn't tell you if two spaces or a single tab separates foo from bar, whereas foo\tbar clearly shows that a tab is used, and the tab escape sequence might even be displayed in a different color.

Moreover, using \t prevents applications from silently replacing tabs with spaces.

The same goes for sequences of spaces. For example, \s\s (two escaped spaces) would prevent two-space indentation from being replaced by a tab when the editor is set to use tab indentations.

Encoding Binary Data

To encode binary data in PDML documents (e.g. images, audios, and videos), you can use common and portable binary-to-text encoding methods like Base64.

The following example shows a tiny (1x1) PNG image encoded in Base64:

[image
    [title Example Image]
    [format PNG]
    [data
        iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwC
        AAAAC0lEQVR42mP8/wQAAwAB/AL+j3sAAAAASUVORK5CYII=
    ]
]

About This Document

Copyright

Everyone is permitted to:

Copy and distribute verbatim copies of this document (changing it is not allowed).
Create verbatim translations of this document into other human languages and distribute them.
Create verbatim conversions of this document into other text formats and distribute them.

Acknowledgments

Many thanks to Tristano Ajmone for his useful feedback to improve this document.

Markup

This document is written in PML, a lightweight markup language based on PDML.

Revision History

2025-xx-xx: version 2.0.0
2021-12-03: version 1.0.1
2021-06-13: version 1.0.0

Core PDML Specification

Introduction

Part I: Specification

Overview

Basic PDML Terminology

Tree Structure

Tag

Tagged Branch Node

Syntax

Separator

Child Nodes

Tagged Leaf Node

Untagged Text Leaf Node

Invalid Characters

Character Escape Sequences

Whitespace Rules

Encoding

Versioning

Useful Links

Glossary

Part II: Usage and Implementation Guidelines

Length Limits

Tags

Usage

Length

Case-Sensitivity

Escaped Characters

Anonymous Nodes

Duplicate Tags

How to Handle Whitespace

Basic Principles

Insignificant Whitespace

Utility Functions and Methods

Types

Customized Parsers

Application-Specific Handling

Line Breaks

Reading Operations

Writing Operations

Character Tabulations

Encoding Binary Data

About This Document

Copyright

Acknowledgments

Markup

Revision History