Introduction to PDML
Author: Christian Neumanns
Published: 2021-11-16
Introduction
The Practical Data and Markup Language (PDML) is a text format to store data and markup code.
PDML's design goals are:
-
human-friendly (easy to read and write for people)
-
suitable for:
-
data and markup code
-
small and big, complex data structures
-
-
a core syntax that is succinct and simple, and therefore easy to parse/deserialize and serialize
-
unique, powerful extensions
A distinction is made between Core PDML and extensions. Core PDML is the absolute minimum needed to store data. Extensions are optional features to make PDML more practical and powerful.
This document mainly covers Core PDML. However, chapter Extensions contains an overview of extensions, and a link to more information.
Basic Examples
This chapter shows some basic, simple examples to get an idea of what can be done with PDML.
Text Node
Suppose a config file in which parameter color
has the value green
.
In JSON we would use the following syntax:
"color" = "green"
In XML we could use an attribute:
color = "green"
... or an element:
<color>green</color>
In PDML the syntax is:
[color green]
The above code is called a node in PDML.
As can be seen, a node is delimited by []
- a pair of square brackets. A node starts with [
, and ends with ]
.
A node has a name and an optional value. In our example, the name is color
and the value is the text green
.
A space character is used to separate the name from the value.
Text values can contain spaces, new lines and Unicode characters:
[names
Tim
Tom
Tam
😃
]
Child Node
Besides text, a node's value can also be another node:
[config [color green]]
The value of node config
is another node with name color
and value green
.
For better readability we can also write:
[config
[color green]
]
Tree
The node's content can be a list of nodes, and each child node can itself have any number of child nodes:
[config
[color green]
[size
[width 200]
[height 100]
]
]
Hence, PDML can be used to store simple or complex tree data that can be structured or unstructured.
Mixed Child Nodes
A node's content can be a mixture of any number of text and child notes. This makes PDML convenient to store markup code.
Suppose we want to render:
Life is better if we are kind.
In HTML we would write:
<div>Life is <i>better</i> if we are <b>kind</b>.</div>
In PDML this is written as:
[div Life is [i better] if we are [b kind].]
Empty Node
A node can be empty. It has a name, but no content:
[color]
In JSON this would be written as:
"color" = null
In XML:
<color></color>
or simply:
<color />
There is not much more to say about PDML's core syntax.
For a formal and complete definition please refer to the PDML Specification.
Versatility
Despite PDML's utmost simplicity, it can be used to store different kinds of data, such as:
-
configuration files
-
database tables
-
markup code
-
unstructured, heterogenous, or polymorphic data
Examples are shown in the article PDML Examples.
The PDML syntax is used in the Practical Markup Language (PML), the precursor of PDML (as explained later). For a real-world example of a PDML document you can have a look at the markup code of the PDML specification which is written in PML and uses the PDML syntax.
PDML can be converted to XML, and XML to PDML. Hence, XML technology (which is well supported in many programming languages) can be used with PDML documents. For example you can read a PDML document into an XML DOM and:
-
validate the document with XML Schema
-
query the document with XML Query
-
change the document (add, remove, and modify nodes) and write a modified version back to XML or PDML
-
transform the document with XSLT
PDML vs XML/JSON/YAML
For a thorough explanation of the rationale behind PDML please read Suggestion For a Better XML/HTML Syntax.
That article compares code examples written in XML, JSON, and YAML and demonstrates that PDML is:
-
less verbose than XML and JSON, but slightly more verbose than YAML
-
suitable for markup code, unlike JSON and YAML
-
suitable for big, complex data structures, unlike YAML
Moreover PDML has a number of unique, practical extensions not found in XML, JSON, or YAML (see next chapter).
Core PDML (without extensions) is much easier to parse than XML, JSON, or YAML.
Extensions
As seen already, the syntax of Core PDML is very simple and succinct - easy to read and write for humans and machines. Despite its simplicity, Core PDML can be used to store small/big data/markup code.
However, this utmost simplicity can cause inconveniences, especially when big documents are read and written by humans. Therefore a PDML implementation can optionally provide pluggable extensions to make it more practical.
The following chapters provide a non-exhaustive, brief overview of some useful extensions. It's a subset of extensions that are implemented already in the reference implementation written in Java.
Comments
A comment starts with [-
and ends with -]
. Comments can be inserted anywhere. They can be nested to any level. Text within comments is ignored.
Example:
This is [- good -] awesome.
[- TODO: explain why -]
[- another comment
[- nested comment -]
-]
Attributes
PDML attributes are conceptually similar to XML attributes. They are typically used to add metadata to nodes.
For example, the following HTML code uses attributes to identify and style node div
:
<div id="my_div" class="my_class">content</div>
In PDML this would be written as follows:
[div (id=my_div class=my_class) content]
Character Escape Sequences
Besides the mandatory character escape sequences (\[
, \]
, and \\
), the following whitespace and Unicode escape sequences can be used:
Code | Description |
---|---|
\t | TAB character |
\r | carriage return |
\r | line feed |
\uhhhh | Unicode escape (4 hex digits / 16 bits) |
\Uhhhhhhhh | Unicode escape (8 hex digits / 32 bits) |
For example, this text:
line 1\nline 2 \u0041 \U0001F600
... is parsed as:
line 1
line 2 A 😃
Parameters
Parameters are used to define recurring text snippets and data structures. This helps to eliminate code duplication and makes PDML documents more maintainable.
A parameter is declared once with a !set
node, and its value can then be inserted any number of times with a !get
node.
Here is an example of PML markup code that stores the company's website URL into parameter company_URL
, and then inserts the URL in subsequent text:
[doc [title Company Overwiew]
[u:set company_URL=https://www.my_company.org]
...
Our website: [u:get company_URL]
...
Click [link (url=[u:get company_URL]/contacts/index.html) here] to see a list of contacts.
]
Note
Note the !
character that precedes the name in nodes set
and get
. The !
is used to denote a so-called extension node, and provides a distinction from normal data nodes. A PDML implementation can provide any number of extension nodes, and support pluggable, customized extensions to cover specific needs.
Document Splitting
When a PDML document exceeds a certain size, it often makes sense to split it up into different files. For example:
-
each table in a database document is stored in a separate file
-
each chapter in a long article or book is stored in a separate file
Document splitting is done with the !ins-file
extension node. Here is an example of markup code that uses a different file for each chapter in an article:
[doc [title Long Article]
[u:ins_file path=chapters/introduction.pml]
[u:ins_file path=chapters/body.pml]
[u:ins_file path=chapters/conclusion.pml]
]
[ch [title Introduction]
text text text
]
[ch [title Body]
text text text
]
[ch [title Conclusion]
text text text
]
Sub-documents can themselves also be splitted to any level.
!ins-file
nodes are also useful if different documents share common parts, such as a common header/footer used in all articles of a blog.
Types (work in progress)
Types are used to validate the content of nodes, and to define how a node is parsed.
For example, node birthdate
could be configured to be of type date
, which means that the content of node birthdate
must be text that represents a valid date in the past, such as:
[birthdate 1879-03-14]
Let's look at a real-world use-case of a PDML type in PML. Some PML nodes are designed to contain small or large pieces of raw text. For instance, PML has a node named code
to display highlighted source code. Suppose we want to show the following source code in a PML document:
repeat 3 times
write_line ( "[Hello]" )
.
If we used only the Core PDML syntax, and a code
node that is itself indented (because it's contained in a parent node), we would need to write:
[code repeat 3 times
write_line ( "\[Hello\]" )
.]
This is not very readable. Moreover the characters [
and ]
in the source code must be escaped ("\[Hello\]"
).
A dedicated PDML type associated with node code
removes these inconveniences and allows us to write:
[code
~~~
repeat 3 times
write_line ( "[Hello]" )
.
~~~
]
Note that:
-
The text content of node
code
is defined between the two"""
lines -
The indent of the first
"""
defines the indent to be removed in the subsequent source code lines. -
Characters
[
and]
in the source code don't need to be escaped anymore.
If a PML highlighter is used, we can use attribute lang
to specify the programming language:
[code (lang=Java)
~~~
for (int i=1; i <= 3; i++) {
System.out.println ( "[Hello]" );
}
~~~
]
A PDML implementation can provide a standard set of frequently used types (string, number, boolean, date, time, etc.). To maximize flexibility and customization for different domains, additional types can be added programmatically or by configuration data that can be included in the PDML document, or provided in an external (possibly shared) PDML document.
History of PDML
In 2018 I created the Practical Markup Language (PML) to solve problems I encountered with existing markup languages (Markdown, Asciidoctor, HTML, Docbook, etc.). In march 2019 I published We Need a New Document Markup Language - Here is Why to illustrate the existing problems, and to show how they are solved in PML.
Besides being suitable for markup code, the PML syntax could also be used to store data. In March 2021 I therefore published Suggestion For a Better XML/HTML Syntax (also published on codeproject). The new syntax was called practicalXML (pXML), because it was more succinct, but conceptually similar to XML. Moreover, pXML could be converted to XML, and vice versa. All was published and documented at the (now obsolete) pXML website.
In October 2021 pXML was renamed to PDML. The reason was that pXML needed a lot of improvements (extensions) to make it suitable for PML (e.g. parameterized text, document splitting, raw text sections, etc). At the end, pXML was more than just an alternative syntax for XML. It had plugable and configurable types and extensions, as well as other features not available in XML. Thus the name was changed from practicalXML (pXML) to Practical Data and Markup Language (PDML), and everything was published on a new website.
In a nutshell: PDML originated in PML, and was temporarily called pXML.