CommonMark: Standard Markdown

8.14

CommonMark: Standard Markdown🔗

The source of this manual is available on GitHub.

The commonmark library implements a CommonMark-compliant Markdown parser. Currently, it passes all test cases in v0.31.2 of the specification. By default, only the Markdown features specified by CommonMark are supported, but non-standard support for footnotes can be optionally enabled; see the Extensions section of this manual for more details.

The commonmark module reprovides all of the bindings provided by commonmark/parse and commonmark/render/html (but not the bindings provided by commonmark/struct).

6 Comparison with markdown

1 Quick start🔗

For information about the Markdown syntax supported by commonmark, see the CommonMark website.

In commonmark, processing Markdown is split into two steps: parsing and rendering. To get started, use string->document or read-document to parse Markdown input into a document structure:

> (require commonmark)
> (define doc (string->document "*Hello*, **markdown**!"))
> doc
(document
(list (paragraph (list (italic "Hello") ", " (bold "markdown") "!")))
'())

A document is an abstract syntax tree representing Markdown content. Most uses of Markdown render it to HTML, so commonmark also provides the document->html and write-document-html functions, which render a document to HTML in the way recommended by the CommonMark specification:

> (write-document-html doc)
Hello, markdown!

The document->xexprs function can also be used to render a document to a list of X-expressions, which can make it more convenient to incorporate rendered Markdown into a larger HTML document (though do be aware of the caveats involving HTML blocks and HTML spans described in the documentation for document->xexprs):

> (document->xexprs doc)
'((p (em "Hello") ", " (strong "markdown") "!"))

2 Parsing🔗

(require commonmark/parse)

package: commonmark-lib

The commonmark/parse module provides functions for parsing Markdown content into a document structure. To render Markdown to HTML, use this module in combination with the functions provided by commonmark/render/html.

All of the bindings provided by commonmark/parse are also provided by commonmark.

procedure
(string->document str) → document?
str : string?

Parses str as a Markdown document.

Example:

> (define doc (string->document "*Hello*, **markdown**!"))
> doc
(document
(list (paragraph (list (italic "Hello") ", " (bold "markdown") "!")))
'())
> (write-document-html doc)
Hello, markdown!

This function cannot fail: every string of Unicode characters is a valid Markdown document.

procedure
(read-document in) → document?
in : input-port?

Like string->document, but the input is read from the given input port rather than from a string.

Example:

> (define doc (read-document (open-input-string "*Hello*, **markdown**!")))
> doc
(document
(list (paragraph (list (italic "Hello") ", " (bold "markdown") "!")))
'())
> (write-document-html doc)
Hello, markdown!

This function may be more efficient than (read-document (port->string in)), but probably not substantially, as the entire document structure must be realized in memory regardless.

parameter
(current-parse-footnotes?) → boolean?
(current-parse-footnotes? parse-footnotes?) → void?
parse-footnotes? : any/c
= #f

Enables or disables footnote parsing, which is an extension to the CommonMark specification; see Footnotes for more details.

Note that the value of current-parse-footnotes? only affects parsing, not rendering. If a document containing footnotes is rendered to HTML, the footnotes will still be rendered even if (current-parse-footnotes?) is #f.

Added in version 1.1 of package commonmark-lib.

3 Rendering HTML🔗

(require commonmark/render/html)

package: commonmark-lib

The commonmark/render/html module provides functions for rendering a parsed Markdown document to HTML as recommended by the CommonMark specification. This module should generally be used in combination with commonmark/parse, which provides functions for producing a document structure from Markdown input.

All of the bindings provided by commonmark/render/html are also provided by commonmark.

procedure
(document->html doc) → string?
doc : document?

Renders doc to HTML in the format recommended by the CommonMark specification.

Example:

> (document->html (string->document "*Hello*, **markdown**!"))
"Hello, markdown!"

procedure
(write-document-html doc [out]) → void?
doc : document?
out : output-port? = (current-output-port)

Like document->html, but writes the rendered HTML directly to out rather than returning it as a string.

Example:

> (write-document-html (string->document "*Hello*, **markdown**!"))
Hello, markdown!

procedure
(document->xexprs doc) → (listof xexpr/c)
doc : document?

Like document->html, but returns the rendered HTML as a list of X-expressions rather than as a string.

Example:

> (document->xexprs (string->document "*Hello*, **markdown**!"))
'((p (em "Hello") ", " (strong "markdown") "!"))

Note that HTML blocks and HTML spans are not parsed and may even contain invalid HTML, which makes them difficult to represent as an X-expression. As a workaround, raw HTML will be represented as cdata elements:

> (document->xexprs
 (string->document "A paragraph with <marquee>raw HTML</marquee>."))
(list
(list
 'p
 "A paragraph with "
 (cdata #f #f "<marquee>")
 "raw HTML"
 (cdata #f #f "</marquee>")
 "."))

This generally yields the desired result, as xexpr->string renders cdata elements directly as their unescaped content. However, strictly speaking, it is an abuse of cdata.

parameter
(current-italic-tag) → symbol?
(current-italic-tag tag) → void?
tag : symbol?
= 'em
parameter
(current-bold-tag) → symbol?
(current-bold-tag tag) → void?
tag : symbol?
= 'strong

These parameters determine which HTML tag is used to render italic spans and bold spans, respectively. The default values of 'em and 'strong correspond to those required by the CommonMark specification, but this can be semantically incorrect if “emphasis” syntax is used for purposes other than emphasis, such as italicizing the title of a book.

Reasonable alternate values for current-italic-tag and current-bold-tag include 'i, 'b, 'mark, 'cite, or 'defn, all of which are elements with semantic (rather than presentational) meaning in HTML5. Of course, the “most correct” choice depends on how italic spans and bold spans will actually be used.

Example:

> (parameterize ([current-italic-tag 'cite]
                 [current-bold-tag 'mark])
    (document->xexprs
     (string->document
      (string-append
       "> First, programming is about stating and solving problems,\n"
       "> and this activity normally takes place in a context with its\n"
       "> own language of discourse; **good programmers ought to\n"
       "> formulate this language as a programming language**.\n"
       "\n"
       "— *The Racket Manifesto* (emphasis mine)"))))
'((blockquote
   (p
    "First, programming is about stating and solving problems,\n"
    "and this activity normally takes place in a context with its\n"
    "own language of discourse; "
    (mark
     "good programmers ought to\n"
     "formulate this language as a programming language")
    "."))
  (p "— " (cite "The Racket Manifesto") " (emphasis mine)"))

4 Document structure🔗

(require commonmark/struct)

package: commonmark-lib

The commonmark/struct module provides structure types used to represent Markdown content as abstract syntax. The root of every syntax tree is a document, which contains blocks, which in turn contain inline content. Most users will not need to interact with these structures directly, but doing so can be useful to perform additional processing on the document before rendering it, or to render Markdown to a format other than HTML.

Note that the bindings in this section are only provided by commonmark/struct, not by commonmark.

struct
(struct document (blocks footnotes)
    #:transparent)
  blocks : (listof block?)
  footnotes : (listof footnote-definition?)

A parsed Markdown document, which has a body flow and a list of footnote definitions. It can be parsed from Markdown input using read-document or string->document and can be rendered to HTML using document->html.

Changed in version 1.1 of package commonmark-lib: Added the footnotes field.

struct
(struct footnote-definition (blocks label)
    #:transparent)
  blocks : (listof block?)
  label : string?

Footnotes are an extension to the CommonMark specification and are not enabled by default; see Footnotes in the Extensions section of this manual for more details.

A footnote definition contains a flow that can be referenced by a footnote reference via its footnote label.

Note: although footnote definitions are syntactically blocks in Markdown input, they are not a type of block (as recognized by the block? predicate) and cannot be included directly in the main document flow. Footnote definitions are collected into the separate document-footnotes field of the document structure during parsing, since they represent auxiliary definitions, and their precise location in the Markdown input does not matter.

(This is quite similar to the way the parser processes link reference definitions, except that footnote definitions must be retained separately for later rendering, whereas link reference definitions can be discarded after all link targets have been resolved.)

Added in version 1.1 of package commonmark-lib.

4.1 Blocks🔗

procedure
(block? v) → boolean?
v : any/c

See § Blocks and inlines in the CommonMark specification for more information about blocks.

Returns #t if v is a block: a paragraph, itemization, block quote, code block, HTML block, heading, or thematic break. Otherwise, returns #f.

A flow is a list of blocks. The body of a document, the contents of a block quote, and each item in an itemization are flows.

struct
(struct paragraph (content)
#:transparent)
content : inline?

See § Paragraphs in the CommonMark specification for more information about paragraphs.

A paragraph is a block that contains inline content. In HTML output, it corresponds to a element. Most blocks in a document are usually paragraphs.

struct
(struct itemization (blockss style start-num)
    #:transparent)
  blockss : (listof (listof block?))
  style : (or/c 'loose 'tight)
  start-num : (or/c exact-nonnegative-integer? #f)

See § Lists and § List items in the CommonMark specification for more information about itemizations.

An itemization is a block that contains a list of flows. In HTML output, it corresponds to a <ul> or <ol> element.

The style field records whether the itemization is loose or tight: if style is 'tight, child paragraphs in HTML output are not wrapped in tags.

If start-num is #f, then the itemization represents a bullet list. Otherwise, the itemization represents an ordered list, and the value of start-num is its start number.

struct
(struct blockquote (blocks)
#:transparent)
blocks : (listof block?)

See § Block quotes in the CommonMark specification for more information about block quotes.

A block quote is a block that contains a nested flow. In HTML output, it corresponds to a <blockquote> element.

struct
(struct code-block (content info-string)
    #:transparent)
  content : string?
  info-string : (or/c string? #f)

See § Indented code blocks and § Fenced code blocks in the CommonMark specification for more information about code blocks.

A code block is a block that has unformatted content and an optional info string. In HTML output, it corresponds to a <pre> element that contains a <code> element.

The CommonMark specification does not mandate any particular treatment of the info string, but it notes that “the first word is typically used to specify the language of the code block.” In HTML output, the language is indicated by adding a CSS class to the rendered <code> element consisting of language- followed by the language name, per the spec’s recommendation.

struct
(struct html-block (content)
#:transparent)
content : string?

See § HTML Blocks in the CommonMark specification for more information about HTML blocks.

An HTML block is a block that contains raw HTML content (and will be left unescaped in HTML output). Note that, in general, the content may not actually be well-formed HTML, as CommonMark simply treats everything that “looks sufficiently like” HTML—according to some heuristics—as raw HTML.

struct
(struct heading (content depth)
    #:transparent)
  content : inline?
  depth : (integer-in 1 6)

See § ATX headings and § Setext headings in the CommonMark specification for more information about headings.

A heading is a block with inline content and a heading depth. In HTML output, it corresponds to one of the <h1> through <h6> elements.

A heading depth is an integer between 1 and 6, inclusive, where higher numbers correspond to more-nested headings.

value
thematic-break : thematic-break?
procedure
(thematic-break? v) → boolean?
v : any/c

See § Thematic breaks in the CommonMark specification for more information about thematic breaks.

A thematic break is a block. It is usually rendered as a horizontal rule, and in HTML output, it corresponds to an <hr> element.

4.2 Inline content🔗

procedure
(inline? v) → boolean?
v : any/c

See § Blocks and inlines in the CommonMark specification for more information about inline content.

Returns #t if v is inline content: a string, italic span, bold span, code span, link, image, footnote reference, HTML span, hard line break, or list of inline content. Otherwise, returns #f.

struct
(struct italic (content)
#:transparent)
content : inline?

See § Emphasis and strong emphasis in the CommonMark specification for more information about italic spans.

An italic span is inline content that contains nested inline content. By default, in HTML output, it corresponds to an element (but an alternate tag can be used by modifying current-italic-tag).

struct
(struct bold (content)
#:transparent)
content : inline?

See § Emphasis and strong emphasis in the CommonMark specification for more information about bold spans.

A bold span is inline content that contains nested inline content. By default, in HTML output, it corresponds to a element (but an alternate tag can be used by modifying current-bold-tag).

struct
(struct code (content)
#:transparent)
content : string?

See § Code spans in the CommonMark specification for more information about code spans.

A code span is inline content that contains unformatted content. In HTML output, it corresponds to a <code> element.

struct
(struct link (content dest title)
    #:transparent)
  content : inline?
  dest : string?
  title : (or/c string? #f)

See § Links in the CommonMark specification for more information about links.

A link is inline content that contains nested inline content, a link destination, and an optional link title. In HTML output, it corresponds to an <a> element.

struct
(struct image (description source title)
    #:transparent)
  description : inline?
  source : string?
  title : (or/c string? #f)

See § Images in the CommonMark specification for more information about images.

An image is inline content with a source path or URL that should point to an image. It has an inline content description (which is used as the alt attribute in HTML output) and an optional title. In HTML output, it corresponds to an <img> element.

struct
(struct footnote-reference (label)
#:transparent)
label : string?

Footnotes are an extension to the CommonMark specification and are not enabled by default; see Footnotes in the Extensions section of this manual for more details.

A footnote reference is inline content that references a footnote definition with a matching footnote label. In HTML output, it corresponds to a superscript <a> element.

Added in version 1.1 of package commonmark-lib.

struct
(struct html (content)
#:transparent)
content : string?

See § Raw HTML in the CommonMark specification for more information about HTML spans.

An HTML span is inline content that contains raw HTML content (and will be left unescaped in HTML output). Note that, in general, the content may not actually be well-formed HTML, as CommonMark simply treats everything that “looks sufficiently like” HTML—according to some heuristics—as raw HTML.

value
line-break : line-break?
procedure
(line-break? v) → boolean?
v : any/c

See § Hard line breaks in the CommonMark specification for more information about hard line breaks.

A hard line break is inline content used for separating inline content within a block. In HTML output, it corresponds to a element.

5 Extensions🔗

By default, commonmark adheres strictly to the CommonMark specification, which provides consistency with other Markdown implementations. However, many implementations include extensions beyond what CommonMark specifies, several of which are useful enough to have become de facto standards. Unfortunately, since extensions are not precisely specified, their interpretation can vary between implementations.

This section documents all of the extensions commonmark currently supports. Note that it may be difficult to determine whether a difference in behavior between two implementations of a Markdown extension constitutes a bug or two incompatible but equally valid interpretations. For that reason, full backwards compatibility of extensions’ behavior, especially in edge cases, is not guaranteed.

5.1 Footnotes🔗

Footnotes enjoy broad support across Markdown implementations, including PHP Markdown Extra, Python-Markdown, Pandoc, GitHub Flavored Markdown, and markdown. The [^label] syntax for references and definitions is essentially universal, but minor differences exist in interpretation, and rendering varies significantly. commonmark’s implementation is not precisely identical to any of them, but it was originally based on the cmark-gfm implementation of GitHub Flavored Markdown.

Footnotes allow auxiliary information to be lifted out of the main document flow to avoid cluttering the body text. When footnote parsing is enabled via the current-parse-footnotes? parameter, shortcut reference links with a link label that begins with a ^ character are instead parsed as footnote references. For example, the following paragraph includes three footnote references:

Racket is a programming language[^1] descended from Scheme.[^scheme]
Although not all Racket programs retain Lisp syntax, most Racket
programs still include a great many parentheses.[^(()())]

Text between the [^ and ] characters constitutes the footnote label, and the content of the footnote is provided via a footnote definition with a matching footnote label. Footnote definitions have similar syntax to link reference definitions, but unlike link reference definitions the body of a footnote definition is an arbitrary flow. For example, the following syntax defines two footnotes matched by the footnote references above:

[^1]: Technically, the name *Racket* refers to both the runtime
    environment and the primary language used to program it.

[^scheme]: The original name for the Racket project was PLT Scheme,
    but it was renamed in 2010 [to avoid confusion and to reflect its
    departure from its roots](https://racket-lang.org/new-name.html).

Like link reference definitions, footnote definitions are syntactically blocks and may appear within any flow, though they are not semantically children of any flow in which they appear. Their placement does not affect their interpretation—a footnote reference may reference any footnote defined in the same document—unless two definitions have matching footnote labels, in which case the later definition is ignored.

As mentioned above, a footnote definition is a container block with an arbitrary flow. All lines after the first must be indented by 4 spaces to be included in the definition (unless they are lazy continuation lines). For example, the following footnote definition includes a block quote, an indented code block, and a paragraph:

[^long note]:
    > This is a block quote that is nested inside
    > the body of a footnote.

        This is an indented code block
        inside of a footnote.

    This paragraph is also inside the footnote.

A footnote reference must match a footnote definition somewhere in the document to be parsed as a footnote reference. If no such definition exists, the label will be parsed as literal text. Each footnote definition can be referenced an arbitrary number of times.

When footnotes are parsed, each footnote reference is represented in-place by a footnote-reference structure, but footnote definitions are removed from the main document flow and collected into a list of footnote-definition structures in a separate document-footnotes field. This allows renderers to more easily match references to their corresponding definitions and ensures that the placement of definitions within a document cannot affect the rendered output.

When given a document containing footnotes, the default HTML renderer mimicks the output produced by cmark-gfm. Specifically, the renderer appends a <section class="footnotes"> element to the end of the output, which wraps an <ol> element containing the footnotes’ content:

markdown

Here is a paragraph[^1] with

two footnote references.[^2]

[^1]: Here is the first footnote.

[^2]: And here is the second.

rendered

Here is a paragraph 1 with two footnote references.2

Here is the first footnote. ↩
And here is the second. ↩

Each rendered footnote definition includes a backreference link, denoted by a ↩ character, that links to the corresponding footnote reference in the body text. If a definition is referenced multiple times, the rendered footnote will include multiple backreference links:

markdown

Here is a paragraph[^1] that

references a footnote twice.[^1]

[^1]: Here is the footnote.

rendered

Here is a paragraph 1 that references a footnote twice.1

Here is the footnote. ↩1 ↩2

In both of the previous examples, the chosen footnote labels happen to line up with the rendered footnote numbers, but in general, that does not need to be the case. Footnote references are always rendered numerically, in the order they appear in the document, regardless of the footnote labels used in the document’s source:

Although footnotes are visually renumbered by the renderer, the generated links and link anchors are based on the original footnote labels. This means that a link to particular footnote definition will remain stable even if a document is modified as long as its label remains unchanged.

markdown

Here are some footnotes[^a]

with non-numeric[^b] names.

And here are some footnotes[^2]

numbered out of order.[^3]

[^a]: Here is footnote a.

[^b]: Here is footnote b.

[^2]: Here is footnote 2.

[^3]: Here is footnote 3.

rendered

Here are some footnotes 1 with non-numeric 2 names.

And here are some footnotes 3 numbered out of order.4

Here is footnote a. ↩
Here is footnote b. ↩
Here is footnote 2. ↩
Here is footnote 3. ↩

In a similar vein, the order in which footnote definitions appear does not matter, as they will be rendered in the order they are first referenced in the document. If a definition is never referenced, it will not be rendered at all:

markdown

Here is a paragraph[^1] with

two footnote references.[^3]

[^3]: Here is footnote 3.

[^2]: Here is footnote 2.

[^1]: Here is footnote 1.

rendered

Here is a paragraph 1 with two footnote references.2

Here is footnote 1. ↩
Here is footnote 3. ↩

Footnote references may appear inside footnote definitions, and commonmark will not object (though your readers might). Footnotes that are first referenced in a footnote definition will be numbered so that they immediately follow the referencing footnote:

markdown

Here is a paragraph[^1] with

two footnote references.[^2]

[^1]: Here is footnote 1.[^3]

[^2]: Here is footnote 2.

[^3]: Here is footnote 3.

rendered

Here is a paragraph 1 with two footnote references.3

Here is footnote 1.2 ↩
Here is footnote 3. ↩
Here is footnote 2. ↩

Note that although matching footnote references to their corresponding definitions is handled by the parser, pruning and renumbering of footnote definitions is handled entirely by the renderer, which allows alternate renderers to use alternate schemes if they so desire.

6 Comparison with markdown🔗

The commonmark library is not the first Markdown parser implemented in Racket: it is long predated by the venerable markdown library, which in fact also predates the CommonMark specification itself. The libraries naturally provide similar functionality, but there are some key differences:

Most obviously and most significantly, commonmark conforms to the CommonMark specification, while markdown does not. This has both pros and cons:
- commonmark enjoys consistency with other CommonMark implementations and is therefore likely to behave better on existing Markdown content than markdown is. Additionally, commonmark handles some tricky edge cases more gracefully than markdown does, such as parsing of emphasis adjacent to Unicode punctuation.
- On the other hand, markdown is more featureful than commonmark, as it provides some extensions that commonmark does not. Additionally, some users may find some of the ways that markdown’s parser diverges from the CommonMark specification more intuitive (which is largely just a matter of personal taste).
commonmark provides a full Markdown AST, while markdown always parses directly to HTML (in the form of X-expressions). For many users, this difference is unlikely to be important, as almost all uses of Markdown render it to HTML, anyway. However, the option to process the intermediate representation affords additional flexibility if it is needed.
commonmark is appreciably faster than markdown. On most documents, markdown is about 5× slower than commonmark, but the performance gap increases dramatically given unusually large inputs: markdown is about 8× slower to parse a 4 MiB document and 28× slower to parse an 11 MiB document.

Takeaway: if you need the extra features provided by markdown, use markdown, otherwise use commonmark.

1	Quick start
2	Parsing
3	Rendering HTML
4	Document structure
5	Extensions
6	Comparison with markdown