This library is a pull parser for CommonMark, written

Further, it optionally supports parsing footnotes,

pulldown-cmark

Documentation

This library is a pull parser for CommonMark, written in Rust. It comes with a simple command-line tool, useful for rendering to HTML, and is also designed to be easy to use from as a library.

It is designed to be:

  • Fast; a bare minimum of allocation and copying
  • Safe; written in pure Rust with no unsafe blocks (except in the opt-in SIMD feature)
  • Versatile; in particular source-maps are supported
  • Correct; the goal is 100% compliance with the CommonMark spec

Further, it optionally supports parsing footnotes, Github flavored tables, Github flavored task lists and strikethrough.

Rustc 1.46 or newer is required to build the crate.

Why a pull parser?

There are many parsers for Markdown and its variants, but to my knowledge none use pull parsing. Pull parsing has become popular for XML, especially for memory-conscious applications, because it uses dramatically less memory than constructing a document tree, but is much easier to use than push parsers. Push parsers are notoriously difficult to use, and also often error-prone because of the need for user to delicately juggle state in a series of callbacks.

In a clean design, the parsing and rendering stages are neatly separated, but this is often sacrificed in the name of performance and expedience. Many Markdown implementations mix parsing and rendering together, and even designs that try to separate them (such as the popular hoedown), make the assumption that the rendering process can be fully represented as a serialized string.

Pull parsing is in some sense the most versatile architecture. It's possible to drive a push interface, also with minimal memory, and quite straightforward to construct an AST. Another advantage is that source-map information (the mapping between parsed blocks and offsets within the source text) is readily available; you can call into_offset_iter() to create an iterator that yields (Event, Range) pairs, where the second element is the event's corresponding range in the source document.

While manipulating ASTs is the most flexible way to transform documents, operating on iterators is surprisingly easy, and quite efficient. Here, for example, is the code to transform soft line breaks into hard breaks:

Or expanding an abbreviation in text:

Another simple example is code to determine the max nesting level:

There are some basic but fully functional examples of the usage of the crate in the examples directory of this repository.

Using Rust idiomatically

A lot of the internal scanning code is written at a pretty low level (it pretty much scans byte patterns for the bits of syntax), but the external interface is designed to be idiomatic Rust.

Pull parsers are at heart an iterator of events (start and end tags, text, and other bits and pieces). The parser data structure implements the Rust Iterator trait directly, and Event is an enum. Thus, you can use the full power and expressivity of Rust's iterator infrastructure, including for loops and map (as in the examples above), collecting the events into a vector (for recording, playback, and manipulation), and more.

Further, the Text event (representing text) is a small copy-on-write string. The vast majority of text fragments are just slices of the source document. For these, copy-on-write gives a convenient representation that requires no allocation or copying, but allocated strings are available when they're needed. Thus, when rendering text to HTML, most text is copied just once, from the source document to the HTML buffer.

When using the pulldown-cmark's own HTML renderer, make sure to write to a buffered target like a Vec<u8> or String. Since it performs many (very) small writes, writing directly to stdout, files, or sockets is detrimental to performance. Such writers can be wrapped in a BufWriter.

Build options

By default, the binary is built as well. If you don't want/need it, then build like this:

Or put in your Cargo.toml file:

SIMD accelerated scanners are available for the x64 platform from version 0.5 onwards. To enable them, build with simd feature:

Or add the feature to your project's Cargo.toml:

Authors

The main author is Raph Levien. The implementation of the new design (v0.3+) was completed by Marcus Klaas de Vries.

Contributions

We gladly accept contributions via GitHub pull requests. Please see CONTRIBUTING.md for more details.

Issues

Collection of the latest Issues

wyes-us

wyes-us

0

I'm using pulldown-cmark version 0.8.0 in a project with strikethrough enabled and decided to try a bunch of different nested spans.

The following combinations have an issue where the parser misses the strikethrough sequence (~~).

It appears to affect just 3 or 4 depth spans starting with "strike emphasis strong" (~~*__) preceded by a 3 depth span starting with "emphasis strike" (*~~) or "strong strike" (**~~).

Here, I used *__ and **_ for "emphasis strong" and "strong emphasis" in order to make it clearer, but the result is the same if I use ***, ___, or _** and __* instead. Using different characters for emphasis and strong don't make any difference.

The following procedure should create a demonstration and show the "good" and "bad" tested inputs:

md\n{}text", good_input); let parser = pdcm::Parser::new_ext(&good_input, pdcm::Options::all()); for event in parser { println!("{:?}", event); } println!("md\n{}text", bad_input); let parser = pdcm::Parser::new_ext(&bad_input, pdcm::Options::all()); for event in parser { println!("{:?}", event); } println!("

While I've used pulldown-cmark's parser, I'm not really knowledgeable of its internals, but if I have time will try to identify the root cause and suggest a fix. If anyone else can provide any insights, please do. Thanks!

boehs

boehs

1

Feel free to close this, but this is something I have been interested in for a while and want to see it in my favorite markdown parser for my favorite programming lang, but don't know enough to understand if/why it can't be done. Most markdown parsers for various langs use a few files, and contain all their parsing in those files. I think it would be cool if instead of giant monofiles, each feature is it's own. So a parsers crate that has stuff like

etc

I think having it like this makes it

  1. easy to read
  2. easy to add new ones
  3. easy to track down bugs and make patches
  4. extension first, this forces a strong api.

I don't know why I could not find a parser anywhere that did this, so maybe I am missing something huge

What are your opinions on this? Is it feasible? Is it outright a bad idea?

Wierd issue but was not sure where to ask

FToovvr

FToovvr

3

Hello.

As title says, for example, if I run cargo run - --enable-tables with the following as input:

I'd expect the output to be [^1]:

| not | in | | a | list|

  • A blockquote:

    inside a list item

  • A Heading:

    inside a list item

  • A table:

    | list | item | | with | leading | | empty | lines |

  • A table:

    | list | item | | without | leading | | empty | lines |

But the actual output is:

Atable
notin
alist
  • A blockquote:

    inside a list item

  • A Heading:

    inside a list item

  • A table:

    insidea
    listitem
    withleading
    emptylines
  • A table: | inside | a | | ------- | ------- | | list | item | | without | leading | | empty | lines |

Raw HTML of the actual output

[^1]: I pasted the exact markdown text above to here and appended > to the beginning of each line, so this is just how GitHub's markdown renderer renders my input.


As the above shows:

  • Table extension is enabled (hence the first table above is rendered).
  • Generally, block elements inside list items are rendered exactly as what they would be in other places, no matter if there are leading empty lines. The CommonMark dingus also works this way (hence the blockquote and h1 above is rendered).
  • However, unlike other block elements, pulldown-cmark interprets tables inside list items without leading empty lines as paragraphs, not tables.

Commit I built the binary from: 5088b21d09ef94b424c4d852db7648c9c94fb630 (The current head commit)

kknives

kknives

0

This seems like an edge case to me, but, when using a HTML tag inside a table, any newlines inside the tag attributes (or body) will cause the HTML to be escaped and the newlines are interpreted as new rows.

Most HTML tags are not used in this way, but the <svg> tag used for rendering math by Katex falls prey to this unfortunate edge case, as we encountered in lzanini/mdbook-katex#3

What was supposed to be: image became: image

I took a brief look at the code and my guess is that, when TableParseMode::Active is set, newlines are interpreted as new rows: https://github.com/raphlinus/pulldown-cmark/blob/4d5094be11ec524e30acededc8ea097ad95267e8/src/firstpass.rs#L416-L420

Do correct me if my guess is wrong and we can work on this together :+1:

GuillaumeGomez

GuillaumeGomez

4

I read through https://github.github.com/gfm/#example-199 and #145 to understand a bit better and I'm not sure what's the best course of action here: the HTML spec for td (and th just below) don't like align as a valid attribute.

For more context, I discovered this issue when working on https://github.com/rust-lang/rust/pull/84480. The goal is to check that rustdoc generated pages are completely valid by using tidy. The problem is that it's an invalid attribute because not listed in the spec. The solution would be to rename align into data-align. However, the browsers are using this attribute to change the content display. So maybe the best thing to do is to simply ignore this warning and keep things as is... If so, please just close the issue.

ISSOtm

ISSOtm

2

It's not in the CommonMark spec, and unlike the GitHub syntax extensions, the README does not link to anything. I've only figured out the syntax by finding an example; where can the README point to?

vedantroy

vedantroy

8

I've been browsing the code to get an understanding of it & I have a few questions if you wouldn't mind answering!

  • Why is TreeIndex a wrapper for NonZeroUsize instead of a wrapper for usize? (I saw https://github.com/raphlinus/pulldown-cmark/pull/184 but didn't really understand it).
  • In scan_containers we clone line_start before calling scan_blockquote_marker, but scan_blockquote_marker seems to save/restore internally, so why do we do it twice?
ovidiu-ionescu

ovidiu-ionescu

0

I could not find a way to add to the html generated so that links open in a new tab. Also, I could not find on the web, a consistent way to specify this behavior in markdown.

OTOH, it is quite easy to add rendering options to html in a similar way to the existing extension options of the parser. I have already written the code, change is here.

Is this something interesting to the project?

5225225

5225225

4

There's already the support for fuzzing here, but I found this issue through https://github.com/rust-fuzz/targets and wrote a simple cargo-fuzz fuzzing target that only checks that parsing arbitrary HTML doesn't panic. (Might be worth having a cargo-fuzz target in addition to the existing one, simply because cargo fuzz is easier to get set up and seems to be able to run test cases faster? Though I know testing for panics isn't the main purpose of the existing fuzzer)

Below's a standalone demo program. I tested that it crashes both with 0.8.0 from crates.io, as well as the latest version from git (d99667b3, if just going to the directory in ~/.cargo/git/checkouts is correct).

Backtrace is below

throwaway1037

throwaway1037

0

Please correct me if I am mistaken in the following message:

As far as I understand, this project currently implements Github Flavored Markdown in its mainline with the exception of § 6.9 Autolinks and § 6.11 Disallowed Raw HTML.

I understand the former was discussed in Issue #494, although it was not mainlined.

Please may full GFM support be implemented, if it is not already? I think it would be best to enable/disable as a compile-time option, with CommonMark being the default.

mattwidmann

mattwidmann

2

The way that footnotes are currently emitted by pulldown_cmark's parser makes it difficult to format them as sidenotes in HTML. The problem is that FootnoteDefinitions are emitted as soon as they're found in the source text. However, sidenote definitions needs to be emitted in the HTML just after the reference mark so that CSS positioning can move them into the margin, like this:

But pulldown_cmark produces the footnote definition too late to make that output possible for a Markdown snippet like this:

I've thought of a few workarounds, but they're not ideal:

  • Pre-process the source to move all paragraphs starting with [^ to the beginning of the document, so that FootnoteDefinitions are guaranteed to be seen before their references.
  • Wrap the parser so that when a FootnoteReference is found, it can hold events in memory until it finds the matching FootnoteDefinition.

I'm not sure of the right way to solve this in the pulldown_cmark library, though. Because a footnote definition can contain more styled content, it seems like the definitions should be treated as a start/end event. The lunamark library for Lua has the writer parse the definition each time it's referenced. That kind of solution might make sense here -- storing the footnote definition's contents and parsing it when it's referenced.

edward-shen

edward-shen

2

I was looking around the codebase and noticed that Rust has determined that Parser was !Send and !Sync. I thought it was odd because from a very high level perspective, it seems reasonable to expect us to be able to send a parser object between threads without difficulty--synchronous access maybe not so much (I can't really imagine a use case for when a client might want to have a parser be access from multiple threads at the same time) but Send implies Sync so ¯\_(ツ)_/¯.

It looks like the primary problem is that the type BrokenLinkCallback isn't Send nor Sync because we don't require the trait object to be send or sync:

A quick fix is to just require the trait object reference to be Send + Sync:

but since BrokenLinkCallback is part of the public API, this would be a breaking change as it restricts what functions are accepted.

As a refresher on Closure auto-impl rules:

  • A closure is Sync if all captured variables are Sync.
  • A closure is Send if all variables captured by non-unique immutable reference are Sync, and all values captured by unique immutable or mutable reference, copy, or move are Send.

I might be wrong, but I don't think adding this requirement would be too difficult on users. I would expect most use cases to already be Send and Sync (our tests need no modification, for example), and I believe the current definition &mut dyn FnMut already forces some sort of Send/Sync primitive around the Parser itself, if they need to do so.

On that note, is there a particular reason why we accept a &mut to a callback, instead of an owned instance we Box instead or have Parser be generic over the trait?

BenjaminRi

BenjaminRi

11

As can be seen in issue #457 , if you parse the following code block

test

test `

you get

The strange behaviour here is that the lines start with \n, but don't end with \n. I think this is highly unusual and makes the strings harder to work with than necessary. Desired behaviour would be:

This behaviour would make much more sense (it is also more natural because it reflects the actual lines seen in the code block) and provides enhanced compatibility to other libraries like syntect which usually parse code on a line-by-line basis, where a line is defined as a string terminated by \n.

I am currently running into issues with this and the only remedy seems to be string slicing and copying, which costs performance.

camelid

camelid

16

Is there a particular reason that pulldown-cmark defines its own type, StrWrite, rather than using the built-in Rust trait core::fmt::Write? Previously core::fmt::Write's documentation said

This trait should generally not be implemented by consumers of the standard library.

But in a recent PR, I changed it to remove that section since no one knew of a reason why it said that. Perhaps it would be good to move pulldown over to that built-in trait?

dhardy

dhardy

5

Markdown's out-of-line code blocks start and end with a line-break which usually isn't considered part of the block. Example:

Pulldown-cmark's output:

That last \n should be stripped from the Text element, right? I consider it not part of the block. If I want to apply highlighting (e.g. background colour) to the code-block, that should not affect the line-break following the Text block.

Related: #457

makoConstruct

makoConstruct

7

Markdown is... I think it's fair enough to say that it is inconsistent about whether or not it will wrap a list item in a <p>. There are rules, but they're not intuitive, for me they aren't lining up with what I want the software to do at all.

There doesn't seem to be a way to get it to <p> the list list items in lists of one, leading lone list items to look like an inconsistency in the document where the usual padding between sections of text is just mysteriously missing.

Clearly what I want is for us to condemn and abandon the standard behaviour as a default, but that doesn't seem to be what this project is about, so I'll instead ask if we'd be alright with a parser option that assumes loose lists in all cases.

diminishedprime

diminishedprime

9

Description

The generated html for footnotes has the <sup> tag outside of the <p> tag of the footnote. Phrasing Content such as <pre> tags should be inside a paragraph. Because the <sup> doesn't belong to the <p>, it ends up on its own line and requires additional css to make it display correctly.

I'm happy to tackle this in a Pull Request, but wanted to make sure there's agreement that this is a valid bug report first.

Expected Behavior

Markdown with footnotes generates html with the correct semantic behavior. The <sup> tag for the footnote number should be inside the <p> tag for the footnote content.

jsfiddle

Actual Behavior

Markdown with footnotes generates html with a <sup> as a direct child of a <div>

jsfiddle

Note: This html was copied directly out of the ./specs/footnotes.txt

IronOxidizer

IronOxidizer

8

Parser should be reusable to improve performance. This would reduce the number of allocations since no new parser is required for each Markdown conversion and would also improve performance since parser options and custom event mapping would not have to be set for each new Parser.

This would be particularly useful in a case where many small parses are happening independently but with the same options and event mapping (e.g. a long forum thread with thousands of comments where each comment has to be parsed with non-standard markdown).

I propose a parser.set_content(text: &'a str) syntax as it would be rather intuitive and does not seem like it would break any current implementations.

frondeus

frondeus

3

Hello, thank you for your fantastic project.

I'm writing a small language with markdown support, and I wanted to add syntax highlighting for Neovim. Therefore I have to parse markdown. Pulldown-cmark works great in most cases, because you provide me access to offsets via into_offset_iter().

However, this gets me a range for the whole tag. For example:

This would give me (0..112) - the range for the whole text.

There are cases when I need to process or highlight only part, for example, title, therefore I need to somehow calculate the range for both src, title, and of course, the body (displayed text, alt text, etc.).

The third one is easy, It lives under different event. The first one is also quite interesting and possible. Because it's under CowStr::Borrowed - I can perform the offset calculation (by subtracting pointers to the slice, which is, in fact, continuous memory. Note that I assume nobody made any owned event and everything is referencing to the original single source).

However, the title is under CowStr::Inlined - which means my only solution, for now, is to perform a classic search to find where exactly lives. However, even this can be tricky - what if the title has the same content as for example link or alt text?

Anyway - do you see here any possible good solution?

Versions

Find the latest versions by id

v0.9.0 - Dec 22, 2021

This release brings a number of changes.

New features

  • Thanks to @lo48576, pulldown now optionally supports custom header ids and classes for headers. Set ENABLE_HEADING_ATTRIBUTES in the options to enable.
  • Users can now access reference definitions, information that was previously only exposed internally.
  • Pulldown is now CommonMark 0.30 compliant.

Changes

  • The function signature for the broken link callback has changed slightly to allow for FnMut functions.

There have also been a number of (small) parsing bug fixes.

v0.8.0 - Sep 01, 2020

This release brings support for markdown smart punctuation. Further, it comes with a renewed design for broken link callbacks. Finally, it fixes a few minor parsing bugs.

v0.7.2 - Jul 02, 2020

Changes:

  • Minor parsing fixes

v0.7.0 - Feb 12, 2020

Minor parsing fixes and bug fixes. Now exposes the difference between delimited code blocks and indented code blocks.

v0.6.1b - Nov 11, 2019

Minor parsing fixes.

v0.6.0 - Sep 06, 2019

This is a backward incompatible release. However, most users will not experience any breakage. It also fixes some parser correctness bugs.

Breaking changes:

  • the get_offset method on the parser was removed. Its semantics were poorly defined and only provided users with the start offset of the next event. To get proper source mapping information which includes the entire source range for each event, upgrade the Parser to an OffsetIter using the into_offset_iter method. This produces an iterator over (Event, Range<usize>) tuples.
  • the Event::HtmlBlock and Event::InlineHTML event variants were removed. Inline HTML is now represented by regular HTML events.
  • horizontal rules are now events, and no longer (empty) tags.
  • Event::Header(i32) has been replaced by Event::Heading(u32).
  • the starting index of numbered lists is now represented by a u64 instead of a usize.
  • the FIRST_PASS option has been removed.

v0.5.3 - Jul 18, 2019

Changes:

  • Addresses rare panics in emphasis routine
  • Fixes some parser correctness issues
  • Small bugfixes

v0.5.2 - May 28, 2019

Changes:

  • bug fixes
  • improved parsing correctness

v0.5.1 - May 13, 2019

Changes:

  • removes last remaining unsafe block in default mode (without simd feature);
  • various bug fixes and guards against quadratic behavior;
  • very minor performance bumps.

v0.5.0 - Apr 24, 2019

Additions:

  • CommonMark 0.29 compatibility
  • SIMD accelerated parsers feature
  • Guards against known pathological inputs causing quadratic scanning time
  • Speed improvements

Changes:

  • Code spans are no longer tags, but are now events containing a single CowStr. This is a breaking change.

v0.4.1 - Apr 12, 2019

Minor release with a number of small bug fixes. No breaking changes.

v0.4.0 - Mar 18, 2019

New extensions (strikethrough, task lists), public CowStr and InlineStr and some small fixes.

This is not backward compatible with v0.3.0, but the changes should be very manageable.

Information - Updated Jan 02, 2022

Stars: 1.2K
Forks: 170
Issues: 59

Rust library for Self Organising Maps (SOM)

Add rusticsom as a dependency in Cargo

Rust library for Self Organising Maps (SOM)

Rust library for parsing configuration files

The 'option' can be any string with no whitespace

Rust library for parsing configuration files

Rust library for the Pimoroni Four Letter pHAT

This library aims to port ht16k33 (or rather a fork, as of right now) so credit goes to ht16k33-diet

Rust library for the Pimoroni Four Letter pHAT

Rust library for emulating 32-bit RISC-V

This library can execute instructions against any memory and register file that implements

Rust library for emulating 32-bit RISC-V

Rust library for connecting to the IPFS HTTP API using Hyper/Actix

You can use actix-web as a backend instead of hyper

Rust library for connecting to the IPFS HTTP API using Hyper/Actix

Rust library to manipulate file system access control lists (ACL) on macOS, Linux, and FreeBSD

This module provides two high level functions, getfacl and setfacl

Rust library to manipulate file system access control lists (ACL) on macOS, Linux, and FreeBSD

Rust library translation (rust-src/rust-std/stdlib/rustlib translation)

This is the place to translate Having a documentation in your native language is essential if you don't speak English, and still enjoyable even if...

Rust library translation (rust-src/rust-std/stdlib/rustlib translation)

Rust library for using Infrared hardware decoders (For example a Vishay TSOP* decoder),

enabling remote control support for embedded project

Rust library for using Infrared hardware decoders (For example a Vishay TSOP* decoder),

Rust library for interaction with the OriginTrail Decentralized Knowledge Graph

open up an issue on this repository and let us know

Rust library for interaction with the OriginTrail Decentralized Knowledge Graph

Rust library for parsing COLLADA files

Notice: This library is built around files exported from Blender 2

Rust library for parsing COLLADA files

Rust library for low-level abstraction of MIPS32 processors

This project is licensed under the terms of the MIT license

Rust library for low-level abstraction of MIPS32 processors
Facebook Instagram Twitter GitHub Dribbble
Privacy