The Elegant Parser

pest is a general purpose parser written in Rust with a focus on accessibility,

pest. The Elegant Parser

pest is a general purpose parser written in Rust with a focus on accessibility, correctness, and performance. It uses parsing expression grammars (or PEG) as input, which are similar in spirit to regular expressions, but which offer the enhanced expressivity needed to parse complex languages.

Getting started

The recommended way to start parsing with pest is to read the official book.

Other helpful resources:

  • API reference on docs.rs
  • play with grammars and share them on our fiddle
  • leave feedback, ask questions, or greet us on Gitter

Example

The following is an example of a grammar for a list of alpha-numeric identifiers where the first identifier does not start with a digit:

Grammars are saved in separate .pest files which are never mixed with procedural code. This results in an always up-to-date formalization of a language that is easy to read and maintain.

Meaningful error reporting

Based on the grammar definition, the parser also includes automatic error reporting. For the example above, the input "123" will result in:

while "ab *" will result in:

Pairs API

The grammar can be used to derive a Parser implementation automatically. Parsing returns an iterator of nested token pairs:

This produces the following output:

Other features

  • Precedence climbing
  • Input handling
  • Custom errors
  • Runs on stable Rust

Projects using pest

  • pest_meta (bootstrapped)
  • AshPaper
  • brain
  • cicada
  • comrak
  • elastic-rs
  • graphql-parser
  • handlebars-rust
  • hexdino
  • Huia
  • insta
  • jql
  • json5-rs
  • mt940
  • Myoxine
  • py_literal
  • rouler
  • RuSh
  • rs_pbrt
  • stache
  • tera
  • ui_gen
  • ukhasnet-parser
  • ZoKrates
  • Vector
  • AutoCorrect
  • yaml-peg
  • caith (a dice roller crate)

Special thanks

A special round of applause goes to prof. Marius Minea for his guidance and all pest contributors, some of which being none other than my friends.

Issues

Collection of the latest Issues

tadman

tadman

0

While the .pest grammar format is quite flexible, there are circumstances under which it's incapable of expressing what's required. Writing the rule manually can solve the problem, but it seems like Pest either supports automatic generation for every function, no exceptions, or you must manually define everything.

I'm facing a situation where a single rule out of hundreds is unable to be expressed with the grammar.

Allowing for manual function definition in addition to automatic definition would solve this.

For example, imagine a grammar like:

Where the parser needs to handle elements like "{4}testX..." being parsed as ( 4, "test" ) and the X... part is not consumed, but left for the next element. In order for this to work, number needs to be converted and employed to consume a fixed number of char. Easily handled with a manual parser.

I'd like to propose an alternative syntax for situations like this:

Where that indicates literal is a manually defined function and is called as Self::literal(...) instead.

I've been working on a fork which implements this where it should be able to handle this with some very minor alterations, but would appreciate some feedback and assistance.

muppi090909

muppi090909

0

In this code snippet

The consume function is not found. Should the doumentation be updated to include a grammar and a valid example?

tadman

tadman

0

When implementing a spec based on an RFC, I wrote a rule like text{0,998}, as the spec (RFC5322) dictates that lines cannot exceed 998 characters.

This dramatically increased compile times from negligible to many seconds. Changing this to text* and doing that validation step in other code eliminates the problem.

By varying the repeat count in the rule and running tests on an Mac Mini (M1 2020):

  • text{0,2048} = 36.04s
  • text{0,998} = 8.77s
  • text{0,512} = 2.94s
  • text{0,256} = 1.12s
  • text{0,128} = 0.62s

Beyond a certain point it's just a "stack overflow":

  • text{0,3000} = fatal runtime error: stack overflow

Is this a known limitation of the implementation of limited repeat?

Sample grammar:

I tried this in the snippet generator but I think it can't handle it because of this issue.

zacps

zacps

0

Currently if pest fails on a blank line it will output something like this:

I would like it to output a configurable number of surrounding lines a la rustc error messages.

Is it possible to configure this somewhere?

max-sixty

max-sixty

0

Currently the literals section of the book gives this as a string example:

Is my understanding correct that this will parse " USA" as USA, discarding the initial , because the rule isn't atomic?

I'm having this issue myself. I'm getting around it by making the rule atomic, but it seems rules can't be atomic & silent, and so the parse tree contains the "s, which requires a bit more work downstream.

Thanks!

cgranade

cgranade

0

It would be nice to be able to more easily use miette to format diagnostics generated by pest errors. Currently, since pest::error::Error::message is private, this cannot easily be done by third-party libraries that depend on both pest and miette without re-implementing much of the message formatting logic in pest. It would be helpful if either message was made public, or if there was a feature-gated impl<R> IntoDiagnostic for pest::error::Error<R> to make it easier to include pest errors in miette diagnostics.

Thank you! ♥

ccleve

ccleve

7

In my app I need to tokenize some text, apply some transformations on it, and then generate an AST. The types of transformations are known only at run time.

In a previous iteration of this app I had a separate lexer and parser, and applied the transformations in the middle. It worked well.

It's not clear how to do this in Pest. How do I supply the output of a custom tokenizer to Pest to generate the ASTs?

mhatzl

mhatzl

1

As mentioned in the discussion entry, I need to match a block of text that starts with a certain number of opening chars and then is closed with the same number of closing chars.

With a COUNT option in the grammar, this could easily be integrated like so:

block = { PUSH( "<"{3 , } ) ~ ( !( ">"{ COUNT( PEEK ) } ) ~ ANY )+ ~ ">"{ COUNT( POP ) } }

I am not familiar with the implementation of pest, so I don't know how complex it would be to integrate this feature, but @nfejzic and I would be interested in helping to implement it.

Currently, it might be possible to solve the above rule in another way, but if nesting should be allowed, where the outer chars must be at least one char longer, I haven't found a way to solve this with the current pest rule syntax.

Any help is greatly appreciated.

matthew-dean

matthew-dean

5

The existence of an additional match after ident_token will cause qualified_rule to hang when typing, as well as any rule which includes qualified_rule, such as root. Even though selecting ident_token from the drop-down on Pest.rs does not hang when typing. I haven't tried the Rust integration yet as I'm still crafting grammar. Is there any reason to expect the Rust part would work when Pest.rs fails / hangs like this?

Note: I tried to reduce this to just qualified_rule, and only the rules referenced. But, when I did that, the grammar actually succeeded and didn't freeze the site. So, somehow there's an invisible interaction with other rules that are not referenced? 🤔

Technohacker

Technohacker

0

While building the expression parser for my grammar, I was stumped on why the climber didn't progress beyond one expression sub-unit. Turns out I had forgot to make my operator catch-all rule silent, which caused the climber to break out on this line

Making this either an explicit panic (for a static operator set) or an Error return (for a dynamic operator set) may make more sense in alerting to a bad grammar, since it's unexpected for the parser to ask for an operator beyond those provided to the climber

This may be a breaking change since the climber is part of the public API

matthew-dean

matthew-dean

2

I've worked a bit with other parser generators (parsers that operate from a PEG-style grammar), and typically there's a CLI tool that can catch / point out errors in the grammar. Is the only way to test a grammar to just try it and see if Rust compilation breaks? I notice that even the editor at pest.rs will throw unhelpful "unreachable" errors if something is wrong.

dlight

dlight

2

pest_derive currently generates a Rule enum without a corresponding enum Rule anywhere in the code. That way, there is no natural place for inserting attributes and custom derive.

My suggestion is that instead of

The code should be like this:

And then, pest would provide a type GenParser<T>, that stands for generated parser (note: maybe think about a better name), such that GenParser<Rule> would be used exactly like MyParser is currently.

I believe this would require minimal changes in the generated code: instead of impl Parser<Rule> for MyParser { .. } it would generate impl Parser<Rule> for GenParser<Rule> { ... }.

As an added bonus, the code wouldn't have a hidden Rule with no definition or import in the source code, improving discoverability: Rule is written in the code itself and GenParser needs to be imported.

mdharm

mdharm

0

For certain grammars, it would be helpful if PUSH(), PEEK, and POP all had case-insensitive variants. The HTML grammar used by the html_parser crate is one example; it does not properly parse sequences such as <BODY></body> and there is no "easy" fix -- this is partially a side-effect of the grammar being written to accept arbitrary tags (rather than limiting to only the well-defined HTML tags such as), so it does not explicitly list all of the valid tag tokens. Instead, it uses PUSH() and POP to find matching opening / closing tags.

I'm sure there are other cases where this would be handy. It doesn't look like a major effort either, as there is already a case-insensitive comparison method Position::match_insensitive() that could be used in place of the existing Position::match_string()

Byter09

Byter09

0

Hello,

in the process of upgrading our CI to use the latest Rust toolchain (1.55.0), we've encountered an error originating from this crate.

It seems that the lint forbidden-lint-groups has been updated with this release, and thus does not allow clippy::all anymore.

As a hotfix, we've added #![allow(forbidden_lint_groups)] to the beginning of the file, as we can't apply it to the derive.

Given, that this is a hard error in the future, we're hoping that this issue will notify you of this upcoming change and speed up resolving this problem.

Here's the error:

And the related code:

Thanks.

jhoobergs

jhoobergs

2

A little context

I am currently developing a parser that can parse certain mathematical expressions. On top of that, I want to be able to parse these expressions within html files. To do this, I copied a pest grammar for HTML and copied it into my existing grammar. The main problem that I am having with this is the fact that the COMMENT (and also WHITESPACE) rules are now unusable for me because the HTML and the mathematical expression have different comment and whitespacing needs. I also don't like the fact that my grammar does now contain multiple languages without a clear separation.

My proposal:

  • Allow including one pest grammar in another (e.g. html = include html.pest)
  • All rules of the included grammar are usable in the other grammar, preferably with a prefix so there are no duplicate names (e.g. html:node instead of just node (which is defined in html.pest)
  • Both grammars do still use the COMMENT and WHITESPACE rules as described in their own file

Other cases

  • An html parser might for example also want to use a pest grammar for css and js, and that would be a lot easier with the include function.

Alternative

  • Including all rules might be a bit to much and give problems with contradicting COMMENT and WHITESPACE values, but just including one rule like expression or html_document or js_code or css_style from another grammar would help. The other rules defined in the subgrammars would still exist and be usable in the parsing code, but not in the main pest file itself.

Current solution

I currently parse the html with an HTML parser (with its own grammar) and have special rule expression. Each expression that I parse, is than parse with to parser for mathematical expressions. This works, but has some drawbacks like much harder line number detection etc.

What do you think? Is this something that seems useful or am I missing something?

Versions

Find the latest versions by id

v2.1.3-generator - Mar 13, 2020

Made secondary grammar include relative.

v2.1.3-pest - Feb 22, 2020

Fixed compile warnings.

v2.1.3-meta - Feb 22, 2020

Fixed compile warnings.

v2.1.2-generator - Feb 22, 2020

Fixed compile warnings and fix non-reproducible builds.

v2.1.2-pest - Sep 02, 2019

Small fixes:

  • incorrect alignment of underline in Error
  • Pair::as_str indices exception
  • stacking issue #394

v2.1.2-meta - Sep 02, 2019

Small dependency change.

v2.1.1-generator - Sep 02, 2019

Small dependency update.

v2.1.1-pest - Apr 15, 2019

Small release that adds Pair/Pairs serialization.

v2.1.1-meta - Apr 15, 2019

Small dependency update.

v2.1.0 - Dec 21, 2018

This release includes:

v2.0.2-pest - Oct 21, 2018

Added a way to access Error's line/column information.

v2.0.3-meta - Oct 02, 2018

Previous version was yanked due it it not including .gitignored grammar.rs. Publishing pest_meta now requires --allow-dirty.

v2.0.2-meta - Oct 02, 2018

pest_meta now breaks bootstrap recursion and does not need v1.0 to build anymore. 🎉

v2.0.1-pest - Oct 02, 2018

Fixed an issue where the skip optimizer wouldn't pass on input that were supposed to be valid.

v2.0.1-derive - Oct 01, 2018

Split pest_derive in two crates to make generation reusable through the new pest_generator.

v2.0.1-meta - Sep 30, 2018

Fixed a bug in the optimizer that was causing exponential times compile times in bigger grammars.

v2.0.0 - Sep 30, 2018

We're happy to release pest 2.0!

While there are a lot of changes that came into this release, here are some of the highlights:

  • improved performance significantly
  • revamped error reporting
  • improved grammar validation to a point where almost all degenerate grammars are now rejected
  • improved stack behavior so that you can now define whitespace-aware languages
  • added unicode builtin rules and other helpers
  • pest_meta crate for parsing, validating, and optimizing grammars
  • pest_vm crate for running grammars on-the-fly
  • finally removed funky const _GRAMMAR paper-cut
  • errors are now owned and much easier to use
  • made EOI non-silent for better error reporting
  • changed special rules to be SHOUT_CASE (e.g. ANY, WHITESPACE)

v1.0.7 - Mar 31, 2018

  • fixes #218

v1.0.6 - Mar 25, 2018

  • fixes #201 where a bug in parser generation caused exponential parse times

v1.0.3 - Jan 26, 2018

  • fixed #189

v1.0.2 - Jan 22, 2018

  • small docs fix
  • licenses included in crates

v1.0.0 - Jan 20, 2018

After months of work, countless scraped ideas, and impressive contributions from the community, pest 1.0 is finally here!

Here are some of the highlights:

  • simplified & improved meta-grammar, now with extra bells and whistles
  • lower debug & release compile times
  • innovative pair API that handles parser output
  • greatly improved automatic error reporting
  • same high performance

Information - Updated Jun 23, 2022

Stars: 3.0K
Forks: 165
Issues: 138

Repositories & Extras

Rust library for parsing configuration files

The 'option' can be any string with no whitespace

Rust library for parsing configuration files

Access Log Parser

This is a pure Rust library for parsing access log entries

Access Log Parser

This crate was originally developed as a personal learning exercise for getting acquainted with Rust...

This crate was originally developed as a personal learning exercise for getting acquainted with Rust and parsing in general

This crate was originally developed as a personal learning exercise for getting acquainted with Rust...

rust crates for parsing stuff

Tokenizers for math expressions, splitting text, lexing lisp-like stuff, etc

rust crates for parsing stuff

A Rust crate for parsing and writing BibTeX and BibLaTeX files

WikiBook section on LaTeX bibliography management

A Rust crate for parsing and writing BibTeX and BibLaTeX files

Rust library for parsing COLLADA files

Notice: This library is built around files exported from Blender 2

Rust library for parsing COLLADA files

Rust JSON parsing benchmarks

This project aims to provide benchmarks to show how various JSON-parsing libraries in the Rust programming language perform at various JSON-parsing tasks

Rust JSON parsing benchmarks

A WIP Rust library for parsing and extracting assets from DELTARUNE's data

(+ some adjustments for the newer version of GM)

A WIP Rust library for parsing and extracting assets from DELTARUNE's data

A Rust library for parsing Smash Ultimate XMB files

The binary application prints an XMB file to the console in JSON format

A Rust library for parsing Smash Ultimate XMB files

Rust library for parsing English time expressions into start and end timestamps

This takes English expressions and returns a time range which ideally matches the expression

Rust library for parsing English time expressions into start and end timestamps
Facebook Instagram Twitter GitHub Dribbble
Privacy