parsimonious

(Soon to be) the fastest pure-Python PEG parser I could muster
Rank: #1488Downloads: 6,646,447 (30 days)Stars: 1,905Forks: 135
Description

============
Parsimonious
============

Parsimonious aims to be the fastest arbitrary-lookahead parser written in pure
Python—and the most usable. It's based on parsing expression grammars (PEGs),
which means you feed it a simplified sort of EBNF notation. Parsimonious was
designed to undergird a MediaWiki parser that wouldn't take 5 seconds or a GB
of RAM to do one page, but it's applicable to all sorts of languages.

:Code:    https://github.com/erikrose/parsimonious/
:Issues:  https://github.com/erikrose/parsimonious/issues
:License: MIT License (MIT)
:Package: https://pypi.org/project/parsimonious/


Goals
=====

* Speed
* Frugal RAM use
* Minimalistic, understandable, idiomatic Python code
* Readable grammars
* Extensible grammars
* Complete test coverage
* Separation of concerns. Some Python parsing kits mix recognition with
  instructions about how to turn the resulting tree into some kind of other
  representation. This is limiting when you want to do several different things
  with a tree: for example, render wiki markup to HTML *or* to text.
* Good error reporting. I want the parser to work *with* me as I develop a
  grammar.


Install
=======

To install Parsimonious, run::

    $ pip install parsimonious


Example Usage
=============

Here's how to build a simple grammar:

.. code:: python

    >>> from parsimonious.grammar import Grammar
    >>> grammar = Grammar(
    ...     """
    ...     bold_text  = bold_open text bold_close
    ...     text       = ~"[A-Z 0-9]*"i
    ...     bold_open  = "(("
    ...     bold_close = "))"
    ...     """)

You can have forward references and even right recursion; it's all taken care
of by the grammar compiler. The first rule is taken to be the default start
symbol, but you can override that.

Next, let's parse something and get an abstract syntax tree:

.. code:: python

    >>> print(grammar.parse('((bold stuff))'))
    <Node called "bold_text" matching "((bold stuff))">
        <Node called "bold_open" matching "((">
        <RegexNode called "text" matching "bold stuff">
        <Node called "bold_close" matching "))">

You'd typically then use a ``nodes.NodeVisitor`` subclass (see below) to walk
the tree and do something useful with it.

Another example would be to implement a parser for ``.ini``-files. Consider the following:

.. code:: python

    grammar = Grammar(
        r"""
        expr        = (entry / emptyline)*
        entry       = section pair*

        section     = lpar word rpar ws
        pair        = key equal value ws?

        key         = word+
        value       = (word / quoted)+
        word        = ~r"[-\w]+"
        quoted      = ~'"[^\"]+"'
        equal       = ws? "=" ws?
        lpar        = "["
        rpar        = "]"
        ws          = ~r"\s*"
        emptyline   = ws+
        """
    )


We could now implement a subclass of ``NodeVisitor`` like so:

.. code:: python

    class IniVisitor(NodeVisitor):
        def visit_expr(self, node, visited_children):
            """ Returns the overall output. """
            output = {}
            for child in visited_children:
                output.update(child[0])
            return output

        def visit_entry(self, node, visited_children):
            """ Makes a dict of the section (as key) and the key/value pairs. """
            key, values = visited_children
            return {key: dict(values)}

        def visit_section(self, node, visited_children):
            """ Gets the section name. """
            _, section, *_ = visited_children
            return section.text

        def visit_pair(self, node, visited_children):
            """ Gets each key/value pair, returns a tuple. """
            key, _, value, *_ = node.children
            return key.text, value.text

        def generic_visit(self, node, visited_children):
            """ The generic visit method. """
            return visited_children or node

And call it like that:

.. code:: python

    from parsimonious.grammar import Grammar
    from parsimonious.nodes import NodeVisitor

    data = """[section]
    somekey = somevalue
    someotherkey=someothervalue

    [anothersection]
    key123 = "what the heck?"
    key456="yet another one here"

    """

    tree = grammar.parse(data)

    iv = IniVisitor()
    output = iv.visit(tree)
    print(output)

This would yield

.. code:: python

    {'section': {'somekey': 'somevalue', 'someotherkey': 'someothervalue'}, 'anothersection': {'key123': '"what the heck?"', 'key456': '"yet another one here"'}}

Status
======

* Everything that exists works. Test coverage is good.
* I don't plan on making any backward-incompatible changes to the rule syntax
  in the future, so you can write grammars with confidence.
* It may be slow and use a lot of RAM; I haven't measured either yet. However,
  I have yet to begin optimizing in earnest.
* Error reporting is now in place. ``repr`` methods of expressions, grammars,
  and nodes are clear and helpful as well. The ``Grammar`` ones are
  even round-trippable!
* The grammar extensibility story is underdeveloped at the moment. You should
  be able to extend a grammar by simply concatenating more rules onto the
  existing ones; later rules of the same name should override previous ones.
  However, this is untested and may not be the final story.
* Sphinx docs are coming, but the docstrings are quite useful now.
* Note that there may be API changes until we get to 1.0, so be sure to pin to
  the version you're using.

Coming Soon
-----------

* Optimizations to make Parsimonious worthy of its name
* Tighter RAM use
* Better-thought-out grammar extensibility story
* Amazing grammar debugging


A Little About PEG Parsers
==========================

PEG parsers don't draw a distinction between lexing and parsing; everything is
done at once. As a result, there is no lookahead limit, as there is with, for
instance, Yacc. And, due to both of these properties, PEG grammars are easier
to write: they're basically just a more practical dialect of EBNF. With
caching, they take O(grammar size * text length) memory (though I plan to do
better), but they run in O(text length) time.

More Technically
----------------

PEGs can describe a superset of *LL(k)* languages, any deterministic *LR(k)*
language, and many others—including some that aren't context-free
(http://www.brynosaurus.com/pub/lang/peg.pdf). They can also deal with what
would be ambiguous languages if described in canonical EBNF. They do this by
trading the ``|`` alternation operator for the ``/`` operator, which works the
same except that it makes priority explicit: ``a / b / c`` first tries matching
``a``. If that fails, it tries ``b``, and, failing that, moves on to ``c``.
Thus, ambiguity is resolved by always yielding the first successful recognition.


Writing Grammars
================

Grammars are defined by a series of rules. The syntax should be familiar to
anyone who uses regexes or reads programming language manuals. An example will
serve best:

.. code:: python

    my_grammar = Grammar(r"""
        styled_text = bold_text / italic_text
        bold_text   = "((" text "))"
        italic_text = "''" text "''"
        text        = ~"[A-Z 0-9]*"i
        """)

You can wrap a rule across multiple lines if you like; the syntax is very
forgiving.

If you want to save your grammar into a separate file, you should name it using
``.ppeg`` extension.


Syntax Reference
----------------

====================    ========================================================
``"some literal"``      Used to quote literals. Backslash escaping and Python
                        conventions for "raw" and Unicode strings help support
                        fiddly characters.

``b"some literal"``     A bytes literal. Using bytes literals and regular
                        expressions allows your grammar to parse binary files.
                        Note that all literals and regular expressions must be
                        of the same type within a grammar. In grammars that
                        process bytestrings, you should make the grammar string
                        an ``r"""string"""`` so that byte literals like ``\xff``
                        work correctly.

[space]                 Sequences are made out of space- or tab-delimited
                        things. ``a b c`` matches spots where those 3
                        terms appear in that order.

``a / b / c``           Alternatives. The first to succeed of ``a / b / c``
                        wins.

``thing?``              An optional expression. This is greedy, always consuming
                        ``thing`` if it exists.

``&thing``              A lookahead assertion. Ensures ``thing`` matches at the
                        current position but does not consume it.

``!thing``              A negative lookahead assertion. Matches if ``thing``
                        isn't found here. Doesn't consume any text.

``things*``             Zero or more things. This is greedy, always consuming as
                        many repetitions as it can.

``things+``             One or more things. This is greedy, always consuming as
                        many repetitions as it can.

``~r"regex"ilmsuxa``    Regexes have ``~`` in front and are quoted like
                        literals. Any flags_ (``asilmx``) follow the end quotes
                        as single chars. Regexes are good for representing
                        character classes (``[a-z0-9]``) and optimizing for
                        speed. The downside is that they won't be able to take
                        advantage of our fancy debugging, once we get that
                        working. Ultimately, I'd like to deprecate explicit
                        regexes and instead have Parsimonious dynamically build
                        them out of simpler primitives. Parsimonious uses the
                        regex_ library instead