regex

Alternative regular expression module, to replace re.
Downloads: 0 (30 days)
Description

Introduction
------------

This regex implementation is backwards-compatible with the standard 're' module, but offers additional functionality.

Python 2
--------

Python 2 is no longer supported. The last release that supported Python 2 was 2021.11.10.

PyPy
----

This module is targeted at CPython. It expects that all codepoints are the same width, so it won't behave properly with PyPy outside U+0000..U+007F because PyPy stores strings as UTF-8.

Multithreading
--------------

The regex module releases the GIL during matching on instances of the built-in (immutable) string classes, enabling other Python threads to run concurrently. It is also possible to force the regex module to release the GIL during matching by calling the matching methods with the keyword argument ``concurrent=True``. The behaviour is undefined if the string changes during matching, so use it *only* when it is guaranteed that that won't happen.

Unicode
-------

This module supports Unicode 17.0.0. Full Unicode case-folding is supported.

Flags
-----

There are 2 kinds of flag: scoped and global. Scoped flags can apply to only part of a pattern and can be turned on or off; global flags apply to the entire pattern and can only be turned on.

The scoped flags are: ``ASCII (?a)``, ``FULLCASE (?f)``, ``IGNORECASE (?i)``, ``LOCALE (?L)``, ``MULTILINE (?m)``, ``DOTALL (?s)``, ``UNICODE (?u)``, ``VERBOSE (?x)``, ``WORD (?w)``.

The global flags are: ``BESTMATCH (?b)``, ``ENHANCEMATCH (?e)``, ``POSIX (?p)``, ``REVERSE (?r)``, ``VERSION0 (?V0)``, ``VERSION1 (?V1)``.

If neither the ``ASCII``, ``LOCALE`` nor ``UNICODE`` flag is specified, it will default to ``UNICODE`` if the regex pattern is a Unicode string and ``ASCII`` if it's a bytestring.

The ``ENHANCEMATCH`` flag makes fuzzy matching attempt to improve the fit of the next match that it finds.

The ``BESTMATCH`` flag makes fuzzy matching search for the best match instead of the next match.

Old vs new behaviour
--------------------

In order to be compatible with the re module, this module has 2 behaviours:

* **Version 0** behaviour (old behaviour, compatible with the re module):

  Please note that the re module's behaviour may change over time, and I'll endeavour to match that behaviour in version 0.

  * Indicated by the ``VERSION0`` flag.

  * Zero-width matches are not handled correctly in the re module before Python 3.7. The behaviour in those earlier versions is:

    * ``.split`` won't split a string at a zero-width match.

    * ``.sub`` will advance by one character after a zero-width match.

  * Inline flags apply to the entire pattern, and they can't be turned off.

  * Only simple sets are supported.

  * Case-insensitive matches in Unicode use simple case-folding by default.

* **Version 1** behaviour (new behaviour, possibly different from the re module):

  * Indicated by the ``VERSION1`` flag.

  * Zero-width matches are handled correctly.

  * Inline flags apply to the end of the group or pattern, and they can be turned off.

  * Nested sets and set operations are supported.

  * Case-insensitive matches in Unicode use full case-folding by default.

If no version is specified, the regex module will default to ``regex.DEFAULT_VERSION``.

Case-insensitive matches in Unicode
-----------------------------------

The regex module supports both simple and full case-folding for case-insensitive matches in Unicode. Use of full case-folding can be turned on using the ``FULLCASE`` flag. Please note that this flag affects how the ``IGNORECASE`` flag works; the ``FULLCASE`` flag itself does not turn on case-insensitive matching.

Version 0 behaviour: the flag is off by default.

Version 1 behaviour: the flag is on by default.

Nested sets and set operations
------------------------------

It's not possible to support both simple sets, as used in the re module, and nested sets at the same time because of a difference in the meaning of an unescaped ``"["`` in a set.

For example, the pattern ``[[a-z]--[aeiou]]`` is treated in the version 0 behaviour (simple sets, compatible with the re module) as:

* Set containing "[" and the letters "a" to "z"

* Literal "--"

* Set containing letters "a", "e", "i", "o", "u"

* Literal "]"

but in the version 1 behaviour (nested sets, enhanced behaviour) as:

* Set which is:

  * Set containing the letters "a" to "z"

* but excluding:

  * Set containing the letters "a", "e", "i", "o", "u"

Version 0 behaviour: only simple sets are supported.

Version 1 behaviour: nested sets and set operations are supported.

Notes on named groups
---------------------

All groups have a group number, starting from 1.

Groups with the same group name will have the same group number, and groups with a different group name will have a different group number.

The same name can be used by more than one group, with later captures 'overwriting' earlier captures. All the captures of the group will be available from the ``captures`` method of the match object.

Group numbers will be reused across different branches of a branch reset, eg. ``(?|(first)|(second))`` has only group 1. If groups have different group names then they will, of course, have different group numbers, eg. ``(?|(?P<foo>first)|(?P<bar>second))`` has group 1 ("foo") and group 2 ("bar").

In the regex ``(\s+)(?|(?P<foo>[A-Z]+)|(\w+) (?P<foo>[0-9]+)`` there are 2 groups:

* ``(\s+)`` is group 1.

* ``(?P<foo>[A-Z]+)`` is group 2, also called "foo".

* ``(\w+)`` is group 2 because of the branch reset.

* ``(?P<foo>[0-9]+)`` is group 2 because it's called "foo".

If you want to prevent ``(\w+)`` from being group 2, you need to name it (different name, different group number).

Additional features
-------------------

The issue numbers relate to the Python bug tracker, except where listed otherwise.

Added ``\p{Horiz_Space}`` and ``\p{Vert_Space}`` (`GitHub issue 477 <https://github.com/mrabarnett/mrab-regex/issues/477#issuecomment-1216779547>`_)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``\p{Horiz_Space}`` or ``\p{H}`` matches horizontal whitespace and ``\p{Vert_Space}`` or ``\p{V}`` matches vertical whitespace.

Added support for lookaround in conditional pattern (`Hg issue 163 <https://github.com/mrabarnett/mrab-regex/issues/163>`_)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The test of a conditional pattern can be a lookaround.

.. sourcecode:: python

  >>> regex.match(r'(?(?=\d)\d+|\w+)', '123abc')
  <regex.Match object; span=(0, 3), match='123'>
  >>> regex.match(r'(?(?=\d)\d+|\w+)', 'abc123')
  <regex.Match object; span=(0, 6), match='abc123'>

This is not quite the same as putting a lookaround in the first branch of a pair of alternatives.

.. sourcecode:: python

  >>> print(regex.match(r'(?:(?=\d)\d+\b|\w+)', '123abc'))
  <regex.Match object; span=(0, 6), match='123abc'>
  >>> print(regex.match(r'(?(?=\d)\d+\b|\w+)', '123abc'))
  None

In the first example, the lookaround matched, but the remainder of the first branch failed to match, and so the second branch was attempted, whereas in the second example, the lookaround matched, and the first branch failed to match, but the second branch was **not** attempted.

Added POSIX matching (leftmost longest) (`Hg issue 150 <https://github.com/mrabarnett/mrab-regex/issues/150>`_)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The POSIX standard for regex is to return the leftmost longest match. This can be turned on using the ``POSIX`` flag.

.. sourcecode:: python

  >>> # Normal matching.
  >>> regex.search(r'Mr|Mrs', 'Mrs')
  <regex.Match object; span=(0, 2), match='Mr'>
  >>> regex.search(r'one(self)?(selfsufficient)?', 'oneselfsufficient')
  <regex.Match object; span=(0, 7), match='oneself'>
  >>> # POSIX matching.
  >>> regex.search(r'(?p)Mr|Mrs', 'Mrs')
  <regex.Match object; span=(0, 3), match='Mrs'>
  >>> regex.search(r'(?p)one(self)?(selfsufficient)?', 'oneselfsufficient')
  <regex.Match object; span=(0, 17), match='oneselfsufficient'>

Note that it will take longer to find matches because when it finds a match at a certain position, it won't return that immediately, but will keep looking to see if there's another longer match there.

Added ``(?(DEFINE)...)`` (`Hg issue 152 <https://github.com/mrabarnett/mrab-regex/issues/152>`_)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If there's no group called "DEFINE", then ... will be ignored except that any groups defined within it can be called and that the normal rules for numbering groups still apply.

.. sourcecode:: python

  >>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant) (?&item)', '5 elephants')
  <regex.Match object; span=(0, 11), match='5 elephants'>

Added ``(*PRUNE)``, ``(*SKIP)`` and ``(*FAIL)`` (`Hg issue 153 <https://github.com/mrabarnett/mrab-regex/issues/153>`_)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``(*PRUNE)`` discards the backtracking info up to that point. When used in an atomic group or a lookaround, it won't affect the enclosing pattern.

``(*SKIP)`` is similar to ``(*PRUNE)``, except that it also sets where in the text the next attempt to match will start. When used in an atomic group or a lookaround, it won't affect the enclosing pattern.

``(*FAIL)`` causes immediate backtracking. ``(*F)`` is a permitted abbreviation.

Added ``\K`` (`Hg issue 151 <https://github.com/mrabarnett/mrab-regex/issues/151>`_)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Keeps the part of the entire match after the position where ``\K`` occurred; the part before it is discarded.

It does not affect what groups return.

.. sourcecode:: python

  >>> m = regex.s