Give AlbumentationsX a star on GitHub — it powers this leaderboard

Star on GitHub

ijson

Iterative JSON parser with standard Python iterator interfaces

Downloads: 0 (30 days)

Description

.. image:: https://github.com/ICRAR/ijson/actions/workflows/deploy-to-pypi.yml/badge.svg
    :target: https://github.com/ICRAR/ijson/actions/workflows/deploy-to-pypi.yml

.. image:: https://github.com/ICRAR/ijson/actions/workflows/fast-built-and-test.yml/badge.svg
    :target: https://github.com/ICRAR/ijson/actions/workflows/deploy-to-pypi.yml

.. image:: https://coveralls.io/repos/github/ICRAR/ijson/badge.svg?branch=master
    :target: https://coveralls.io/github/ICRAR/ijson?branch=master

.. image:: https://badge.fury.io/py/ijson.svg
    :target: https://badge.fury.io/py/ijson

.. image:: https://img.shields.io/pypi/pyversions/ijson.svg
    :target: https://pypi.python.org/pypi/ijson

.. image:: https://img.shields.io/pypi/dd/ijson.svg
    :target: https://pypi.python.org/pypi/ijson

.. image:: https://img.shields.io/pypi/dw/ijson.svg
    :target: https://pypi.python.org/pypi/ijson

.. image:: https://img.shields.io/pypi/dm/ijson.svg
    :target: https://pypi.python.org/pypi/ijson


=====
ijson
=====

Ijson is an iterative JSON parser with standard Python iterator interfaces.

.. contents::
   :local:


Installation
============

Ijson is hosted in PyPI, so you should be able to install it via ``pip``::

  pip install ijson

Binary wheels are provided
for major platforms
and python versions.
These are built and published automatically
using `cibuildwheel <https://cibuildwheel.readthedocs.io/en/stable/>`_
via GitHub Actions.


Usage
=====

All usage example will be using a JSON document describing geographical
objects:

.. code-block:: json

    {
      "earth": {
        "europe": [
          {"name": "Paris", "type": "city", "info": { ... }},
          {"name": "Thames", "type": "river", "info": { ... }},
          // ...
        ],
        "america": [
          {"name": "Texas", "type": "state", "info": { ... }},
          // ...
        ]
      }
    }


High-level interfaces
---------------------

ijson works by continuously reading data from a JSON stream provided by the user.
This is presented as a file-like object.
In particular it must provide a ``read(size)`` method
returning either ``bytes`` (preferably) or ``str``.
Example file-like objects are
files opened with ``open``,
HTTP/HTTPS requests made using ``urllib.request.urlopen``,
``socket.socket`` objects,
and more.

The most common usage of ijson is to yield native Python objects
located under a prefix.
This is done using the ``items`` function.
Here's how to process all European cities:

.. code-block::  python

    import ijson

    f = urlopen('http://.../')
    objects = ijson.items(f, 'earth.europe.item')
    cities = (o for o in objects if o['type'] == 'city')
    for city in cities:
        do_something_with(city)

For how to build a prefix see the prefix_ section below.

Other times it might be useful to iterate over object members
rather than objects themselves (e.g., when objects are too big).
In that case one can use the ``kvitems`` function instead:

.. code-block::  python

    import ijson

    f = urlopen('http://.../')
    european_places = ijson.kvitems(f, 'earth.europe.item')
    names = (v for k, v in european_places if k == 'name')
    for name in names:
        do_something_with(name)


Lower-level interfaces
----------------------

Sometimes when dealing with a particularly large JSON payload it may worth to
not even construct individual Python objects and react on individual events
immediately producing some result.
This is achieved using the ``parse`` function:

.. code-block::  python

    import ijson

    parser = ijson.parse(urlopen('http://.../'))
    stream.write('<geo>')
    for prefix, event, value in parser:
        if (prefix, event) == ('earth', 'map_key'):
            stream.write('<%s>' % value)
            continent = value
        elif prefix.endswith('.name'):
            stream.write('<object name="%s"/>' % value)
        elif (prefix, event) == ('earth.%s' % continent, 'end_map'):
            stream.write('</%s>' % continent)
    stream.write('</geo>')

Even more bare-bones is the ability to react on individual events
without even calculating a prefix
using the ``basic_parse`` function:

.. code-block:: python

    import ijson

    events = ijson.basic_parse(urlopen('http://.../'))
    num_names = sum(1 for event, value in events
                    if event == 'map_key' and value == 'name')


Command line
------------

A command line utility is included with ijson
to help visualise the output of each of the routines above.
It reads JSON from the standard input,
and it prints the results of the parsing method chosen by the user
to the standard output.

The tool is available by running the ``ijson.dump`` module.
For example::

 $> echo '{"A": 0, "B": [1, 2, 3, 4]}' | python -m ijson.dump -m parse
 #: path, name, value
 --------------------
 0: , start_map, None
 1: , map_key, A
 2: A, number, 0
 3: , map_key, B
 4: B, start_array, None
 5: B.item, number, 1
 6: B.item, number, 2
 7: B.item, number, 3
 8: B.item, number, 4
 9: B, end_array, None
 10: , end_map, None

Using ``-h/--help`` will show all available options.


Benchmarking
------------

A command line utility is included with ijson
to help benchmarking the different methods offered by the package.
It offers some built-in example inputs
that try to mimic different scenarios,
but more importantly it also supports user-provided inputs.
You can also specify which backends to time,
number of iterations,
and more.

The tool is available by running the ``ijson.benchmark`` module.
For example::

 $> python -m ijson.benchmark my/json/file.json -m items -p values.item

Using ``-h/--help`` will show all available options.


``bytes``/``str`` support
-------------------------

Although not usually how they are meant to be run,
all the functions above also accept
``bytes`` and ``str`` objects
directly as inputs.
These are then internally wrapped into a file object,
and further processed.
This is useful for testing and prototyping,
but probably not extremely useful in real-life scenarios.


Iterator support
----------------

In many situations the direct input users want to pass to ijson
is an iterator (e.g., a generator) rather than a file-like object.
ijson provides a built-in adapter to bridge this gap:

- ``ijson.from_iter(iterable_or_async_iterable_of_bytes)``


``asyncio`` support
-------------------

All of the methods above
work also on file-like asynchronous objects,
so they can be iterated asynchronously.
In other words, something like this:

.. code-block:: python

   import asyncio
   import ijson

   async def run():
      f = await async_urlopen('http://..../')
      async for object in ijson.items(f, 'earth.europe.item'):
         if object['type'] == 'city':
            do_something_with(city)
   asyncio.run(run())

An explicit set of ``*_async`` functions also exists
offering the same functionality,
except they will fail if anything other
than a file-like asynchronous object is given to them.
(so the example above can also be written using ``ijson.items_async``).
In fact in ijson version 3.0
this was the only way to access
the ``asyncio`` support.


Intercepting events
-------------------

The four routines shown above
internally chain against each other:
tuples generated by ``basic_parse``
are the input for ``parse``,
whose results are the input to ``kvitems`` and ``items``.

Normally users don't see this interaction,
as they only care about the final output
of the function they invoked,
but there are occasions when tapping
into this invocation chain this could be handy.
This is supported
by passing the output of one function
(i.e., an iterable of events, usually a generator)
as the input of another,
opening the door for user event filtering or injection.

For instance if one wants to skip some content
before full item parsing:

.. code-block:: python

  import io
  import ijson

  parse_events = ijson.parse(io.BytesIO(b'["skip", {"a": 1}, {"b": 2}, {"c": 3}]'))
  while True:
      prefix, event, value = next(parse_events)
      if value == "skip":
          break
  for obj in ijson.items(parse_events, 'item'):
      print(obj)


Note that this interception
only makes sense for the ``basic_parse -> parse``,
``parse -> items`` and ``parse -> kvitems`` interactions.

Note also that event interception
is currently not supported
by the ``async`` functions.


Push interfaces
---------------

All examples above use a file-like object as the data input
(both the normal case, and for ``asyncio`` support),
and hence are "pull" interfaces,
with the library reading data as necessary.
If for whatever reason it's not possible to use such method,
you can still **push** data
through yet a different interface: `coroutines <https://www.python.org/dev/peps/pep-0342/>`_
(via generators, not ``asyncio`` coroutines).
Coroutines effectively allow users
to send data to them at any point in time,
with a final *target* coroutine-like object
receiving the results.

In the following example
the user is doing the reading
instead of letting the library do it:

.. code-block:: python

   import ijson

   @ijson.coroutine
   def print_cities():
      while True:
         obj = (yield)
         if obj['type'] != 'city':
            continue
         print(obj)

   coro = ijson.items_coro(print_cities(), 'earth.europe.item')
   f = urlopen('http://.../')
   for chunk in iter(functools.partial(f.read, buf_size)):
      coro.send(chunk)
   coro.close()

All four ijson iterators
have a ``*_coro`` counterpart
that work by pushing data into them.
Instead of receiving a file-like object
and option buffer size as arguments,
they receive a single ``target`` argument,
which should be a coroutine-like object
(anything implementing a ``send`` method)
through which results will be published.

An alternative to providing a coroutine
is to use ``ijson.sendable_list`` to accumulate results,
providing the list is cleared after each parsing iteration,
like this:

.. code-block:: python