Give AlbumentationsX a star on GitHub — it powers this leaderboard

Star on GitHub

xmltodict

Makes working with XML feel like you are working with JSON

Downloads: 0 (30 days)

Description

xmltodict

xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":

Tests

>>> print(json.dumps(xmltodict.parse("""
...  <mydocument has="an attribute">
...    <and>
...      <many>elements</many>
...      <many>more elements</many>
...    </and>
...    <plus a="complex">
...      element as well
...    </plus>
...  </mydocument>
...  """), indent=4))
{
    "mydocument": {
        "@has": "an attribute",
        "and": {
            "many": [
                "elements",
                "more elements"
            ]
        },
        "plus": {
            "@a": "complex",
            "#text": "element as well"
        }
    }
}

Namespace support

By default, xmltodict does no XML namespace processing (it just treats namespace declarations as regular node attributes), but passing process_namespaces=True will make it expand namespaces for you:

>>> xml = """
... <root xmlns="http://defaultns.com/"
...       xmlns:a="http://a.com/"
...       xmlns:b="http://b.com/">
...   <x>1</x>
...   <a:y>2</a:y>
...   <b:z>3</b:z>
... </root>
... """
>>> xmltodict.parse(xml, process_namespaces=True) == {
...     'http://defaultns.com/:root': {
...         'http://defaultns.com/:x': '1',
...         'http://a.com/:y': '2',
...         'http://b.com/:z': '3',
...     }
... }
True

It also lets you collapse certain namespaces to shorthand prefixes, or skip them altogether:

>>> namespaces = {
...     'http://defaultns.com/': None, # skip this namespace
...     'http://a.com/': 'ns_a', # collapse "http://a.com/" -> "ns_a"
... }
>>> xmltodict.parse(xml, process_namespaces=True, namespaces=namespaces) == {
...     'root': {
...         'x': '1',
...         'ns_a:y': '2',
...         'http://b.com/:z': '3',
...     },
... }
True

Streaming mode

xmltodict is very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:

>>> def handle_artist(_, artist):
...     print(artist['name'])
...     return True
>>>
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...

It can also be used from the command line to pipe objects to a script like this:

import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print(article['title'])
$ bunzip2 enwiki-pages-articles.xml.bz2 | xmltodict.py 2 | myscript.py
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople
AfghanistanCommunications
Autism
...

Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:

$ bunzip2 enwiki-pages-articles.xml.bz2 | xmltodict.py 2 | gzip > enwiki.dicts.gz

And you reuse the dicts with every script that needs them:

$ gunzip enwiki.dicts.gz | script1.py
$ gunzip enwiki.dicts.gz | script2.py
...

Roundtripping

You can also convert in the other direction, using the unparse() method:

>>> mydict = {
...     'response': {
...             'status': 'good',
...             'last_updated': '2014-02-16T23:10:12Z',
...     }
... }
>>> print(unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<response>
	<status>good</status>
	<last_updated>2014-02-16T23:10:12Z</last_updated>
</response>

Text values for nodes can be specified with the cdata_key key in the python dict, while node properties can be specified with the attr_prefix prefixed to the key name in the python dict. The default value for attr_prefix is @ and the default value for cdata_key is #text.

>>> import xmltodict
>>>
>>> mydict = {
...     'text': {
...         '@color':'red',
...         '@stroke':'2',
...         '#text':'This is a test'
...     }
... }
>>> print(xmltodict.unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<text stroke="2" color="red">This is a test</text>

Lists that are specified under a key in a dictionary use the key as a tag for each item. But if a list does have a parent key, for example if a list exists inside another list, it does not have a tag to use and the items are converted to a string as shown in the example below. To give tags to nested lists, use the expand_iter keyword argument to provide a tag as demonstrated below. Note that using expand_iter will break roundtripping.

>>> mydict = {
...     "line": {
...         "points": [
...             [1, 5],
...             [2, 6],
...         ]
...     }
... }
>>> print(xmltodict.unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<line>
        <points>[1, 5]</points>
        <points>[2, 6]</points>
</line>
>>> print(xmltodict.unparse(mydict, pretty=True, expand_iter="coord"))
<?xml version="1.0" encoding="utf-8"?>
<line>
        <points>
                <coord>1</coord>
                <coord>5</coord>
        </points>
        <points>
                <coord>2</coord>
                <coord>6</coord>
        </points>
</line>

API Reference

xmltodict.parse()

Parse XML input into a Python dictionary.

  • xml_input: XML input as a string, file-like object, or generator of strings.
  • encoding=None: Character encoding for the input XML.
  • expat=expat: XML parser module to use.
  • process_namespaces=False: Expand XML namespaces if True.
  • namespace_separator=':': Separator between namespace URI and local name.
  • disable_entities=True: Disable entity parsing for security.
  • process_comments=False: Include XML comments if True. Comments can be preserved when enabled, but by default they are ignored. Multiple top-level comments may not be preserved in exact order.
  • xml_attribs=True: Include attributes in output dict (with attr_prefix).
  • attr_prefix='@': Prefix for XML attributes in the dict.
  • cdata_key='#text': Key for text content in the dict.
  • force_cdata=False: Force text content to be wrapped as CDATA for specific elements. Can be a boolean (True/False), a tuple of element names to force CDATA for, or a callable function that receives (path, key, value) and returns True/False.
  • cdata_separator='': Separator string to join multiple text nodes. This joins adjacent text nodes. For example, set to a space to avoid concatenation.
  • postprocessor=None: Function to modify parsed items.
  • dict_constructor=dict: Constructor for dictionaries (e.g., dict).
  • strip_whitespace=True: Remove leading/trailing whitespace in text nodes. Default is True; this trims whitespace in text nodes. Set to False to preserve whitespace exactly. When process_comments=True, this same flag also trims comment text; disable strip_whitespace if you need to preserve comment indentation or padding.
  • namespaces=None: Mapping of namespaces to prefixes, or None to keep full URIs.
  • force_list=None: Force list values for specific elements. Can be a boolean (True/False), a tuple of element names to force lists for, or a callable function that receives (path, key, value) and returns True/False. Useful for elements that may appear once or multiple times to ensure consistent list output.
  • item_depth=0: Depth at which to call item_callback.
  • item_callback=lambda *args: True: Function called on items at item_depth.
  • comment_key='#comment': Key used for XML comments when process_comments=True. Only used when process_comments=True. Comments can be preserved but multiple top-level comments may not retain order.

xmltodict.unparse()

Convert a Python dictionary back into XML.

  • input_dict: Dictionary to convert to XML.
  • output=None: File-like object to write XML to; returns string if None.
  • encoding='utf-8': Encoding of the output XML.
  • bytes_errors='replace': Error handler used when decoding byte values during unparse (for example 'replace', 'strict', 'ignore').
  • full_document=True: Include XML declaration if True.
  • short_empty_elements=False: Use short tags for empty elements (<tag/>).
  • attr_prefix='@': Prefix for dictionary keys representing attributes.
  • cdata_key='#text': Key for text content in the dictionary.
  • pretty=False: Pretty-print the XML output.
  • indent='\t': Indentation string for pretty printing.
  • newl='\n': Newline character for pretty printing.
  • expand_iter=None: Tag name to use for items in nested lists (breaks roundtripping).

Note: When building XML from dictionaries, keys whose values are empty lists are skipped. For example, {'a': []} produces no <a> element. Add a placeholder child (for example, {'a': ['']}) if an explicit empty container element is required in the output.

Note: xmltodict aims to cover the common 90% of cases. It does not preserve every XML nuance (attribute order, mixed content ordering, multiple top-level comments). For exact fidelity, use a full XML library such as lxml.

Examples

Selective force_cdata

The force_cdata parameter can be used to selectively force CDATA wrapping for specific elements:

>>> xml = '<a><b>data1</b><c>data2</c><d>data3</d></a>'
>>> # Force CDATA only for 'b' and 'd' elements
>>> xmltodict.parse(xml, force_cdata=('b', 'd'))
{'a': {'b': {'#text': 'data1'}, 'c': 'data2', 'd': {'#text': 'data3'}}}

>>> # Force CDATA for all elements (original behavior)
>>> xmltodict.parse(xml, force_cdata=True)
{'a': {'b': {'#text': 'data1'}, 'c': {'#text': 'data2'}, 'd': {'#text': 'data3'}}}

>>> # Use a callable for complex logic
>>> def should_force_cda