xmltodict
Makes working with XML feel like you are working with JSON
Description
xmltodict
xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":
>>> print(json.dumps(xmltodict.parse("""
... <mydocument has="an attribute">
... <and>
... <many>elements</many>
... <many>more elements</many>
... </and>
... <plus a="complex">
... element as well
... </plus>
... </mydocument>
... """), indent=4))
{
"mydocument": {
"@has": "an attribute",
"and": {
"many": [
"elements",
"more elements"
]
},
"plus": {
"@a": "complex",
"#text": "element as well"
}
}
}
Namespace support
By default, xmltodict does no XML namespace processing (it just treats namespace declarations as regular node attributes), but passing process_namespaces=True will make it expand namespaces for you:
>>> xml = """
... <root xmlns="http://defaultns.com/"
... xmlns:a="http://a.com/"
... xmlns:b="http://b.com/">
... <x>1</x>
... <a:y>2</a:y>
... <b:z>3</b:z>
... </root>
... """
>>> xmltodict.parse(xml, process_namespaces=True) == {
... 'http://defaultns.com/:root': {
... 'http://defaultns.com/:x': '1',
... 'http://a.com/:y': '2',
... 'http://b.com/:z': '3',
... }
... }
True
It also lets you collapse certain namespaces to shorthand prefixes, or skip them altogether:
>>> namespaces = {
... 'http://defaultns.com/': None, # skip this namespace
... 'http://a.com/': 'ns_a', # collapse "http://a.com/" -> "ns_a"
... }
>>> xmltodict.parse(xml, process_namespaces=True, namespaces=namespaces) == {
... 'root': {
... 'x': '1',
... 'ns_a:y': '2',
... 'http://b.com/:z': '3',
... },
... }
True
Streaming mode
xmltodict is very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:
>>> def handle_artist(_, artist):
... print(artist['name'])
... return True
>>>
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
... item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...
It can also be used from the command line to pipe objects to a script like this:
import sys, marshal
while True:
_, article = marshal.load(sys.stdin)
print(article['title'])
$ bunzip2 enwiki-pages-articles.xml.bz2 | xmltodict.py 2 | myscript.py
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople
AfghanistanCommunications
Autism
...
Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:
$ bunzip2 enwiki-pages-articles.xml.bz2 | xmltodict.py 2 | gzip > enwiki.dicts.gz
And you reuse the dicts with every script that needs them:
$ gunzip enwiki.dicts.gz | script1.py
$ gunzip enwiki.dicts.gz | script2.py
...
Roundtripping
You can also convert in the other direction, using the unparse() method:
>>> mydict = {
... 'response': {
... 'status': 'good',
... 'last_updated': '2014-02-16T23:10:12Z',
... }
... }
>>> print(unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<response>
<status>good</status>
<last_updated>2014-02-16T23:10:12Z</last_updated>
</response>
Text values for nodes can be specified with the cdata_key key in the python dict, while node properties can be specified with the attr_prefix prefixed to the key name in the python dict. The default value for attr_prefix is @ and the default value for cdata_key is #text.
>>> import xmltodict
>>>
>>> mydict = {
... 'text': {
... '@color':'red',
... '@stroke':'2',
... '#text':'This is a test'
... }
... }
>>> print(xmltodict.unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<text stroke="2" color="red">This is a test</text>
Lists that are specified under a key in a dictionary use the key as a tag for each item. But if a list does have a parent key, for example if a list exists inside another list, it does not have a tag to use and the items are converted to a string as shown in the example below. To give tags to nested lists, use the expand_iter keyword argument to provide a tag as demonstrated below. Note that using expand_iter will break roundtripping.
>>> mydict = {
... "line": {
... "points": [
... [1, 5],
... [2, 6],
... ]
... }
... }
>>> print(xmltodict.unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<line>
<points>[1, 5]</points>
<points>[2, 6]</points>
</line>
>>> print(xmltodict.unparse(mydict, pretty=True, expand_iter="coord"))
<?xml version="1.0" encoding="utf-8"?>
<line>
<points>
<coord>1</coord>
<coord>5</coord>
</points>
<points>
<coord>2</coord>
<coord>6</coord>
</points>
</line>
API Reference
xmltodict.parse()
Parse XML input into a Python dictionary.
xml_input: XML input as a string, file-like object, or generator of strings.encoding=None: Character encoding for the input XML.expat=expat: XML parser module to use.process_namespaces=False: Expand XML namespaces if True.namespace_separator=':': Separator between namespace URI and local name.disable_entities=True: Disable entity parsing for security.process_comments=False: Include XML comments if True. Comments can be preserved when enabled, but by default they are ignored. Multiple top-level comments may not be preserved in exact order.xml_attribs=True: Include attributes in output dict (withattr_prefix).attr_prefix='@': Prefix for XML attributes in the dict.cdata_key='#text': Key for text content in the dict.force_cdata=False: Force text content to be wrapped as CDATA for specific elements. Can be a boolean (True/False), a tuple of element names to force CDATA for, or a callable function that receives (path, key, value) and returns True/False.cdata_separator='': Separator string to join multiple text nodes. This joins adjacent text nodes. For example, set to a space to avoid concatenation.postprocessor=None: Function to modify parsed items.dict_constructor=dict: Constructor for dictionaries (e.g., dict).strip_whitespace=True: Remove leading/trailing whitespace in text nodes. Default is True; this trims whitespace in text nodes. Set to False to preserve whitespace exactly. Whenprocess_comments=True, this same flag also trims comment text; disablestrip_whitespaceif you need to preserve comment indentation or padding.namespaces=None: Mapping of namespaces to prefixes, or None to keep full URIs.force_list=None: Force list values for specific elements. Can be a boolean (True/False), a tuple of element names to force lists for, or a callable function that receives (path, key, value) and returns True/False. Useful for elements that may appear once or multiple times to ensure consistent list output.item_depth=0: Depth at which to callitem_callback.item_callback=lambda *args: True: Function called on items atitem_depth.comment_key='#comment': Key used for XML comments whenprocess_comments=True. Only used whenprocess_comments=True. Comments can be preserved but multiple top-level comments may not retain order.
xmltodict.unparse()
Convert a Python dictionary back into XML.
input_dict: Dictionary to convert to XML.output=None: File-like object to write XML to; returns string if None.encoding='utf-8': Encoding of the output XML.bytes_errors='replace': Error handler used when decoding byte values during unparse (for example'replace','strict','ignore').full_document=True: Include XML declaration if True.short_empty_elements=False: Use short tags for empty elements (<tag/>).attr_prefix='@': Prefix for dictionary keys representing attributes.cdata_key='#text': Key for text content in the dictionary.pretty=False: Pretty-print the XML output.indent='\t': Indentation string for pretty printing.newl='\n': Newline character for pretty printing.expand_iter=None: Tag name to use for items in nested lists (breaks roundtripping).
Note: When building XML from dictionaries, keys whose values are empty lists are skipped. For example,
{'a': []}produces no<a>element. Add a placeholder child (for example,{'a': ['']}) if an explicit empty container element is required in the output.
Note: xmltodict aims to cover the common 90% of cases. It does not preserve every XML nuance (attribute order, mixed content ordering, multiple top-level comments). For exact fidelity, use a full XML library such as lxml.
Examples
Selective force_cdata
The force_cdata parameter can be used to selectively force CDATA wrapping for specific elements:
>>> xml = '<a><b>data1</b><c>data2</c><d>data3</d></a>'
>>> # Force CDATA only for 'b' and 'd' elements
>>> xmltodict.parse(xml, force_cdata=('b', 'd'))
{'a': {'b': {'#text': 'data1'}, 'c': 'data2', 'd': {'#text': 'data3'}}}
>>> # Force CDATA for all elements (original behavior)
>>> xmltodict.parse(xml, force_cdata=True)
{'a': {'b': {'#text': 'data1'}, 'c': {'#text': 'data2'}, 'd': {'#text': 'data3'}}}
>>> # Use a callable for complex logic
>>> def should_force_cda