Python can use the xml.etree.elementtree module to extract data from simple xml documents. For demonstration,Let's say i want to parse the RSS feed on planet python. Here is the corresponding code:

from urllib.request import urlopen
from xml.etree.elementtree import parse
#download the rss feed and parse it
u=urlopen ("http://planet.python.org/rss20.xml")
doc=parse (u)
#extract and output tags of interest
for item in doc.iterfind ("channel/item"):
  title=item.findtext ("title")
  date=item.findtext ("pubdate")
  link=item.findtext ("link")
  print (title)
  print (date)
  print (link)
  print ()

Run the above code,The output looks like this:

steve holden:python for data analysis
mon, 19 nov 2012 02:13:51 +0000
vasudev ram:the python data model (for v2 and v3)
sun, 18 nov 2012 22:06:47 +0000
python diary:been playing around with object databases
sun, 18 nov 2012 20:40:29 +0000
vasudev ram:wakari, scientific python in the cloud
sun, 18 nov 2012 20:19:41 +0000
jesse jiryu davis:toro:synchronization primitives for tornado coroutines
sun, 18 nov 2012 20:17:49 +0000

Obviously, if i want to do further processing,You need to replace the print () statement to accomplish other interesting things.

It is common in many applications to process XML-encoded data. Not only because xml has been widely used for data exchange on the internet, It is also a common format for storing application data (such as word processing,Music library, etc.). The following discussion assumes that the reader is already familiar with the basics of XML.

In many cases,When using xml to store only data,The corresponding document structure is very compact and intuitive. For example, the rss feed in the above example looks like this:

<?xml version="1.0"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
    <title>planet python</title>
    <description>planet python-http://planet.python.org/</description>
      <title>steve holden:python for data analysis</title>
      <pubdate>mon, 19 nov 2012 02:13:51 + 0000</pubdate>
      <title>vasudev ram:the python data model (for v2 and v3)</title>
      <pubdate>sun, 18 nov 2012 22:06:47 + 0000</pubdate>
      <title>python diary:been playing around with object databases</title>
      <pubdate>sun, 18 nov 2012 20:40:29 + 0000</pubdate>

The xml.etree.elementtree.parse () function parses the entire xml document and converts it into a document object. Then you can use find (), iterfind () and findtext () to search for specific xml elements. The parameters of these functions are the specified tag names.For example, channel/item or title.

Every time a label is specified,You need to traverse the entire document structure.Each search operation starts with a starting element. Similarly, the tag name specified for each operation is also the relative path of the starting element. For example, execute doc.iterfind ("channel/item") to search for all item elements below the channel element. doc represents the topmost level of the document (that is, the first-level rss element). Then the next call to item.findtext () will search from the position of the found item element.

Each element in the elementtree module has some important properties and methods,Very useful when parsing. The tag attribute contains the name of the tag,The text property contains the internal text,The get () method gets the value of the property.E.g:

<xml.etree.elementtree.elementtree object at 0x101339510>
>>>e=doc.find ("channel/title")
<element "title" at 0x10135b310>
"planet python"
>>e.get ("some_attribute")

It is important to emphasize that xml.etree.elementtree is not the only way to parse XML. For more advanced applications,You need to consider using lxml. It uses the same programming interface as elementtree,So the above example also applies to lxml. You just need to replace the import statement you just started with from lxml.etree import parse. lxml fully complies with the XML standard and is very fast.Also supports verification,xslt, and xpath.

Incrementally parse large xml filesAny time you encounter incremental data processing,The first time you should think of iterators and generators. Here is a very simple function,Incrementally process a large xml file with very little memory:

from xml.etree.elementtree import iterparse
def parse_and_remove (filename, path):
  path_parts=path.split ("/")
  doc=iterparse (filename, ("start", "end"))
  #skip the root element
  next (doc)
  for event, elem in doc:
    if event == "start":
      tag_stack.append (elem.tag)
      elem_stack.append (elem)
    elif event == "end":
      if tag_stack == path_parts:
        yield elem
        elem_stack [-2] .remove (elem)
        tag_stack.pop ()
        elem_stack.pop ()
      except indexerror:

To test this function,You need to have a large xml file first. You can usually find such documents on government websites or public data websites. For example, you can download the Chicago city road pothole database in xml format. At the time of writing this book,The download file already contains more than 100,000 lines of data, and the encoding format looks like this:

Suppose i want to write a script to arrange postal codes by the number of pothole reports.You can do it like this:

from xml.etree.elementtree import parse
from collections import counter
potholes_by_zip=counter ()
doc=parse ("potholes.xml")
for pothole in doc.iterfind ("row/row"):
  potholes_by_zip [pothole.findtext ("zip")] +=1
for zipcode, num in potholes_by_zip.most_common ():
  print (zipcode, num)

The only problem with this script is that it loads the entire xml file into memory and then parses it. On my machine,In order to run this program, about 450mb of memory space is required. If you use the following code,The program only needs to be modified a little:

from collections import counter
potholes_by_zip=counter ()
data=parse_and_remove ("potholes.xml", "row/row")
for pothole in data:
  potholes_by_zip [pothole.findtext ("zip")] +=1
for zipcode, num in potholes_by_zip.most_common ():
  print (zipcode, num)

The result:this version of the code requires only 7mb of memory to run-greatly saving memory resources.


This technique relies on two core functions in the elementtree module. First, the iterparse () method allows incremental operations on xml documents. To use it, you need to provide the filename and a list of one or more of the following types of events:start, end, start-ns and end-ns. An iterator created by iterparse () produces a tuple of the form (event, elem), where event is one of the above event lists,And elem is the corresponding xml element. E.g:

>>data=iterparse ("potholes.xml", ("start", "end"))
>>>next (data)
("start",<element "response" at 0x100771d60>)
>>>next (data)
("start",<element "row" at 0x100771e68>)
>>>next (data)
("start",<element "row" at 0x100771fc8>)
>>>next (data)
("start",<element "creation_date" at 0x100771f18>)
>>>next (data)
("end",<element "creation_date" at 0x100771f18>)
>>>next (data)
("start",<element "status" at 0x1006a7f18>)
>>>next (data)
("end",<element "status" at 0x1006a7f18>)

The start event is created when an element is first created and has not been inserted into other data (such as child elements). The end event is created when an element has completed. Although not demonstrated in the examples, The start-ns and end-ns events are used to handle declarations of XML document namespaces.

In the examples in this section, The start and end events are used to manage the element and tag stacks. The stack represents the hierarchy when the document is parsed, It is also used to determine whether an element matches the path passed to the function parse_and_remove (). If it matches,Use the yield statement to return this element to the caller.

The following statement after yield is the core feature of elementtree that makes the program take up very little memory:

elem_stack [-2] .remove (elem)

This statement causes the element previously produced by yield to be removed from its parent node. Assuming this element is no longer referenced,Then this element is destroyed and the memory is reclaimed.

The final effect of iterative analysis and deletion of nodes is an efficient incremental cleaning process on the document. The document tree structure has not been completely created from beginning to end.despite this,It is still possible to process this xml data through the above simple way.

The main drawback of this solution is its performance. The results of my own tests are,The version that reads the entire document into memory runs almost twice as fast as the incremental version. But it uses 60 times more memory than the latter. So if you care more about memory usage,Then the incremental version wins.

  • Previous Extract site addresses in search results based on Python regular expressions
  • Next Android setting ringtone implementation code