Home>

In terms of xml parsing,Python implements its own "batteries included" principle. In the built-in standard library,Python provides a large number of packages and tools that can be used to process the XML language.The quantity,It even leaves new Python programmers with no choice.

This article will introduce several ways to parse XML files using Python.Take the elementtree module recommended by the author as an example.Demonstrate specific usage and scenarios.The python version used in this article is 2.7.

What is xml?

XML is an abbreviation of extensible markup language, where markup is the key part.You can create content,Then mark it with a qualifying tag,This makes each word, phrase, or block identifiable and categorizable.

Markup languages ​​have evolved from early private companies and government-developed forms into standard generalized markup language (sgml), hypertext markup language (html), and eventually evolved into XML. XML has the following characteristics.

xml is designed to transmit data,Instead of displaying data.The xml tags are not predefined.You need to define the labels yourself.XML is designed to be self-describing.XML is the recommended standard for w3c.

At present, xml plays a role in the web no less than html, which has been the cornerstone of the web. XML is everywhere.XML is the most commonly used tool for data transfer between various applications.And it is becoming more and more popular in the field of information storage and description.Therefore, learning how to parse xml files is very important for web development.

What are the Python packages that can parse XML?

Python's standard library,Provides 6 packages that can be used to process XML.

xml.dom

xml.dom implements the dom api developed by w3c. If you are used to using dom api or someone asks for this,You can use this package.But be careful,In this package,Several different modules are also provided,Their performance is different.

dom parser before any processing starts,The tree data generated based on the xml file must be stored in memory,So the memory usage of the dom parser is completely based on the size of the input data.

xml.dom.minidom

xml.dom.minidom is a very simplified implementation of the dom api,Much simpler than the full version of dom,And this bag is much smaller.Those who are not familiar with dom should consider using the xml.etree.elementtree module. According to the authors of lxml,This module is not convenient to use,Efficiency is not high,It is also prone to problems.

xml.dom.pulldom

Unlike other modules,The xml.dom.pulldom module provides a "pull parser". The basic concept behind it is to pull events from the XML stream and then process them.Although it uses the event-driven processing model like sax, the difference is thatWhen using the pull parser,Consumers need to explicitly pull events from the xml stream and traverse these events,Until processing is complete or an error occurs.

Pull parsing is a recent XML processing trend.Previously popular XML parsing frameworks such as sax and dom,Are push-based, that is, control over parsing,In the hands of the parser.

xml.sax

The xml.sax module implements the sax api, which sacrifices convenience for speed and memory footprint.sax is an abbreviation for simple api for xml, it is not a standard proposed by the w3c official.It's event-driven,It is not necessary to read the entire document at once,The reading process of the document is the parsing process of sax.So-called event-driven,Refers to a method of running a program based on a callback mechanism.

xml.parser.expat

xml.parser.expat provides a direct, low-level API interface to the expat parser written in C language. The expat interface is similar to sax, but also based on the event callback mechanism.But this interface is not standardized,Only applicable to expat library.

expat is a stream-oriented parser.You register the parser callback (or handler) function and start searching for its documentation.When the parser recognizes the specified location of the file,It will call the appropriate handler for that part (if you have already registered one). The file is passed to the parser,Will be split into multiple pieces,And loaded into memory in sections.So expat can parse those huge files.

xml.etree.elementtree (hereinafter referred to as et)

The xml.etree.elementtree module provides a lightweight, pythonic API, and also an efficient C language implementation.That is, xml.etree.celementtree. Compared to dom, et is faster,The use of api is more direct and convenient. Compared with sax, et.iterparse function also provides the function of on-demand parsing.The entire document is not read in memory at once.The performance of et is similar to that of the sax module.But its api is more high-level,Users are more convenient to use.

I recommend,When using python for XML parsing, the et module is preferredUnless you have other special needs,Additional modules may be required.

These APIs for parsing XML are not unique to Python. Python is also introduced by borrowing from other languages ​​or directly from other languages.For example, expat is a development library developed in C language for parsing XML documents.And sax was originally developed by davidmegginson using the java language,dom can access and modify the content and structure of a document in a platform and language independent way,Can be applied to any programming language.

Below, we take the elementtree module as an example.Introduce how to parse lxml in python.

Parsing XML using elementtree

In the python standard library,Two implementations of et are provided.One is xml.etree.elementtree implemented in pure python, and the other is xml.etree.celementtree implemented in faster C language. Remember to always use c language,Because it's much faster,And the memory consumption is much less.If the acceleration module required by celementtree is not available in the version of python you are using,You can import the module like this:

try:
 import xml.etree.celementtree as et
except importerror:
 import xml.etree.elementtree as et

If there is a different implementation of an api,The above is a common import method.Of course, it is very likely that when you import the first module directly,No problem.Please note that since Python 3.3, the above import method has not been used.Because the elementree module automatically uses the C accelerator first, if there is no C implementation, it will use the Python implementation. Therefore, friends using Python 3.3+ need only import xml.etree.elementtree.

1. Parse the XML document into a tree

Let's start with the basics.XML is a structured, hierarchical data format.The most suitable data structure for xml is the tree.et provides two objects:elementtree transforms the entire xml document into a tree,element represents a single node in the tree.The interaction (reading, writing, searching for required elements) of the entire XML document is generally performed at the elementtree level.For a single xml element and its child elements,It is done at the element level.Below we give examples to introduce the main usage methods.

We use the following xml document as demo data:

<?xml version="1.0"?>
<doc>
 <branch name="codingpy.com" hash="1cdf045c">
  text, source
 </branch>
 <branch name="release01" hash="f200013e">
  <sub-branch name="subrelease01">
   xml, sgml
  </sub-branch>
 </branch>
 <branch name="invalid">
 </branch>
</doc>

Next, we load this document,And parse it:

>>>import xml.etree.elementtree as et
>>tree=et.elementtree (file="doc1.xml")

Then, we get the root element:

>>tree.getroot ()
<element "doc" at 0x11eb780>

As said before,The root element is an element object. Let's see what attributes the root element has:

>>root=tree.getroot ()
>>>root.tag, root.attrib
("doc", {})

That's right, the root element has no attributes.Like other element objects,The root element also has an interface to traverse its immediate children:

>>for child_of_root in root:
... print child_of_root.tag, child_of_root.attrib
...
branch {"hash":"1cdf045c", "name":"codingpy.com"}
branch {"hash":"f200013e", "name":"release01"}
branch {"name":"invalid"}

We can also access specific child elements by index value:

>>>root [0] .tag, root [0] .text
("branch", "\ n text, source \ n")

Find the elements you need

From the example above,It is obvious that we can use a simple recursive method (for each element,Recursively access all its children) to get all elements in the tree.However, since this is a very common task,et provides some easy implementation methods.

The element object has an iter method that can perform a depth-first traversal (dfs) on all child elements below an element object. The elementtree object also has this method.Here is the easiest way to find all elements in an xml document:

>>for elem in tree.iter ():
... print elem.tag, elem.attrib
...
doc {}
branch {"hash":"1cdf045c", "name":"codingpy.com"}
branch {"hash":"f200013e", "name":"release01"}
sub-branch {"name":"subrelease01"}
branch {"name":"invalid"}

on the basis of,We can traverse the tree arbitrarily-iterate over all elements,Find out the attributes you are interested in.But et can make this job easier and faster. The iter method can accept the tag name and then iterate through all the elements with the provided tag:

>>for elem in tree.iter (tag="branch"):
... print elem.tag, elem.attrib
...
branch {"hash":"1cdf045c", "name":"codingpy.com"}
branch {"hash":"f200013e", "name":"release01"}
branch {"name":"invalid"}

3.Support finding elements by xpath

Use xpath to find elements of interest,More convenient.There are some find methods in the element object that can accept xpath paths as parameters.The find method returns the first matching child element.findall returns all matching child elements as a list, iterfind returns an iterator of all matching elements. elementtree objects also have these methods,Accordingly, its search starts from the root node.

Here is an example of finding elements using xpath:

>>for elem in tree.iterfind ("branch/sub-branch"):
... print elem.tag, elem.attrib
...
sub-branch {"name":"subrelease01"}

The above code returns all elements with the tag sub-branch below the branch element. Next find all branch elements with a certain name attribute:

>>for elem in tree.iterfind ("branch [@ name =" release01 "]"):
... print elem.tag, elem.attrib
...
branch {"hash":"f200013e", "name":"release01"}

4, build xml document

Using et, it is easy to complete the xml document construction,And write to save as a file.The write method of the elementtree object can achieve this requirement.

Generally speaking,There are two main use cases.One is that you first read an xml file and modify it.Then write the changes to the document,The second is to create a new xml document from scratch.

If you modify the document,This can be achieved by adjusting the element object.Consider the following example:

>>root=tree.getroot ()
>>>del root [2]
>>>root [0] .set ("foo", "bar")
>>for subelem in root:
... print subelem.tag, subelem.attrib
...
branch {"foo":"bar", "hash":"1cdf045c", "name":"codingpy.com"}
branch {"hash":"f200013e", "name":"release01"}

In the above code,We removed the third child of the root element,Added new attributes to the first child element.This tree can be rewritten to a file.The final XML document should look like this:

>>import sys
>>tree.write (sys.stdout)
<doc>
 <branch foo="bar" hash="1cdf045c" name="codingpy.com">
  text, source
 </branch>
 <branch hash="f200013e" name="release01">
  <sub-branch name="subrelease01">
   xml, sgml
  </sub-branch>
 </branch>
 </doc>

It's also easy to build a complete document from scratch.The et module provides a subelement factory function,Make the process of creating elements simple:

>>a=et.element ("elem")
>>>c=et.subelement (a, "child1")
>>>c.text="some text"
>>d=et.subelement (a, "child2")
>>>b=et.element ("elem_b")
>>>root=et.element ("root")
>>>root.extend ((a, b))
>>tree=et.elementtree (root)
>>tree.write (sys.stdout)
<root><elem><child1>some text</child1><child2 /></elem>

5.Parse xml stream using iterparse

XML documents are usually large,How to read a document directly into memory,Then there will be problems when parsing.This is one of the reasons why dom is not recommended, but sax api.

We talked above,et can load the xml document as an in-memory tree stored in memory, and then process it.But when parsing large files,This should also cause the same memory consumption problem as dom?That's right, this problem does exist.To solve this problem,et provides a special tool similar to sax-iterparse, which can parse XML sequentially.

Next, I show you how to use iterparse, and compare it with the standard tree parsing method.We use an automatically generated xml document, here is the beginning of the document:

<?xml version="1.0" standalone="yes"?>
<site>
 <regions>
 <africa>
  <item>
  <location>united states</location><!-Counting locations->
  <quantity>1</quantity>
  <name>duteous nine eighteen</name>
  <payment>creditcard</payment>
  <description>
   <parlist>
[...]

Let's count how many location elements with text value zimbabwe appear in the document. Here is the standard method using et.parse:

tree=et.parse (sys.argv [2])
count=0
for elem in tree.iter (tag="location"):
 if elem.text == "zimbabwe":
  count +=1
print count

The above code will load all elements into memory,Parse one by one.When parsing an xml document of about 100mb, the peak memory usage of the python process running the above script is about 560mb, and the total running time is 2.9 seconds.

Note that we don't actually need to talk about loading the entire tree into memory.As long as the text is detected as the corresponding location element.All other data can be discarded.At this point, we can use the iterparse method:

count=0
for event, elem in et.iterparse (sys.argv [2]):
 if event == "end":
  if elem.tag == "location" and elem.text == "zimbabwe":
   count +=1
 elem.clear () #Discard element
print count

The for loop above will traverse the iterparse event, first check if the event is end, and then determine whether the element's tag is location and whether its text value meets the target value.In addition, calling elem.clear () is critical:because iterparse will still generate a tree,It's just generated sequentially.Discard unnecessary elements,Is equivalent to scrapping the entire tree,Free up memory allocated by the system.

When using the above script to parse the same file,The peak memory usage is only 7mb and the runtime is 2.5 seconds. The reason for the speed increase,We are here only when the tree is being constructed,Iterate once.The standard method using parse is to complete the construction of the entire tree,Only then iterate again to find the required elements.

Iterparse's performance is comparable to sax's, but its api is more useful:iterparse builds trees sequentiallyWith sax, you have to build the tree yourself.

  • Previous Android development resource file usage example summary
  • Next Thirty-one of data manipulation in ASPNET 20: Use DataList to display multiple records in a row