Home>

I. Introduction

xml (extensible markup language) refers to extensible markup language,Designed to transmit and store data,Has become the core of many emerging technologies today,There are different applications in different fields.It is an inevitable product of the development of the web to a certain stage,Not only has the core characteristics of sgml,Has the simple characteristics of html,It also has many new features such as clear and well-structured.

There are three common methods for parsing XML in Python:one is the xml.dom. * module, which is an implementation of the w3c dom api. This module is suitable if you need to process the dom apiNote that there are many modules in the xml.dom package,A distinction must be made between them;The second is the xml.sax. * Module, which is an implementation of the sax api. This module sacrifices convenience for speed and memory consumption.sax is an event-based api, which means that it can process a huge number of documents "in the air",Do not need to be fully loaded into memory;The third is the xml.etree.elementtree module (referred to as et), which provides lightweight python-style APIs. Compared to dom, et is much faster,And there are many pleasing APIs available,Compared to sax, et's.eterparse also provides "on the air" processing methods.There is no need to load the entire document into memory,The average performance of et is similar to sax, but the api is a bit more efficient and easy to use.

Detailed explanation

Parsed xml file (country.xml):

See code snippets on code derived to my code snippets

 <?xml version="1.0"?>
  <data>
    <country name="singapore">
      <rank>4</rank>
      <year>2011</year>
      <gdppc>59900</gdppc>
      <neighbor name="malaysia" direction="n" />
    </country>
    <country name="panama">
      <rank>68</rank>
      <year>2011</year>
      <gdppc>13600</gdppc>
      <neighbor name="costa rica" ​​direction="w" />
      <neighbor name="colombia" direction="e" />
    </country>
  </data>

1.xml.etree.elementtree

Elementtree was born to process XML. It has two implementations in the Python standard library:one is implemented in pure Python, such as xml.etree.elementtree, and the other is xml.etree.celementtree, which is faster. Note:Try to use the one implemented by c language,Because it's faster,And it consumes less memory.

See code snippets on code derived to my code snippets

 try:
    import xml.etree.celementtree as et
  except importerror:
    import xml.etree.elementtree as et

This is a more common way to make different python libraries use the same api,Starting from Python 3.3, the elementtree module will automatically find available C libraries to speed things up.So just import xml.etree.elementtree.

See code snippets on code derived to my code snippets

 #!/usr/bin/evn python
  #coding:utf-8
  try:
    import xml.etree.celementtree as et
  except importerror:
    import xml.etree.elementtree as et
  import sys
  try:
    tree=et.parse ("country.xml") #Open XML document
    #root=et.fromstring (country_string) #Pass xml from a string
    root=tree.getroot () #Get the root node
  except exception, e:
    print "error:cannot parse file:country.xml."
    sys.exit (1)
  print root.tag, "---", root.attrib
  for child in root:
    print child.tag, "---", child.attrib
  print "*" * 10
  print root [0] [1] .text #Access via subscript
  print root [0] .tag, root [0] .text
  print "*" * 10
  for country in root.findall ("country"):#Find all country nodes under the root node
    rank=country.find ("rank"). text #The value of the node rank under the child node
    name=country.get ("name") #The value of the attribute name under the child node
    print name, rank
  #Modify the xml file
  for country in root.findall ("country"):
    rank=int (country.find ("rank"). text)
    if rank>50:
      root.remove (country)
  tree.write ("output.xml")

operation result:

Xml.dom. *

The document object model (dom) is a standard programming interface for extensible markup language recommended by the w3c organization.A dom parser reads the entire document at once when parsing an xml document.Save all elements of the document in a tree structure in memory,Then you can use different functions provided by dom to read or modify the content and structure of the document.You can also write the modified content to an xml file. Use xml.dom.minidom to parse XML files in Python. An example is as follows:

See code snippets on code derived to my code snippets

 #!/usr/bin/python
  #coding=utf-8
  from xml.dom.minidom import parse
  import xml.dom.minidom
  #Use minidom parser to open xml document
  domtree=xml.dom.minidom.parse ("country.xml")
  data=domtree.documentelement
  if data.hasattribute ("name"):
    print "name element:%s"%data.getattribute ("name")
  #Get all countries in the collection
  countrys=data.getelementsbytagname ("country")
  #Print detailed information for each country
  for country in countrys:
    print "***** country *****"
    if country.hasattribute ("name"):
     print "name:%s"%country.getattribute ("name")
    rank=country.getelementsbytagname ("rank") [0]
    print "rank:%s"%rank.childnodes [0] .data
    year=country.getelementsbytagname ("year") [0]
    print "year:%s"%year.childnodes [0] .data
    gdppc=country.getelementsbytagname ("gdppc") [0]
    print "gdppc:%s"%gdppc.childnodes [0] .data
    for neighbor in country.getelementsbytagname ("neighbor"):
      print neighbor.tagname, ":", neighbor.getattribute ("name"), neighbor.getattribute ("direction")

operation result:

3.xml.sax. *

Sax is an event-driven API. Parsing XML with sax involves two parts:a parser and an event handler.The parser is responsible for reading the XML document and sending events to the event handler.Such as element start and end event;The event handler is responsible for responding to the event,Process the passed xml data.To use sax to process XML in Python, you must first introduce the parse function in xml.sax, and the contenthandler in xml.sax.handler. It is often used in the following situations:1. Processing of large files;Second, only part of the content is needed,Or just get specific information from the file;Third, when i want to build your own object model.

Contenthandler class method introduction

(1) characters (content) method

When to call:

Starting from the line,Before encountering a label,Characters exist,The value of content is these strings.

From a label,Before encountering the next label, Characters exist,The value of content is these strings.

From a label,Before the end of line is encountered,Characters exist,The value of content is these strings.

The tag can be a start tag,Can also be an end tag.

(2) startdocument () method

Called when the document starts.

(3) enddocument () method

Called when the parser reaches the end of the document.

(4) startelement (name, attrs) method

Called when the xml start tag is encountered,name is the name of the tag,attrs is a dictionary of attribute values ​​for tags.

(5) endelement (name) method

Called when the xml end tag is encountered.

See code snippets on code derived to my code snippets

 #coding=utf-8
  #!/usr/bin/python
  import xml.sax
  class countryhandler (xml.sax.contenthandler):
    def __init __ (self):
     self.currentdata=""
     self.rank=""
     self.year=""
     self.gdppc=""
     self.neighborname=""
     self.neighbordirection=""
    #Element start event processing
    def startelement (self, tag, attributes):
     self.currentdata=tag
     if tag == "country":
       print "***** country *****"
       name=attributes ["name"]
       print "name:", name
     elif tag == "neighbor":
       name=attributes ["name"]
       direction=attributes ["direction"]
       print name, "->", direction
    #Element end event processing
    def endelement (self, tag):
     if self.currentdata == "rank":
       print "rank:", self.rank
     elif self.currentdata == "year":
       print "year:", self.year
     elif self.currentdata == "gdppc":
       print "gdppc:", self.gdppc
     self.currentdata=""
    #Content event processing
    def characters (self, content):
     if self.currentdata == "rank":
       self.rank=content
     elif self.currentdata == "year":
       self.year=content
     elif self.currentdata == "gdppc":
       self.gdppc=content
  if __name__ == "__main__":
     #Create an xmlreader
    parser=xml.sax.make_parser ()
    #turn off namepsaces
    parser.setfeature (xml.sax.handler.feature_namespaces, 0)
     #Override contexthandler
    handler=countryhandler ()
    parser.setcontenthandler (handler)
    parser.parse ("country.xml")

operation result:

4, libxml2 and lxml parsing xml

libxml2 is an xml parser developed using C language. It is a free open source software based on a mit license.Many programming languages ​​have implementations based on it,The libxml2 module in Python is a bit small and inadequate:the xpatheval () interface does not support similar template usage,But does not affect the use,Because libxml2 was developed using C language,Therefore, the way of using the api interface will inevitably be a bit uncomfortable.

See code snippets on code derived to my code snippets

 #!/usr/bin/python
  #coding=utf-8
  import libxml2
  doc=libxml2.parsefile ("country.xml")
  for book in doc.xpatheval ("// country"):
    if book.content!="":
      print "----------------------"
      print book.content
  for node in doc.xpatheval ("// country/neighbor [@name =" colombia "]"):
    print node.name, (node.properties.name, node.properties.content)
  doc.freedoc ()

lxml is developed in python based on libxml2.In terms of usage, it is more suitable for Python developers than lxml, and the xpath () interface supports similar template usage.

See code snippets on code derived to my code snippets

 #!/usr/bin/python
  #coding=utf-8
  import lxml.etree
  doc=lxml.etree.parse ("country.xml")
  for node in doc.xpath ("// country/neighbor [@name=$name]", name="colombia"):
    print node.tag, node.items ()
  for node in doc.xpath ("// country [@name=$name]", name="singapore"):
    print node.tag, node.items ()

Summary(1) The class libraries or modules available for xml parsing in Python include xml, libxml2, lxml, xpath, etc. If you need to learn more, you need to refer to the corresponding documents.

(2) Each analysis method has its own advantages and disadvantages,Before choosing, you can integrate the performance considerations of various aspects.

(3) If there are deficiencies,Please leave a message, thanks in advance!

  • Previous Methods of counting various data sizes in Redis
  • Next Introduction to several methods of manipulating strings in Python