Home>

When I first learned python, I only knew that there are two parse methods:dom and sax.But its efficiency is not ideal.Due to the large number of files that need to be processed,These two methods are too time-consuming and unacceptable.

After searching the web,It is currently widely used.The relatively efficient elementtree is also an algorithm recommended by many people.So take this algorithm to measure and compare,elementtree also includes two implementations,One is ordinary elementtree (et) and one is elementtree.iterparse (et_iter).

This article will compare the four methods of dom, sax, et, et_iter.Evaluate the efficiency of each algorithm by processing the same file to compare the time spent on each algorithm.

The four parsing methods are written as functions in the program.Called separately in the main program,To evaluate its parsing efficiency.

An example of the contents of the decompressed xml file is:

The main program function call code is:

 print ("File count:%d /%d."%(gz_cnt, paser_num))
  str_s, cnt=dom_parser (gz)
  #str_s, cnt=sax_parser (gz)
  #str_s, cnt=et_parser (gz)
  #str_s, cnt=et_parser_iter (gz)
  output.write (str_s)
  vs_cnt +=cnt

The function returns two values ​​in the initial function call,But when receiving the function call value, it is called with two variables,Causes each function to be executed twice,Then modify it to call two variables at a time to receive the return value,Reduced invalid calls.

1.Dom analysis

Function definition code:

def dom_parser (gz):
  import gzip, cstringio
  import xml.dom.minidom
  vs_cnt=0
  str_s=""
  file_io=cstringio.stringio ()
  xm=gzip.open (gz, "rb")
  print ("Read:%s. \ nParsing:"%(os.path.abspath (gz)))
  doc=xml.dom.minidom.parsestring (xm.read ())
  bulkpmmrdatafile=doc.documentelement
  #Reading child elements
  enbs=bulkpmmrdatafile.getelementsbytagname ("enb")
  measurements=enbs [0] .getelementsbytagname ("measurement")
  objects=measurements [0] .getelementsbytagname ("object")
  #Write to csv file
  for object in objects:
    vs=object.getelementsbytagname ("v")
    vs_cnt +=len (vs)
    for v in vs:
      file_io.write (enbs [0] .getattribute ("id") + "" + object.getattribute ("id") + "" + \
      object.getattribute ("mmeues1apid") + "" + object.getattribute ("mmegroupid") + "" + object.getattribute ("mmecode") + "" + \
      object.getattribute ("timestamp") + "" + v.childnodes [0] .data + "\ n") #Get text
  str_s=(((file_io.getvalue (). replace ("\ n", "\ r \ n")). replace ("", ",")). replace ("t", "")). replace ("nil", "")
  xm.close ()
  file_io.close ()
  return (str_s, vs_cnt)

Program running results:

**************************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31 /.

The output directory is:/tmcdata/mro2csv/output31 /.

Enter the directory.The number of gz files is:12, of which 12 were processed this time.

**************************************************

File count:1/12.

Read:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_234598_20160224060000.xml.gz.

Parsing:

File count:2/12.

Read:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_233798_20160224060000.xml.gz.

Parsing:

File count:3/12.

Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_123798_20160224060000.xml.gz.

Parsing:

……………………………………

File count:12/12.

Read into:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_235598_20160224060000.xml.gz.

Parsing:

vs line count:177849, running time:107.077867, number of processed lines per second:1660.

Written:/tmcdata/mro2csv/output31/mro_0001.csv.

**************************************************

Program processing ends.

Because dom parsing requires reading the entire file into memory,And build a tree structure,Its memory consumption and time consumption are relatively high,But the advantage is that the logic is simple,No need to define a callback function,Easy to implement.

2.Sax analysis

Function definition code:

def sax_parser (gz):
  import os, gzip, cstringio
  from xml.parsers.expat import parsercreate
  #Variable declaration
  d_enb=()
  d_obj=()
  s=""
  global flag
  flag=false
  file_io=cstringio.stringio ()
  #saxparse class
  class defaultsaxhandler (object):
    #Processing start tags
    def start_element (self, name, attrs):
      global d_enb
      global d_obj
      global vs_cnt
      if name == "enb":
        d_enb=attrs
      elif name == "object":
        d_obj=attrs
      elif name == "v":
        file_io.write (d_enb ["id"] + "" + d_obj ["id"] + "" + d_obj ["mmeues1apid"] + "" + d_obj ["mmegroupid"] + "" + d_obj ["mmecode"] + "" + d_obj ["timestamp"] + "")
        vs_cnt +=1
      else:
        pass
    #Processing intermediate text
    def char_data (self, text):
      global d_enb
      global d_obj
      global flag
      if text [0:1] .isnumeric ():
        file_io.write (text)
      elif text [0:17] == "mr.ltescplrulqci1":
        flag=true
        #print (text, flag)
      else:
        pass
    #Processing end tags
    def end_element (self, name):
      global d_enb
      global d_obj
      if name == "v":
        file_io.write ("\ n")
      else:
        pass
  #saxparse call
  handler=defaultsaxhandler ()
  parser=parsercreate ()
  parser.startelementhandler=handler.start_element
  parser.endelementhandler=handler.end_element
  parser.characterdatahandler=handler.char_data
  vs_cnt=0
  str_s=""
  xm=gzip.open (gz, "rb")
  print ("Read:%s. \ nParsing:"%(os.path.abspath (gz)))
  for line in xm.readlines ():
    parser.parse (line) #parse the xml file content
    if flag:
      break
  str_s=file_io.getvalue (). replace ("\ n", "\ r \ n"). replace ("", ","). replace ("t", "") .replace ("nil", " ") #Write the parsed content
  xm.close ()
  file_io.close ()
  return (str_s, vs_cnt)

Program running results:

**************************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31 /.

The output directory is:/tmcdata/mro2csv/output31 /.

Enter the directory.The number of gz files is:12, of which 12 were processed this time.

**************************************************

File count:1/12.

Read:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_234598_20160224060000.xml.gz.

Parsing:

File count:2/12.

Read:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_233798_20160224060000.xml.gz.

Parsing:

File count:3/12.

Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_123798_20160224060000.xml.gz.

Parsing:

...............................

File count:12/12.

Read into:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_235598_20160224060000.xml.gz.

Parsing:

vs line count:177849, running time:14.386779, lines processed per second:12361.

Written:/tmcdata/mro2csv/output31/mro_0001.csv.

**************************************************

Program processing ends.

Compared with dom parsing, sax parsing significantly reduces the running time.Since sax uses line-by-line parsing,It also consumes less memory for processing larger files.Therefore, sax parsing is a parsing method that is widely used at present.The disadvantage is that you need to implement the callback function yourself,The logic is more complicated.

3.et parsing

Function definition code:

def et_parser (gz):
  import os, gzip, cstringio
  import xml.etree.celementtree as et
  vs_cnt=0
  str_s=""
  file_io=cstringio.stringio ()
  xm=gzip.open (gz, "rb")
  print ("Read:%s. \ nParsing:"%(os.path.abspath (gz)))
  tree=et.elementtree (file=xm)
  root=tree.getroot ()
  for elem in root [1] [0] .findall ("object"):
      for v in elem.findall ("v"):
          file_io.write (root [1] .attrib ["id"] + "" + elem.attrib ["timestamp"] + "" + elem.attrib ["mmecode"] + "" + \
          elem.attrib ["id"] + "" + elem.attrib ["mmeues1apid"] + "" + elem.attrib ["mmegroupid"] + "" + v.text + "\ n")
      vs_cnt +=1
  str_s=file_io.getvalue (). replace ("\ n", "\ r \ n"). replace ("", ","). replace ("t", "") .replace ("nil", " ") #Write the parsed content
  xm.close ()
  file_io.close ()
  return (str_s, vs_cnt)

Program running results:

**************************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31 /.

The output directory is:/tmcdata/mro2csv/output31 /.

Enter the directory.The number of gz files is:12, of which 12 were processed this time.

**************************************************

File count:1/12.

Read:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_234598_20160224060000.xml.gz.

Parsing:

File count:2/12.

Read:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_233798_20160224060000.xml.gz.

Parsing:

File count:3/12.

Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_123798_20160224060000.xml.gz.

Parsing:

.................................

File count:12/12.

Read into:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_235598_20160224060000.xml.gz.

Parsing:

vs line count:177849, running time:4.308103, number of processed lines per second:41282.

Written:/tmcdata/mro2csv/output31/mro_0001.csv.

**************************************************

Program processing ends.

Compared with sax parsing, et parsing time is shorter,And the function implementation is relatively simple,So et has a simple logic implementation similar to dom and rivals the parsing efficiency of sax.Therefore, et is currently the first choice for XML parsing.

4, et_iter analysis

Function definition code:

def et_parser_iter (gz):
  import os, gzip, cstringio
  import xml.etree.celementtree as et
  vs_cnt=0
  str_s=""
  file_io=cstringio.stringio ()
  xm=gzip.open (gz, "rb")
  print ("Read:%s. \ nParsing:"%(os.path.abspath (gz)))
  d_enb=()
  d_obj=()
  i=0
  for event, elem in et.iterparse (xm, events=("start", "end")):
    if i>= 2:
      break
    elif event == "start":
          if elem.tag == "enb":
              d_enb=elem.attrib
          elif elem.tag == "object":
        d_obj=elem.attrib
      elif event == "end" and elem.tag == "smr":
      i +=1
    elif event == "end" and elem.tag == "v":
      file_io.write (d_enb ["id"] + "" + d_obj ["timestamp"] + "" + d_obj ["mmecode"] + "" + d_obj ["id"] + "" + \
      d_obj ["mmeues1apid"] + "" + d_obj ["mmegroupid"] + "" + str (elem.text) + "\ n")
          vs_cnt +=1
      elem.clear ()
  str_s=file_io.getvalue (). replace ("\ n", "\ r \ n"). replace ("", ","). replace ("t", "") .replace ("nil", " ") #Write the parsed content
  xm.close ()
  file_io.close ()
  return (str_s, vs_cnt)

Program running results:

**************************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31 /.

The output directory is:/tmcdata/mro2csv/output31 /.

Enter the directory.The number of gz files is:12, of which 12 were processed this time.

**************************************************

File count:1/12.

Read:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_234598_20160224060000.xml.gz.

Parsing:

File count:2/12.

Read:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_233798_20160224060000.xml.gz.

Parsing:

File count:3/12.

Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_123798_20160224060000.xml.gz.

Parsing:

........................................ .

File count:12/12.

Read into:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_235598_20160224060000.xml.gz.

Parsing:

vs line count:177849, running time:3.043805, number of lines processed per second:58429.

Written:/tmcdata/mro2csv/output31/mro_0001.csv.

**************************************************

Program processing ends.

After the introduction of et_iter parsing, the parsing efficiency is improved by nearly 50%compared to et, and it is 35 times higher than that of dom parsing. While the parsing efficiency is improved,Because it uses iterparse, a sequential parsing tool,Its memory footprint is also relatively small.

So little friends,Please make good use of these tools.

  • Previous Summary of filter related usage in PHP Yii framework
  • Next A simple animation effect implemented in native javascript