Home>

The previous python3 entry series also basically has a door to python. From this chapter, we will introduce the python crawler tutorial.Take it out and share it with everyone;Reptiles say it simple,It is to capture the data of the network for analysis and processing;This chapter is mainly for getting started,Learn about a few quizzes,And an introduction to the tools used by the crawler,Like collections,Queue, regular expression;

Crawl the specified page with python:

code show as below:

import urllib.request
url="http://www.google.com"
data=urllib.request.urlopen (url) .read () #
data=data.decode ("utf-8")
print (data)

urllib.request.urlopen (url) official document returns a http.client.httpresponse object, which in turn uses the read () method;returns the data;This function returns a http.client.httpresponse object, which in turn has various methods, For example, the read () method we use;

Find variable URLs:

import urllib
import urllib.request
data=()
data ["word"]="one peace"
url_values ​​= urllib.parse.urlencode (data)
url="http://www.google.com/s?"
full_url=url + url_values
a=urllib.request.urlopen (full_url)
data=a.read ()
data=data.decode ("utf-8")
print (data)
##Print out URL:
a.geturl ()

data is a dictionary, Then use urllib.parse.urlencode () to convert data into a string of "word=one + peace", Finally merge with url to full_url

Introduction to python regular expressions:

Queue

The breadth priority algorithm is used in the crawler program.The algorithm uses data structures,Of course, you can also implement queues with lists,But it is not efficient.Now introduced here:there is a queue in the container:collection.deque

#Queue simple test:

from collections import deque

queue=deque (["peace", "rong", "sisi"])

queue.append ("nick")

queue.append ("pishi")

print (queue.popleft ())

print (queue.popleft ())

print (queue)

Collection introduction:

In the crawler program, In order not to repeatedly crawl those websites that have already been crawled, We need to put the URL of the crawled page into the collection, Before every time i want to crawl a certain url, first see if the collection already exists. If it already exists, We skip this url;if it doesn't exist, We first put the url into the collection, Then go to this page again.

Python also contains a data type-set (set). A set is an unordered set of repeated elements. Basic functions include relational testing and elimination of duplicate elements. Set objects also support mathematical operations such as union, intersection, difference, and sysmmetric difference.

Braces or the set () function can be used to create a collection. Note:To create an empty collection,You have to use set () instead of {}. {} Is used to create an empty dictionary;

The creation of the collection is shown below:

a={"peace", "peace", "rong", "rong", "nick"}

print (a)

"peace" in a

b=set (["peace", "peace", "rong", "rong"])

print (b)

#Demo joint

print (a | b)

#Presentation

print (a&b)

#Presentation difference

print (a-b)

#Symmetrical difference

print (a ^ b)

#Output:

{"peace", "rong", "nick"}

{"peace", "rong"}

{"peace", "rong", "nick"}

{"peace", "rong"}

{"nick"}

{"nick"}

Regular expression

The character stream collected during crawling is generally a character stream.We need to have simple string processing capabilities to select URLs from themThis task can be easily accomplished with regular expressions;

Steps for regular expressions:1. Compilation of regular expressions 2. Regular expressions match string 3. Processing of results

The following figure lists the syntax of a regular expression:

Using regular expressions in pytho,The re module needs to be introduced;some methods in this module are introduced below;

1.compile and match

The compile in the re module is used to generate a pattern object, and the match instance is processed by calling the match method of the pattern instance to finally obtain a match instance;information is obtained by using match

import re
#Compile the regular expression into a pattern object
pattern=re.compile (r "rlovep")
#Use pattern to match text,Get matching results,None will be returned if there is no match
m=pattern.match ("rlovep.com")
if m:
#Use match to get group information
  print (m.group ())
###Output ###
#rlovep
re.compile (strpattern [, flag]):

This method is a factory method of the pattern class,Used to compile a regular expression as a string into a pattern object. The second parameter flag is the matching pattern,The value can use the bitwise OR operator "|" to mean that both are valid.For example re.i | re.m. Alternatively, you can specify the pattern in the regex string,For example re.compile ("pattern", re.i | re.m) and re.compile ("(?im) pattern") are equivalent.

Possible values ​​are:

re.i (re.ignorecase):Ignore caseSame below)

m (multiline):multiline mode,Change the behavior of "^" and "$" (see above)

s (dotall):point any matching pattern,Change the behavior of "."

l (locale):make the predetermined character class \ w \ w \ b \ b \ s \ s depend on the current locale

u (unicode):make the predetermined character class \ w \ w \ b \ b \ s \ s \ d \ d depend on the character attributes defined by unicode

x (verbose):Verbose mode.The regular expression in this mode can be multiple lines,Ignore whitespace characters,And can add comments.

match:The match object is the result of a match.Contains a lot of information about this match,This information can be obtained using the readable properties or methods provided by match.

Attributes:

string:The text to use when matching.

re:the pattern object to use when matching.

pos:the index in the text where the regular expression starts to search.The value is the same as the parameter of the same name in the pattern.match () and pattern.seach () methods.

endpos:the index at which the regular expression ends the search in the text.The value is the same as the parameter of the same name in the pattern.match () and pattern.seach () methods.

lastindex:The index of the last captured group in the text.If no packet was captured,Will be none.

lastgroup:alias of the last captured group.If this packet has no alias or no captured packet,Will be none.

method:

group ([group1,…]):

Get the string intercepted by one or more packets;When multiple parameters are specified, they are returned as tuples.group1 can use numbers or aliases;The number 0 represents the entire matched substring;When you do not fill in the parameters,Returns group (0);groups without intercepted strings return none;groups that have intercepted multiple times return the last substring intercepted

groups ([default]):

Returns the string intercepted by all packets as a tuple.Equivalent to calling group (1,2, ... last). default means that the group without intercepted string is replaced with this value,The default is none.

groupdict ([default]):

Returns a dictionary with the alias of the aliased group as the key and the substring intercepted by the group as the value.Groups without aliases are not included.The meaning of default is the same as above.

start ([group]):

Returns the starting index of the substring intercepted by the specified group in the string (the index of the first character of the substring). The default value of group is 0.

end ([group]):

Returns the end index of the substring intercepted by the specified group in the string (the index of the last character of the substring + 1). The default value of group is 0.

span ([group]):

Returns (start (group), end (group)).

expand (template):

Substitute the matched group into the template and return.Templates can be grouped using \ id or \ g<id> ;, \ g&name;However, the number 0 cannot be used. \ id is equivalent to \ g<id>But \ 10 will be considered the 10th grouping. If i want to express the character "0" after \ 1, you can only use \ g<1>0.

pattern:pattern object is a compiled regular expression,A series of methods provided by pattern can be used to find and match text.

pattern cannot be instantiated directly,Must be constructed using re.compile ().

pattern provides several readable properties for getting information about expressions:

pattern:An expression string used at compile time.

flags:Match patterns used at compile time.Digital form.

groups:The number of groups in the expression.

groupindex:dictionary with the alias of the aliased group in the expression as the key and the corresponding number of the group as the valueGroups without aliases are not included.

Example method [| re module method]:

match (string [, pos [, endpos]]) | re.match (pattern, string [, flags]):

This method will try to match the pattern from the pos subscript of the string;if the pattern still matches,Return a match object;if the pattern cannot match during the matching process,Or if endpos is reached before the match ends, none is returned.

The default values ​​of pos and endpos are 0 and len (string);re.match () cannot specify these two parameters.The flags parameter is used to specify the matching pattern when compiling the pattern.

Note:This method is not an exact match.If the string has characters left when pattern ends,Still considered successful.Want an exact match,You can add a boundary match "$" to the end of the expression.

search (string [, pos [, endpos]]) | re.search (pattern, string [, flags]):

This method is used to find successful substrings in a string.Try to match the pattern from the pos index of the string. If the pattern still matches,Returns a match object;if it cannot match,Add 1 to pos and try to match again;If it can't match until pos=endpos, it returns none. The default values ​​of pos and endpos are 0 and len (string));re.search () cannot specify these two parameters.The flags parameter is used to specify the matching pattern when compiling the pattern.

split (string [, maxsplit]) | re.split (pattern, string [, maxsplit]):

Divides string into matching substrings and returns a list.maxsplit is used to specify the maximum number of splits,Do not specify to split all.

findall (string [, pos [, endpos]]) | re.findall (pattern, string [, flags]):

Search for string and return all matching substrings as a list.

finditer (string [, pos [, endpos]]) | re.finditer (pattern, string [, flags]):

Searches for a string and returns an iterator that sequentially accesses each match result (match object).

sub (repl, string [, count]) | re.sub (pattern, repl, string [, count]):

Replace each matching substring in string with repl and return the replaced string. When repl is a string,You can group by \ id or \ g<id> ;, \ g&name;However, the number 0 cannot be used. When repl is a method,This method should only accept one parameter (the match object) and return a string for replacement (no more grouping can be referenced in the returned string). count is used to specify the maximum number of replacements,Replace all if not specified.

subn (repl, string [, count]) | re.sub (pattern, repl, string [, count]):

Returns (sub (repl, string [, count]), the number of replacements).

2.re.match (pattern, string, flags=0)

Function parameter description:

parameter
description
pattern
Matching regular expressions
string
The string to match.
flags
Flags, used to control how regular expressions are matched,Such as:whether it is case sensitive,Multi-line matching and more.

We can use the group (num) or groups () match object function to get the match expression.

Match object method
description
group (num=0)
A string that matches the entire expression,group () can enter multiple group numbers at once,In this case it will return a tuple containing the values ​​corresponding to those groups.
groups ()
Returns a tuple containing all group strings,From 1 to the included team number.

The demonstration is as follows:

#re.match.
import re
print (re.match ("rlovep", "rlovep.com")) ##Matching rlovep
print (re.match ("rlovep", "rlovep.com"). span ()) ##Match rlovep from the beginning
print (re.match ("com", "http://rlovep.com")) ##No longer start position cannot match successfully
##Output:
<_sre.sre_match object;span=(0, 6), match="rlovep">
(0, 6)
none

Example two:using group

import re
line="this is my blog"
#Match the string containing is
matchobj=re.match (r "(. *) is (. *?). *", line, re.m | re.i)
#Using group output:when group without parameters is the output of the entire match
#When the parameter is 1, the first parenthesis included on the left of the outermost layer is matched.
One analogy
if matchobj:
 print ("matchobj.group ():", matchobj.group ()) #match the entire
 print ("matchobj.group (1):", matchobj.group (1)) #match the first parenthesis
 print ("matchobj.group (2):", matchobj.group (2)) #match the second parenthesis
else:
 print ("no match !!")
#Output:
matchobj.group ():this is my blog
matchobj.group (1):this
matchobj.group (2):my

3.re.search method

re.search scans the entire string and returns the first successful match.

Function syntax:

re.search (pattern, string, flags=0)

Function parameter description:

parameter
description
pattern
Matching regular expressions
string
The string to match.
flags
Flags, used to control how regular expressions are matched,Such as:whether it is case sensitive,Multi-line matching and more.

We can use the group (num) or groups () match object function to get the match expression.

Match object method
description
group (num=0)
A string that matches the entire expression,group () can enter multiple group numbers at once,In this case it will return a tuple containing the values ​​corresponding to those groups.
groups ()
Returns a tuple containing all group strings,From 1 to the included team number.

Example 1:

import re
print (re.search ("rlovep", "rlovep.com"). span ())
print (re.search ("com", "http://rlovep.com") .span ())
#Output:
import re
print (re.search ("rlovep", "rlovep.com"). span ())
print (re.search ("com", "http://rlovep.com") .span ())

Example two:

import re
line="this is my blog"
#Match the string containing is
matchobj=re.search (r "(. *) is (. *?). *", line, re.m | re.i)
#Using group output:when group without parameters is the output of the entire match
#When the parameter is 1, the first parenthesis included on the left of the outermost layer is matched.
One analogy
if matchobj:
 print ("matchobj.group ():", matchobj.group ()) #match the entire
 print ("matchobj.group (1):", matchobj.group (1)) #match the first parenthesis
 print ("matchobj.group (2):", matchobj.group (2)) #match the second parenthesis
else:
 print ("no match !!")
#Output:
matchobj.group ():this is my blog
matchobj.group (1):this
matchobj.group (2):my

The difference between search and match:re.match only matches the beginning of the string,If the string does not start with a regular expression,The match fails,The function returns none;while re.search matches the entire string,Until a match is found.

Python crawler small test

Use python to crawl all the http protocol links in the page,And recursively crawl links to subpages.Use collections and queues;This is my website.Many bugs in the first version;the code is as follows:

import re
import urllib.request
import urllib
from collections import deque
#Using queues to store URLs
queue=deque ()
>The previous python3 entry series also basically has a door to python. From this chapter, we will introduce the python crawler tutorial.
Take it out and share it with everyone;
Reptiles say it simple,It is to capture the data of the network for analysis and processing;
This chapter is mainly for getting started,Learn about a few quizzes,And an introduction to the tools used by the crawler,Like collections,Queue, regular expression;
<!-More->
#Use visited to prevent repeated crawling of the same page
visited=set ()
url="http://rlovep.com" #entry page, Can be changed to something else
 #Enqueue the initial page
queue.append (url)
cnt=0
while queue:
 url=queue.popleft () #the first element of the team
 visited |={url} #mark as visited
 print ("Crawled:" + str (cnt) + "Crawling<---" + url)
 cnt +=1
 #Crawl page
 urlop=urllib.request.urlopen (url)
 #Determine whether it is an html page
 if "html" not in urlop.getheader ("content-type"):
  continue
 #Avoid program aborts, Handling exceptions with try..catch
 try:
  #Convert to UTF-8 code
  data=urlop.read (). decode ("utf-8")
 except:
  continue
 #Regular expression to extract all queues in the page, And determine if it has been visited, Then join the queue to be crawled
 linkre=re.compile ("href=[" \ "] ([^ \" ">] *?) [" \ "]. *?")
 for x in linkre.findall (data):##Returns all lists that have a match
  if "http" in x and x not in visited:##
And determine if it has been crawled
   queue.append (x)
   print ("Join the queue --->" + x)

The results are as follows:

  • Previous Android implementation of extended Menu method
  • Next Introduction to ASP Basics Part 1 (ASP Technology Introduction)