Home>

Regular expression basics

1.1. Brief introduction

Regular expressions are not part of python.Regular expressions are a powerful tool for working with strings,Has its own unique syntax and an independent processing engine,May not be as efficient as str's own method,But very powerful.Thanks to this,In languages ​​that provide regular expressions,The syntax of regular expressions is the same,The only difference is that the number of syntaxes supported by different programming language implementations is different;But do n’t worry,Unsupported syntax is usually the less commonly used part.If you have used regular expressions in other languages,Just take a quick look and get started.

The following figure shows the flow of matching using regular expressions:

The approximate matching process of a regular expression is:compare the expression and the characters in the text one by one,If every character matches,The match is successful;Once there are unsuccessful characters, the match fails.If there are quantifiers or boundaries in the expression,This process will be slightly different,But it is also well understood,Look at the example in the image below and use it a few times to understand it.

The following figure lists the regular expression metacharacters and syntax supported by Python:

1.2. The greedy and non-greedy models of quantifiers

Regular expressions are often used to find matching strings in text.Quantifiers in Python are greedy by default (or non-greedy by default in a few languages), always trying to match as many characters as possible;The opposite is not greedy,Always try to match as few characters as possible.For example:the regular expression "ab *" will find "abbb" if used to find "abbbc". If you use the non-greedy quantifier "ab *?", you will find "a".

1.3. The backslash problem

As with most programming languages,Use "\" as an escape character in regular expressions,This can cause backslash problems.If you need to match the character "\" in the text, then a regular expression expressed in a programming language will require 4 backslashes "\\\\":the first two and the last two are used in the programming language respectively Escaped into a backslash,Converted to two backslashes and then escaped to a backslash in the regular expression.Native strings in python solve this problem well,The regular expression in this example can be expressed using r "\\". Similarly, "\\ d" that matches a number can be written as r "\ d". With native strings,You no longer have to worry about missing backslashes,The written expression is also more intuitive.

1.4. Matching patterns

Regular expressions provide some available matching patterns,Such as ignoring case, multi-line matching, etc.This part will be introduced together in the factory method re.compile (pattern [, flags]) of the pattern class.

Re module

2.1. Getting started with re

Python provides regular expression support through the re module.The general steps to use re are to first compile the string form of the regular expression into a pattern instance, then use the pattern instance to process the text and get a match result (a match instance), and finally use the match instance to get the information,Perform other operations.

#encoding:utf-8
import re
#Compile the regular expression into a pattern object
pattern=re.compile (r "hello")
#Use pattern to match text,Get matching results,None will be returned if there is no match
match=pattern.match ("hello world!")
if match:
#Use match to get group information
print match.group ()
###Output ###
#hello
re.compile (strpattern [, flag]):

This method is a factory method of the pattern class,Used to compile a regular expression as a string into a pattern object. The second parameter flag is the matching pattern,The value can use the bitwise OR operator "|" to mean that both are valid.For example re.i | re.m. Alternatively, you can specify the pattern in the regex string,For example re.compile ("pattern", re.i | re.m) and re.compile ("(?im) pattern") are equivalent.

Possible values ​​are:

•re.i (re.ignorecase):Ignore caseSame below)

•m (multiline):multi-line mode,Change the behavior of "^" and "$" (see above)

•s (dotall):point arbitrary matching pattern,Change the behavior of "."

L (locale):make the predetermined character class \ w \ w \ b \ b \ s \ s depend on the current locale

U (unicode):make the predetermined character class \ w \ w \ b \ b \ s \ s \ d \ d depend on the character attributes defined by unicode

X (verbose):Verbose mode.The regular expression in this mode can be multiple lines,Ignore whitespace characters,And can add comments.The following two regular expressions are equivalent:

a=re.compile (r "" "\ d + #the integral part
\. #the decimal point
\ d * #some fractional digits "" ", re.x)
b=re.compile (r "\ d + \. \ d *")

re provides a number of module methods for completing the functions of regular expressions.These methods can be replaced with the corresponding methods of the pattern instance,The only benefit is to write one less line of re.compile () code, but at the same time, you cannot reuse the compiled pattern object. These methods are described together in the instance method section of the pattern class.The above example can be abbreviated as:

m=re.match (r "hello", "hello world!")
print m.group ()

The re module also provides a method escape (string), which is used to return the regular expression metacharacters in the string such as */+ /? ;, etc. before the escape character.This is useful when you need to match a large number of metacharacters.

2.2. Match

The match object is the result of a match.Contains a lot of information about this match,This information can be obtained using the readable properties or methods provided by match.

Attributes:

1.string:The text to use when matching.

2.re:The pattern object used when matching.

3.pos:The index at which the regular expression starts to search in the text.The value is the same as the parameter of the same name in the pattern.match () and pattern.seach () methods.

4.endpos:The index at which the regular expression ends the search in the text.The value is the same as the parameter of the same name in the pattern.match () and pattern.seach () methods.

5.lastindex:The index of the last captured group in the text.If no packet was captured,Will be none.

6.lastgroup:Alias ​​of the last captured group.If this packet has no alias or no captured packet,Will be none.

method:

1.group ([group1,…]):

Get the string intercepted by one or more packets;When multiple parameters are specified, they are returned as tuples.group1 can use numbers or aliases;The number 0 represents the entire matched substring;When you do not fill in the parameters,Returns group (0);groups without intercepted strings return none;groups that have intercepted multiple times return the last substring intercepted

2.groups ([default]):

Returns the string intercepted by all packets as a tuple.Equivalent to calling group (1,2, ... last). default means that the group without intercepted string is replaced with this value,The default is none.

3.groupdict ([default]):

Returns a dictionary with the alias of the aliased group as the key and the substring intercepted by the group as the value.Groups without aliases are not included.The meaning of default is the same as above.

4.start ([group]):

Returns the starting index of the substring intercepted by the specified group in the string (the index of the first character of the substring). The default value of group is 0.

5.end ([group]):

Returns the end index of the substring intercepted by the specified group in the string (the index of the last character of the substring + 1). The default value of group is 0.

6.span ([group]):

Returns (start (group), end (group)).

7.expand (template):

Substitute the matched group into the template and return.Templates can be grouped using \ id or \ g<id> ;, \ g&name;However, the number 0 cannot be used. \ id is equivalent to \ g<id>But \ 10 will be considered as the 10th grouping. If i want to express the character "0" after \ 1, you can only use \ g<1>0.

import re
m=re.match (r "(\ w +) (\ w +) (?p<sign>. *)", "hello world!")
print "m.string:", m.string
print "m.re:", m.re
print "m.pos:", m.pos
print "m.endpos:", m.endpos
print "m.lastindex:", m.lastindex
print "m.lastgroup:", m.lastgroup
print "m.group (1,2):", m.group (1, 2)
print "m.groups ():", m.groups ()
print "m.groupdict ():", m.groupdict ()
print "m.start (2):", m.start (2)
print "m.end (2):", m.end (2)
print "m.span (2):", m.span (2)
print r "m.expand (r" \ 2 \ 1 \ 3 "):", m.expand (r "\ 2 \ 1 \ 3")
###output ###
#m.string:hello world!
#m.re:<_sre.sre_pattern object at 0x016e1a38>
#m.pos:0
#m.endpos:12
#m.lastindex:3
#m.lastgroup:sign
#m.group (1,2):("hello", "world")
#m.groups ():("hello", "world", "!")
#m.groupdict ():{"sign":"!"}
#m.start (2):6
#m.end (2):11
#m.span (2):(6, 11)
#m.expand (r "\ 2 \ 1 \ 3"):world hello!

2.3. Pattern

The pattern object is a compiled regular expression,A series of methods provided by pattern can be used to find and match text.

pattern cannot be instantiated directly,Must be constructed using re.compile ().

pattern provides several readable properties for getting information about expressions:

1.pattern:expression string used at compile time.

2.flags:Match patterns used at compile time.Digital form.

3.groups:The number of groups in the expression.

4.groupindex:A dictionary with the alias of the aliased group in the expression as the key and the corresponding number of the group as the value.Groups without aliases are not included.

import re
p=re.compile (r "(\ w +) (\ w +) (?p<sign>. *)", re.dotall)
print "p.pattern:", p.pattern
print "p.flags:", p.flags
print "p.groups:", p.groups
print "p.groupindex:", p.groupindex
###output ###
#p.pattern:(\ w +) (\ w +) (?p<sign>. *)
#p.flags:16
#p.groups:3
#p.groupindex:{"sign":3}

Example method [| re module method]:

1.match (string [, pos [, endpos]]) | re.match (pattern, string [, flags]):

This method will try to match the pattern from the pos subscript of the string;if the pattern still matches,Return a match object;if the pattern cannot match during the matching process,Or if endpos is reached before the match ends, none is returned.

The default values ​​of pos and endpos are 0 and len (string);re.match () cannot specify these two parameters.The flags parameter is used to specify the matching pattern when compiling the pattern.

Note:This method is not an exact match.If the string has characters left when pattern ends,Still considered successful.Want an exact match,You can add a boundary match "$" to the end of the expression.

See Section 2.1 for examples.

2.search (string [, pos [, endpos]]) | re.search (pattern, string [, flags]):

This method is used to find successful substrings in a string.Try to match the pattern from the pos index of the string. If the pattern still matches,Returns a match object;if it cannot match,Add 1 to pos and retry the match;If it can't match until pos=endpos, it returns none.

The default values ​​of pos and endpos are 0 and len (string));re.search () cannot specify these two parameters.The flags parameter is used to specify the matching pattern when compiling the pattern.

#encoding:utf-8
import re
#Compile the regular expression into a pattern object
pattern=re.compile (r "world")
#Use search () to find matching substrings,None will be returned if no matching substring exists
#This example does not match successfully using match ()
match=pattern.search ("hello world!")
if match:
#Use match to get group information
print match.group ()
###Output ###
#world

3.split (string [, maxsplit]) | re.split (pattern, string [, maxsplit]):

Divides string into matching substrings and returns a list.maxsplit is used to specify the maximum number of splits,Do not specify to split all.

import re
p=re.compile (r "\ d +")
print p.split ("one1two2three3four4")
###output ###
#["one", "two", "three", "four", ""]

4.findall (string [, pos [, endpos]]) | re.findall (pattern, string [, flags]):

Search for string and return all matching substrings as a list.

import re
p=re.compile (r "\ d +")
print p.findall ("one1two2three3four4")
###output ###
#["1", "2", "3", "4"]

5.finditer (string [, pos [, endpos]]) | re.finditer (pattern, string [, flags]):

Searches for a string and returns an iterator that sequentially accesses each match result (match object).

import re
p=re.compile (r "\ d +")
for m in p.finditer ("one1two2three3four4"):
print m.group (),###output ###
#1 2 3 4

6.sub (repl, string [, count]) | re.sub (pattern, repl, string [, count]):

Replace each matching substring in string with repl and return the replaced string.

When repl is a string,You can group by \ id or \ g<id> ;, \ g&name;However, the number 0 cannot be used.

When repl is a method,This method should only accept one parameter (the match object) and return a string for replacement (no more grouping can be referenced in the returned string).

count is used to specify the maximum number of replacements,Replace all if not specified.

import re
p=re.compile (r "(\ w +) (\ w +)")
s="i say, hello world!"
print p.sub (r "\ 2 \ 1", s)
def func (m):
return m.group (1) .title () + "" + m.group (2) .title ()
print p.sub (func, s)
###output ###
#say i, world hello!
#i say, hello world!

7.subn (repl, string [, count]) re.sub (pattern, repl, string [, count]):

Returns (sub (repl, string [, count]), the number of replacements).

import re
p=re.compile (r "(\ w +) (\ w +)")
s="i say, hello world!"
print p.subn (r "\ 2 \ 1", s)
def func (m):
return m.group (1) .title () + "" + m.group (2) .title ()
print p.subn (func, s)
###output ###
#("say i, world hello!", 2)
#("i say, hello world!", 2)
  • Previous iOS implements the cute owl login interface animation
  • Next Summary of JS movement-related knowledge points (with elastic movement example)