What does W mean in Python

Python course

Regular expressions



The term "regular expression" comes from automata theory and the theory of formal languages, two areas of theoretical computer science. The American mathematician Stephen Cole Kleene, who introduced regular sets in the 1950s, is considered to be the inventor of regular languages.
In theoretical computer science, regular expressions are used to formally define a language family with certain properties, the so-called regular languages. For every regular expression there is a finite automaton (implementation of a finite automaton in Python) that accepts the language specified by the expression.

In programming languages, regular expressions are mostly used to filter texts or text strings, i.e. they allow one to check whether a text or a string is "matched" to an RA, i.e. applies, matches or matches.

RA are also used to perform text substitutions, which can be quite complex.

We would like to mention an extremely interesting aspect of regular expressions: The syntax of the regular expressions is the same in all programming languages ​​and scripting languages, e.g. in Python, Perl, Java, SED, AWK or also in C #. they are also used by many text editors, such as vi.

introduction

In our introduction to sequential data types, you learned about the "in" operator. >>> s = "Regular expressions simply explained!" >>> "simple" in s True >>> In the above example it was checked whether the word "simple" occurs in the string s.

In the following we show step by step how the string comparisons are carried out.
In the string s = "xaababcbcd"



should be checked whether the substring sub = "abc"



occurs. By the way, a substring is already a regular expression, albeit a particularly simple one.

First it is checked whether the first positions match, that is, s [0] == sub [0]. This is not fulfilled in our example, which we indicate by the color red:



It is now checked whether s [1: 4] == sub applies. For this purpose, it is first checked whether sub [0] == s [1] is fulfilled. This applies what we indicate with the color green. Then the comparison continues, but s [2] is not equal to sub [1]:



Now we compare s [2: 5] with sub. In this case, even the first two positions of the substring match.



We probably don't need an explanation for the next steps:



In the next step, a complete match can be found, because s [4: 7] == sub applies

The re module

In the previous section we said that sub can already be viewed as a simple regular expression.
In order to be able to work with regular expressions in Python, you first have to re-import the module. re is a standard library that provides numerous functions and methods for working with regular expressions.

Representation of regular expressions in Python

Numerous characters within regular expressions have special meanings, as does the backslash. In principle, regular expressions are represented as strings in Python. In the case of strings, backslashes are used as escape characters. However, this means that they are removed from our regular expression or given a special meaning with the following character. You can prevent this by writing every backslash of a regular expression as "\", but this carries the risk of forgetting it more often and the expressions are more complicated to understand. The best solution is to use raw strings, i.e. to mark a string with a prefixed r:



We'll discuss the elements of the regular expression above in detail, but so as not to leave you in the dark: This expression matches all filenames (strings) that start with a lowercase "a" and end with ".html".

Regular Expression Syntax


is a regular expression. It is one of the easiest expressions to understand because it does not contain any metacharacters (function symbols) with special meanings. For example, our example expression matches the following string: "A cat and a rat can't be friends."

Interestingly enough, this first example already shows a "popular" error. You actually want to match strings that contain the word cat. This also works, but you also get "cats", for example, which may still be desirable. Worse, however, are a lot of additional words in which the letter sequence "cat" occurs as a substring, including words like "education", "communicate", "falsification", "ramifications", "cattle" and many others. This is a case of "over matching", i.e. we get positive results that are not wanted. We have illustrated this in terms of quantity in the diagram opposite. The dark green circle C corresponds to the set that we would like to recognize, but we recognize the set O (blue circle). C is a subset of O. In this diagram you can also see a set U (light green circle), which is a subset of our desired set C. This is a case of "under matching". An example of this would be when we try to improve our regular expression. We could get the idea of ​​adding a space before and after the word cat in the regular expression, so with this change we would no longer fall for words like "education", "falsification" and "rammification". But what about a string "The cat, called Oscar, climbed on the roof." out? It is now classified as no longer suitable.

This problem becomes even clearer in the following. But first let's take a look at how to check regular expressions in Python. Then we are also able to understand the following examples directly in Python. >>> import re >>> x = re.search ("cat", "A cat and a rat can't be friends.") >>> print x <_sre.sre_match object="" at="" 0x7fd4bf238238=""> >>> x = re.search ("cow", "A cat and a rat can't be friends.") >>> print x None In the previous example we used the search method from the re module. It is arguably the most important and most frequently used method. Using search (expr, s), a string s is examined for the occurrence of a substring that matches the regular expression expr. The first substring found is returned. If we print out the result, we see that in the positive case a so-called match object is returned, while in the negative case a "None" is returned. With this knowledge you can already use regular expressions in a Python script without knowing more about the match objects: >>> if re.search ("cat", "A cat and a rat can't be friends." ): ... print "The expression fit" ... else: ... print "The expression did not fit" ... The expression fit >>> if re.search ("cow", "A cat and a rat can't be friends. "): ... print" The expression matched "... else: ... print" The expression didn't match "... The expression didn't fit

Any character

Let's assume that in the previous example we are not interested in finding the word cat, but rather three-letter words ending with "at":
The regular expressions offer a metacharacter ".", Which stands as a placeholder for any character. The regular expression could be formulated like this:

The regular expression now matches three-letter words isolated by blank and ending with "at". Now we get words like "rat", "cat", "bat", "eat", "sat" and others.

But what if there was an "@at" or "3at" in the text? They then also match and we have over-matching again. We will learn a solution to get around this in the following subsection of our introduction to regular expressions in Python.

Character selection

We can define a selection of characters using square brackets "[" and "]". The expression in square brackets then stands for exactly one character from this selection. Let us consider the following regular expression: r "M [ae] [iy] er This expression fits four different spellings of the common German family name. The capital M can be followed by a lowercase" a "or a lowercase" e ", then a" Follow i "or" y "followed by" he "at the end.

Note for advanced users:
Regular expressions can be accepted with so-called finite automata: A finite automaton for accepting the Meyer / Meier / Mayer / Maier variants would look like this:



Simplification in the diagram: there should actually be a pointer in the start node that points back to the start node. This means that you stay on the start node as long as you read characters that are different from "M". From all other nodes there should also be an arrow back to the start node, if you do not read a sign that is present on the outgoing arrows. As already said at the beginning, what has just been said about finite automata can be safely ignored if one does not understand it. It is not essential to regular expression use.

Instead of individual letters, as in the previous example, choosing between "e" or "a" (in RE notation [ea]), you often need to choose between whole character classes, for example a digit between "0" and "5" "or a letter between" a "and" e ". For this, the notation for regular expressions has a reserved special character within the character selection, namely the hyphen "-". is an abbreviation for or stands for. The advantage of writing becomes immediately apparent when you want to note the character selection "any capital letter". You can or write. If this is not yet convincing: What about the character selection "any lower or upper case letter"?
We leave the inconvenient alternative to the inclined reader :-)

But there is one more thing about the hyphen that we used to mark the beginning and the end of a character class. The hyphen only has a special meaning if it is enclosed in square brackets and only if it is not positioned immediately after the opening square bracket or in front of the closing square bracket.
The expression only describes the choice between the three characters "-", "a" and "z", but no other characters. The same applies .

Exercise:
Which character class is described by?

answer The character "-" because it is at the beginning directly after the opening bracket, and all characters between "a" to "z", i.e. the entire alphabet of small letters and the hyphen.

The only other metacharacter within square brackets is the caret (also called the text cursor or hyphen). If it is positioned directly after the opening square bracket, then it negates the selection. Alternatively, you can use an exclamation mark for negation instead of the caret. indicates the selection "any character but no number". The position of the caret within the square brackets is crucial. If it is not the first character, then it has no special meaning and only designates itself.
means everything except "a", "b" or "c"
means either an "a", "b", "c" or a "^"

Predefined character classes

But even with the simplification by the hyphen, i.e. the specification of areas, it can be very tedious to construct certain character classes in the previous way. A good example of this is certainly the character class that defines a valid word letter. These are all lower and upper case letters, all digits and the underscore "_". This corresponds to the following character class

This is why there are predefined abbreviations for frequently occurring character classes:
\ dOne digit corresponds to [0-9].
\ Dthe complement of \ d. So all characters except the digits correspond to the class notation [^ 0-9].
\ sA whitespace, i.e. spaces, tabs, newlines and so on, corresponds to the class [\ t \ n \ r \ f \ v].
\ SThe complement of \ s. So everything except whitespace corresponds to [^ \ t \ n \ r \ f \ v].
\ wAlphanumeric character plus underscore, ie [a-zA-Z0-9_]. If the LOCALE is set, it also matches the special characters of the LOCALE, e.g. the umlauts.
\ WThe complement of \ w.
\ bMatches the empty string, but only if it is at the beginning or the end of a string.
\ BMatches the empty string like \ b, but only if it is not at the beginning or the end of a string.
\\A backslash.

The predefined character classes \ b and \ B in the previous overview are often not correctly understood or even misunderstood. While the other classes match individual characters - for example \ w matches "a", "b", "m", "3" and so on, - \ b and \ B do not match any characters. They match empty strings depending on their neighborhood, i.e. it depends on which character is before and after the empty string. \ b fits, for example, if an empty string is between a \ W and a \ w character or vice versa, if it is between \ w and \ W. As usual, \ B denotes the complement, which means that empty strings between \ W and \ W and empty strings between \ w and \ w are matched.
We illustrate this fact in the following diagram:



Match at the beginning and the end

As we have already explained, the expression is able to match different spellings of the name Meyer. It does not matter whether the name is at the beginning, inside or at the end of the string. >>> import re >>> line = "He is a German called Mayer." >>> if re.search (r "M [ae] [iy] er", line): print "I found one!" ... I found one! >>> But how does it look if we only look for occurrences directly at the beginning of the string, i.e. that the string begins immediately with the "M" of the name?
Python's re module provides two functions for matching regular expressions. We have already got to know one of the two, i.e. the search () function. The other function, in our opinion, has a misleading name because it is called match ().
Misleading because match (re_str, s) only checks whether there is a match at the beginning of the string.
But no matter how, match () is a solution to our problem, as we can see in the following example: >>> import re >>> s1 = "Mayer is a very common name" >>> s2 = "He is called Meyer but he isn't German. " >>> print re.search (r "M [ae] [iy] er", s1) <_sre.sre_match object="" at="" 0x7fc59c5f26b0=""> >>> print re.search (r "M [ae] [iy] er", s2) <_sre.sre_match object="" at="" 0x7fc59c5f26b0=""> >>> print re.match (r "M [ae] [iy] er", s1) <_sre.sre_match object="" at="" 0x7fc59c5f26b0=""> >>> print re.match (r "M [ae] [iy] er", s2) None >>> In this way we can match the beginning of a string, but this method only works in Python. But the regular expression syntax provides another possibility.

The character '^' (text cursor, insertion character) ensures that the following regular expression is only applied directly to the beginning of the string, i.e. the regular expression with a leading "^" must match the beginning of the string. Except in MULTILINE mode, the expression can always follow a newline character. >>> import re >>> s1 = "Mayer is a very common name" >>> s2 = "He is called Meyer but he isn't German." >>> print re.search (r "^ M [ae] [iy] er", s1) <_sre.sre_match object="" at="" 0x7fc59c5f26b0=""> >>> print re.search (r "^ M [ae] [iy] er", s2) None But what happens if we combine the two strings s1 and s2 in the following way: s = s2 + "\ n" + s1 The string does not begin with a Maier regardless of the notation. >>> s = s2 + "\ n" + s1 >>> print re.search (r "^ M [ae] [iy] er", s) None >>> The expression could not match. But the name comes after a newline character. Therefore, the result changes when we switch on the MULTILINE mode: >>> print re.search (r "^ M [ae] [iy] er", s, re.MULTILINE) <_sre.sre_match object="" at="" 0x7fc59c5f26b0=""> >>> print re.search (r "^ M [ae] [iy] er", s, re.M) <_sre.sre_match object="" at="" 0x7fc59c5f26b0=""> >>> print re.match (r "^ M [ae] [iy] er", s, re.M) None >>> The previous example also shows that the multiline mode has no influence on the match method . match () never checks anything other than the beginning of the string, regardless of whether you are in multiline mode or not.

We have now completed the test for the beginning of a string. Checking whether a regular expression fits the end of a string looks similar. The "$" character has a special meaning for this purpose. If a regular expression is followed by a '$' character, the expression must match the end of the string, i.e. there must be no further character between the regular expression and the newline of the string. We demonstrate this in the following example: >>> print re.search (r "Python \. $", "I like Python.") <_sre.sre_match object="" at="" 0x7fc59c5f26b0=""> >>> print re.search (r "Python \. $", "I like Python and Perl.") None >>> print re.search (r "Python \. $", "I like Python. \ nSome prefer Java or Perl. ") None >>> print re.search (r" Python \. $ "," I like Python. \ NSome prefer Java or Perl. ", Re.M) <_sre.sre_match object="" at="" 0x7fc59c5f26b0=""> >>>

Optional parts

If you thought that we had already captured all the spellings of the names Mayer and Co., then you were wrong. There are other variants in other spellings around the world, whether in Paris or London. In this notation, the "e" has been dropped. So we get four more spellings: ["Mayr", "Meyr", "Meir", "Mair"] plus our old set ["Mayer", "Meyer", "Meier", "Maier"].

If we try to construct a suitable regular expression, we notice that we are still missing something. How can we say "e can but does not have to occur"? Instead, the question mark has been given a special meaning. The expression "e?" means exactly what we want, so "the letter e can but does not have to appear".

Our final Mayer recognizer now looks like this:

r "M [ae] [iy] e? r" A question mark can also be placed behind round brackets. Then this means that the complete subexpression can appear within the brackets but does not have to occur. We can identify substrings with "Feb 2011" or February 2011 "with the following expression: r" Feb (ruary)? 2011 "

Quantifiers

With the syntactic means that we have learned so far, certain properties cannot be mapped in regular expressions. For example, you always need ways to show that you want to repeat certain partial expressions. We had just gotten to know one form of repetition, the question mark. A character or a sub-expression enclosed in round brackets is "repeated" either once or never.

In addition, at the beginning of this introduction we got to know another quantifier without going into it in any particular way. It was the star operator. If an asterisk "*" follows a character or a partial expression, this means that this character or the partial expression may not appear or be repeated any number of times. r "[0-9] *" The above expression matches any sequence of digits, but also the empty string. r ". *" matches any sequence of characters and the empty string.

Exercise:
Write a regular expression that matches strings that begin with a sequence of digits - at least one - and are followed by a space.

Solution: r "^ [0-9] [0-9] *"
So you used the plus sign? This is great, but in this case you must have cheated by reading on, or you already know more about regular expressions than what we have covered in our course so far :-)

So when we are already at the plus operator: The plus operator can be used to solve the previous exercise in a pleasant way. In principle, the plus operator works like the asterisk operator, except that the plus operator requires at least one occurrence of the character or partial expression.

Solution to our problem with the "+" operator: r "^ [0-9] +" But even with the plus and asterisk operators, something important is missing: In certain situations we want to be able to specify the exact number of repetitions or a minimum or maximum number of repetitions. Let us assume that we want to read address lines from envelopes in Switzerland. So the line in which the postcode and the place name are, i.e. a four-digit postcode followed by a space and the place name. + and * are too unspecific for this case, and the following solution is certainly too cumbersome: r "^ [0-9] [0-9] [0-9] [0-9] [A-Za-z] + "Fortunately, the regular expression syntax offers an optimal solution: r" ^ [0-9] {4} [A-Za-z] * "Now we want to improve our regular expression even further. Let us assume that there is no city or town in Switzerland with a name made up of fewer than three letters. We can describe this fact with [A-Za-z] [3,}. Now we also want to include letters that go to Germany. It is well known that postcodes have one more place in Germany. [0-9] {4,5} means that we expect at least 4 digits but at most 5: r "^ [0-9] {4,5} [A-Z] [a-z] {2,}" In general:
{min, max}: at least min times and at most max times.
{, max} is an abbreviated form of {0, to} and
{min,} is an abbreviation for "at least min-times but no upper limit"

A practical example in Python

Before we proceed with our introduction, let's insert a little hands-on exercise in Python.
We also have a Simpsong phone book. Exactly, THE SIMPSONS, the one from the famous American series. This list includes people whose last name is "New". The self-imposed task now is to find those people who have the name Neu and whose first name begins with a "J". To do this, we write a Python script that reads this line and processes it line by line. For those who have problems processing files, we recommend our chapter on files: import re fh = open ("simpsons_phone_book.txt") for line in fh: if re.search (r "J. * New", line ): print line.rstrip () fh.close ()

Groupings and backward references

As already explained, expressions can be combined with round brackets "(" and ")". The matches found in the groupings are saved by Python. This enables their reuse in the same regular expression at a later point. This is known as back references. \ n (n = 1, 2, 3, ...) denotes the nth grouping. Before we continue with backward references, we want to add a paragraph about match objects, which we will need in the following.

Match Objects

So far we were only interested in whether an expression was correct or not. We used the fact that Python, or more precisely the re.search () method, returns a match object if the regular expression matched and otherwise only a None. So far we weren't interested in what had fit, i.e. which substring. Another piece of information would be where the match took place in the string, i.e. the start and end position.

A match object contains the methods group (), span (), start () and end (), which can be seen in the following example in self-explanatory use: >>> import re >>> mo = re.search ("[0 -9] + "," Customer number: 232454, Date: February 12, 2011 ") >>> mo.group () '232454' >>> mo.span () (17, 23) >>> mo.start () 17 >>> mo.end () 23 >>> mo.span () [0] 17 >>> mo.span () [1] 23 >>> These methods are not difficult to understand: span () returns a 2-way tuple containing the start and end value of the substring to which the regular expression matched. There are two functions start () and end () for the start and end value. Whereby it applies that span () [0] corresponds to the value of start () span () [1] corresponds to the value of end ().
If group () is called without arguments, it returns the substring that matched the RE. If you call group with an integer argument n, it returns the substring to which the nth group matched. You can also call group () with more than one integer value, e.g. group (n, m), then no string is returned, but a tuple with the values ​​of group (n) and group (m), so ( group (n), group (m)) is equal to group (n, m): >>> import re >>> mo = re.search ("([0-9] +). *: (. *)" , "Customer number: 232454, Date: February 12, 2011") >>> mo.group () '232454, Date: February 12, 2011' >>> mo.group (1) '232454' >>> mo. group (2) 'February 12, 2011' >>> mo.group (1,2) ('232454', 'February 12, 2011') >>> A very intuitive example is reading corresponding closing tags from XML or HTML. A file (eg "tags.txt") contains the following content: Wolfgang Amadeus Mozart Samuel Beckett London We want this text to be automatic rewrite in the following format: composer: Wolfgang Amadeus Mozart author: Samuel Beckett city: London This can be done real using Python and regular expressions with the following script ize. The regular expression works as follows: It first tries to find the symbol "<". Then he reads a group of lowercase letters until he comes across the uppercase sign ">". Everything between "<" and ">" is saved in a backreference under \ 1. First, \ 1 contains the value "composer": After the expression has reached the first ">", the regular expression continues as if it had been the same from the beginning.

The associated Python script: import re fh = open ("tags.txt") for i in fh: res = re.search (r "<([a-z]+)>(. *) ", i) print res.group (1) +": "+ res.group (2) If there is more than one pair of brackets (round brackets) within a regular expression, then they are Backward references numbered in the order of the brackets: \ 1, \ 2, \ 3, ...

Exercise:
In the next example three back references are used. A phone list of the Simpsons is given. Not every entry contains a phone number, but if a phone number exists it is at the beginning of the string. This is followed by the last name, separated by a space. First names follow, separated by a comma. The list should be given in the following form: Allison Neu 555-8396 C. Montgomery Burns Lionel Putz 555-5299 Homer Jay Simpson 555-7334 Here is the Python script that solves the problem:
import re l = ["555-8396 New, Allison", "Burns, C. Montgomery", "555-5299 Putz, Lionel", "555-7334 Simpson, Homer Jay"] for i in l: res = re. search (r "([0-9 -] *) \ s * ([A-Za-z] +), \ s + (. *)", i) print res.group (3) + "" + res. group (2) + "" + res.group (1)

Extensive python exercise

In this extensive exercise, we need to bring the information from two lists together. The first file contains almost 15,000 lines of postcodes with the associated city names and other information. Here are a few randomly selected lines for clarity and use for regular expressions:

The other file contains a list of the 19 largest German cities. Each line contains the position of the city, the city name, the number of inhabitants and the state, but not the postcode: 1. Berlin 3,382,169 Berlin 2. Hamburg 1,715,392 Hamburg 3. Munich 1,210,223 Bavaria 4. Cologne 962,884 North Rhine-Westphalia 5. Frankfurt am Main 646,550 Hesse 6. Essen 595,243 North Rhine-Westphalia 7. Dortmund 588,994 North Rhine-Westphalia 8. Stuttgart 583,874 Baden-Württemberg 9. Düsseldorf 569,364 North Rhine-Westphalia 10.Bremen 539,403 Bremen 11. Hanover 515,001 Lower Saxony 12. Duisburg 514,915 North Rhine -Westfalen 13. Leipzig 493.208 Saxony 14. Nuremberg 488.400 Bavaria 15. Dresden 477.807 Saxony 16. Bochum 391.147 North Rhine-Westphalia 17. Wuppertal 366.434 North Rhine-Westphalia 18. Bielefeld 321.758 North Rhine-Westphalia 19. Mannheim 306.729 Baden-Württemberg The task now is this d Output the 19 largest cities along with their postcodes. To test the following program in Python, you should save the above list in a file called largest_cities_germany.txt and download the list of German postcodes.
# - * - coding: iso-8859-15 - * - import re fh_post_codes = open ("post_codes_germany.txt") ZIP = {} for line in fh_post_codes: (post_code, city, rest) = line.split ("," , 2) ZIP [city.strip ("\" ")] = post_code fh_largest_cities = open (" largest_cities_germany.txt ") for line in fh_largest_cities: re_obj = re.search (r" ^ [0-9] {1,2 } \. \ s + ([\ wÄÖÜäöüß \ s] + \ w) \ s + [0-9] ", line) city = re_obj.group (1) print city, ZIP [city]

Another postal example

Here is another large example using regular expressions and Python. This time we are leaving Germany and Switzerland and going to England or more precisely to the United Kingdom. We are writing a regular expression for a UK postal code.

A postal code consists of five or seven characters, which are divided into two parts by a space. The two to four characters in front of the space represent the so-called outward code, which is used to distribute the mail in the incoming post office for the destination post offices. The space is followed by a number followed by two letters. This second part of the postcode is called the inward code, according to which it is finely distributed in the destination post office. The last two characters can only be from the set ABDHJLNPUWZ.

The outgoing code, i.e. the first part of the postcode, has the following form: One or two capital letters, followed by either a number or the letter R, optionally either another letter or another number can follow. (There are many more rules that apply to which letters can appear in the postal code depending on position and context, but we want to save them here!)
This is followed by a regular expression that matches a superset of the valid postal codes from Great Britain: r "\ b [AZ] {1,2} [0-9R] [0-9A-Z]? [0-9] [ABD- HJLNP-UW-Z] {2} \ b "The following Python script uses the above expression: import re example_codes = [" SW1A 0AA ", # House of Commons" SW1A 1AA ", # Buckingham Palace" SW1A 2AA ", # Downing Street "BX3 2BB", # Barclays Bank "DH98 1BT", # British Telecom "N1 9GU", # Guardian Newspaper "E98 1TT", # The Times "TIM E22", # a fake postcode "A B1 A22", # not a valid postcode "EC2N 2DB", # Deutsche Bank "SE9 2UG", # University of Greenwhich "N1 0UY", # Islington, London "EC1V 8DS", # Clerkenwell, London "WC1X 9DT", # WC1X 9DT "B42 1LG ", # Birmingham" B28 9AD ", # Birmingham" W12 7RJ ", # London, BBC News Center "BBC 007" # a fake postcode] pc_re = r "[A-z] {1,2} [0-9R] [0-9A-Z]? [0-9] [ABD-HJLNP-UW-Z] {2} "for postcode in example_codes: r = re.search (pc_re, postcode) if r: print postcode +" matched! "Else: print postcode +" is not a valid postcode! "