What is the Knuth Morris Pratt Algorithm

Community

introduction

Like the naive algorithm, the Knuth-Morris-Pratt algorithm starts at the beginning of the text and then works its way through the text. If a prefix of the same name is found, the naive algorithm shows, however, that all characters found so far have been compared for free. Knuth-Morris-Pratt, on the other hand, first examines the search pattern for repetitive prefixes.

In the word 'pineapple', the prefix 'ana' is repeated from the 3rd position: 'ananas'. If a comparison is now made in the text, one finds in the naive algorithm and then establishes that the 'n' was found instead of the 's' sought.

However, we have now searched for repetitive prefixes and found that the (three-digit) prefix 'ana' is repeated from the 3rd position onwards. This means that at the 6th position where we found the 'n' instead of the 's' we were looking for, we have already recognized the positions 3, 4 and 5 as a prefix. In this case we are not in position 6 (the 's') but in position 4 (the 'n') in the search pattern.

The idea

The basic one is this: words can have prefixes. So it happens that the verb 'go' has a letter combination 'be' that is repeated in the word: begeben. If there is a combination of letters in the text 'issued', we find giveg and notice at the second 'g' that we did not find the word 'go' we were looking for. But the last hit matched the prefix 'be'. Since we have not yet checked the following letters, it could be that we have just successfully found the first two letters of the search term 'begend', unfortunately we do not compare with the first 'be' in the search term (begive), but with the second 'be' in the search term (covben). In the naive algorithm we would now have to return to the 'e' ('begiven ') and start the search again there, although we have already analyzed the text up to the red marked' b '. The text between 'e' and 'b' would have to be parsed again.

We can optimize here. If we still knew that we had just read 'be', i.e. the first two letters of the word, then we would not have to jump back in the text, but we would simply continue to compare the search word from the 3rd letter onwards. To do this, we have to examine the search pattern to determine where the search pattern corresponds to the beginning of the search pattern. In addition to the search pattern, we create a table that says for each letter at which point the next comparison must take place in the search pattern in order to generate a valid comparison.

So if we find after a prefix that the word does not match, we look in the table to see which previous position in the search pattern we can compare with in order to possibly find an element after all.

We are looking for the search pattern “pineapple” in the text “anabellmagananasananabolika.” (Anabell likes pineapple anabolic steroids).

The prefix investigation

Explanation

Instead of a simple search pattern, we now have to generate a series of information from the pattern. So that we don't have to juggle thousands of arguments here, we first pack everything that belongs to the search pattern into its own small structure:

struct Pattern {charconst * Pattern; unsignedint Length; int * Table;};

Here we can save the table that saves the shift in the search pattern and memorize the length of the search pattern so that we don't have to continually re-determine it.

We now create a search pattern by creating a structure and writing the address of the search pattern in the structure:

struct pattern pat; pat.Pattern = "ababcabab";

Then we have to somehow create the table. It works like this: we leave out the first letter, because the first letter cannot be a repetition of a previous letter. We are interested in prefix repetitions from the 2nd letter. At each position of the search pattern, we are interested in how many positions in the preceding letters correspond to the beginning of the word.

Let's look at the table for 'pineapple':

position Letter value
0 a -1 Special case: marks the beginning of the search pattern. This has to be compared with!
1 n 0 If the letter from the text does not match, we have to compare the letter found in the text with the search mister at position 0.
2 a 0 An 'a' can be found at position 0 in the search pattern, so the prefix 'a' is 1 + 0 characters long
If the letter found in the text does not match, the preceding 'a', which has no prefix in front of it, is found at position 0 of the search pattern.
3 n 1 The 'n' is at position 1 of the search pattern, so the prefix 'an' is 1 + 1 characters long
If the letter found in the text does not match, the preceding 'n', which has an 'a' in front of it, is at position 1 of the search pattern.
4 a 2 The 'a' is at position 2 of the search pattern, so the prefix 'ana' is 1 + 2 characters long
If the letter does not match, the preceding 'a', which has an 'an' in front of it, is found at position 2 of the search pattern.
5 s 0 The 's' can be found Not at position 3 of the search pattern, so 'anas' is not a prefix. If the letter does not match, there is no preceding 's' with an 'ana' in front of it. We have to recreate the search pattern.

implementation

char createPrefixTable (struct Pattern * pattern) {int patternPos; // Position in the patternint partLength; // Length of the match pattern-> Length = strlen (pattern-> Pattern); pattern-> Table = malloc ((1+ pattern-> Length) * sizeof (int)); if (pattern-> Table) {patternPos = 0; partLength = -1; pattern-> Table [0] = - 1; while (patternPos Length) {while (partLength> = 0 && pattern-> Pattern [partLength]! = pattern-> Pattern [patternPos]) partLength = pattern-> Table [partLength]; patternPos ++; partLength ++; pattern-> Table [patternPos] = partLength;} return TRUE;} return FALSE;}

The search

Explanation

The search itself is now comparatively easy: You go through the text and compare as in the naive algorithm search. The larger the alphabet (i.e. the smaller the chance of encountering a prefix), the more likely the Knuth-Morris-Pratt algorithm will behave in a similar way to the naive search algorithm. However, if it breaks off in a part of the search pattern because the text and search pattern no longer match, then it can skip a prefix that has already been found.

When reading the source text, it is helpful to see how the normal case works. As in the naive algorithm, the rule is that the first letter does not match. In this case == 0, so it is compared with the 1st letter (index 0 in the search pattern). If it does not match, the value of the table at index of (i.e. still 0) is set. The value is the special case in the table: -1.

Then it is increased by 1, so that it is 0 again and then the next letter in the text is compared with the letter of the search pattern at index 0. That corresponds to the naive algorithm.

implementation

int find (charconst * text, unsignedint textLength, struct Pattern * pattern) {int textPos = 0; int patternPos = 0; while (textPos = 0 && text [textPos]! = pattern-> Pattern [patternPos]) patternPos = pattern-> Table [patternPos]; patternPos ++; textPos ++; if (patternPos == pattern-> Length) return textPos - pattern-> Length;} return-1;}

Further information

Source code