Overview: Given "needle" string P = P[0..N-1] of length N, prepare a "jump table" A[0..N-1] in O(N) time, to enable finding occurrences of P in any given "haystack" string S of Length M, in O(M) time. Intuitively, we scan S from left to right and maintain in a clever way (using the jump table) how many initial characters from P we already matched. If, say, P = "ababc", S = "xyabababc" and we are at position 6 and we already matched 4 initial needle characters (since S[2..5] = "abab" = P[0..4]) but the next haystack character doesn't match the next needle character (since S[6] = "a" ≠ "c" = P[4]). We don't want to naively go back and try from position 2+1=3 with 0 matched needle characters. Since "ab" is both a suffix of the already matched characters and also a prefix of the needle, we can try again at position 6, this time with only 2 matched needle characters.
For each position i of the pattern string P's positions [0..N-1], KMP precomputes the length A[i] of the longest substring that is both a prefix and a suffix of P[0..i]. For example,
P = abaabc A = 001120
Explanation:
At position i=2, we have P[0..i] = aba, and the longest
string that is both a prefix and a suffix of that is a, which has
length A[i] = A[2] = 1.
At position i=3, we have P[0..i] = abaa, and the longest
string that is both a prefix and a suffix of that is a, again, and it
has length A[i] = A[3] = 1.
At position i=4, we have P[0..i] = abaab, and the
longest string that is both a prefix and a suffix of that is ab, which
has length A[i] = A[4] = 2.
Construction of A is possible in O(n) time with a Dynamic Programming approach: We fill the suffix lengths from left to right. To fill A[i], we reuse the already computed value A[i-1]. At any position i ≥ 1 we have: P[0..A[i-1]-1] = P[i-A[i-1]..i-1] (these substrings are empty if A[i-1] = 0). If P[A[i-1]] = P[i], then it follows that P[0..A[i-1]] = P[i-A[i-1]..i]. We set A[i] := A[i-1]+1. Otherwise, we have P[A[i-1]] ≠ P[i], and unfortunately P[0..A[i-1]] ≠ P[i-A[i-1]..i]. But we can try the next smaller prefix-suffix, A[A[i-1]]: by assumption P[0..A[A[i-1]-1] = P[i-A[A[i]]..i-1]. If P[A[A[i-1]-1]] = P[i], then it follows that P[0..A[A[i-1]]] = P[i-A[A[i]]..i]. We set A[i] := A[A[i-1]]+1.
We continue trying smaller prefix-suffixes until we find one that we can extend by P[i]. If we don't find one, we set A[i] := 0 (no initial characters matched).
The matching algorithm is similar. In any case I think the algorithm is pretty difficult to get right. Here's some C code (untested):
#include <stdio.h> static char P[10000]; static char S[10000]; static int A[10000]; static int N; static int NS; static void preprocess(void) { int i, x; A[0] = 0; for (i = 1; i < N; i++) { x = A[i-1]; while (x && P[x] != P[i]) x = A[x-1]; if (P[x] == P[i]) A[i] = x + 1; else A[i] = 0; } } static void match(void) { int i, x; x = 0; for (i = 0; i < NS; i++) { while (x > 0 && (x == N || P[x] != S[i])) x = A[x-1]; if (P[x] == S[i]) x++; if (x == N) printf("match at %d-%d\n", i-x+1, i); } } int main(void) { int i; fgets(P, sizeof P, stdin); while (P[N] > ' ') N++; preprocess(); printf("A[0..%d]:", N-1); for (i = 0; i < N; i++) printf(" %d", A[i]); printf("\n"); while (fgets(S, sizeof S, stdin)) { NS = 0; while (S[NS] > ' ') NS++; match(); } return 0; }
Created: 2018-07-29