# KMP Algorithm - Notes

Overview: Given "needle" string P = P[0..N-1] of length N, prepare a "jump table" A[0..N-1] in O(N) time, to enable finding occurrences of P in any given "haystack" string S of Length M, in O(M) time. Intuitively, we scan S from left to right and maintain in a clever way (using the jump table) how many initial characters from P we already matched. If, say, P = "ababc", S = "xyabababc" and we are at position 6 and we already matched 4 initial needle characters (since S[2..5] = "abab" = P[0..4]) but the next haystack character doesn't match the next needle character (since S = "a" ≠ "c" = P). We don't want to naively go back and try from position 2+1=3 with 0 matched needle characters. Since "ab" is both a suffix of the already matched characters and also a prefix of the needle, we can try again at position 6, this time with only 2 matched needle characters.

For each position i of the pattern string P's positions [0..N-1], KMP precomputes the length A[i] of the longest substring that is both a prefix and a suffix of P[0..i]. For example,

```P = abaabc
A = 001120
```

Explanation:
At position i=2, we have P[0..i] = aba, and the longest string that is both a prefix and a suffix of that is a, which has length A[i] = A = 1.
At position i=3, we have P[0..i] = abaa, and the longest string that is both a prefix and a suffix of that is a, again, and it has length A[i] = A = 1.
At position i=4, we have P[0..i] = abaab, and the longest string that is both a prefix and a suffix of that is ab, which has length A[i] = A = 2.

Construction of A is possible in O(n) time with a Dynamic Programming approach: We fill the suffix lengths from left to right. To fill A[i], we reuse the already computed value A[i-1]. At any position i ≥ 1 we have: P[0..A[i-1]-1] = P[i-A[i-1]..i-1] (these substrings are empty if A[i-1] = 0). If P[A[i-1]] = P[i], then it follows that P[0..A[i-1]] = P[i-A[i-1]..i]. We set A[i] := A[i-1]+1. Otherwise, we have P[A[i-1]] ≠ P[i], and unfortunately P[0..A[i-1]] ≠ P[i-A[i-1]..i]. But we can try the next smaller prefix-suffix, A[A[i-1]]: by assumption P[0..A[A[i-1]-1] = P[i-A[A[i]]..i-1]. If P[A[A[i-1]-1]] = P[i], then it follows that P[0..A[A[i-1]]] = P[i-A[A[i]]..i]. We set A[i] := A[A[i-1]]+1.

We continue trying smaller prefix-suffixes until we find one that we can extend by P[i]. If we don't find one, we set A[i] := 0 (no initial characters matched).

The matching algorithm is similar. In any case I think the algorithm is pretty difficult to get right. Here's some C code (untested):

```#include <stdio.h>

static char P;
static char S;
static int A;
static int N;
static int NS;

static void preprocess(void)
{
int i, x;

A = 0;
for (i = 1; i < N; i++) {
x = A[i-1];
while (x && P[x] != P[i])
x = A[x-1];
if (P[x] == P[i])
A[i] = x + 1;
else
A[i] = 0;
}
}

static void match(void)
{
int i, x;

x = 0;
for (i = 0; i < NS; i++) {
while (x > 0 && (x == N || P[x] != S[i]))
x = A[x-1];
if (P[x] == S[i])
x++;
if (x == N)
printf("match at %d-%d\n", i-x+1, i);
}
}

int main(void)
{
int i;

fgets(P, sizeof P, stdin);
while (P[N] > ' ')
N++;

preprocess();

printf("A[0..%d]:", N-1);
for (i = 0; i < N; i++)
printf(" %d", A[i]);
printf("\n");

while (fgets(S, sizeof S, stdin)) {
NS = 0;
while (S[NS] > ' ')
NS++;
match();
}

return 0;
}
```

Created: 2018-07-29