KMP Algorithm - Notes

Overview: Given "needle" string P = P[0..N-1] of length N, prepare a "jump table" A[0..N-1] in O(N) time, to enable finding occurrences of P in any given "haystack" string S of Length M, in O(M) time. Intuitively, we scan S from left to right and maintain in a clever way (using the jump table) how many initial characters from P we already matched. If, say, P = "ababc", S = "xyabababc" and we are at position 6 and we already matched 4 initial needle characters (since S[2..5] = "abab" = P[0..4]) but the next haystack character doesn't match the next needle character (since S[6] = "a" ≠ "c" = P[4]). We don't want to naively go back and try from position 2+1=3 with 0 matched needle characters. Since "ab" is both a suffix of the already matched characters and also a prefix of the needle, we can try again at position 6, this time with only 2 matched needle characters.

For each position i of the pattern string P's positions [0..N-1], KMP precomputes the length A[i] of the longest substring that is both a prefix and a suffix of P[0..i]. For example,

P = abaabc
A = 001120

Explanation:
At position i=2, we have P[0..i] = aba, and the longest string that is both a prefix and a suffix of that is a, which has length A[i] = A[2] = 1.
At position i=3, we have P[0..i] = abaa, and the longest string that is both a prefix and a suffix of that is a, again, and it has length A[i] = A[3] = 1.
At position i=4, we have P[0..i] = abaab, and the longest string that is both a prefix and a suffix of that is ab, which has length A[i] = A[4] = 2.

Construction of A is possible in O(n) time with a Dynamic Programming approach: We fill the suffix lengths from left to right. To fill A[i], we reuse the already computed value A[i-1]. At any position i ≥ 1 we have: P[0..A[i-1]-1] = P[i-A[i-1]..i-1] (these substrings are empty if A[i-1] = 0). If P[A[i-1]] = P[i], then it follows that P[0..A[i-1]] = P[i-A[i-1]..i]. We set A[i] := A[i-1]+1. Otherwise, we have P[A[i-1]] ≠ P[i], and unfortunately P[0..A[i-1]] ≠ P[i-A[i-1]..i]. But we can try the next smaller prefix-suffix, A[A[i-1]]: by assumption P[0..A[A[i-1]-1] = P[i-A[A[i]]..i-1]. If P[A[A[i-1]-1]] = P[i], then it follows that P[0..A[A[i-1]]] = P[i-A[A[i]]..i]. We set A[i] := A[A[i-1]]+1.

We continue trying smaller prefix-suffixes until we find one that we can extend by P[i]. If we don't find one, we set A[i] := 0 (no initial characters matched).

The matching algorithm is similar. In any case I think the algorithm is pretty difficult to get right. Here's some C code (untested):

#include <stdio.h>

static char P[10000];
static char S[10000];
static int A[10000];
static int N;
static int NS;

static void preprocess(void)
{
        int i, x;

        A[0] = 0;
        for (i = 1; i < N; i++) {
                x = A[i-1];
                while (x && P[x] != P[i])
                        x = A[x-1];
                if (P[x] == P[i])
                        A[i] = x + 1;
                else
                        A[i] = 0;
        }
}

static void match(void)
{
        int i, x;

        x = 0;
        for (i = 0; i < NS; i++) {
                while (x > 0 && (x == N || P[x] != S[i]))
                        x = A[x-1];
                if (P[x] == S[i])
                        x++;
                if (x == N)
                        printf("match at %d-%d\n", i-x+1, i);
        }
}

int main(void)
{
        int i;

        fgets(P, sizeof P, stdin);
        while (P[N] > ' ')
                N++;

        preprocess();

        printf("A[0..%d]:", N-1);
        for (i = 0; i < N; i++)
                printf(" %d", A[i]);
        printf("\n");

        while (fgets(S, sizeof S, stdin)) {
                NS = 0;
                while (S[NS] > ' ')
                        NS++;
                match();
        }

        return 0;
}

Created: 2018-07-29