Overview: Given "needle" string P = P[0..N-1] of length N, prepare a "jump
table" `A[0..N-1]` in `O(N)` time, to enable finding occurrences
of `P` in any given "haystack" string `S` of Length `M`, in
`O(M)` time. Intuitively, we scan `S` from left to right and
maintain in a clever way (using the jump table) how many initial characters from
`P` we already matched. If, say, `P = "ababc", S = "xyabababc"`
and we are at position `6` and we already matched `4` initial
needle characters (since `S[2..5] = "abab" = P[0..4])` but the next
haystack character doesn't match the next needle character (since `S[6] = "a"
≠ "c" = P[4]`). We don't want to naively go back and try from position
`2+1=3` with `0` matched needle characters. Since `"ab"` is
both a suffix of the already matched characters and also a prefix of the needle,
we can try again at position `6`, this time with only `2` matched
needle characters.

For each position `i` of the pattern string `P`'s positions
`[0..N-1]`, KMP precomputes the length `A[i]` of the longest
substring that is both a prefix and a suffix of `P[0..i]`. For example,

P = abaabc A = 001120

Explanation:

At position `i=2`, we have `P[0..i] = aba`, and the longest
string that is both a prefix and a suffix of that is `a`, which has
length `A[i] = A[2] = 1`.

At position `i=3`, we have `P[0..i] = abaa`, and the longest
string that is both a prefix and a suffix of that is `a`, again, and it
has length `A[i] = A[3] = 1`.

At position `i=4`, we have `P[0..i] = abaab`, and the
longest string that is both a prefix and a suffix of that is `ab`, which
has length `A[i] = A[4] = 2`.

Construction of `A` is possible in O(n) time with a Dynamic
Programming approach: We fill the suffix lengths from left to right. To fill
A[i], we reuse the already computed value A[i-1].
At any position `i ≥ 1` we have: `P[0..A[i-1]-1] =
P[i-A[i-1]..i-1]` (these substrings are empty if `A[i-1] = 0`).
If `P[A[i-1]] = P[i]`, then it follows that `P[0..A[i-1]] =
P[i-A[i-1]..i]`. We set `A[i] := A[i-1]+1`.
Otherwise, we have `P[A[i-1]] ≠ P[i]`, and unfortunately
`P[0..A[i-1]] ≠ P[i-A[i-1]..i]`. But we can try the next smaller
prefix-suffix, `A[A[i-1]]`: by assumption `P[0..A[A[i-1]-1] =
P[i-A[A[i]]..i-1]`. If `P[A[A[i-1]-1]] = P[i]`, then it follows that
`P[0..A[A[i-1]]] = P[i-A[A[i]]..i]`. We set `A[i] := A[A[i-1]]+1`.

We continue trying smaller prefix-suffixes until we find one that we can
extend by `P[i]`. If we don't find one, we set `A[i] := 0` (no
initial characters matched).

The matching algorithm is similar. In any case I think the algorithm is pretty difficult to get right. Here's some C code (untested):

#include <stdio.h> static char P[10000]; static char S[10000]; static int A[10000]; static int N; static int NS; static void preprocess(void) { int i, x; A[0] = 0; for (i = 1; i < N; i++) { x = A[i-1]; while (x && P[x] != P[i]) x = A[x-1]; if (P[x] == P[i]) A[i] = x + 1; else A[i] = 0; } } static void match(void) { int i, x; x = 0; for (i = 0; i < NS; i++) { while (x > 0 && (x == N || P[x] != S[i])) x = A[x-1]; if (P[x] == S[i]) x++; if (x == N) printf("match at %d-%d\n", i-x+1, i); } } int main(void) { int i; fgets(P, sizeof P, stdin); while (P[N] > ' ') N++; preprocess(); printf("A[0..%d]:", N-1); for (i = 0; i < N; i++) printf(" %d", A[i]); printf("\n"); while (fgets(S, sizeof S, stdin)) { NS = 0; while (S[NS] > ' ') NS++; match(); } return 0; }

Created: 2018-07-29