



n text character comparisons in the worst case.The design of the Colussi algorithm follows a tight analysis of the Knuth, Morris and Pratt algorithm.
|
in the first phase the comparisons are performed from left to right with text characters aligned with pattern position for which the value of the kmpNext function is strictly greater than -1. These positions are called noholes; |
|
the second phase consists in comparing the remaining positions (called holes) from right to left. |
|
when a mismatch occurs during the first phase, after the appropriate shift it is not necessary to compare the text characters aligned with noholes compared during the previous attempt; |
|
when a mismatch occurs during the second phase it means that a suffix of the pattern matches a factor of the text, after the corresponding shift a prefix of the pattern will still match a factor of the text, then it is not necessary to compare this factor again. |
For 0
i
m-1: kmin[i]= d>0 iff x[0 .. i-1-d]=x[d .. i-1] and x[i-d]
x[i], 0 otherwise.
When kmin
0 a periodicity ends at position i in x.
For 0 < i < m if kmin[i-1]
0 then i is a nohole otherwise i is a hole.
Let nd+1 be the number of noholes in x.
|
for 0 i nd, h[i] is a nohole and h[i] < h[i+1] for 0 i<nd; |
|
for nd < i < m, h[i] is a hole and h[i] > h[i+1] for nd < i < m-1. |
If i is a hole then rmin[i] is the smallest period of x greater than i.
The value of first[u] is the smallest integer v such that u
h[v].
Then assume that x is aligned with y[j .. j+m-1]. If x[h[k]]=y[j+h[k]] for 0
k < r < nd and x[h[r]]
y[j+h[r]]. Let j’ = j+kmin[h[r]].
Then there is no occurrence of x beginning in y[j .. j’] and x can be shifted by kmin[h[r]] positions to the right.
Moreover x[h[k]]=y[j’+h[k]] for 0
k < first[h[r]-kmin[h[r]]] meaning that the comparisons can be resume with x[h[first[h[r]-kmin[h[r]]]]] and y[j’+h[first[h[r]-kmin[h[r]]]]].
Figure 9.1: Mismatch with a nohole. Noholes are black circles and are compared from left to right.
In this situation, after the shift, it is not necessary to compare the first two noholes again.
If x[h[k]]=y[j+h[k]] for 0
k < r and x[h[r]]
y[j+h[r]] with nd
r < m. Let j’=j+rmin[h[r]]. Then there is no occurrence of x beginning in y[j .. j’] and x can be shifted by kmin[h[r]] positions to the right.
Moreover x[0 .. m-1-rmin[h[r]]]=y[j’ .. j+m-1] meaning that the comparisons can be resume with x[h[first[m-1-rmin[h[r]]]]] and y[j’+h[first[m-1-rmin[h[r]]]]].
To compute the values of kmin, a table hmax is used and defined as follows: hmax[k] is such that x[k .. hmax[k]-1]=x[0 .. hmax[k]-k-1] and x[hmax[k]]
x[hmax[k]-k].
The value of nhd0[i] is the number of noholes strictly smaller than i.
|
shift[i]=kmin[h[i]] and next[i]=nhd0[h[i]-kmin[h[i]]] for i < nd; |
|
shift[i]=rmin[h[i]] and next[i]=nhd0[m-rmin[h[i]]] for nd i < m; |
|
shift[m]=rmin[0] and next[m]=nhd0[m-rmin[h[m-1]]]. |
Thus, during an attempt where the window is positioned on the text factor y[j .. j+m-1], when a mismatch occurs between x[h[r]] and y[j+h[r]] the window must be shifted by shift[r] and the comparisons can be resume with pattern position h[next[r]].
The preprocessing phase can be done in O(m) space and time. The searching phase can then be done in O(n) time complexity and furthermore at most
n text character comparisons are performed during the searching phase.
int preColussi(char *x, int m, int h[], int next[],
int shift[]) {
int i, k, nd, q, r, s;
int hmax[XSIZE], kmin[XSIZE], nhd0[XSIZE], rmin[XSIZE];
/* Computation of hmax */
i = k = 1;
do {
while (x[i] == x[i - k])
i++;
hmax[k] = i;
q = k + 1;
while (hmax[q - k] + k < i) {
hmax[q] = hmax[q - k] + k;
q++;
}
k = q;
if (k == i + 1)
i = k;
} while (k <= m);
/* Computation of kmin */
memset(kmin, 0, m*sizeof(int));
for (i = m; i >= 1; --i)
if (hmax[i] < m)
kmin[hmax[i]] = i;
/* Computation of rmin */
for (i = m - 1; i >= 0; --i) {
if (hmax[i + 1] == m)
r = i + 1;
if (kmin[i] == 0)
rmin[i] = r;
else
rmin[i] = 0;
}
/* Computation of h */
s = -1;
r = m;
for (i = 0; i < m; ++i)
if (kmin[i] == 0)
h[--r] = i;
else
h[++s] = i;
nd = s;
/* Computation of shift */
for (i = 0; i <= nd; ++i)
shift[i] = kmin[h[i]];
for (i = nd + 1; i < m; ++i)
shift[i] = rmin[h[i]];
shift[m] = rmin[0];
/* Computation of nhd0 */
s = 0;
for (i = 0; i < m; ++i) {
nhd0[i] = s;
if (kmin[i] > 0)
++s;
}
/* Computation of next */
for (i = 0; i <= nd; ++i)
next[i] = nhd0[h[i] - kmin[h[i]]];
for (i = nd + 1; i < m; ++i)
next[i] = nhd0[m - rmin[h[i]]];
next[m] = nhd0[m - rmin[h[m - 1]]];
return(nd);
}
void COLUSSI(char *x, int m, char *y, int n) {
int i, j, last, nd,
h[XSIZE], next[XSIZE], shift[XSIZE];
/* Processing */
nd = preColussi(x, m, h, next, shift);
/* Searching */
i = j = 0;
last = -1;
while (j <= n - m) {
while (i < m && last < j + h[i] &&
x[h[i]] == y[j + h[i]])
i++;
if (i >= m || last >= j + h[i]) {
OUTPUT(j);
i = m;
}
if (i > nd)
last = j + m - 1;
j += shift[i];
i = next[i];
}
}
Preprocessing phase

Searching phase



