Next: Galil-Giancarlo algorithm Up: ESMAJ Prev: Simon algorithm

# Colussi algorithm

Main features
• refinement of the Knuth, Morris and Pratt algorithm;
• partitions the set of pattern positions into two disjoint subsets; the positions in the first set are scanned from left to right and when no mismatch occurs the positions of the second subset are scanned from right to left;
• preprocessing phase in O(m) time and space complexity;
• searching phase in O(n) time complexity;
• performs n text character comparisons in the worst case.
Description

The design of the Colussi algorithm follows a tight analysis of the Knuth, Morris and Pratt algorithm.

The set of pattern positions is divided into two disjoint subsets. Then each attempt consists in two phases:
 in the first phase the comparisons are performed from left to right with text characters aligned with pattern position for which the value of the kmpNext function is strictly greater than -1. These positions are called noholes;
 the second phase consists in comparing the remaining positions (called holes) from right to left.
This strategy presents two advantages:
 when a mismatch occurs during the first phase, after the appropriate shift it is not necessary to compare the text characters aligned with noholes compared during the previous attempt;
 when a mismatch occurs during the second phase it means that a suffix of the pattern matches a factor of the text, after the corresponding shift a prefix of the pattern will still match a factor of the text, then it is not necessary to compare this factor again.
Definitions

For  i  m-1: kmin[i]= d>0 iff x[0 .. i-1-d]=x[d .. i-1] and x[i-d x[i], 0 otherwise.

When kmin  0 a periodicity ends at position i in x.

For 0 < i < m if kmin[i-1]  0 then i is a nohole otherwise i is a hole.

Let nd+1 be the number of noholes in x.

The table h contains first the nd+1 noholes in increasing order and then the m-nd-1 holes in decreasing order:
 for 0  i  nd, h[i] is a nohole and h[i] < h[i+1] for 0  i
 for nd < i < m, h[i] is a hole and h[i] > h[i+1] for nd < i < m-1.

If i is a hole then rmin[i] is the smallest period of x greater than i.

The value of first[u] is the smallest integer v such that u  h[v].

Then assume that x is aligned with y[j .. j+m-1]. If x[h[k]]=y[j+h[k]] for  k < r < nd and x[h[r]]  y[j+h[r]]. Let j’ = j+kmin[h[r]].
Then there is no occurrence of x beginning in y[j .. j’] and x can be shifted by kmin[h[r]] positions to the right.
Moreover x[h[k]]=y[j’+h[k]] for  k < first[h[r]-kmin[h[r]]] meaning that the comparisons can be resume with x[h[first[h[r]-kmin[h[r]]]]] and y[j’+h[first[h[r]-kmin[h[r]]]]].

Figure 9.1: Mismatch with a nohole. Noholes are black circles and are compared from left to right.
In this situation, after the shift, it is not necessary to compare the first two noholes again.

If x[h[k]]=y[j+h[k]] for  k < r and x[h[r]]  y[j+h[r]] with nd  r < m. Let j’=j+rmin[h[r]]. Then there is no occurrence of x beginning in y[j .. j’] and x can be shifted by kmin[h[r]] positions to the right.
Moreover x[0 .. m-1-rmin[h[r]]]=y[j’ .. j+m-1] meaning that the comparisons can be resume with x[h[first[m-1-rmin[h[r]]]]] and y[j’+h[first[m-1-rmin[h[r]]]]].

Figure 9.2: Mismatch with a hole. Noholes are black circles and are compared from left to right
while holes are white circles and are compared from right to left.
In this situation, after the shift, it is not necessary to compare the matched prefix of the pattern again.

To compute the values of kmin, a table hmax is used and defined as follows: hmax[k] is such that x[k .. hmax[k]-1]=x[0 .. hmax[k]-k-1] and x[hmax[k]]  x[hmax[k]-k].

The value of nhd0[i] is the number of noholes strictly smaller than i.

We can now define two functions shift and next as follows:
 shift[i]=kmin[h[i]] and next[i]=nhd0[h[i]-kmin[h[i]]] for i < nd;
 shift[i]=rmin[h[i]] and next[i]=nhd0[m-rmin[h[i]]] for nd  i < m;
 shift[m]=rmin[0] and next[m]=nhd0[m-rmin[h[m-1]]].

Thus, during an attempt where the window is positioned on the text factor y[j .. j+m-1], when a mismatch occurs between x[h[r]] and y[j+h[r]] the window must be shifted by shift[r] and the comparisons can be resume with pattern position h[next[r]].

The preprocessing phase can be done in O(m) space and time. The searching phase can then be done in O(n) time complexity and furthermore at most n text character comparisons are performed during the searching phase.

The C code
```int preColussi(char *x, int m, int h[], int next[],
int shift[]) {
int i, k, nd, q, r, s;
int hmax[XSIZE], kmin[XSIZE], nhd0[XSIZE], rmin[XSIZE];

/* Computation of hmax */
i = k = 1;
do {
while (x[i] == x[i - k])
i++;
hmax[k] = i;
q = k + 1;
while (hmax[q - k] + k < i) {
hmax[q] = hmax[q - k] + k;
q++;
}
k = q;
if (k == i + 1)
i = k;
} while (k <= m);

/* Computation of kmin */
memset(kmin, 0, m*sizeof(int));
for (i = m; i >= 1; --i)
if (hmax[i] < m)
kmin[hmax[i]] = i;

/* Computation of rmin */
for (i = m - 1; i >= 0; --i) {
if (hmax[i + 1] == m)
r = i + 1;
if (kmin[i] == 0)
rmin[i] = r;
else
rmin[i] = 0;
}

/* Computation of h */
s = -1;
r = m;
for (i = 0; i < m; ++i)
if (kmin[i] == 0)
h[--r] = i;
else
h[++s] = i;
nd = s;

/* Computation of shift */
for (i = 0; i <= nd; ++i)
shift[i] = kmin[h[i]];
for (i = nd + 1; i < m; ++i)
shift[i] = rmin[h[i]];
shift[m] = rmin[0];

/* Computation of nhd0 */
s = 0;
for (i = 0; i < m; ++i) {
nhd0[i] = s;
if (kmin[i] > 0)
++s;
}

/* Computation of next */
for (i = 0; i <= nd; ++i)
next[i] = nhd0[h[i] - kmin[h[i]]];
for (i = nd + 1; i < m; ++i)
next[i] = nhd0[m - rmin[h[i]]];
next[m] = nhd0[m - rmin[h[m - 1]]];

return(nd);
}

void COLUSSI(char *x, int m, char *y, int n) {
int i, j, last, nd,
h[XSIZE], next[XSIZE], shift[XSIZE];

/* Processing */
nd = preColussi(x, m, h, next, shift);

/* Searching */
i = j = 0;
last = -1;
while (j <= n - m) {
while (i < m && last < j + h[i] &&
x[h[i]] == y[j + h[i]])
i++;
if (i >= m || last >= j + h[i]) {
OUTPUT(j);
i = m;
}
if (i > nd)
last = j + m - 1;
j += shift[i];
i = next[i];
}
}

```
The example

Preprocessing phase

Tables used by Colussi algorithm

Searching phase

References
• BRESLAUER, D., 1992, Efficient String Algorithmics, Ph. D. Thesis, Report CU-024-92, Computer Science Department, Columbia University, New York, NY.
• COLUSSI L., 1991, Correctness and efficiency of the pattern matching algorithms, Information and Computation 95(2):225-251.
• COLUSSI, L., GALIL, Z., GIANCARLO, R., 1990, On the exact complexity of string matching, in Proceedings of the 31st IEEE Annual Symposium on Foundations of Computer Science, Saint Louis, MO, pp 135-144, IEEE Computer Society Press.
• GALIL, Z., GIANCARLO, R., 1992, On the exact complexity of string matching: upper bounds, SIAM Journal on Computing, 21(3):407-437.

Next: Galil-Giancarlo algorithm Up: ESMAJ Prev: Simon algorithm

e-mails: {Christian.Charras, Thierry.Lecroq }@laposte.net
Tue Jan 14 15:03:31 MET 1997