Next: Turbo Reverse Factor algorithm Up: ESMAJ Prev: Raita algorithm

Reverse Factor algorithm

Main features
• uses the suffix automaton of xR;
• fast on practice for long pattens and small alphabets;
• preprocessing phase in O(m) time and space complexity;
• searching phase in O(mn) time complexity;
• optimal in the average.
Description

The Boyer-Moore type algorithms match some suffixes of the pattern but it is possible to match some prefixes of the pattern by scanning the character of the window from right to left and then improve the length of the shifts. This is made possible by the use of the smallest suffix automaton (also called DAWG for Directed Acyclic Word Graph) of the reverse pattern. The resulting algorithm is called the Reverse Factor algorithm.

The smallest suffix automaton of a word w is a Deterministic Finite Automaton S(w) =(Q,q0,T,E). The language accepted by S(w) is L(S(w))={u in * :  exists v in * such that w=vu}. The preprocessing phase of the Reverse Factor algorithm consists in computing the smallest suffix automaton for the reverse pattern xR. It is linear in time and space in the length of the pattern.

During the searching phase, the Reverse Factor algorithm parses the characters of the window from right to left with the automaton S(xR), starting with state q0. It goes until there is no more transition defined for the current character of the window from the current state of the automaton. At this moment it is easy to know what is the length of the longest prefix of the pattern which has been matched: it corresponds to the length of the path taken in S(xR) from the start state q0 to the last final state encountered. Knowing the length of this longest prefix, it is trivial to compute the right shift to perform.

The Reverse Factor algorithm has a quadratic worst case time complexity but it is optimal in average. It performs O( n.log(m) / m ) inspections of text characters on the average reaching the best bound shown by Yao in 1979.

The C code

All the functions to create and manipulate a data structure suitable for a suffix automaton are given in the introduction (see implementation).

```void buildSuffixAutomaton(char *x, int m, Graph aut) {
int i, art, init, last, p, q, r;
char c;

init = getInitial(aut);
art = newVertex(aut);
last = init;
for (i = 0; i < m; ++i) {
c = x[i];
p = last;
q = newVertex(aut);
setLength(aut, q, getLength(aut, p) + 1);
setPosition(aut, q, getPosition(aut, p) + 1);
while (p != init &&
getTarget(aut, p, c) == UNDEFINED) {
setTarget(aut, p, c, q);
setShift(aut, p, c, getPosition(aut, q) -
getPosition(aut, p) - 1);
}
if (getTarget(aut, p, c) == UNDEFINED) {
setTarget(aut, init, c, q);
setShift(aut, init, c,
getPosition(aut, q) -
getPosition(aut, init) - 1);
}
else
if (getLength(aut, p) + 1 ==
getLength(aut, getTarget(aut, p, c)))
else {
r = newVertex(aut);
copyVertex(aut, r, getTarget(aut, p, c));
setLength(aut, r, getLength(aut, p) + 1);
while (p != art &&
getLength(aut, getTarget(aut, p, c)) >=
getLength(aut, r)) {
setShift(aut, p, c,
getPosition(aut,
getTarget(aut, p, c)) -
getPosition(aut, p) - 1);
setTarget(aut, p, c, r);
}
}
last = q;
}
setTerminal(aut, last);
while (last != init) {
setTerminal(aut, last);
}
}

char *reverse(char *x, int m) {
char *xR;
int i;

xR = (char *)malloc((m + 1)*sizeof(char));
for (i = 0; i < m; ++i)
xR[i] = x[m - 1 - i];
xR[m] = '\0';
return(xR);
}

int RF(char *x, int m, char *y, int n) {
int i, j, shift, period, init, state;
Graph aut;
char *xR;

/* Preprocessing */
aut = newSuffixAutomaton(2*(m + 2), 2*(m + 2)*ASIZE);
xR = reverse(x, m);
buildSuffixAutomaton(xR, m, aut);
init = getInitial(aut);
period = m;

/* Searching */
j = 0;
while (j <= n - m) {
i = m - 1;
state = init;
shift = m;
while (i + j >= 0 &&
getTarget(aut, state, y[i + j]) !=
UNDEFINED) {
state = getTarget(aut, state, y[i + j]);
if (isTerminal(aut, state)) {
period = shift;
shift = i;
}
--i;
}
if (i < 0) {
OUTPUT(j);
shift = period;
}
j += shift;
}
}

```
The example

Preprocessing phase

Suffix automaton used by Reverse Factor algorithm.

Searching phase

References
• BAEZA-YATES R., NAVARRO G., RIBEIRO-NETO B., 1999, Indexing and Searching, in Modern Information Retrieval, Chapter 8, pp 191-228, Addison-Wesley.
• CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., RYTTER, W., 1992, Deux méthodes pour accélérer l'algorithme de Boyer-Moore, in Théorie des Automates et Applications, Actes des 2e Journées Franco-Belges, D. Krob ed., Rouen, France, 1991, pp 45-63, PUR 176, Rouen, France.
• CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., RYTTER, W., 1994, Speeding up two string matching algorithms, Algorithmica 12(4/5):247-267.
• CROCHEMORE, M., RYTTER, W., 1994, Text Algorithms, Oxford University Press.
• LECROQ T., 1992, A variation on the Boyer-Moore algorithm, Theoretical Computer Science 92(1):119--144.
• LECROQ, T., 1992, Recherches de mot, Ph. D. Thesis, University of Orléans, France.
• LECROQ, T., 1995, Experimental results on string matching algorithms, Software - Practice & Experience 25(7):727-765.
• YAO, A.C., 1979, The complexity of pattern matching for a random string, SIAM Journal on Computing, 8 (3):368--387.

Next: Turbo Reverse Factor algorithm Up: ESMAJ Prev: Raita algorithm

e-mails: {Christian.Charras, Thierry.Lecroq }@laposte.net
Tue Jan 14 15:03:31 MET 1997