Volume 1 : Number 2 : Paper 3

December 1998 Special Issue of Best Papers presented at CLEI'97. Valparaiso
Title:
A Practical q -Gram Index for Text Retrieval Allowing Errors

Authors and Affiliations:
Gonzalo Navarro, Depto. de Ciencias de la Computaci
Ricardo Baeza-Yates, Depto. de Ciencias de la Computacion, U. de Chile

Abstract:
We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the building time, space requirements and querying time of our index, finding that it is a practical alternative for text retrieval. The retrieval times are reduced from 10% to 60% of the best on-line algorithm.


Received May 1998, Revised December 1998 , Editor: Nivio Ziviani
Full paper, 16 pages [ PDF, 830 Kb ]