Speeding up Natural Language Text Search using Compression

Published 2021 in International Journal of Advanced Computer Science and Applications

ABSTRACT

Text search is a well-known problem in computer science where the valid shifts of a pattern P in a text string T are found. This paper shows how to speed up text search by searching for P in a compressed version of T. A fast compression algorithm was designed for this aim. This algorithm is based on the assumption that T is restricted to the letters of a single natural language. Relying on this assumption, a letter, in T or P, is encoded into a single byte instead of the two-byte unicode which shortens the string on which a text search algorithm works. The main disadvantage of this approach is the restriction of the alphabet of T to be from a single natural language. However, wide range of text documents complies to this assumption. Another issue is the overhead that is required to compress P and T, but it was found that the proposed compression algorithm is so fast such that its run-time can be paid for and still save text search time. Different approaches to store compressed T are also explored. The conducted experimental study showed that this approach does actually reduce the text search time. Keywords—Text compression; text search; unicode

PUBLICATION RECORD

Publication year
2021
Venue
International Journal of Advanced Computer Science and Applications
Publication date
Unknown publication date
Fields of study
Computer Science
Identifiers
DOI 10.14569/IJACSA.2021.0120452
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Pattern Matching in Compressed Texts and Images
2013cited by this paper
Simple Compression Code Supporting Random Access and Fast String Matching
2007cited by this paper
A general compression algorithm that supports fast searching
2006cited by this paper
String matching with stopper compression
2002cited by this paper
Faster approximate string matching over compressed text
2001cited by this paper
Boyer-Moore String Matching over Ziv-Lempel Compressed Text
2000cited by this paper
Fast and flexible word searching on compressed text
2000cited by this paper
String Matching in Lempel—Ziv Compressed Strings
1998cited by this paper
A text compression scheme that allows fast searching directly in the compressed file
1994cited by this paper
Two-dimensional periodicity and its applications
1992cited by this paper
Fast Pattern Matching in Strings
1977cited by this paper

CITED BY

A Hybrid Length-Based Pattern Matching Algorithm for Text Searching
2025cites this paper
Arabic News Text Summarization: An Extractive Technique
2025cites this paper