Speeding up Natural Language Text Search using Compression

Majed AbuSafiya

Published 2021 in International Journal of Advanced Computer Science and Applications

ABSTRACT

Text search is a well-known problem in computer science where the valid shifts of a pattern P in a text string T are found. This paper shows how to speed up text search by searching for P in a compressed version of T. A fast compression algorithm was designed for this aim. This algorithm is based on the assumption that T is restricted to the letters of a single natural language. Relying on this assumption, a letter, in T or P, is encoded into a single byte instead of the two-byte unicode which shortens the string on which a text search algorithm works. The main disadvantage of this approach is the restriction of the alphabet of T to be from a single natural language. However, wide range of text documents complies to this assumption. Another issue is the overhead that is required to compress P and T, but it was found that the proposed compression algorithm is so fast such that its run-time can be paid for and still save text search time. Different approaches to store compressed T are also explored. The conducted experimental study showed that this approach does actually reduce the text search time. Keywords—Text compression; text search; unicode

PUBLICATION RECORD

  • Publication year

    2021

  • Venue

    International Journal of Advanced Computer Science and Applications

  • Publication date

    Unknown publication date

  • Fields of study

    Computer Science

  • Identifiers
  • External record

    Open on Semantic Scholar

  • Source metadata

    Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

CITED BY