In computer science, the string search algorithm , sometimes called string matching algorithm , is an important class of string algorithms that try to find the place where one or more strings (also called patterns ) are found in larger strings or text.
Let it? to be an alphabet (limited set). The most basic example of a string search is where the patterns and texts searched are an array of elements ?. That? may be a regular human alphabet (for example, letters A to Z in Latin alphabet). Other applications can use the binary alphabet (? = {0,1}) or DNA alphabet (? = {A, C, G, T}) in bioinformatics.
In practice, how encoded strings can affect a decent string search algorithm. In particular, if a variable-width encoding is in use then it may be slower to find N characters (may take time proportional to N). This can significantly slow down some search algorithms. One of the many possible solutions is to look for a sequence of unit codes instead, but doing so can result in false matches unless the encoding is specifically designed to avoid it.
Video String-searching algorithm
Jenis pencarian
The most basic case of string search involves a string (often very long), sometimes called a "haystack", and one (often very short) string, sometimes called a "needle". The goal is to find one or more occurrences of the "needle" in the "haystack". For example, someone might search "to" in:
Some books have to be tasted, others are swallowed, and some are chewed and digested.
Someone may ask for the first event, which is the fourth word; or all events, where there are 3; or the last one, which is the fifth word from the end.
Very common, however, various constraints are added. For example, one might want to match the "needle" only where it consists of one (or more) full words - probably defined as not having other letters adjacent on either side. In this case, the "hew" or "low" search should fail for the example of the sentence above, even if the string literal actually happened.
Another common example is "normalization". For many purposes, phrase searches such as "to be" must succeed even in places where there are other things that intervene between "to" and "to":
- More than one space
- Other "space" characters such as tabs, non-broken spaces, newlines, etc.
- Less frequent, hyphens or soft hyphens
- In structured text, tags or even large arbitrary but "brackets" things like footnotes, other numbers or markers, embedded images, and so on.
Many symbol systems include identical characters (at least for some purposes):
- Latin-based alphabets distinguish lowercase from uppercase, but for many search purposes strings are expected to ignore the differences.
- Many languages ââinclude ligatures, in which one composite character equals two or more other characters.
- Many writing systems involve diacritical marks such as accents or vowel points, which may vary in use, or have various interests in matching.
- The sequence of DNA may involve non-coding segments that can be ignored for some purpose, or polymorphism that causes no change in the encoded protein, which may not be counted as a real difference for some other purpose.
- Some languages ââhave rules where different characters or characters must be used at the beginning, middle, or end of the word.
Finally, for strings representing natural language, the language aspect itself becomes involved. For example, one might want to find all "word" occurrences even if they have spelling, prefixes, or alternate endings, etc.
Another more complex type of search is a regular expression search, where users build patterns of characters or other symbols, and any matches with that pattern must meet the search. For example, to catch the English "color" English and equivalent English "color", instead of searching for two different literal strings, one might use regular expressions like:
colou? r
Where "?" conventionally make the previous character ("u") optional.
This article deals primarily with algorithms for simpler string search types.
Maps String-searching algorithm
The basic classification of search algorithms
Various algorithms can be classified based on the number of patterns used respectively.
Single-pattern algorithm
Let m be the length of the pattern, n to be the length of searchable text and k = |? | into the size of the alphabet.
- 1. ^ Asymptotic time is expressed using O,?, and? notation.
The Boyer-Moore string search algorithm has become the standard benchmark for practical string search literature.
The algorithm uses a limited pattern set
- Aho-Corasick string matching algorithm (Knuth-Morris-Pratt extension)
- Commentz-Walter Algorithm (extension from Boyer-Moore)
- Set-BOM (Oracle Backward Matching extension)
- Rabin-Karp string search algorithm
The algorithm uses an infinite number of patterns
Naturally, the patterns can not be mentioned subtly in this case. They are usually represented by regular grammar or regular expressions.
Other classifications
Another classification approach is possible. One of the most common uses of preprocessing as the main criteria.
Others classify algorithms with their matching strategies:
- Match the prefix first (Knuth-Morris-Pratt, Shift-And, Aho-Corasick)
- Match the suffix first (Boyer-Moore and its variant, Commentz-Walter)
- Match the best first factor (BNDM, BOM, Set-BOM)
- Other strategies (Naive, Rabin-Karp)
NaÃÆ'ïve string search
A simple and inefficient way to see where a string occurs inside another is to check each place, one by one, to see if it exists. So first we see if there is a copy of the needle in the first character of the haystack; otherwise we will see if there is a copy of the needle starting on the second character of the haystack; if not, we see starting from the third character, and so on. In the normal case, we only need to see one or two characters for every wrong position to see that it is the wrong position, so in the average case, this requires O ( n m ) step, where n is the length of the haystack and m is the needle length; but in the worst case, looking for a string like "aaaab" in a string like "aaaaaaaaab", it takes O ( nm )
Automaton-based-automaton-based search
In this approach, we avoid backtracking by constructing deterministic finite automatons (DFAs) that recognize stored search strings. It's expensive to build - they're usually made using powerset construction - but are very fast to use. For example, the DFA displayed to the right recognizes the word "MOMMY". This approach is often generalized in practice for finding arbitrary regular expressions.
Stubs
Knuth-Morris-Pratt calculates the DFA that recognizes the input with the string being searched as a suffix, Boyer-Moore starts searching from the tip of the needle, so it can usually jump forward along the needle at each step. Baeza-Yates keeps track of whether previous j characters are prefixed from search string, and therefore can adapt to fuzzy string search. Bitap algorithm is an application of Baeza-Yates approach.
Index method
A faster search algorithm processes the previous text. After building a substring index, such as a suffix tree or suffix, the appearance of a pattern can be found quickly. For example, the tree suffix can be built in time, and all pattern occurrences can be found in
Some search methods, such as search trigrams, are meant to find "proximity" scores between search strings and text rather than "match/non-match". This is sometimes called a "fuzzy" search.
See also
- Sync sequence
- Graph match
- Pattern matching
- Compression pattern matching
- Matching wildcards
References
- R. S. Boyer and J. S. Moore, Rapid string search algorithm, Carom. ACM 20, (10), 262-272 (1977).
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithm , Third Edition. MIT Press and McGraw-Hill, 2009. ISBNÃ, 0-262-03293-7. Chapter 32: String Match, pp. 985-1013.
External links
- Large list of pattern matching links Last updated: 12/27/2008 20:18:38
- StringSearch - high performance pattern matching algorithm in Java - Implementation of many String-Matching-Algorithms in Java (BNDM, Boyer-Moore-Horspool, Boyer-Moore-Horspool-Raita, Shift-Or)
- StringsAndChars - Implementation of many String-Matching-Algorithms (for single and double pattern) in Java
- Right String Matching Algorithm - Animation in Java, detailed description, and C implementation of many algorithms.
- (PDF) Improved Single Rope Match and Multiple Estimates
- Kalign2: high-performance alignment of protein and nucleotide sequences that enable external features
- Implementation of C from Suffix-Based Pattern of Tree Pattern
Source of the article : Wikipedia