Suffix Array Part 2–String Matching/Searching
This is part 2 of the post “suffix array”, which covers the string matching/searching problem that can be solved by suffix array. You may want to read part 1 first.
String Matching/Searching
Once we have the suffix array, it’s easy to check if another string appears in the input string/text or not. If the string appears in the input text, if will be the prefix of one or more its suffix. Since the suffix array are sorted suffix of the input string, binary search can be applied.
Below is an implementation that finds all occurrence of a string in the input text,
/**
this program demonstrate how to use suffix array to search for a particular substring
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int pstrcmp(const void *a, const void *b) {
//printf("pstrcmp: %s %s\n", (const char*)*(char **)a, (const char*)*(char **)b);
return strcmp((const char *)*(char **)a, (const char *)*(char **)b);
}
int mybsearch(const char *target, char **ap, int len) {
int l, r, m;
int tsize = strlen(target);
for (l = 0, r = len - 1; l < len && r >= 0 && l <= r;){
m = (l+r)/2;
if (strncmp(target, ap[m], tsize) < 0) {
r = m - 1;
} else if (strncmp(target, ap[m], tsize) > 0) {
l = m + 1;
} else {
return m;
}
}
return -1;
}
int main(int argc, char **argv) {
int i;
char **ap; //suffix pointers array
int len;
int bpos;
len = strlen(argv[1]);
ap = (char**)malloc(len*sizeof(char*));
for (i = 0; i < len; ++i) {
ap[i] = &(argv[1][i]);
}
for (i = 0; i < len; ++i) {
printf("%s\n", ap[i]);
}
printf("sort the suffices\n");
qsort(ap, len, sizeof(char *), pstrcmp);
for (i = 0; i < len; ++i) {
printf("%s\n", ap[i]);
}
printf("searching for string:%s\n", argv[2]);
bpos = mybsearch(argv[2], ap, len);
printf("found at suffix array pos, original string pos: %d, %d\n", bpos, (ap[bpos]-argv[1])/sizeof(char));
if (bpos > 0) {
printf("found %s\n", ap[bpos]);
//after binary search, we searching in both left directions and right directions for suffix
//that also has the target string as prefix
for (i = bpos-1; i >= 0; --i){
if (strncmp(argv[2], ap[i], strlen(argv[2])) == 0) {
printf("found at suffix array pos, original string pos: %d, %d\n", i, (ap[i]-argv[1])/sizeof(char));
} else {
break;
}
}
for (i = bpos + 1; i < len; ++i) {
if (strncmp(argv[2], ap[i], strlen(argv[2])) == 0) {
printf("found at suffix array pos, original string pos: %d, %d\n", i, (ap[i]-argv[1])/sizeof(char));
} else {
break;
}
}
}
}
The program accepts two input parameters, the input text and the string to search for.
Compile the code using the command below,
gcc -o sufsearch sufsearch.c
Below is a screenshot of the execution,
Figure 1. Searching Strings using Suffix Array
Suppose the length of the string is m, the input text is n. Note that every comparison of the string with suffix of the input text consumes O(m) and O(logn) comparisons is needed to perform the binary search. Therefore, the run time for the algorithm above is O(mlogn), excluding the cost of building the suffix array.
There’re are better algorithms that use the longest common prefix to help the search process. The run time is O(m + logn). Interested users can refer to reference 6 for details.
Note that the input text can be preprocessed to build the suffix array, and the search target string can be given later. This differs from KMP algorithm that the pattern must be given before the computation can begin. This difference makes them suitable for different applications.
References:
1. Suffix Array, Wikipedia. http://en.wikipedia.org/wiki/Suffix_array
2. Brief Introduction to Suffix Array: http://sary.sourceforge.net/docs/suffix-array.html
3. Algorithms, 4th edition website: http://algs4.cs.princeton.edu/63suffix/
4. Programming Perls. 2nd Edition.
5. libdivsufsort, http://code.google.com/p/libdivsufsort/
6. Suffix Arrays: a new method for On-Line String Searches: http://delivery.acm.org/10.1145/330000/320218/p319-manber.pdf?ip=137.132.250.14&acc=ACTIVE%20SERVICE&CFID=67974067&CFTOKEN=52233823&__acm__=1330495809_b1bcb53c53d3d7a276f870cb1e0fdf89
Leave a Reply Cancel reply
40% Discount on My Book — Android NDK Cookbook
Android NDK Cookbook ebook 40% discount with promotion code MREANC40 at Packt Publishing The promotion code is valid until 15th June.Categories
- Android Apps (18)
- Android Audio Editor (1)
- TS 2 (3)
- Video Converter Android (8)
- Video2Gif (1)
- Android Tutorial (26)
- Android Dev Tools (1)
- API illustrated (8)
- Multimedia API (3)
- ffmpeg on Android (4)
- NDK (6)
- UI (5)
- Animation (1)
- Code Snippet (2)
- Coding Beyond Technique (18)
- a word, a world (4)
- Bug Rectified (4)
- Programming Habit (1)
- Software as a Career (1)
- Software as User Experience (1)
- Compilers and Related (2)
- ELF (2)
- Computer Languages (31)
- C/C++ (13)
- Java (9)
- JavaScript (2)
- PHP (1)
- Python (8)
- Data Structure & Algorithms (29)
- Bits (1)
- Data Structure (5)
- Integers (10)
- BigInteger (1)
- Prime (4)
- Search (3)
- Sorting (5)
- Strings (5)
- Database (1)
- SQLite (1)
- Digital Signal Processing (33)
- Distributed Systems (17)
- Apache Cassandra (6)
- Apache Hadoop (8)
- Apache Avro (3)
- Apache Nutch (3)
- Apache Solr (1)
- Linux Study Notes (40)
- crontab (1)
- Linux Kernel Programming (8)
- Linux Programming (12)
- IPC (2)
- Linux Network Programming (5)
- Linux Signals (2)
- Linux Shell Scripting (1)
- ssh (3)
- Machinery (30)
- misc (1)
- My Ideas (1)
- My Project (3)
- Mobile Caching (1)
- Selective Decoding (2)
- My Publication (1)
- My Readings (1)
- Networking (15)
- Program for Performance (8)
- Uncategorized (1)
- Virtual Machine (2)
- Web Dev (8)
- web components (3)
- Android Apps (18)
Recent Comments
Archives
- May 2013 (1)
- April 2013 (1)
- March 2013 (4)
- December 2012 (2)
- November 2012 (6)
- October 2012 (6)
- September 2012 (3)
- August 2012 (13)
- July 2012 (15)
- June 2012 (3)
- May 2012 (8)
- April 2012 (4)
- March 2012 (13)
- February 2012 (19)
- January 2012 (9)
- December 2011 (11)
- November 2011 (12)
- October 2011 (4)
- September 2011 (12)
- August 2011 (16)
- July 2011 (15)
- June 2011 (6)
- May 2011 (10)
- April 2011 (13)
- March 2011 (20)
- February 2011 (4)
- November 2010 (2)
- May 2010 (1)
- April 2010 (1)
- February 2010 (1)




