Helpful Information
 
 
Category: Software Design
Searching for a Search Algorithm

I'm trying to implement a search function into a flat file database system I'm writing in C. I need to find a search algorithm to perform this search.

I was thinking that the obvious way was to search for the string in each record, one-by-one, and return the records that matched. However, this looks a bit inefficient and 'brute force' to me.

I was also thinking that an alternative could be to search for the first couple of letters of my 'search words' and search for the rest from the list of records that matched *that*.

Perhaps alphabetizing and keeping a separate index file that keeps track of where sections start/end? That way if you're seaking for a "S" record it doesn't have to search the entire file. (I wouldn't suggest breaking it down simply by first letter though because there are quite a few more words that start with S than X, Y, or Z. Perhaps grouping letters might help there but I digress...)

Unfortunately by doing this, you incur some overhead in insertions to the database. The question is which is the operation that will be used most often. If the searching is the most mission critical portion of your code, optimizing the search at the detriment of insertions and other ops, is not only a good idea, it's a must.

Another option is to apply a hash to search terms (word by word)as well as the text in the record (again word by word.) A few bit shifts and a bunch of integer comparisons are _NEVER_ going to hit you nearly as hard as any sort of string comparison. A good string hashing function is the "Dragon Book Hash":



int hash(char *s){
char* ss = s;
unsigned int h = 0, g;
for (; *ss != '\0"; ss++)
{
h = (h << 4) + *ss;
if (g = h & 0xf0000000)
{
h ^= g >> 24;
h ^= g;
}
}
return h % table_size;
}


Generally it's used to hash strings for symbol tables in compilers, but in your case this might help. Granted you'll have to store the hashed values of the records somewhere incurring overhead in storage. Overhead in insertions too.

This would seem the best. It's going to knock down on the size of the search space really quickly. One thing to guard against is that this function doesn't produce unique integers for every input string, but its going to return fewer "hits" to look over on the second pass than re-searching things that match the first two letters.

*shrugs* That's just my $0.02 though...

Just want to note that what you call "The dragon book hash" is actually a modified "multiply by 31" hashing algo. The multiplication

h = h*31 + *ss;

has been optimized to:

h = (h << 5) - h + *ss;

Which optimized further becomes:

h = (h << 5) + *ss;
if (g = h & 0xf0000000)
{
h ^= g >> 24;
h ^= g;
}

I have made some experiments, and they have shown a better distribution using shift by 5 instead of shift by 4.

The actual way, you can make your search "thingy", is to read a file into a data structure, which uses a hash table for every record field, you need to search by.

If your search program can load the file at startup, it is not a problem, as the overhead of indexing will only occur at startup.

Once loaded, the data can be manipulated inside your data structure and re-written to file when your program needs to save them.

Good choice....

I don't know why I didn't think of that before. Some sort of Symbol Table is definately the answer.

I have a good optimized one laying around somewhere if you're interested.










privacy (GDPR)