Let's you find the first occurance of a string within a file.
I started working on a project that was to replace the often times buggy and slow (and in my opinion just plain bad) Find Files/Folders function that comes with Windows (windows key + ‘F’). In Windows XP the searching utility in the OS seems to be severly lacking in functionality. In previous versions of Windows I used i didn’t find there to be too many problems.
The most important part of my project was searching for text within files, something which the Find Files/Folders function claims it can do but it never seems to return results even when I know there should be some. This is what caused me to look for a nice way to search for text inside of a file in much the same way strstr searches for a string inside a larger string. I did find a solution somewhere out there on the web but for reasons I still can’t figure out (the code was very messy) it would stop actually looking once you searched through about 16MB worth of files in one run.
Since I could not find anything out there that would allow me to do very extreme amounts of file searching I had to make it myself. What i created is designed to generally be platform independent. Normally I do not write code to be this way because 99.9% of the time I develop things exclusively for Windows.
Because I happen to like them better this will be done with plain old C-style file functions for several reasons:
- They’re MUCH faster than C++ filestreams.
- The compiled code is MUCH smaller than code using C++ filestreams.
- They’re compatable with old C code.
- The code looks nicer (to me at least).
- The project I was working with was using them.
- They’re fun and you should learn to use them.
I’m also going to be using malloc() and free() instead of new and delete. No real reason other than to make this code more C complient even though it is meant to be C++ code. And of course I’ll be using C-stlye strings and C-syle string functions, I always do this with code I plan on recycling, because if I ever put it in a DLL I can rest assured programs written in another language will be able to use the function. A program written in VB won’t be able to make use of a function inside a DLL that returns a std::string, but it can make use of a function that returns a pointer to a C-style string.
Enough talk, here is the actual code:
unsigned long FileSearch(FILE* pFile, const char* lpszSearchString)
{
//make sure we were passed a valid, if it isn't return -1
if ((!pFile)||(!lpszSearchString))
{
return -1;
}
unsigned long ulFileSize=0;
//get the size of the file
fseek(pFile,0,SEEK_END);
ulFileSize=ftell(pFile);
fseek(pFile,0,SEEK_SET);
//if the file is empty return -1
if (!ulFileSize)
{
return -1;
}
//get the length of the string we're looking for, this is
//the size the buffer will need to be
unsigned long ulBufferSize=strlen(lpszSearchString);
if (ulBufferSize>ulFileSize)
{
return -1;
}
//allocate the memory for the buffer
char* lpBuffer=(char*)malloc(ulBufferSize);
//if malloc() returned a null pointer (which probably means
//there is not enough memory) then return -1
if (!lpBuffer)
{
return -1;
}
unsigned long ulCurrentPosition=0;
//this is where the actual searching will happen, what happens
//here is we set the file pointer to the current position
//is incrimented by one each pass, then we read the size of
//the buffer into the buffer and compare it with the string
//we're searching for, if the string is found we return the
//position at which it is found
while (ulCurrentPosition<ulFileSize-ulBufferSize)
{
//set the pointer to the current position
fseek(pFile,ulCurrentPosition,SEEK_SET);
//read ulBufferSize bytes from the file
fread(lpBuffer,1,ulBufferSize,pFile);
//if the data read matches the string we're looking for
if (!memcmp(lpBuffer,lpszSearchString,ulBufferSize))
{
//free the buffer
free(lpBuffer);
//return the position the string was found at
return ulCurrentPosition;
}
//incriment the current position by one
ulCurrentPosition++;
}
//if we made it this far the string was not found in the file
//so we free the buffer
free(lpBuffer);
//and return -1
return -1;
}
Just a note, I know the return value is unsigned and in all the error cases I returned -1, remember, -1 is the same as 0xFFFFFFFF in a 32-bit number. Since i sincerly doubt you will ever come across a single file that is over 4GB this should never be a problem. If you should need to search a file that is over 4GB then I suggest replacing “unsigned long” with “unsigned __int64” if your compiler supports it. If you do need to do that then I doubt even more your hard drive can even hold a file that is 2^64 bytes in size so returning -1 (a REALLY big number for 64-bit numbers) will do nicely.
The above code is probably not the most effecient way of doing this, but it works, and it works fast. If i get the time I might try and make this as fast as possible, but unless this becomes the bottleneck of a program I’m working on that might not be for a while.