开发者

Regex-based strstr function in C

开发者 https://www.devze.com 2023-03-02 18:31 出处:网络
I needed to find a way to get a pointer to a substring (like strstr, first occurence) to more than one possible needles (patterns) a large string. C\'s standard strstr() does only support one needle,

I needed to find a way to get a pointer to a substring (like strstr, first occurence) to more than one possible needles (patterns) a large string. C's standard strstr() does only support one needle, I need 2 needles or even 3. Why all this? I need to be able to "tokenize" a html document into parts to further parse these "snippets". The "anchor" I need for tokenizing can vary, for example <div class="blub"> or <span id="bla and the html tags to be used as an token could contain numbers in the id/class attribute values (there for I could use \d+ or such to filter).

So I thought to write a function in using posix regex.

The function looks like this:

char * reg_strstr(const char *str, const char *pattern) {
    char *result = NULL;
    regex_t re;
    regmatch_t match[REG_MATCH_SIZE];

    if (str == NULL)
        return NULL;

    if (regcomp( &re, pattern, REG_ICASE | REG_EXTENDED) != 0) {
        regfree( &re );         
        return NULL;
    }

    if (!regexec(&re, str, (size_t) REG_MATCH_SIZE, match, 0)) {

        fprintf( stdout, "Match from %2d to %2d: \"%s\"\n",
             match[0].rm_so,
             match[0].rm_eo,
             str + match[0].rm_so);
        fflush(stdout);

        if ((str + match[0].rm_so) != NULL) {
            result = strndup(str + match[0].rm_so, strlen(str + match[0].rm_so));
        }
    }

    regfree( &re );

    return result;
}

The constant REG_MATCH_SIZE is 10

First of all, does that idea using regex as an extended strstr function make sense at all?

In simple test cases that function seem to work fine:

char *str_result = reg_strstr("<tr class=\"i10\"><td><div class=\"xyz\"><!--DDDD-1234--><div class=\"xx21\">", "<div class=\"xyz\">|<div class=\"i10 rr");

printf( "\n\n"
    "reg_strstr result: '%s' ..\n", str_result);

free( str_result) ;

Using that function in a in a real case environment using a complete HTML document does not to work like expected. It does not find the pattern. Using this function on a memory mapped string (I use a mmap'ed file as a cache for tmp. storage while parsing HTML document data).

EDIT:

Here in a loop like used:

Variables: parse_tag->firsttoken and parse_tag->nexttoken are the html anchors I try to match, just like illustrated above. doc is the input document, from the mmap'ed cache an allocated and '\0' terminated string (with strndup()). Code below works with strstr() as expected. If I find out, the idea using regex strstr really work for me开发者_如何学编程 I can rewrite the loop and maybe return all matches from reg_strstr (as an stringlist or such). So for now I am just trying ...


...
char *tokfrom = NULL, *tokto = NULL;
char *listend = NULL;

/* first token found ? */ if ((tokfrom = strstr(doc, parse_tag->firsttoken)) != NULL) { /* is skipto_nexttoken set ? */ if (!parse_tag->skipto_nexttoken) tokfrom += strlen(parse_tag->firsttoken); else { /* ignore string between firsttoken and first nexttoken */ if ((tokfrom = strstr(tokfrom, parse_tag->nexttoken)) == NULL) goto end_parse; }

/* no listend tag found ? */
if (parse_tag->listend == NULL ||
    (listend = reg_strstr(tokfrom, parse_tag->listend)) == NULL) {
    listend = doc + strlen(doc);
}

*listend = '\0';        /* truncate */

do {
    if((tokto = reg_strstr(tokfrom + 1, parse_tag->nexttoken)) == NULL)
        tokto = listend;
    tokto--;  /* tokto-- : this token up to nexttoken */

    if (tokto <= tokfrom)
        break;

    /* do some filtering with current token here ... */
    /* ... */

} while ((tokfrom = tokto + 1) < listend);

} ...

EDIT END

Do I miss something here? Like said, is this possible at all what I try to accomplish? Is the regex pattern errornous?

Suggestions are welcome!

Andreas


I tried you code on a test HTML file that I simply input from a text-file through stdin via redirection, and that seemed to work just fine with repeated reads to fgets(). I would then suspect that the issue is somewhere in the formatting of the string-data in you memory-mapped file. My suspicion is that there is a null-terminating character somewhere in your memory mapped file, such that if you are simply using the memory mapped file itself as a char buffer, it is ending the string far earlier than you were expecting.

Secondly, you are only returning the first match plus the rest of the string, which would mean the entire file from the first match onwards if you're using the pointer to the memory mapped file as your str parameter. I'm suspecting your "real" implementation is a bit different if you're looking to tokenize the file?


EDIT

I've been looking at your concept code, and it does seem to be working overall. I made a couple modifications just to help me print things out, but here is what I'm compiling (very down-and-dirty for the file-memory-mapping just to check if the regex code is working):

#include <regex.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>

#define REG_MATCH_SIZE 10
#define FILE_SIZE 60000

static int total_matches = 0;

char * reg_strstr(const char *str, const char *pattern) 
{
    char *result = NULL;
    regex_t re;
    regmatch_t match[REG_MATCH_SIZE];

    if (str == NULL)
        return NULL;

    if (regcomp( &re, pattern, REG_ICASE | REG_EXTENDED) != 0) {
        regfree( &re );         
        return NULL;
    }

    if (!regexec(&re, str, (size_t) REG_MATCH_SIZE, match, 0)) {

        fprintf( stderr, "@@@@@@ Match from %2d to %2d @@@@@@@@@\n",
             match[0].rm_so,
             match[0].rm_eo);

    total_matches++;

        if ((str + match[0].rm_so) != NULL) {
            result = strndup(str + match[0].rm_so, strlen(str + match[0].rm_so));
        }
    }

    regfree( &re );

    return result;
}


int main()
{   
    int filedes = open("testhtml.txt", O_RDONLY);

    void* buffer = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, filedes, 0); 

    char* str_result;
    char* temp_buffer = strdup((char*)buffer);
    while(str_result = reg_strstr(temp_buffer, "<div"))
    {
        char* temp_print = strndup(str_result, 30);
        fprintf(stderr, "reg_strstr result: '%s' ..\n\n", temp_print);
        free(temp_print);
        free(temp_buffer);
        temp_buffer = strdup(str_result + 1);
        free( str_result) ;
    }

    fprintf(stderr, "Total Matches: %d\n", total_matches);

    return 0;
}

Just using the simple match for "<div", if I run it on the entire HTML source for a page like this one here at Bloomberg, I get a total of 87 matches, and I get something that is equivalent to what you would get with a repeated call to the standard strstr(). For instance, sample output looks like (note: I've cut off the match on the return string after 30 characters for sanity's sake):

@@@@@@ Match from 5321 to 5325 @@@@@@@@@
reg_strstr result: '<div id="noir_dialog" class="p' ..

@@@@@@ Match from 362 to 366 @@@@@@@@@
reg_strstr result: '<div id="container" class="mod' ..

The matching indexes change of course since the new input string is shorter than the previous input string, so that's why you see a match that starts at 5321, but then the next match is at 362 ... the overall offset would be at 5683 in the original string. With a different regular expression I'm sure you would get different results, but overall it seems that your concept is working, or at least is working like strstr() would work, that is it's returning the entire string starting at the match to the substring all the way to the end of the string.

If you're not getting the results you're expecting (I'm not sure exactly what you're not getting), then I would say the problem is either in the regular expression itself, or in the loop, in that you may have your indexing off (i.e., with totok-- you could be creating a loop for yourself that simply keeps returning a match at the same point in the string).


Make sure the data you are loading is null-terminated. Arg 2 of regexec must be a null-terminated string.


Why don't you simply use a finite state machine which processes one character at a time and parses HTML tags. This way, you'll also get rid of the problem of HTML comments. Think of the following cases:

Anchor tag in HTML comment:

<!-- <my anchortag="foo"> -->

Comment in HTML attribute:

<some tag="<!--"> <my anchortag="foo"> </some tag="-->">

With regular expressions, you'll have a hard time dealing with these cases.

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (Jamie Zawinski)

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号