开发者

Finding a substring token in a string in C++

开发者 https://www.devze.com 2023-01-28 15:27 出处:网络
OK suppose I\'m parsing some XML (the problem exists while reading any \"language\" but XML is one many people are familar with).

OK suppose I'm parsing some XML (the problem exists while reading any "language" but XML is one many people are familar with).

The XML looks like the following:

<Tag>
  <[CDATA[ blah blah]]>
  <Tag2>
    <Tag3/>
  </Tag2>
<Tag>

Now I want to find the various tokens in that stream. The important tokens are as follows (Please excuse my crappy "token" names ;)).

<           = Open Token
<[CDATA[    = Open CDATA Token
]]>         = Close CDATA Token
<!          = Open Comment Token
/>          = Close Open Token
</          = Open Close Token
>           = Close Token

The problem I have is that I have an array of the above and I am trying to correctly identify one of the above Token's as I read in the file character by character.

So I read the first character, '<'. The instant thought is that this matches with the "Open Token" so we'll select that. However that also matches with the first character of the "Open Close Token". So lets say we read the second ch开发者_开发技巧aracter and its a'T'. So I instantly know this is the "Open Token" and not the "Open Close Token".

Equally when finishing a tag, e.g "/>". I read the first character and I get '/'. This matches the "Close Open Token". But its not complete so I should check for the next character, which in this case is '>' giving me "/>" which does match the Close Token.

My issue is that when the numbers of these tokens increases significantly it becomes quite hard to track what the possible matches are. Is there an elegant way to do this? Or should I, simply when I meet the first character of one of the "token strings" push that token on to a vector and then ONLY check those tokens on subsequent reads? If the next character does not match then I can clear the list of tokens and then start over.

Is this the correct way to approach the problem? Is there a better way?

(Edit: Please don't point me towards Lexx, YACC etc ... I'm trying to learn some fundamentals here)

Any help would be much appreciated :)


You need to track state in the parser - where am I now? what am I expecting next? - in a context-specific way. When you see what you get next, you check it against a list of valid values for current state, and possibly store a completed parsed data item, and possibly change state.

Parsing XML only looks easy by the way - if you really want to do this work by hand yourself, there are lots of corner cases to handle. Your parser is a Finite State Machine, but a non-trivial example of such.


I've been doing a lot of this type of parsing lately (mostly with C#).

I don't know exactly what you are trying to accomplish so I'm not sure how much help this is, but I would parse the entire thing and store it in some sort of data array.

Find the start tag. Then parse whatever text comes next (you know when you've reached the end of the text because you'll either hit whitespace or punctuation).

You can put in a special test for "!" and maybe set a flag in your data structure when it was found. I've found that it's just not practical to do a quick scan for known sequences. You need to break up the entire thing, character by character.

You can see one of my C# results at http://www.softcircuits.com/Blog/post/2010/02/07/Parsing-HTML-Tags-in-C.aspx.


Parsing is a well known problem, but that doesn't mean it's easily programmed. You could write anything yourself, but as you encountered, this becomes rather complex pretty fast.

You could use the Boost.Spirit library, which is very large and will probably take you some time to master.

Or as an alternative, use Lex / Yacc (or something similar) to generate a parser and lexical analyzer. (this is more C than C++, but that's not necessarily a bad thing of course)

I'd personally spend the time learning to master the Boost Spirit library, while it might seem a lot of work at first, you'll save a lot of time and headaches in the long run. Parsing XML like languages by hand takes a lot more work than you'd expect at first.


You could have flex do this for you. Better yet, dig around for an existing XML parser for your language -- I'm sure someone's implemented it by now.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号