开发者

How do I write a simple Ragel tokenizer (no backtracking)?

开发者 https://www.devze.com 2023-03-11 08:51 出处:网络
UPDATE 2 Original question: Can I avoid using Ragel\'s |**| if I don\'t need backtracking? Updated answer: Yes, you can write a simple tokenizer with ()* if you don\'t need backtracking.

UPDATE 2

Original question: Can I avoid using Ragel's |**| if I don't need backtracking?

Updated answer: Yes, you can write a simple tokenizer with ()* if you don't need backtracking.

UPDATE 1

I realized that asking about XML tokenizing was a red herring, because what I'm doing is not specific to XML.

END UPDATES

I have a Ragel scanner/tokenizer that simply looks for FooBarEntity elements in files like:

<ABC >
  <XYZ >
    <FooBarEntity>
      <Example >Hello world</Example >
    </FooBarEntity>
  </XYZ >
  <XYZ >
    <FooBarEntity>
      <Example >sdrastvui</Example >
    </FooBarEntity>
  </XYZ >
</ABC >

The scanner version:

%%{
  machine simple_scanner;
  action Emit {
    emit data[(ts+14)..(te-15)].pack('c*')
  }
  foo = '<FooBarEntity>' any+ :>> '</FooBarEntity>';
  main := |*
    foo => Emit;
    any;
  *|;
}%%

The non-scanner version (i.e. uses ()* instead of |**|)

%%{
  machine simple_tokenizer;
  action MyTs {
    my_ts = p
  }
  action MyTe {
    my_te = p
  }
  action Emit {
    emit data[my_ts...my_te].pack('c*')
    my_ts = nil
    my_te = nil    
  }
  foo = '<FooBarEntity>' any+ >MyTs :>> '</FooBarEntity>' >MyTe %Emit;
  main := ( foo | any+ )*;
}%%

I figured this out and wrote tests for it at https://github.com/seamusabshere/ruby_ragel_examples

You can see the reading/buffering code at https://github.com/seamusabshere/ruby_ragel_examples/blob/master/lib/simple_scanner.rl and https://github.com/seamusabshere/ruby_ragel_examples/blob/master开发者_运维知识库/lib/simple_tokenizer.rl


You don't have to use a scanner to parse XML. I've implemented a simple XML parser in Ragel, without a scanner. Here is a blog post with some timings and more info.

Edit: You can do it many ways. You could use a scanner. You could parse for words and if you see STARTANIMAL you start collecting words until you see STOPANIMAL.


Rephrasing Occam: you do not need the scanner unless you need it. Without scanner you can process one symbol at a time, possibly reading it from the stream with no buffer.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号