开发者

Question regarding regex and tokenizing

开发者 https://www.devze.com 2023-01-16 00:46 出处：网络

I need to make a tokenizer that is able to English words. Currently, I\'m stuck with characters where they can be part of of a url expression.

相关专题：python regex tokenize

I need to make a tokenizer that is able to English words.

Currently, I'm stuck with characters where they can be part of of a url expression.

For instance, if the characters ':','?','=' are part of a url, i shouldn't really segment them.

My qns is, can this be expressed in regex? I have the regex

\b(?:(?:https?|ftp|file)://|www\.|ftp\开发者_开发百科.)
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])

from here

but I don't know how to piece everything such that if the characters are spotted inside the above expression, don't insert spaces between them.

Help!

I would approach this problem by doing a sweep with a different regexp, putting hits into an array, removing those hits from the string, and then doing your tokenizer as normal.