开发者

PHP Regex Cleaning of User Posts

开发者 https://www.devze.com 2023-03-30 09:57 出处:网络
I am trying to clean up user submitted c开发者_运维问答omments in PHP using regex but have become rather stuck and confused!

I am trying to clean up user submitted c开发者_运维问答omments in PHP using regex but have become rather stuck and confused!

Is it possible using regex to:

  1. Remove punctuation repeated more than twice so that:

    • OMG it was AWESOME!!!! becomes OMG it was AWESOME!!
    • !!!!!!!!!!.........------ becomes !!..--
    • !?!?!? becomes !?
  2. Remove duplicate words of phrases (for example a user has copied and pasted a message) so:

    • spamspamspamspam becomes spam
    • I love copy and paste. I love copy and paste. I love copy and paste. becomes I love copy and paste.
  3. Remove collections of letters and spaces longer than say 10 letters in caps:

    • I LOVE CAPITALS THEY ARE SO AWESOME becomes I love capitals they are so awesome
    • GOOD that sounds stays the same
  4. Any suggestions you have?

This is for a student system (hence the urge to at least try and tidy up what they post), although I do not wish to go as far as filtering it or blocking their messages, just "correct" it with some regex.

Thanks for your time,


Edit:

If it isn't possible using regex (or regex mised with other PHP) how would you do it?


1:

// same punctuation repeated more than 2 times
preg_replace('#([?!.-])\1{2,}#', '$1$1', $string);

// sequence of different punctuations repeated more than one time
preg_replace('#([?!.-][?!.-]+?)\1+#', '$1', $string);

2:

// any sequence of characters repeated more than one time
preg_replace('#(.{2,}?)\1+#', '$1', $string);

3:

// sequence of uppercase letters and spaces
function tolower_cb($match) {
        return strtolower($match[0]);
}
preg_replace_callback('#([A-Z ]{10,})#', 'tolower_cb', $string);

Try it here: http://codepad.org/iQsZ2vJ0


A good rule of thumb is to never, ever try and "fix" user input. If a user wants to type 4 exclamation points after a sentence then allow it. There is no reason not too.

You should be more concerned with injection attacks then things like this.

0

精彩评论

暂无评论...
验证码 换一张
取 消