开发者

Cleaning text scraped from webpage with php & regex

开发者 https://www.devze.com 2023-02-10 04:44 出处:网络
I have been building a function reads in the title text as found on a webpage between the <title></title> tags.I am using the following regex code to grab the title text form the html page

I have been building a function reads in the title text as found on a webpage between the <title></title> tags. I am using the following regex code to grab the title text form the html page:

 if(preg_match('#<title>([^<]+)</title>#simU', $this->html, $m1))
      $this->title = trim($m1[1]);

I am using the following to encode the value for the mysql insert statement:

mysql_real_escape_string(rawurldecode($this->title))

So that leaves me with a database full of titles that have html entities(&nsbp etc...) and foreign characters such as in Dating S.o.s | Gluten-free, Dairy-free, Sugar-free Recipes And Lifestyle Tips

The goal is to decode,remove, clean the titles so that they look as close to perfect english as possible.

I have constructed a function that uses the following 2 regex's to remove html entities and limit junk respectively. And while not ideal(because it removes the html entities rather than preserves them) it's the closest to clean as I've got.

$string = preg_replace("/&#?[a-z0-9]+;/i","",$string);
//remove all non-normal chars
$string = preg_replace('/[^a-zA-Z0-9-\s\'\!\,\|\(\)\.\*\&\#\/\:]/', '', $string);

But the non-english chars still exist.

Would anyone be able to offer help as to:

  1. Best way to save these title strings to the db trying to preserve the english intent (punctuation, apostrophies, etc...)
  2. How to convert or eliminate the strange chars as shown in my example title above?

Thanks much for your h开发者_运维知识库elp!


For point 1, PHP has an html_entity_decode() function that you can use to turn HTML entities into "regular" characters.


Check out http://www.php.net/manual/en/function.html-entity-decode.php for #1

And http://php.net/manual/en/function.mb-convert-encoding.php for #2

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号