you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thu开发者_运维技巧mb from a video related link (like youtube).
any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)
thanks!
FB scrapes the meta tags from the HTML.
I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.
As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.
Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/
I've had a look at how FB does it, and it looks like the scraping is done at server side.
class ScrapedInfo { public $url; public $title; public $description; public $imageUrls; } function scrapeUrl($url) { $info = new ScrapedInfo(); $info->url = $url; $html = file_get_html($info->url); //Grab the page title $info->title = trim($html->find('title', 0)->plaintext); //Grab the page description foreach($html->find('meta') as $meta) if ($meta->name == "description") $info->description = trim($meta->content); //Grab the image URLs $imgArr = array(); foreach($html->find('img') as $element) { $rawUrl = $element->src; //Turn any relative Urls into absolutes if (substr($rawUrl,0,4)!="http") $imgArr[] = $url.$rawUrl; else $imgArr[] = $rawUrl; } $info->imageUrls = $imgArr; return $info; }
Facebook looks at various meta information in the HTML of the page that you paste into a link field. The title
and description
are two obvious ones but a developer can also use <link rel="image_src" href="thumbnail.jpg" />
to provide a preferred screengrab. I guess you could check for these things. If this tag is missing you could always use a website thumbnail generation service.
As I am developing a project like that, it is not as easy as it seems, encoding issues, rendering content with javascript, existence of so many non-semantic websites are one of big problems I encountered. Especially extracting video info and trying to get auto-play behavior is always tricky or sometimes impossible. You can see a demo in http://www.embedify.me , it is written in .net but it has a service interface so you can call it via javascript, also there is javascript api to get the same ui/behavior as in fb.
精彩评论