开发者

Post data to other site using PHP and save output

开发者 https://www.devze.com 2023-01-28 23:43 出处:网络
I\'m trying to save info from the http://www.woorank.com search results. The site caches data for popular sites, but for most you need to do a search before it returns a report. So I tried this:

I'm trying to save info from the http://www.woorank.com search results. The site caches data for popular sites, but for most you need to do a search before it returns a report. So I tried this:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.woorank.com/en/report/generate');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, array('url'=>'hellothere.com'));
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
curl_close($ch);

It seems (based on curl output) to redirect to http://www.woorank.com/en/www/hellothere.com, as it should after you search, but it doesn't generate a report and simply states there is no report yet (just as it would when you visit the url directly).

Am I doing something wrong? Or is it not possible to retrieve this info?

Update

Request headers: http://pastebin.com/3ijZfMmF

(Request-Line) POST /en/report/generate HTTP/1.1
Host    www.woorank.com
User-Agent  Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3
Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language en-us,en;q=0.5
Accept-Encoding gzip,deflate
Accept-Charset  ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive  115
Connection  keep-alive
Referer http://www.woorank.com/
Cookie  __utma=201458455.1161920622.1291713267.1291747441.1291773488.4; __utmc=201458455; __utmz=201458455.1291713267.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmb=201458455.1.10.1291773488
Content-Type    application/x-www-form-urlencoded
Content-Length  16

I'm not sure how to get the request headers from the test script, but using this:

curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);

$headers = curl_getinfo($ch);

The $headers var contains:

Array
(
    [url] => http://www.woorank.com/en/www/someothersite.com
    [content_type] =>开发者_如何学编程 text/html; charset=UTF-8
    [http_code] => 200
    [header_size] => 841
    [request_size] => 280
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 1
    [total_time] => 0.904581
    [namelookup_time] => 3.2E-5
    [connect_time] => 3.3E-5
    [pretransfer_time] => 3.7E-5
    [size_upload] => 155
    [size_download] => 5297
    [speed_download] => 5855
    [speed_upload] => 171
    [download_content_length] => 5297
    [upload_content_length] => 0
    [starttransfer_time] => 0.242975
    [redirect_time] => 0.577306
    [request_header] => GET /en/www/someothersite.com HTTP/1.1
Host: www.woorank.com
Accept: */*
)

It seems to me that this is the redirect that happens after the search form is submitted. But I'm not sure whether there's no POST at all, or that it isn't visible in these headers. But since it doesn't work, I'm guessing it's the former.

The output from curl_exec is simply the HTML from http://www.woorank.com/en/www/someothersite.com.

Update 2

I tried adding some of the headers to the curl request using:

curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

and e.g.

$headers = array( 
  "Host: www.woorank.com",
  "Referer: http://www.woorank.com/"
);

Doesn't make it POST the form, but now the curl_exec shows the response headers. Here's the difference:

Firefox, response headers from site:

HTTP/1.1 302 Found
Date    Wed, 08 Dec 2010 02:19:18 GMT
Server  Apache/2.2.9 (Fedora)
X-Powered-By    PHP/5.2.6
Set-Cookie  language=en; expires=Wed, 08-Dec-2010 03:19:18 GMT; path=/
Set-Cookie  generate=somesite.com; expires=Wed, 08-Dec-2010 03:19:19 GMT; path=/
Location    /en/www/somesite.com
Cache-Control   max-age=1
Expires Wed, 08 Dec 2010 02:19:19 GMT
Vary    Accept-Encoding,User-Agent
Content-Encoding    gzip
Content-Length  20
Keep-Alive  timeout=1, max=100
Connection  Keep-Alive
Content-Type    text/html; charset=UTF-8

and from test.php:

HTTP/1.1 302 Found
Date: Wed, 08 Dec 2010 02:27:21 GMT
Server: Apache/2.2.9 (Fedora)
X-Powered-By: PHP/5.2.6
Set-Cookie: language=en; expires=Wed, 08-Dec-2010 03:27:21 GMT; path=/
Set-Cookie: generate=someothersite.com; expires=Wed, 08-Dec-2010 03:27:22 GMT; path=/
Location: /en/www/someothersite.com
Cache-Control: max-age=1
Expires: Wed, 08 Dec 2010 02:27:22 GMT
Vary: Accept-Encoding,User-Agent
Content-Length: 0
Keep-Alive: timeout=1, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8

I only notice Content-Encoding gzip and Content-Length 20 missing in the test. Don't know what that means but when adding "Content-Length: 20" to the headers it says "HTTP/1.1 413 Request Entity Too Large" and doesn't do anything; adding "Content-Encoding: gzip" makes it return the HTML gzipped (I assume, since it looks like this: "‹ÍXésÚ8ÿœüZíì&]ìºG “æè1 MmÚ...").

Hope this info helps.


You want to make sure you're matching the necessary headers. Make the request that you want to emulate with cURL and post the headers here. Use a plugin like HTTPFox on firefox, or similar tools. Then we can see if your query matches the header

ANSWER : I looked at the site myself and found that it uses cookies to make sure you're not a simple robot before generating reports. This can be evaded by updating your cURL script to generate the right cookies.

There may also be other simple checks that you'd have to bypass (e.g. Referer, User-Agent, etc.), you can do it all with cURL though.

However, they probably use this kind of cookie protection because they don't want people scraping their data. If you're going to hack past that restriction you should go through the courtesy of asking the admin permission to download his site. While you're not at legal risk (they have no ToS), it'd be a nice thing to do.


Maybe something like this? especially wondering what you get as output(print_r)?

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.woorank.com/en/report/generate');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, array('url'=>'hellothere.com'));
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
print_r($result); // output?
curl_close($ch);
0

精彩评论

暂无评论...
验证码 换一张
取 消