how to fetch google images_问答_开发者_运维开发者技术经验分享

I want to fetch google images against any query. I have gon开发者_开发百科e through the google image search api but unable to understand. i have also seen some methods, they fetch images but only of first page.i have used following method.

function getGoogleImg($k)
{
    $url = "http://images.google.it/images?as_q=##query##&hl=it&imgtbs=z&btnG=Cerca+con+Google&as_epq=&as_oq=&as_eq=&imgtype=&imgsz=m&imgw=&imgh=&imgar=&as_filetype=&imgc=&as_sitesearch=&as_rights=&safe=images&as_st=y";
    $web_page = file_get_contents( str_replace("##query##",urlencode($k), $url ));
    $tieni = stristr($web_page,"dyn.setResults(");
    $tieni = str_replace( "dyn.setResults(","", str_replace(stristr($tieni,");"),"",$tieni) );
    $tieni = str_replace("[]","",$tieni);
    $m = preg_split("/[\[\]]/",$tieni);
    $x = array();
    for($i=0;$i<count($m);$i++)
    {
        $m[$i] = str_replace("/imgres?imgurl\\x3d","",$m[$i]);
        $m[$i] = str_replace(stristr($m[$i],"\\x26imgrefurl"),"",$m[$i]);
        $m[$i] = preg_replace("/^\"/i","",$m[$i]);
        $m[$i] = preg_replace("/^,/i","",$m[$i]);
        if ($m[$i]!="")
        array_push($x,$m[$i]);
   }
   return $x;
}

This function return only 21 images. i want all images against this query. i am doing this in php

Sadly the image API is being closed down, so I wont suggest moving to that, but that would have been a nicer solution I think.

My best guess is that image 22 and forwards is being loaded using som ajax/javascript of some sort (if you search for say logo and scroll down you will see placeholders that gets loaded as you move down) and that you need to pass the page by a javascript engine and that is not something that I can find anyone who have done with php (yet). Have you checked that $web_page contains more than 21 images (when I toy against google image search it uses javascript to load some of the images)? When you access the link from your normal browser what happens then and what happens if you turn off javascript? Is there perhaps a link to next page in the result you have?

In the now deprecated Image API there were ways to limit the number of results per page and ways to step to the next page https://developers.google.com/image-search/v1/jsondevguide#json_snippets_php

If you wish to keep on doing searches and fetching images from the search result then for later http://simplehtmldom.sourceforge.net/ might be a nice alternative to look at. It fetches a html DOM and allows you to easily find nodes and makes it easy to work with them. But it still uses file_get_contents or curl libraries to fetch the data so it might need some fiddling to get javascript working.

I wrote a script to download images form google Image search which I currently downloading 100 original images

The original script I wrote on stackoverflow answer

Python - Download Images from google Image search?

which I will explain in detail how I am scraping url’s of original Images from Google Image search using urllib2 and BeautifulSoup

For example if u want to scrape images of movie terminator 3 from google image search

query= "Terminator 3"
query=  '+'.join(query.split())  #this will make the query terminator+3
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
req = urllib2.Request(url,headers=header)
soup= urllib2.urlopen(req)
soup = BeautifulSoup(soup)

variable soup above contains the html code of the page that is requested now we need to extract the images for that u have to open the web page in your browser and and do inspect element on the image

here you will find the the tags containing the image of the url

for example for google image i found "div",{"class":"rg_meta"} containing the link to image

You can search up the BeautifulSoup documentation

print soup.find_all("div",{"class":"rg_meta"})

You will get a list of results as

<div class="rg_meta">{"cl":3,"cr":3,"ct":12,"id":"C0s-rtOZqcJOvM:","isu":"emuparadise.me","itg":false,"ity":"jpg","oh":540,"ou":"http://199.101.98.242/media/images/66433-Terminator_3_The_Redemption-1.jpg","ow":960,"pt":"Terminator 3 The Redemption ISO \\u0026lt; GCN ISOs | Emuparadise","rid":"VJSwsesuO1s1UM","ru":"http://www.emuparadise.me/Nintendo_Gamecube_ISOs/Terminator_3_The_Redemption/66433","s":"Screenshot Thumbnail / Media File 1 for Terminator 3 The Redemption","th":168,"tu":"https://encrypted-tbn2.gstatic.com/images?q\\u003dtbn:ANd9GcRs8dp-ojc4BmP1PONsXlvscfIl58k9hpu6aWlGV_WwJ33A26jaIw","tw":300}</div>

the result above contains link to our image url

http://199.101.98.242/media/images/66433-Terminator_3_The_Redemption-1.jpg

You can extract these links and images as follows

ActualImages=[]# contains the link for Large original images, type of  image
for a in soup.find_all("div",{"class":"rg_meta"}):
    link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
    ActualImages.append((link,Type))

for i , (img , Type) in enumerate( ActualImages):
    try:
        req = urllib2.Request(img, headers={'User-Agent' : header})
        raw_img = urllib2.urlopen(req).read()
        if not os.path.exists(DIR):
            os.mkdir(DIR)
        cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
        print cntr
        if len(Type)==0:
            f = open(DIR + image_type + "_"+ str(cntr)+".jpg", 'wb')
        else :
            f = open(DIR + image_type + "_"+ str(cntr)+"."+Type, 'wb')


        f.write(raw_img)
        f.close()
    except Exception as e:
        print "could not load : "+img
        print e

Voila now u can use this script to download images from google search. Or for collecting training images

For the fully working script you can get it here

https://gist.github.com/rishabhsixfeet/8ff479de9d19549d5c2d8bfc14af9b88