开发者

Web crawling and robots.txt - II

开发者 https://www.devze.com 2023-03-17 06:47 出处:网络
Similar scenario as one of my previous question: Using wget, i type the following to pull down images fr开发者_如何学运维om a site (sub-folder):

Similar scenario as one of my previous question:

  1. Using wget, i type the following to pull down images fr开发者_如何学运维om a site (sub-folder):

     wget -r -A.jpg http://www.abc.com/images/
    
  2. I get two images from the above command - Img1, Img2.

  3. The index.php file in http://www.abc.com/images/ refers to only Img2.jpg (saw the source).

  4. If i key in http://www.abc.com/images/Img4.jpg or http://www.abc.com/images/Img5.jpg, i get two separate images.

  5. But these images are not downloaded by wget.

  6. How should I go about retrieving the entire set of images under http://www.abc.com/images/?


Not exactly sure what you want but try this:

wget --recursive --accept=gif,jpg,png http://www.abc.com

This will:

  1. Create a directory called www.abc.com\
  2. Crawl all pages on www.abc.com
  3. Save all .GIF, .JPG or .PNG files inside the corresponding directories under www.abc.com\

You can then delete all directories except the one you're interested in, namely, www.abc.com\images\

Crawling all pages is a time consuming operation but probably the only way to make sure you that you get all images that are referenced by any of the pages on www.abc.com. There is no other way to detect what images are present inside http://abc.com/images/ unless the server allows directory browsing.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号