Web crawling and robots.txt - II_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-17 06:47 出处：网络

Similar scenario as one of my previous question: Using wget, i type the following to pull down images fr开发者_如何学运维om a site (sub-folder):

相关专题：php wget

Similar scenario as one of my previous question:

Using wget, i type the following to pull down images fr开发者_如何学运维om a site (sub-folder):
```
 wget -r -A.jpg http://www.abc.com/images/
```
I get two images from the above command - Img1, Img2.
The index.php file in http://www.abc.com/images/ refers to only Img2.jpg (saw the source).
If i key in http://www.abc.com/images/Img4.jpg or http://www.abc.com/images/Img5.jpg, i get two separate images.
But these images are not downloaded by wget.
How should I go about retrieving the entire set of images under http://www.abc.com/images/?

Not exactly sure what you want but try this:

wget --recursive --accept=gif,jpg,png http://www.abc.com

This will:

Create a directory called www.abc.com\
Crawl all pages on www.abc.com
Save all .GIF, .JPG or .PNG files inside the corresponding directories under www.abc.com\

You can then delete all directories except the one you're interested in, namely, www.abc.com\images\

Crawling all pages is a time consuming operation but probably the only way to make sure you that you get all images that are referenced by any of the pages on www.abc.com. There is no other way to detect what images are present inside http://abc.com/images/ unless the server allows directory browsing.