Use curl to parse XML, get an image's URL and download it_问答_开发者

Use curl to parse XML, get an image's URL and download it

开发者 https://www.devze.com 2023-01-09 18:26 出处：网络

I want to write a shell script to get an image from an rss feed. Right now I have: curl http://foo.com/rss.xml | grep -E \'<img src=\"http://www.foo.com/full/\' | head -1 | sed -e \'s/<img src=

I want to write a shell script to get an image from an rss feed. Right now I have:

curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g'

This I use to grab the first occurence of an image URL in the file. Now I want to put this URL in a variable to use cU开发者_Go百科RL again to download the image. Any help appreciated! (Also you might give tipps on how to better remove everything from the line with the URL. This is the line:

 <img src="http://www.nichtlustig.de/comics/full/100802.jpg" alt="" width="400" height="400" />

There's probably some better regex to remove everything except the URL than my solution.) Thanks in advance!

Using a regexp to parse HTML/XML is a Bad Idea in general. Therefore I'd recommend that you use a proper parser.

If you don't object to using Perl, let Perl do the proper XML or HTML parsing for you using appropriate parser libraries:

HTML

curl http://BOGUS.com |& perl -e '{use HTML::TokeParser; 
    $parser = HTML::TokeParser->new(\*STDIN); 
    $img = $parser->get_tag('img') ; 
    print "$img->[1]->{src}\n"; 
}'

/content02/groups/intranetcommon/documents/image/blk_logo.gif

XML

curl http://BOGUS.com/whdata0.xml | perl -e '{use XML::Twig;
    $twig=XML::Twig->new(twig_handlers =>{img => sub { 
       print $_[1]->att("src")."\n"; exit 0;}}); 
    open(my $fh, "-");
    $twig->parse($fh);
}'

/content02/groups/intranetcommon/documents/image/blk_logo.gif

I used wget instead of curl, but its just the same

#!/bin/bash
url='http://www.nichtlustig.de/rss/nichtrss.rss'
wget -O- -q "$url" | awk 'BEGIN{ RS="</a>" }
/<img src=/{
  gsub(/.*<img src=\"/,"")
  gsub(/\".[^>]*>/,"")
  print
}'  |  xargs -i wget "{}"

Use a DOM parser and extract all img elements using getElementsByTagName. Then add them to a list/array, loop through and separately fetch them.

I would suggest using Python, but any language would have a DOM library.

#!/bin/sh
URL=$(curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g')
curl -C - -O $URL

This totally does the job! Any idea on the regex?

Here's a quick Python solution:

from BeautifulSoup import BeautifulSoup
from os import sys

soup = BeautifulSoup(sys.stdin.read())
print soup.findAll('img')[0]['src']

Usage:

$ curl http://www.google.com/`curl http://www.google.com | python get_img_src.py`

This works like a charm and will not leave you trying to find the magical regex that will parse random HTML (Hint: there is no such expression, especially not if you have a greedy matcher like sed.)