开发者

How do you wget a page protected with shibboleth authentication?

开发者 https://www.devze.com 2023-03-09 09:07 出处:网络
I am attempting to scrape data off of pages protected by shibboleth authentication.I was having trouble getting cURL and webisoget to work correctly.So, I am trying wget, because I was thinking I coul

I am attempting to scrape data off of pages protected by shibboleth authentication. I was having trouble getting cURL and webisoget to work correctly. So, I am trying wget, because I was thinking I could pass my certificate and grab the pages I wanted. However, I am also having trouble with this and I have had difficulty finding documentation concerning my problem.

Here is the wget command I am attempting to execute:

>wget --no-check-certificate --certificate=www.washington.edu.crt https://www.washington.edu/cec/i/INFO200A2821.html

This is what that command returns:

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = c:/progra~1/wget/etc/wgetrc
--2011-05-28 00:32:37--  https://www.washington.edu/cec/i/INFO200A2821.html
Resolving www.washington.edu... 140.142.16.69, 140.142.11.167, 140.142.15.8
Connecting to www.washington.edu|140.142.16.69|:443... connected.
WARNING: cannot verify www.washington.edu's certificate, issued by `/C=ZA/ST=Wes
tern Cape/L=Cape Town/O=Thawte Consulting cc/OU=Certification Services Division/
CN=Thawte Premium Server CA/emailAddress=premium-server@thawte.com':
  Self-signed certificate encountered.
HTTP request sent, awaiting response... 200 OK
Length: 807 [text/html]
Saving to: `INFO200A2821.html.2'

100%[=====================================> ] 807         --.-K/s   in 0s

2011-05-28 00:32:38 (6.78 MB/s) - `INFO200A2821.html.2' saved [807/807]

However, even though I receive a page, it does not contain the information I hope to scrape. The page that returns is one that contains a form that auto-submits a form upo开发者_Python百科n loading. The form contains hidden input fields for the pubcookie and the relay_url.

I am able to access the page when logging in with my credentials. However, I am struggling on automating this and grabbing the information.


I'm not sure you can do that with wget. Shibboleth is an implementation of the SAML web SSO profile, and it expects you to have a valid session to access the protected resource. Without a valid session it will redirect your to the WAIF page, for you to select the appropriate identity provider. There are a series steps that must be performed before you can access the resource.

You could try to use something like Mechanize.pm for Perl to automate the authentication procedure and then retrieve the protected resource.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号