开发者

What is the best way to check each link of a website?

开发者 https://www.devze.com 2023-04-02 01:42 出处:网络
I want to create a crawler that follows each link of a site and check the开发者_如何学运维 URL to see if it works. Now my code opens the URL using url.openStream().

I want to create a crawler that follows each link of a site and check the开发者_如何学运维 URL to see if it works. Now my code opens the URL using url.openStream().

So what is the best way to create a crawler?


Use a HTML parser like Jsoup.

Set<String> validLinks = new HashSet<String>();
Set<String> invalidLinks = new HashSet<String>();

Document document = Jsoup.connect("http://example.com").get();
Elements links = document.select("a");

for (Element link : links) {
    String url = link.absUrl("href");

    if (!validLinks.contains(url) && !invalidLinks.contains(url)) {
        try {
            int statusCode = Jsoup.connect(url).execute().statusCode();

            if (200 <= statusCode && statusCode < 400) {
                validLinks.add(url);
            } else {
                invalidLinks.add(url);
            }
        } catch (Exception e) {
            invalidLinks.add(url);
        }
    }
}

You may want to send a HEAD instead inside that loop to make it more efficient, but then you'll have to use URLConnection instead as Jsoup by design doesn't support it (a HEAD returns no content).


Use the internal link analyzer tool to analyze the links search engine spiders can detect on a specific page of your website. Search ... Best practices internal links. Number of links: Back in 2008, Matt Cutts (head of Google's Web-spam team) recommended limiting the number of links to a maximum of 100 links per page.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号