开发者

How to parse a simple page using html agility pack?

开发者 https://www.devze.com 2023-04-12 14:43 出处:网络
I am trying to parse this page, but there aren\'t much unique i开发者_运维问答nfo for me to uniquely identify the sections I want.

I am trying to parse this page, but there aren't much unique i开发者_运维问答nfo for me to uniquely identify the sections I want.

Basically I am trying to get the most of the data right to the flash video. So:

Alternating Floor Press

Type: Strength
Main Muscle Worked: Chest 
Other Muscles: Abdominals, Shoulders, Triceps 
Equipment: Kettlebells 
Mechanics Type: Compound
Level: Beginner
Sport: No
Force: N/A

And also the image links that shows before and after states.

Right now I use this:

HtmlAgilityPack.HtmlDocument doc = web.Load ( "http://www.bodybuilding.com/exercises/detail/view/name/alternating-floor-press" );
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants ( "a" );

foreach ( var link in threadLinks )
{
    string str = link.InnerHtml;
    Console.WriteLine ( str );
}

This gives me a lot of stuff I don't need but also prints what I need. Should I be parsing this printed data by trying to see where my goal data might be inside it?


You can select the id of the nodes you are interested in:

        HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.bodybuilding.com/exercises/detail/view/name/alternating-floor-press");
        IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.SelectNodes("//*[@id=\"exerciseDetails\"]");

        foreach (var link in threadLinks)
        {
            string str = link.InnerText;
            Console.WriteLine(str);
        }
        Console.ReadKey();


For a given <a> node, to get the text shown, try .InnerText.

Right now you are using the contents of all <a> tags within the document. Try narrowing down to only the ones you need. Look for other elements which contain the particular <a> tags you are after. For example, do they all sit inside a <div> with a certain class?

E.g. if you find the <a> tags you are interested in all sit within <div class="foolinks"> then you can do something like:-

IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("div")
    .First(dn => dn.Attributes["class"] == "foolinks").Descendants("a");

--UPDATE--

Given the information in your comment, I would try:-

IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("div")
    .First(dn => dn.Id == "exerciseDetails").Descendants("a");

--UPDATE--

If you are having trouble getting it to work, try splitting it up into variable assignments and stepping through the code, inspecting each variable to see if it holds what you expect.

E.g,

var divs = doc.DocumentNode.Descendants("div");
var div = divs.FirstOrDefault(dn => dn.Id == "exerciseDetails");
if (div == null)
{
    // couldn't find the node - do whatever is appropriate, e.g. throw an exception
}

IEnumerable<HtmlNode> threadLinks = div.Descendants("a");

BTW - I'm not sure if the .Id property maps to the id attribute of the node as you suggest it does. If not, you could try dn => dn.Attributes["id"] == "exerciseDetails" instead.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号