开发者

c# html agility pack

开发者 https://www.devze.com 2023-03-08 10:57 出处:网络
We are moving an e-commerce website to a new platform and because all of their pages are static html and they do not have all their product information in a database, we must scrape their current webs

We are moving an e-commerce website to a new platform and because all of their pages are static html and they do not have all their product information in a database, we must scrape their current website for the product descriptions.

Here is one of the pages: http://www.cabinplace.com/accrugsbathblackbear.htm

What is the best was to get the description into a string? Should I use html agility pack? and if so how would this be done? as I am new to html agility 开发者_高级运维pack and xhtml in general.

Thanks


The HTML Agility Pack is a good library to use for this kind of work.

You did not indicate if all of the content is structured this way nor if you have already gotten the kind of fragment you posted from the HTML files, so it is difficult to advise further.

In general, if all pages are structured similarly, I would use an XPath expression to extract the paragraph and pick the innerHtml or innerText from each page.

Something like the following:

var description = htmlDoc.SelectNodes("p[@class='content_txt']")[0].innerText;


Also,

If you need a good tool for testing or finding the Xpath for the HAP you can use this one: HTML-Agility-xpath-finder. It is made using the same library so if you find a xpath in this tool you be securely able to use in your code.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号