开发者

Using Html Agility Pack to grab text content

开发者 https://www.devze.com 2023-03-23 00:54 出处:网络
I will try my best to specific. Basically working on a crawler in vb.net whereby I am more interested in extracting text content of the page. My current application downloads the body of the html sour

I will try my best to specific. Basically working on a crawler in vb.net whereby I am more interested in extracting text content of the page. My current application downloads the body of the html source in a textbox by using a web browser control as follows:

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs)   Handles Button1.Click
    Dim url As String = "<url>"
    WebBrowser1.Navigate(url)
End Sub

Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As    System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
    TextBox2.Text = WebBrowser1.Document.Body.OuterHtml
End Sub

Now from here on, textbox2 consists of junk html which contains href,img,ads,script etc but I need to get ride of all these metadata and grab the plain text.

I could apply regex properties to get ride of all the anomalies but i think HAP is much more appropriate for html parser.

Searching on here brought me to this page which discusses the use of Whitelist technique mentioned by 'Meltdown'

HTML Agility Pack strip tags NOT IN whitelist

But how do I apply it in vb.net as it seems like a great idea?

Please adivce guys..........

EDIT: I found a vb.net version of the code shown below, but there seems to be an error at

If i IsNot DeletableNodesXpath.Count - 1 Then

Errors: IsNot requires operand that have reference types, but this operand has the value type integer

Here is the code:

Public NotInheritable Class HtmlSanitizer Private Sub New() End Sub Private Shared ReadOnly Whitelis开发者_开发百科t As IDictionary(Of String, String()) Private Shared DeletableNodesXpath As New List(Of String)()

Shared Sub New()
    Whitelist = New Dictionary(Of String, String())() From { _
        {"a", New () {"href"}}, _
        {"strong", Nothing}, _
        {"em", Nothing}, _
        {"blockquote", Nothing}, _
        {"b", Nothing}, _
        {"p", Nothing}, _
        {"ul", Nothing}, _
        {"ol", Nothing}, _
        {"li", Nothing}, _
        {"div", New () {"align"}}, _
        {"strike", Nothing}, _
        {"u", Nothing}, _
        {"sub", Nothing}, _
        {"sup", Nothing}, _
        {"table", Nothing}, _
        {"tr", Nothing}, _
        {"td", Nothing}, _
        {"th", Nothing} _
    }
End Sub

Public Shared Function Sanitize(input As String) As String
    If input.Trim().Length < 1 Then
        Return String.Empty
    End If
    Dim htmlDocument = New HtmlDocument()

    htmldocument.LoadHtml(input)
    SanitizeNode(htmldocument.DocumentNode)
    Dim xPath As String = HtmlSanitizer.CreateXPath()

    Return StripHtml(htmldocument.DocumentNode.WriteTo().Trim(), xPath)
End Function

Private Shared Sub SanitizeChildren(parentNode As HtmlNode)
    For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1
        SanitizeNode(parentNode.ChildNodes(i))
    Next
End Sub

Private Shared Sub SanitizeNode(node As HtmlNode)
    If node.NodeType = HtmlNodeType.Element Then
        If Not Whitelist.ContainsKey(node.Name) Then
            If Not DeletableNodesXpath.Contains(node.Name) Then
                'DeletableNodesXpath.Add(node.Name.Replace("?",""));
                node.Name = "removeableNode"
                DeletableNodesXpath.Add(node.Name)
            End If
            If node.HasChildNodes Then
                SanitizeChildren(node)
            End If

            Return
        End If

        If node.HasAttributes Then
            For i As Integer = node.Attributes.Count - 1 To 0 Step -1
                Dim currentAttribute As HtmlAttribute = node.Attributes(i)
                Dim allowedAttributes As String() = Whitelist(node.Name)
                If allowedAttributes IsNot Nothing Then
                    If Not allowedAttributes.Contains(currentAttribute.Name) Then
                        node.Attributes.Remove(currentAttribute)
                    End If
                Else
                    node.Attributes.Remove(currentAttribute)
                End If
            Next
        End If
    End If

    If node.HasChildNodes Then
        SanitizeChildren(node)
    End If
End Sub

Private Shared Function StripHtml(html As String, xPath As String) As String
    Dim htmlDoc As New HtmlDocument()
    htmlDoc.LoadHtml(html)
    If xPath.Length > 0 Then
        Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)
        For Each node As HtmlNode In invalidNodes
            node.ParentNode.RemoveChild(node, True)
        Next
    End If
    Return htmlDoc.DocumentNode.WriteContentTo()


End Function

Private Shared Function CreateXPath() As String
    Dim _xPath As String = String.Empty
    For i As Integer = 0 To DeletableNodesXpath.Count - 1
        If i IsNot DeletableNodesXpath.Count - 1 Then
            _xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString())
        Else
            _xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString())
        End If
    Next
    Return _xPath
End Function
End Class

Please can somebody help??????


Instead of using IsNot, just use <>. As you're bascially check the value of an integer does not equal the value of another integer - 1.

I believe IsNot can't be used on integers.

edit: I did just notice this is super super old. Just saw the July 26 date!

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号