I will try my best to specific. Basically working on a crawler in vb.net whereby I am more interested in extracting text content of the page. My current application downloads the body of the html source in a textbox by using a web browser control as follows:
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim url As String = "<url>"
WebBrowser1.Navigate(url)
End Sub
Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
TextBox2.Text = WebBrowser1.Document.Body.OuterHtml
End Sub
Now from here on, textbox2 consists of junk html which contains href,img,ads,script etc but I need to get ride of all these metadata and grab the plain text.
I could apply regex properties to get ride of all the anomalies but i think HAP is much more appropriate for html parser.
Searching on here brought me to this page which discusses the use of Whitelist technique mentioned by 'Meltdown'
HTML Agility Pack strip tags NOT IN whitelist
But how do I apply it in vb.net as it seems like a great idea?
Please adivce guys..........
EDIT: I found a vb.net version of the code shown below, but there seems to be an error at
If i IsNot DeletableNodesXpath.Count - 1 Then
Errors: IsNot requires operand that have reference types, but this operand has the value type integer
Here is the code:
Public NotInheritable Class HtmlSanitizer Private Sub New() End Sub Private Shared ReadOnly Whitelis开发者_开发百科t As IDictionary(Of String, String()) Private Shared DeletableNodesXpath As New List(Of String)()
Shared Sub New()
Whitelist = New Dictionary(Of String, String())() From { _
{"a", New () {"href"}}, _
{"strong", Nothing}, _
{"em", Nothing}, _
{"blockquote", Nothing}, _
{"b", Nothing}, _
{"p", Nothing}, _
{"ul", Nothing}, _
{"ol", Nothing}, _
{"li", Nothing}, _
{"div", New () {"align"}}, _
{"strike", Nothing}, _
{"u", Nothing}, _
{"sub", Nothing}, _
{"sup", Nothing}, _
{"table", Nothing}, _
{"tr", Nothing}, _
{"td", Nothing}, _
{"th", Nothing} _
}
End Sub
Public Shared Function Sanitize(input As String) As String
If input.Trim().Length < 1 Then
Return String.Empty
End If
Dim htmlDocument = New HtmlDocument()
htmldocument.LoadHtml(input)
SanitizeNode(htmldocument.DocumentNode)
Dim xPath As String = HtmlSanitizer.CreateXPath()
Return StripHtml(htmldocument.DocumentNode.WriteTo().Trim(), xPath)
End Function
Private Shared Sub SanitizeChildren(parentNode As HtmlNode)
For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1
SanitizeNode(parentNode.ChildNodes(i))
Next
End Sub
Private Shared Sub SanitizeNode(node As HtmlNode)
If node.NodeType = HtmlNodeType.Element Then
If Not Whitelist.ContainsKey(node.Name) Then
If Not DeletableNodesXpath.Contains(node.Name) Then
'DeletableNodesXpath.Add(node.Name.Replace("?",""));
node.Name = "removeableNode"
DeletableNodesXpath.Add(node.Name)
End If
If node.HasChildNodes Then
SanitizeChildren(node)
End If
Return
End If
If node.HasAttributes Then
For i As Integer = node.Attributes.Count - 1 To 0 Step -1
Dim currentAttribute As HtmlAttribute = node.Attributes(i)
Dim allowedAttributes As String() = Whitelist(node.Name)
If allowedAttributes IsNot Nothing Then
If Not allowedAttributes.Contains(currentAttribute.Name) Then
node.Attributes.Remove(currentAttribute)
End If
Else
node.Attributes.Remove(currentAttribute)
End If
Next
End If
End If
If node.HasChildNodes Then
SanitizeChildren(node)
End If
End Sub
Private Shared Function StripHtml(html As String, xPath As String) As String
Dim htmlDoc As New HtmlDocument()
htmlDoc.LoadHtml(html)
If xPath.Length > 0 Then
Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)
For Each node As HtmlNode In invalidNodes
node.ParentNode.RemoveChild(node, True)
Next
End If
Return htmlDoc.DocumentNode.WriteContentTo()
End Function
Private Shared Function CreateXPath() As String
Dim _xPath As String = String.Empty
For i As Integer = 0 To DeletableNodesXpath.Count - 1
If i IsNot DeletableNodesXpath.Count - 1 Then
_xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString())
Else
_xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString())
End If
Next
Return _xPath
End Function
End Class
Please can somebody help??????
Instead of using IsNot
, just use <>
. As you're bascially check the value of an integer does not equal the value of another integer - 1.
I believe IsNot
can't be used on integers.
edit: I did just notice this is super super old. Just saw the July 26 date!
精彩评论