OK, there are many HTML/XML parsers for Java. What I want to do is a bit more than just 开发者_如何学运维knowing how to parse it. I want to filter the content and have it in suitable form.
More precisely, I want to keep only the text and images. However, I want to preserve some of the text formatting, too, like: italic, bold, alignment, etc.
All this is for the reason that I'm trying to implement a converter that converts html to a specific format that I've created myself for my own purposes.
Any ideas? Surely, it must have been done many times before.
If your intent is to clean user-submitted content against a safe white-list to prevent XSS, then I'd suggest to use Jsoup for this. It provides a builtin white-list. It's then as simple as:
String safeHtml = Jsoup.clean(unsafeHtml, Whitelist.basicWithImages());
You can customize the Whitelist
as described in its javadoc.
See also:
- Pros and cons of HTML parsers in Java
JTidy + XSLT?
Have a look at HTML Parser, it could be handy.
O.K. I think found it out: when parsing the Element
I can construct a javax.swing.text.html.InlineView
, i.e. InlineElement ie = new InlineView(element)
and then get the attributes as ie.getAttributes)
.
Right. If you could help more, i.e. have some first-hand experience to share, please do!
you can use xml dom parser under packages org.w3c.dom and javax.xml with that you can easily parse the document and get the node contents
Document doc = DocumentBuilder.parse(file);
and then get the elements by using
NodeList nl = doc.getElementsByTagName("p"); // for paragraph tags
and then get the content from nodelist, it'll give u whole content in paragraph tag, like that you can apply for any tag
精彩评论