I have some emails whose content is in html format and I want to save in database in readable format any inputs.
Also i have got emails dumped in a text file and i need to extract data from it.Any inputs .All in java
If you want to remove all HTML tags then take a look at Jsoup. The code below using Jsoup shouldc remove all the html tags and give you plain text.
public static String html2text(String html) {
return Jsoup.parse(html).text();
}
You could try removing all tags, leaving just the "text" of the tag content:
String text = str.replaceAll("(?m)<.*?>", "");
But it's not going to work for all cases.
This page describes three ways to extract the data
- using a regular expression
- using HTMLEditorKit included in Swing
- using an HTML parser library like JSoup
which one is best for you is for you to decide, an important consideration will be the dependencies it adds to your application - eg if you already have a desktop application a Swing dependency will probably not hurt, whereas in a server application this might not be the best possible idea.
HTML is a "readable" format. But if you're looking for something to abstract away the HTML-isms, there are several libraries you can look at. jsoup is one example. Mostly the keywords to look for here are "DOM", "HTML parser", and "XSS prevention".
You can use HtmlCleaner to remove html tags.
精彩评论