开发者

Unit testing an HTML parser/cleaner?

开发者 https://www.devze.com 2023-01-07 12:24 出处:网络
I\'m trying to choose between a couple of different HTML parsers for a project I am working on, part of which accepts HTML input from the client.

I'm trying to choose between a couple of different HTML parsers for a project I am working on, part of which accepts HTML input from the client.

I've built a simple automated test for each one, to see if they fit my needs. I have a large number of real-life HTML fragments to test, but they aren't enough for testing for safety, since they (probably) do not contain any malicious code.

I don't m开发者_Python百科ind reviewing the outputs by hand.

My question is, is there a freely available database or list of HTML snippets containing malformed HTML and scripts intended for testing for XSS?


The ha.ckers XSS cheatsheet is pretty comprehensive, and was the catalyst for me to build a whitelist based sanitiser into jsoup.


Google's home page seems to be malformed, maybe you can use that? http://validator.w3.org/check?uri=www.google.com&charset=%28detect+automatically%29&doctype=Inline&group=0

http://www.codinghorror.com/blog/2006/11/its-a-malformed-world.html


I built html-sanitizer-testbed for exactly this purpose. It consists of two components:

  1. A suite of tests, that are designed to check the security of a HTML sanitizer. I have collected every tricky case I've been able to find. It includes everything on the ha.ckers.org XSS cheatsheet, as well as many other test cases I've collected over the years. Over the years I've analyzed dozens of HTML sanitizers (most of them were vulnerable), and added a test case for every security vulnerability I've ever found, so this is a pretty nice collection.

  2. Also, it provides some test automation functionality, so that you don't need to review the outputs by hand: you can fire up a browser and check whether the browser seems to have executed any Javascript in the outputs of the sanitizer (in which case the sanitizer is broken). This part is not 100% reliable and comes with no guarantees whatsoever, so for maximum effectiveness, you might want to review the outputs by hand. However, it has worked pretty well for me so far.

I welcome feedback and contributions.

0

精彩评论

暂无评论...
验证码 换一张
取 消