开发者

Force HTML Tidy to output XML (instead of XHTML), or force XSLTproc to parse XHTML files

开发者 https://www.devze.com 2023-04-03 02:19 出处:网络
I have a large number of HTML files that I need to process with XSLT, using an XML file to choose which HTML files, and what we\'re doing with them.

I have a large number of HTML files that I need to process with XSLT, using an XML file to choose which HTML files, and what we're doing with them.

开发者_运维技巧I tried:

  1. Use HTML Tidy to convert HTML -> XHTML / XML
  2. Use document(filename) in XSLT to read in particular XHTML/XML files
  3. ...use standard nodeset commands to access e.g. "html/body/*"

This doesn't work, because:

  1. It seems that XSLT (tried: libXSLT/xsltproc ... and Saxon) cannot process XHTML documents as external files (it sees the xhtml DOCTYPE, and refuses to parse it as nodes).

Fine (I thought) ... XHTML is just XML, I just need to put it through HTML Tidy and say:

"output-xml yes ... output-html no ... output-xhtml no"

...but HTML Tidy ignores you if you attempt that, and forces html instead :(. It seems to be hardcoded to only output XML files if the input was XML to begin with.

Any ideas for how to:

  1. Force HTML Tidy to obey the command-line parameters, and set the doctype I asked for
  2. Force XSLTproc to parse xhtml DOCTYPEs as xml
  3. ...some other cunning way that will work?

NB: this has to work on OS X - it's part of a build process for iOS apps. That shouldn't be a big problem, but e.g. any windows-only tools aren't available. I'd like to achieve this with standard open-source cross-platform tools (like tidy, libxslt, etc)


I finally discovered why XSLTproc / Saxon were refusing to parse the files if they were passed-in with a DOCTYPE html:

The DOCTYPE of the external document alters how they interpret the xmlns (namespace) directive. Tidy was declaring (correctly) "xmlns=...the xhtml: namespace" - so all my node-names were ... I don't know: non-existent? ... inside my XSLT. XSLT was just ignoring them, as if they didn't exist - it needed me to provide a compatible mapping to the same namespace

...strangely, if the DOCTYPE was xml, then they happily ignored the xmlns command - or they allowed me to reference nodes by unqualified name. This fooled me into thinking that they were point-blank ignoring the nodesets inside the xhtml DOCTYPE'd version.

So, the "solution" is something like this:

  1. modify your XSLT stylesheet to ALSO import the "xhtml" namespace - NB: this is required so that you can reference the nodes in the external files
  2. write all your XSL match / select / template rules with the "xhtml" prefix on every node (and every attribute, I think?)
  3. let Tidy output whatever it wants: it doesn't matter, it'll Just Work, once you have the namespace support in there

Example code:

  1. Your stylesheet goes from this:

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    

    ...to this:

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml">
    
  2. Your select / match / document-import goes from this:

    <xsl:copy-of select="document('html-files/file1.htm')/html/body"/>
    

    ...to this:

    <xsl:copy-of select="document('html-files/file1.htm')/xhtml:html/xhtml:body"/>
    

NB: just to be clear: if you ignore namespaces, then it seems XSLT will work on files that are unDOCTYPED, even if they have a namespace in them. Don't make the mistake I made of thinking your XSLT is correct just because it appears to be :)


XHTML is XML (if it is valid).

To get your XHTML processed as XML, you must not serve it as "text/html" MIME. Use application/xhtml+xml instead (keep in mind, that IE6 does not support to render this and will prompt a download window for your site).

In PHP do you serve it as xhtml+xml with the header() function.

I think this should do the trick:

header('Content-Type: application/xhtml+xml');

Does this help?


If you run xsltproc --help, among the accepted input flags is a very conspicuous one called --html which supposedly tells xsltproc that:

--html: the input document is(are) an HTML file(s)

Presumably for this to work you must have valid HTML files to begin with, though. So you might want to tidy them up first.


I think the main problem is given by the XML catalog doctype declaration. You can test this by removing the external entity reference in the input XHTML and see if the processor correctly works with it.

I would do as follows:

  • Use Tidy with doctype omit option.
  • Add the Doctype at XSLT side as described here

The main problem is that Saxon and xsltproc has not any option to disable external entities resolution. This is supported by MSXSL.exe command line utility with option -xe.


It's been a while, but I remember trying to use HTMLTidy to prep HTML files for XSLT and was disappointed by how easily it gave up while trying to "well form" the HTML. Then I found TagSoup, and was very pleased.

TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

I don't know if you're bound to HTMLTidy, but if not try this: http://home.ccil.org/~cowan/tagsoup/

As an example, here's a bad HTML file:

<body>
  <p>Testing
</body>

And here's the tagsoup command and its ouput:

~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html bad.html 
src: bad.html
<html><body>
  <p>Testing
</p></body></html>

Edit 01

Here is how tagsoup handles DOCTYPEs.

Here's a bad HTML file with a valid DOCTYPE:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<body>
  <p>Testing
</body>
</html>

Here's how tagsoup handles it:

~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html bad.html 
src: bad.html
<html><body>
  <p>Testing
</p></body></html>

It isn't until you explicitly pass a DOCTYPE to tagsoup that it attempts to output one:

~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html --doctype-public=html bad.html 
src: bad.html
<!DOCTYPE  PUBLIC "html" "">
<html><body>
  <p>Testing
</p></body></html>

I hope this helps,
Zachary

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号