开发者

Why are the blue quote lines not properly maintained when converting RTF to HTML?

开发者 https://www.devze.com 2023-04-10 22:39 出处:网络
I saved an Outlook e-mail reply I was working on as an RTF document, which I\'ve uploaded. What I\'d like to do is convert this RTF document to HTML.I\'ve tried various different means - LibreOffice,

I saved an Outlook e-mail reply I was working on as an RTF document, which I've uploaded.

What I'd like to do is convert this RTF document to HTML. I've tried various different means - LibreOffice, various conversion utilities, and of course Microsoft Word. Most of the markup is converted fine, but there seems to be something 'magical' about the blue quote lines down the left. I just can't get them to be accurately converted.

Most conversion utilities just drop them altogether. As for Microsoft Word; when I open the file initially, it looks fine (inline replies have no blue quote line, quoted text does). However, when I save it to HTML in Word, and then open that HTML file, the blue quote line is retained up until the first reply ("Indeed it is."), and after that it disappears. Why are t开发者_开发问答he remaining parts of the blue quote line being destroyed in the conversion process, and how can I get them to stay there?

By the way, the exact same problem happens if I instead save the Outlook e-mail in DOCX format, open that in Word, and save it as HTML. There seems to be something proprietary and/or esoteric about the way those quote lines are implemented. See below for screenshots of what it should look like (ie. after I initially open it in Word), and what it does look like (ie. after it's been saved to HTML format).


Should look like:

Why are the blue quote lines not properly maintained when converting RTF to HTML?


Does look like:

Why are the blue quote lines not properly maintained when converting RTF to HTML?


OK, I've been experimenting with the DOCX version of this saved e-mail (I saved it in both RTF and DOCX format), and I've found and remedied the problem with that. I'm guessing the same problem somehow made its way into the RTF version of the file, perhaps because the way Microsoft implements the blue quoteline in the RTF is just to use some proprietary RTF extension that stores the necessary extra styling data that would have been stored in the DOCX anyway - that would explain why I lose the quoteline when I use anything other than MS Word to open said RTF. As RTF is a rather ugly format and I find DOCX a lot easier to work with, I'll describe my DOCX fix below.

The problem with the DOCX was this: Word defines a bunch of paragraphs in the document.xml part of an Outlook-format document package, links some of them to divIds, and then defines a separate websettings.xml part to go along with it. If you break up the blue quoteline in Outlook by pressing Ctrl+Q, as I did to create this DOCX, Word tags each of the paragraphs to be prefixed with a blue quoteline with the same divId, and then just has that one divId defined in websettings.xml; so, you get something like this in document.xml (I've formatted it a bit more nicely than the one long string you get from MS Word):

<w:p w:rsidR="00ED60D7" w:rsidRPr="007B768D" w:rsidRDefault="00ED60D7" w:rsidP="007B768D">
    <w:pPr>
        <w:divId w:val="1800686860"/>
    </w:pPr>
    <w:r w:rsidRPr="007B768D">
       <w:t>Let's do some inline quoting when replying to it.</w:t>
    </w:r>
</w:p>

[...]

<w:p w:rsidR="00ED60D7" w:rsidRPr="007B768D" w:rsidRDefault="00ED60D7" w:rsidP="007B768D">
    <w:pPr>
        <w:divId w:val="1800686860"/>
    </w:pPr>
    <w:r w:rsidRPr="007B768D">
        <w:t>Best regards,</w:t>
    </w:r>
</w:p>

... and something like this in websettings.xml (formatting made prettier again):

<w:div w:id="1800686860">
    <w:marLeft w:val="0"/>
    <w:marRight w:val="0"/>
    <w:marTop w:val="0"/>
    <w:marBottom w:val="0"/>
    <w:divBdr>
        <w:top w:val="none" w:sz="0" w:space="0" w:color="auto"/>
        <w:left w:val="single" w:sz="12" w:space="4" w:color="0000FF"/>
        <w:bottom w:val="none" w:sz="0" w:space="0" w:color="auto"/>
        <w:right w:val="none" w:sz="0" w:space="0" w:color="auto"/>
    </w:divBdr>
    <w:divsChild>
        <w:div w:id="1800686861">
            <w:marLeft w:val="0"/>
            <w:marRight w:val="0"/>
            <w:marTop w:val="0"/>
            <w:marBottom w:val="0"/>
            <w:divBdr>
                <w:top w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                <w:left w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                <w:bottom w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                <w:right w:val="none" w:sz="0" w:space="0" w:color="auto"/>
            </w:divBdr>
            <w:divsChild>
                <w:div w:id="1800686862">
                    <w:marLeft w:val="0"/>
                    <w:marRight w:val="0"/>
                    <w:marTop w:val="0"/>
                    <w:marBottom w:val="0"/>
                    <w:divBdr>
                        <w:top w:val="single" w:sz="8" w:space="3" w:color="B5C4DF"/>
                        <w:left w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                        <w:bottom w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                        <w:right w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                    </w:divBdr>
                </w:div>
            </w:divsChild>
        </w:div>
    </w:divsChild>
</w:div>

So, the one w:div defined in websettings.xml is being referenced multiple times in document.xml. Now, although this seems to work fine when you open the file as a DOCX in MS Word, it becomes a major problem when you want to convert the document to HTML. It looks like an XSLT transformation is being applied to document.xml, and because in XML there should only ever be a single element in a document with a particular ID, the transformation only applies the websettings.xml styling to the first paragraph in document.xml with a divId of 1800686860. In my example, that happens to be the paragraph containing the header information and first line ("From: Joe Bloggs [...] This is an initial e-mail.") The remaining paragraphs with that divId DON'T receive the styling in websettings.xml.

Because it's the styling for a divId of 1800686860 in websettings.xml that causes the blue quoteline to appear on the left, the remaining paragraphs that we want to receive the quoteline don't receive it, because the styling isn't applied to any of the remaining paragraphs! In my opinion this is a nasty bug in MS Word - that it allows itself to generate XML like this that causes a broken HTML transform.

The fix? Find all paragraphs in document.xml with duplicate divIds. Make a note of them. Then, for each divId with duplicates, create a copy of its w:div element in websettings.xml and assign the copy a new, unique ID, for each duplicate instance in document.xml. Then, change each duplicate ID in document.xml to one of the copies. Once those changes are made (so each paragraph is genuinely linked to a separate, unique, w:div in websettings.xml), and you save the modified DOCX as an HTML file in Word... it works! The generated HTML file looks pretty much identical to the DOCX, blue quotelines included.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号