My Life in Bits

Home Photography Work

Microsoft Word to HTML Converter

Overview

At work an intranet was developed and many paper and email distributed documents were switched to the intranet server (Plone). Almost without exception, these documents were authored in Microsoft Word. Given the work culture, it was unrealistic to attempt to transition my colleagues to a more appropriate software application for authoring HTML. Additionally, it made sense for many of the types of reports generated at work to be authored in Microsoft Word and then converted to HTML when they were posted on the intranet.

Not surprisingly, the File > Save As... > webpage (htm, html) option in Microsoft Word produced pretty poor results. The Office 2000 HTML Filter 2.0 from Microsoft did not get us much farther. There were ongoing issues with non-printing characters making their way into the final HTML, formatting being set through inline styles (an external stylesheet was preferred), and the document's revision history being left in the HTML.

A python script for cleaning MicrosoftWord HTML was found on the python cookbook, it was then modified, hacked up, and rewritten to suit our ongoing needs.

Problem Background

A simple Microsoft Word to HTML converter was needed. This software needed to be usable by employees who call for help when the internet is down (the intarweb is broken, could you please put a backup on a floppy for me so I can keep working?). The output HTML need to be clean of most all inline styles, document history text, and non-printing characters. The document converter needed to cleanly handle image files, organizing them into a logical arrangement and deleting unused, Microsoft specific files. For use in a corporate setting, the converter had to be easy to deploy and upgrade. Additionally, it was hoped that the converter could easily be tailored to specific applications and document types.

Solution

A client side python script (word2html.py) was written to automate MS word and then post process the resulting HTML. A client side approach was taken because with COM objects in Word documents the client computer is the most reliable place for document conversion. In our implementation, the word2html.py script was 'compiled' into an executable. This executable was placed on a windows share and a link was distributed. This link was placed on the desktop of client computers. When files where dragged onto the link, the program was loaded across the network onto the client computer and executed. The converted files would then appear on the client computer's desktop in a folder named HTML. Images are then placed in their own folder within a common images directory (HTML\images). This folder is named FILE_NAME_images (where file name is the name of the file being converted).

In our specific application of converting files for a Plone CMS server, a document subclass was written in the word2html.py script. This subclass automatically fills in metadata used by the Plone CMS (document description, effective date, etc.) Another document subclass is used for a specific type of file, when this file is encountered, the user is given the option to have the files automatically uploaded to the intranet server. This set greatly helps people who do not understand the concept of markup lanaguage documents that have serparate image files.

All of the MS specific formatting and inline styles are stripped out through regular expressions, leaving behind just the HTML tags. A number of non-standard characters are also replaced with their standard HTML encoding (e.g. \xb7 is mapped to &#149, the • symbol) Empty paragraphs and the like are also removed. The converted file is saved with a lowercase, no-blank name. Spaces are mapped to an underscore; for instance 'my homepage.doc' would be saved as my_homepage.htm. Images urls are rewritten and unneeded files produced by Microsoft Word are deleted.

In addition, the basic document conversion class is easily sub-classed, allowing for specific document conversions to be written. These specific document conversion classes can be mapped to the files being converted through a file name checker that maps file names to document conversion classes. This file name checker also uses regular expressions.

Also, if you really want, it should be easy to configure the word2html.py script to automatically upload certain documents to a ftp server, this option is turned off, but perusing the source code for AUTO_FTP should get you there very quickly. It is used this way with our intranet site.

Note: this is only a general description the features in word2html.py, be sure to read the source code to get a full understanding of what is going on in the document conversion process.

Sample Results

Here is a short excerpt of a file that was converted both with Microsoft Word's save as HTML feature and through the word2html.py script. The formatting in the example on the right reflects the standard Plone style sheet.

bad html from ms word good html from word2html.py
HTML directly from Microsoft Word in our Plone site. HTML converted through word2html.py

Installation

If you have python installed, there is little that you need to do in order to install this script - however, this script can me made much more functional through the use of one of two hacks that allow it to accept dragged and dropped files, even directories of dragged and dropped files for document conversion.

  1. Convert this script to a stand alone executable. I have had the best success with Pyinstaller (http://pyinstaller.hpcf.upr.edu/). MS Windows will pass dragged and dropped files as a list of pathnames to executable programs, allowing for non-command line operation of the word2html.py script. Even better, you can then distribute this script to computers without python installed. Or you can place the script on a network share and distribute a link - same difference, but this allows for an easier deployment and upgrading.
  2. Make your python file be recognized as an executable by Windows (requires a small registry hack) You can download the appropriate registry keys from http://www.japikse.com/resources/) Note: this approach only passes short pathnames to the converter, mangling document's title in some instances (this might be different on WinXP, I'm presently working on Win2k).

Downloads

  1. Word2html Converter (word2html.py)
  2. Registry Hack to allow MS Windows to pass file names as a list to python scripts (optional)
  3. Pyinstaller (optional)

www.japikse.comrestricted | stats