Conversion of `Word Processor' formats into Latex

Correction (2011-03): try LibreOffice (a now-dominant fork).

Something I read somewhere (can't remember the source).

Q: "What does a word-processor do?"
A: "Well, you know what a food-processor does to food ..."

On rare attempts at using these `word-processor' things I think of this explanation. When I see a nice (content-wise) piece of work horribly typeset, I feel tempted to make it visually tolerable before reading it. I've tried this with some long works that I've acquired in word-processor formats (typically msword-doc) but also with some huge html and plaintext files. There are many problems with conversion, generally because the source contains no, or inconsistent, information about the structure.

Even a work with nothing but simple English text has its problems: if there are titles or other formatting then they're probably applied unclearly and inconsistently. This isn't really a fault of word-processors, except in that they emphasise appearance (and `just get going: the interface makes it all obvious') to the point that many intelligent users of them have no idea about setting up formats for different levels of text, heading etc. and then applying these to mark the document structure, thereby making it easy to change its style, automatically number and list sections, or convert it to other formats.

Anything with equations becomes lots worse: perhaps these are from m$ equation editor N. Or N+1. Or N.1. Or perhaps a (also) proprietary add-on, version M. Or M+2. And so on. Never mind that some weird symbols have probably been included from a non-standard, proprietary set of add-on fonts. I've not yet met a form of word-processor equation that is reliably converted to Latex source, when more than a few subscripts and perhaps greek letters.

Then figures: perhaps they've come as poor quality screen-grabs or at least some non-vector format, even if quite high resolution: these are typical things to find in word-processor documents, and they really don't look too out of place there. But they probably will when surrounded by text from Latex. Some charts may require a proprietary program to interpret them.

I've looked up a few converters for going from m$word to Latex. I won't waste time listing them. There's often a need of running them on a m$ platform with the full corresponding office-suite and `net' `framework'. The number of errors these things can encounter from relatively simple documents is quite impressive. To get EPS output in one case, a further printer driver had to be installed, and the figures were messed up hopelessly, anyway, except when using PNG. Equations and diagrams became a dog's breakfast. The mess that was made of the Latex source, and the low probability of its even compiling, are tributes to the quality of these sorts of converters. (I fancy there will be costly proprietary converters for use by publishers who allow m$word submissions, but I'm not interested in using or advertising these: I want to help with other people doing quick conversion of one or two things, and to give a plug for OOo on the way!)

After wasting time on these programs and on having borrowed a laptop with m$-windos on it (slow, slow, updates, restart now, `you might like to know that I've detected this or that', etc.), I thought of OpenOffice. I'd initially spurned it, thinking that it must be better to convert from m$word format using the m$word libraries to do the interpretation. But I opened my example file in OpenOffice (about 80 pages, 20 figures, 4 tables, poor text-formatting structure: it took about 20s to open) and saw that the proprietary and inefficient .doc-format file had been read very competently. I then saved to the native format, .odt -- opening this took only a second or so, as did subsequent openings. Under `File' I used `Export' to export to Latex. (To do this, it was necessary to enable Java: Tools / Options / OpenOffice.org, Java submenu, then wait a few seconds for the available JVMs to show, and pick a Sun 1.5.x version for the OpenOffice.org-3.0.0 -- it didn't like version 1.6.x.)

The result didn't include non-inline equations from one of the `Mathtype' programs, though it pointed out their absence in the text. The result also didn't include the images, but these are easily extracted from the .odt saved version of the word-processor file by opening the file as a zip archive (which it actually is) and checking the Pictures/ subdirectory. The important point was that although this was highly impractical for a true conversion that would allow intervention-free compilation into a good-looking document, it beat the other methods at producing (near)-compilable Latex code that preserved a good semblance of the formatting but without doing so in a crazy way. It was also much easier and quicker. So give OOo a try before wasting time on seemingly more appropriate `application specific' software.

Page started: 2010-01-12
Last change: 2011-04-01