I've always been a little perplexed at the popularity of pdf format, which is requires a separate application to read it, cannot be edited, and is binary rather than text. The only thing pdf is good for, my view, is preserving the layout of a document, for printing aesthetic documents without the size overhead of bitmapping the text. Jacob Neilson agrees with me. There seems to be a perception that printing to pdf is somehow equivalent to printing to paper - it fixes a finished document, for publication.
There are other issues with pdfs which push the digital divide. So much so that DFID commissioned a study on how this could be minimised.
WHO, too recognised this problem, and went back and painstakingly converted all their World Health Reports back to 1995 into html. I guess the process was just too expensive to see it to its logical conclusion.
Another problem is searching a whole library of pdfs. There are proprietary tools for archiving and indexing, but I have been looking for a more open way. Ideally, I want the pdfs converted to html.
However most pdf2html tools don't give you the kind of htm that I like. Look at the source of this example converted page to see what I mean. You couldn't put that in a search index, or edit it, or tidy it up with regular expressions.
Pdf2text is a much simpler process and gives more usable results. However it doesn't distinguish between the title of the document and the foot notes in terms of importance, it leaves line-breaking hyphens in words and puts the spaces in the wrong places from time to time. In short, its helpful but not good. But I have discovered something better.
As Optical Character Recognition looks at a scan and recognises the letters and words, there is now software which looks at pdfs and deduces the structure of the document, as long as the document is sensible. The structure of the document is, of course represented in XML, and the output via a standard or custom dtd is in whatever format you want. It belongs to a company called Exegenix.
This means our premature enthusiasm for the format is no longer irreversible. All those pdfs can be converted, en masse, to usable, indexable, searchable, web-browsable formats. Can't wait.
Posted November 3rd, 2007 by matslats

