The project I am working on at the moment involves putting a lot of information that our clients publish into our system for them to apply metadata to. All of this published information is structured hierarchically, although it isn't obvious from the printed versions what these hierarchies are. So we have been spending the last few weeks moving this information into an XML format that we can load into the CMS.
Some of this work has been done before, so that information was easily exported from another copy of our CMS, however that is only a small part. What we have been finding is that the only machine readable versions of much of this data is in held Quark files and they can only offer us Word document exports from this. There is no semantic markup and the WordML structure involves lots of tables which are used for layout purposes. In the end most of this information has had to be cut and pasted by hand.
I have to admit that I am a little shocked by all of this. The information in question is the life blood of this organisation, in fact it is pretty much the only reason that the department that we are dealing with exists. Having this data locked into a format that is strictly for layout publishing purposes seems absolutely crazy. From what I can see all editing is done directly to this format. I think that the reason this has happened is that our clients have always seen the end product, i.e. the printed versions, as sacred without ever thinking that somewhere in there is pure data which is actually what they should be concerned about.
The true extent of this problem came to light on a related project with the same client and data set. The printed versions have marginal notes which provide cross-references between parts of the information and we needed to know if there were any reciprocal links in there. Our client couldn't tell us this without checking through all the relevant parts of a printed copy!
It is easy for us developers to forget that a client's perspective on something can be very different to our own. We had assumed that there would be a way to get some of this information in an electronic format that we could at least begin to use and transform. I think this has been something of a learning curve for our client, and we are helping them to understand the implications. Of course the great outcome of this project is that they will have these pure data versions in XML. They are now seeing all the possible benefits of this perspective change from thinking that the Quark files are their only precious commodity to thinking about the underlying data as being more valuable.
We have a position open for a Java/XML/XSLT developer at our offices in London (£26-£30K + benefits). If you are interested there is more information available on our website.