Friday, November 21, 2008

ODF documents statistics

As Christian mentioned in his post here, the IBM Symphony UX team did some statistical analysis of text documents on the web and they presented the results at the conference in Beijing. Since then, I could not sleep and was wondering how they did it and how we could possibly do similar things since their code is not available yet.

So, I just started sniffing around in the XML files of one of my own ODTs and look what I found:

This is a line from the meta.xml file that describes some basic document statistics. It includes the number of tables, images, other objects, pages, paragraphs, words, and even characters.

As a consequence, it is amazingly easy to get these statistics from ODF documents, assuming that every ODF file has this information included. I will bug Svante a bit about that to figure out some more details.

Then, I checked out the page and the opportunities the framework offers. So here is shot, how about a small and nice piece of Java code to parse the exact meta.xml files within ODF documents and to output a Calc spreadsheet with the statistics? I’d love that and the information would actually help us to make particular decisions within the Renaissance project.

So, what do you think of that, any volunteers?



Christoph Noack said...

Hey, that would be great! At the conference, I talked to Max (mod) and he explained that the ODF toolkit would make this possible with relatively low effort. And, in contrast to the original IBM implementation it should be even possible to access the data/formatting inside the documents...

If the tool would be easily usable/distributable, we would also solve the most important problem: we don't know anything about document statistics inside governments or industry (the IBM tools uses available data from the internet). Maybe open-minded people would even analyze their own documents in the corporate network and share their findings (and not the documents itself...).

Andreas, do you think that this request should be distributed more widely? If yes, do you mind to to cross-post this cool request for participation to the ux-discuss, the (general (dev) list or the ODF toolkit forums? I know you have an account there... ;-)


JZA said...

So here is an Old Old Old idea I will dust out and see what's up.

Now that you discover metatags, what about having a ODF manager in a similar fashion we manage music and pictures.

Exif and ID3 tags help us to manage our favorite song or whatever. With meta.xml we could get some ideas on how we can get an ODF jukebox where we could get dates of our ODF or most open.

We make the user start using that instead of the filesystem and now we gain order into our lives instead of having millions of folders with different names and now the user learn desktop document management.

Andreas Bartel said...


That was not my intention but who cares, I like your idea. However, I think this part of ODF is standardized. Hence, a great bunch of people would have to agree with both of us. Let me check who's the right person to ask.

Andreas Bartel said...

... it's me again. As promised, I checked who's the right individual to forward your idea to. Bettina Haberer, another UX team member, is a member of sub TC within OASIS. You could contact her. As an alternative, you might check out this link:

You can directly submit your comments there. However, the way to do it is a bit cumbersome.