toxi.in.process

Friday, April 21, 2006

Quality data for visualizationists

Quite a few Processing users (incl. myself) are working professionally and/or experimentally with (data) visualizations. Interesting and good works in this field are not just down to ingenuity of their authors but also largely dependent on quality data sources. Often those can be quite hard to come by, especially if you're reliant on free data sources. The creation and preparation of your own data can turn into a big stumbling stone since you are suddenly confronted with major technical issues about retrieval, parsing, storage, transformation, putting bits of data into relationships etc. No wonder a lot of amateur experiments are based around the readily available data as, for example, provided by Flickr, Technorati or del.icio.us.

As the blogging and Open Source movement has shown, innovation in any domain can and does happen bottom-up and amateurs play a major role in that. As Paul Graham writes:
There's a name for people who work for the love of it: amateurs. The word now has such bad connotations that we forget its etymology, though it's staring us in the face. "Amateur" was originally rather a complimentary word. But the thing to be in the twentieth century was professional, which amateurs, by definition, are not.

That's why the business world was so surprised by one lesson from open source: that people working for love often surpass those working for money.
Related anectode about amateur hardship: When working on base26 two years ago, I've spent about 5 long nights going page by page through the Oxford Dictionary manually filtering four-letter words and noting down their usage types, all for lack of an electronic version with this information.

On a large scale, good quality Open data is still generally rare, but steadily growing across various domains. The success of XML based standard data formats like RSS has been playing another important role on the road to liberated and readily usable data, yet the inherent problem with these formats is their lack of (direct) support for multi-dimensional and multi-directional data relationships.

So with these (amongst many other things) in mind, today reasearchers at Austrian company System One have announced the release of a snapshot of the entire English version of Wikipedia, converted into common flavours of RDF (RDF/XML, Ntriples and Turtle) and licensed as GFDL. Wikipedia3, a monthly updated dataset currently counts in at approx. 47 Million triples (metadata statements about wikipedia articles) and that so far only includes the combined structural information of each article, like internal link and category relationships. A separate dataset containing the actual annotated articles is planned as is support for inter-wiki relationships and a SPARQL interface for processing remote queries.

This is a pretty amazing endeavour and hopefully will provide enough incentive for people to pick up and learn to use RDF technologies as flexible and powerful tool also for infoviz purposes.