Data.gov and lessons from the open-source world
A previous blog bost talked about what data government departments should be releasing. In this post I like to talk about how to release it.
One approach is to centralise things. For example, the US Government has established the Data.gov web site with the stated purpose of “increasing public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government [of the US]”. The UK Government is currently considering a similar approach. The goals are commendable, but in a sense, Data.gov adopts a traditional “Web 1.0” approach to the challenge of increasing access to public sector information (PSI). To use an analogy drawn from Eric Raymond’s “The cathedral and the bazaar”, Data.gov can be thought of as a data “cathedral”, which is to say a huge, ambitious, centralised undertaking.
Another approach is decentralised, and would be modeled on a “bazaar”. In this approach, government web sites scattered around the Internet would utilise Web 2.0 technologies to provide data in both human and machine readable data and metadata formats.
In “The cathedral and the bazaar” Eric Raymond was of course describing software development. However if you replace “code” with “data”, and “developer” with “author” the same principles apply, namely:
- Users should be treated as co-authors, because having more co-authors increases the rate at which the data evolves and improves, i.e., user-generated content (UGC) plays a key role.
- Release as early as possible, because this increases one’s chances of finding co-authors early and stimulates innovative uses of the data.
- New data should be incorporated frequently, because, as above, this maximises the rate of innovation and avoids the cost of “big bang” style integration.
- There should be multiple versions of data sets, a newer version that is known to be of lower quality, and an older version that is stable and higher quality.
- Data sets should be highly modular, because this allows for parallel and incremental development.
The bazaar approach is flexible and economical and supports evolutionary change. It enables different government agencies to move at different speeds to open up public sector information, one data set at a time.
What about data discoverability, you may say? Doesn’t a data portal make it easier? Well, users don’t expect to find all their news and entertainment at a single web site, so why would they expect to find all of their data at a single web site? The trick is to ensure that government web sites are discoverable and searchable, both technically (through open robots.txt, site maps, etc) and legally (through friendly copyright, provisions such as Creative Commons, etc).
Of course it’s not an either/or scenario. Data cathedrals can coexist with data bazaars, and perhaps different data sets are best served in different ways. A related question is whetherPSI platforms should be government operated at all though, or instead left to the private sector or non-profit organizations.
What do you think? Should government departments embrace some of the principles of the open-source world in order to liberate public sector information?