Posted on Fri 18 October 2013

Deliver and receive Open Data

This week I attended an hack event about Open Data organized by the city of Zürich. It was nice to see around 70 people getting excited about mostly spatial data provided by the city and the canton of Zürich. Open Data is currently a hot topic, especially in Switzerland: Last month, there was the Open Knowledge conference in Geneva, which - unfortunately - I did not have the opportunity to go to.

One aspect of Open Data is the question on how to deliver and receive the data. I like to think of the use of Open Government Data (or Open Data) as using layers of services1:

Presentation (i.e. Websites, Apps) > Aggregated Services ("topics") > Feature Level Services > Raw Data

  • Raw data is usually in a basic file format (CSV, Excel or even worse as unstructured PDF), may be zipped and delivered via websites, ftp or shockingly by DVD.
  • Individual features are usually served by an API, like WFS or a proprietary JSON API. The features may also be requested in bulk.
  • Aggregations or topics are based on the inidviual features and provide higher level abstractions, like a summary of the population of all cantons, based on raw population numbers.
  • Presentations may be reports, websites or even apps, which present the aggregated data according to a problem.

Using Open Data - It's about invalidating your cache

In the long-term, one of the biggest challenges of Open Data is not only to have the raw data publicly available, but to provide and use up-to-date and high quality services based on the original data. You have to keep in mind that everything is a cache, as long as you are not the owner of the data. And arguably, handling caches is a hard thing to tackle in computer science2: You are in a dependency chain and you are not in control of the whole chain. Note that this observation applies both to the provider and the consumer of third party data.

The Open Knowledge Foundation says: "Our Mission is a World of Frictionless Data". In my opinion, the challenge of keeping up-to-date is one of the long-term frictions we have to overcome. By the way, other frictions are astutely described in Tyler Bell's write-up on using 3rd party data.

Invalidating your cache means being responsible with data: If you use raw data for your services or apps, you have the responsibility to make sure your data is up-to-date for your service (or otherwise clearly mark the source and the compilation date of your content).

Services or Raw Data?

So, what should the government provide? Raw data for sure. If it is part of their mission, they should also provide at least feature level and aggregation services. For example, geodata is used within many governmental processes. So, an agency has to provide services like maps or cadastral information in order to make sure others can fulfill their task, like issuing building permits. On the other hand, there are cases where services or apps are not part of the government's mission and they shouldn't build it (example: location analysis for real estate).

In general, data services are favorable instead of raw data downloads since it is easier to invalidate your local data cache with up-to-date services.

In Switzerland, the recently launched Open Data government portal is a great step towards frictionless data. It is a common landing point to get data, which can be referenced by permalinks. Howevery, most of the data is in Excel format with non-structured meta-data, ZIP file or even worse in PDFs. And it's still raw data.

The canton of Zürich experimentally started an approach towards services with its geodata: Currently, there are 8 WFS services, which can be accessed directly with common GIS tools, or in more popular formats like JSON with HSR's GeoConverter.

What can you do as an Open Data consumer?

As said above, both the provider and the consumer should act responsible since they are both part of the unavoidable dependency chain. As a consumer of data, you can do lots of things, some of them are:

  • Provide exact attribution with exact reference and timestamp.
  • Make sure that you get notified when the original data changes.
  • Make sure your workflow keeps you up-to-date, such as Mike Bostock's data compilation approach.
  • As a community: Think of an ethos of data consumers, which reminds everyone in the dependency chain to act responsibly.

Did I miss something? Probably lots of things. Let me know by pinging me on Twitter: @ping13.

  1. You might be reminded by the disputable DIKW pyramid: Wisdom > Knowledge > Information > Data

  2. There is also a variation on this that says there are two hard things in computer science: cache invalidation, naming things, and off-by-one errors (Source). 

Category: misc

Comments: toggle

© Stephan. Built using Pelican. Theme by Giulio Fidente on github. .