Methodology | Global Open Data Index by Open Knowledge

The Global Open Data Index collects and presents information on the current state of open data release around the world. The Global Open Data Index is run by Open Knowledge and reviewed by volunteers from the Open Knowledge network and around the world. The first Open Data Index was released on October 28, 2013.

The following page explains the methodology behind the global Global Open Data Index. If you have any further questions or comments about our methodology please write to the community of volunteers and reviewers who contributed to the Index at our official discussion forum.

In 2013 the Index included 70 countries. In 2014 we have extended our scope to include 32 additional countries, mainly from Latin America, Africa, and Asia. Therefore we have put additional effort into trying to locate contributors and experts from these regions. This way we believe the Index will more accurately portray the state of the open government data landscape worldwide.

The Global Open Data Index is not a representation of the official government open data offering in each country, but an independent assessment from a citizen perspective.

The Global Open Data Index is not only a benchmarking tool, it also plays a powerful role in building the open government data community around the world. If, for example, the government of a country does publish a dataset, but this is not clear to the public and cannot be found through a simple search, then the data can easily be overlooked. Governments and open data practitioners can review the Index results to see how accessible the open data they publish actually appears to citizens, and where improvements are necessary to make open data really open and useful.

Datasets

The Global Open Data Index benchmarks open data by looking at ten key datasets in each place. These datasets were chosen based on the G8 key datasets definition and after consulting the Open Government Community. This year the Global Open Index examines the following ten datasets:

Election Results (national) - Results by constituency/district for all major national electoral contests
Company Register - List of registered (limited liability) companies including name, unique identifier and additional information such as address, registered activities. The submissions in this data category does not need to include detailed financial data such as balance sheet etc.
National Map (Low resolution: 1:250,000 or better) - High level map at a scale of 1:250,000 or better (1cm = 2.5km)
Government Spending (high level of spending by sector) - Records of actual (past) national government spending at a detailed transactional level; at the level of month to month government expenditure on specific items (usually this means individual records of spending amounts under $1m or even under $100k). (Note: Just a database of contracts awarded or similar is not considered sufficient. This data category refers to detailed ongoing data on actual expenditure)
Government Budget (detailed transactional level data) - National government budget at a high level (e.g. spending by sector, department etc). This category is about budgets which are government plans for expenditure (not actual expenditure in the past).
Legislation (laws and statutes) - This data category requires all national laws and statutes available to be available online, although it is not a requirement that information on legislative behaviour e.g. voting records is available.
National Statistical Office Data (economic and demographic information) - Key national statistics such as demographic and economic indicators (GDP, unemployment, population, etc). Aggregate data (e.g. GDP for whole country at a quarterly level, or population at an annual level) is also considered acceptable in this data category. In general, answers of 'yes' in this category refers to entries with a reasonable amount of both economic and demographic information available.
National Postcode/ZIP database - A database of postcodes/zipcodes and the corresponding geospatial locations in terms of a latitude and a longitude (or similar coordinates in an openly published national coordinate system). A database which gives a location in terms of the name of a town or a street without lat/long coordinates is not considered acceptable unless the name of the town or street can be further converted to a latitude and longitude by means of other open data (eg an open gazetteer with latitude and longitude attributes).
Public Transport Timetables - Timetables of major government operated (or commissioned) national-level public transport services (specifically bus and train). The focus here is on national level services (not those which operate only at a municipal or city level and which are not controlled or regulated by the national government). A 'yes' in any question will refer to both types of transport. However, if there is no national level service operated or regulated by the government for a given type of transport (for instance busses), then this type is ignored in this data category.
Environmental Data on major sources of pollutants (e.g. location, emissions) - Aggregate data about the emission of air pollutants especially those potentially harmful to human health (although it is not a requirement to include information on greenhouse gas emissions). Aggregate means national-level or more detailed and on an annual basis or better. Standard examples of relevant pollutants would be: carbon monoxides, nitrogen oxides, particulate matter etc.

Places

Where we have received submissions for 2013 and 2014 from places that may not be officially recognised as independent countries, we have included these if they are complete and accurate submissions. Therefore, the Global Open Data Index 2014 ranks ‘Places’ and not ‘Countries’.

The way we define a ‘Place’ or ‘Country’ in the Index is under review for 2015. You will see that this has been debated on our discussion list and it’s an issue that we will return to.

Scoring

Each dataset in each place is evaluated using nine questions that examine the technical and the legal openness of the dataset. In order to balance between the two aspects, each question is weighted differently and worth a different score. Together, the six technical questions are worth 50 points, the three legal questions are also worth 50 points.

The following questions examine technical openness:

Does the data exist?
Is the data in digital form?
Is the data available online?
Is the data machine-readable?
Is it available in bulk?
Is the data provided on a timely and up to date basis?

The following questions examine the legal status of openness:

Is the data publicly available?
Is the data available for free?
Is the data openly licensed?

The following table describes the questions in further details alongside their weights.

Question	Details	Weighting
Does the data exist?	Does the data exist at all? The data can be in any form (paper or digital, offline or online etc). If it is not, then all the other questions are not answered.	5
Is data in digital form?	This question addresses whether the data is in digital form (stored on computers or digital storage) or if it only in e.g. paper form.	5
Publicly available?	This question addresses whether the data is "public". This does not require it to be freely available, but does require that someone outside of the government can access it in some form (examples include if the data is available for purchase, if it exists as a PDF on a website that you can access, if you can get it in paper form - then it is public). If a freedom of information request or similar is needed to access the data, it is not considered public.	5
Is the data available for free?	This question addresses whether the data is available for free or if there is a charge. If there is a charge, then that is stated in the comments section.	15
Is the data available online?	This question addresses whether the data is available online from an official source. In the cases that this is answered with a 'yes', then the link is put in the URL field below.	5
Is the data machine- readable?	Data is machine-readable if it is in a format that can be easily structured by a computer. Data can be digital but not machine-readable. For example, consider a PDF document containing tables of data. These are definitely digital but are not machine-readable because a computer would struggle to access the tabular information (even though they are very human-readable!). The equivalent tables in a format such as a spreadsheet would be machine-readable. Note: The appropriate machine-readable format may vary by type of data – so, for example, machine-readable formats for geographic data may be different than for tabular data. In general, HTML and PDF are not machine-readable.	15
Available in bulk?	Data is available in bulk if the whole dataset can be downloaded or accessed easily. Conversely it is considered non-bulk if the citizens are limited to just getting parts of the dataset (for example, if restricted to querying a web form and retrieving a few results at a time from a very large database).	10
Openly licensed?	This question addresses whether the dataset is open as per http://opendefinition.org. It needs to state the terms of use or license that allow anyone to freely use, reuse or redistribute the data (subject at most to attribution or share alike requirements). It is vital that a licence is available (if there is no licence, the data is not openly licensed). Open licences which meet the requirements of the Open Definition are listed at http://opendefinition.org/licenses/.	30
Is the data provided on a timely and up to date basis?	This question addresses whether the data is up to date and timely - or long delayed. For example, for election data that it is made available immediately or soon after the election or if it is only available many years later. Any comments around uncertainty are put in the comments field.	10

Sample methodology

The Index uses a non-probability sampling technique - a “snowball sample”. A snowball sample tries to locate subject of studies in areas which are hard to locate. In our case, we work with contributors who are interested in open government data activity who can assess the availability and quality of open datasets in their respective locations. We do so not only by using referrals, but also by posting in social media, through our Open Government Data and Open Data Census mailing lists and even actively by meeting people face-to-face in conferences. This means that anyone from any place can participate and contribute to the Global Open Data Index as a contributor and make submissions, which are then reviewed. We do not have a quota on the number of places that can participate. Rather, we are aiming to sample as many places as we can.

This also has an impact on the quality of the data we collected in the first stage of the Global Open Data Index. Contributors have diverse knowledge and backgrounds in open data and therefore they sometimes need help finding the data we are looking for. The following section will explain how we try to deal with this problem.

Assessment and quality review process

The assessment takes place in two steps. The first step is collecting the evaluation of datasets through volunteer contributors, and the second step is verifying the results through volunteer expert reviewers. The following steps are taken each time a dataset is submitted:

Contributors, which can be any person, submit information about the availability of one of the key datasets in their Place. At this stage, the results are not published online straight away. Instead it is held back for review (see below). Please note that in some Places, when there were no changes in a dataset between years (for example, from 2013 to 2014), an Open Knowledge staff member may transfer the existing entry over to the next year, without further ado, while adding in the comments section of that entry that there were ‘no changes from year X to year Y’.
Next, the submission is sent to a reviewer who has been appointed by Open Knowledge based on the person’s experience and insight into open data in their Place or region. He or she will verify the results and make sure they are accurate. In cases where it was not possible to find a reviewer for a Place, we used an Open Knowledge staff member to review the results. In rare cases the contributor and reviewer for a Place will be the same person, however this is generally avoided in order to ensure that each submission gets a second set of eyes going over the data to make sure it is accurate. Countries where the same people did both - submissions and reviews - include Thailand and Russia.
Once the review process is done, the entries are then scrutinized by a panel of expert reviewers, who are volunteers that carry particular open data expertise across the key dataset fields. This panel made a review thematically across several Places (vertical review in the table as compared to the prior reviewer’s horizontal review). This final review was carried out on the top and bottom 15 Places for each of the key datasets. This reduces the likelihood of false positive answers, allows the assessments of different Places to be compared and helps to generally ensure that remaining errors are corrected and estimates better explained. This year, we had a particularly thorough review of some of the datasets:
- Government Spending: Making sure that the entries that were submitted referred to transactional and not aggregated data.
- Postcodes: Submissions were scrutinized particularly to ensure that the data contains geolocation as well as the zip codes themselves
- National map: Making sure that the map is very high quality (scale 1:250,000 or better) and that data could be effectively be downloaded. Lastly, an Open Knowledge staff member checked that one complex question was addressed consistently: The machine-readability of the submissions, for instance making sure that submissions referring to data in HTML or PDF format were marked ‘Not machine-readable’, which generally caused uncertainty among many contributors. This was a measure to eliminate false positive answers.
Throughout the process, and right up until the day before the finalized Global Open Data Index was released under embargo to journalists, revisions could be submitted at any time by anyone who suggested having more updated or correct information about a dataset. Such contributions were then reviewed in the same manner as outlined above before being posted on the Index. This iterative process was put in place to allow as many people as possible to have a say.
Please note that this extended kind of quality control was not done for the 2013 Open Data Index. This may explain some of the differences in scores when comparing the 2013 and 2014 results.