When it comes to data, quantity outweighs quality

By Jill Kowalchuk, Interim President, Compute Canada / Calcul Canada

One of the biggest concerns about the use of "open data" revolves around the quality of data being shared. How, goes the argument, can any organization trust a source of information, when any Tom, Dick or Harry had been allowed to contribute to it? After all, there has to be some accountability in the data we use to make important, potentially life-altering decisions.

However, I would argue that this line of thinking could be holding us back from making some amazing discoveries. Having access to an abundance of data —€” no matter its quality —€” can provide a comprehensive and unique view of a topic.

One very basic example is Google Maps Traffic data. The Google Live Traffic app provides real-time updates on road conditions in select towns in Canada, the USA, France, the UK and Australia. These updates are displayed through coloured lines overlaid on the map. A green line means traffic is moving fast, red means it is at a standstill. This data is drawn from the GPS readings of all smartphones on the road that have enabled Google Maps. These devices transmit where they are and how fast they are going. Combining this data creates a general picture of how well traffic in a specific region is moving.

From a technical viewpoint, it could be argued that this data does not have proper quality assurance. There isn't a person checking every source. You could have a driver who just likes to travel very slow, or one who likes to travel very fast, or you could have a sensor that has gone awry and is sending bad data. All of this has the potential to lead to inaccurate map information. And yet, due to the vast amount of information available, most anomalies are negligible. Occasionally the information on the map is wrong, but in most cases, it works. As a frequent user of Live Traffic, I can testify to the accuracy of this app. I have even taken to planning my road trips around the live info streaming on my iPhone.

This is just one example of using public data to create a wider picture. And we will continue to see more projects built on vast amounts of data, much of which is gleaned from so-called "unreliable" sources.

In an excellent article published in the New York Times on December 5th, Larry Smarr discusses the billions of data streams being created by smart sensors for the home or body. People who wear devices that measure their steps, sleep patterns or heart rates will soon be able to post this data to the "cloud," where it can be shared and compared with similar readings. Analysis of this data could lead to early warning detection systems for chronic health issues or diseases, and potentially save lives.

Are these smart sensors 100% accurate? No. There will be faulty devices, and there will be people with abnormal readings who don't fit into a general health diagnosis. But the law of averages should compensate for any freak anomalies. And the potential benefits of having such a vast resource of information far outweigh any risks from the few unreliable sources.

Now, this is not to say that we should trust every piece of data that is sent to us. People should absolutely double-check their sources before making a decision that will affect lives or businesses. And it is all too easy to draw the wrong conclusions from data that has been haphazardly, or incorrectly, published.

But there are solutions. Google has recently released Refine — a program that helps clean up "messy data". It is intended to aid those looking to create their own apps, or conduct data journalism projects. The system sorts through data in a spreadsheet, and helps reduce repetitive fields, or convert information into standard units.

This is just one way to ensure a more accurate picture, and I am certain there are many other systems and procedures being developed to fine-tune open data, or check it for accuracy.

We should not fall into the belief that there is data out there that can be classified as unequivocally "bad"; all data can be good. Ultimately, in this case, quantity is far more important than quality.