Authors: David Ackerman, Alyssa Moore, David Chan, Tatiana Meleshko, Byron Chu
Last June, Cybera’s data scientists took on a monumental task: to create a tool that can easily examine the thousands of documents and testimonials that were provided to the CRTC for its 2016 “basic telecommunications service” consultation, and search for key words and common language. Our goal was to use data science principles to help Canadians better understand how policy decisions are made to safeguard the internet.
This has not been an easy job. Over the course of the consultation, the equivalent of ~65,000 pages or ~216 novels worth of material were published on the public record. The types of documents submitted range from Microsoft Word documents to PDFs to spreadsheets. Some written submissions read like novels, others in tightly-scripted point-form notes. Many submissions used conversational english, others were in legalese. Extracting the data and written language from the different documents, and then classifying the arguments for and against the “basic telecommunications service” designation has been very tricky.
In this blog, we will go over some of the complications we have run into in the last 6-7 months, and what tools we have used to build our understanding. In the next month, we hope to unveil the results of this massive undertaking!
Issue #1: Navigating the CRTC site
If you were to peruse the consultation submissions on the CRTC website, your only option is to navigate through a hierarchy they set up – you often have to click through several sub-pages to get to a specific document. Also, many of the documents on the site are PDF/Word docs, which are not easy to view directly in a web browser. Even HTML form submissions need to be downloaded before you can read them.
Our Solution: Sinatra
We wrote a document browser web application (in Ruby), using the lightweight Sinatra framework. This tool enables users to navigate the submission documents based on three different criteria:
By date of submission
By associated company/organization
By sections of documents that match “fuzzy searches” (i.e. searches for key words or phrases)
We hope to use text mining techniques like topic analysis, clustering, and named entity recognition to allow additional ways of organizing/browsing the documents.
We’re currently using the document browser to give us a high level “dashboard” view on the quality of the data. Our plan is refine it to answer specific questions, such as “how many, and what groups of interveners believe the internet should be a considered a basic service?” and “what do they think it should cost?”.
We’re trying to do this in such a way that you don’t need to manually go through every document to find these answers. We also want to make it easier to trace the source text of a statement or fact put forward.
Issue #2: Finding a text match in a large trove of documents
First there is the challenge of dealing with the different format of the documents, as PDF and DOC files are not nearly as easy to extract information from as raw text or more standardized information. Then there is the challenge of organizing all the text from these documents to make them accessible through one search.
Our Solution: Neo4J
One tool we have been using is Neo4J, which is a Java-based “graph database”. It models relationships between data as connected nodes on a graph, and has been used in investigative journalism on the Paradise and Panama Papers.
The important bits for us:
The data model is very flexible and allows us to easily visualize whatever bits of knowledge we can extract from the documents.
Connections can emerge that aren’t easily apparent when just reading through the documents. (Just looking at an emerging graph often invites more questions and avenues for analysis).
It’s very natural to query and “ask questions” about your data once it’s in this connected form.
We’re also using the Apache Solr project to help find relevant sections of text from larger documents that we want to pull out and do further analysis on. Most important for us at the moment is its ability to do quick “fuzzy” matching on document contents based on queries we give it.
We feed those results back into the Neo4J graph as segments that can be traced back to the documents they come from, or the queries that generated them.
Issue #3: A lack of easy metadata about the documents
Uploading the documents is one thing, figuring out if we’re grabbing the right documents (and what’s even in those documents) is another.
Our Solution: Web scraper
We’re using a Ruby-based web scraping tool to crawl through the CRTC site, keeping track of as much metadata about the documents as possible, and also retrieving those documents.
A major issue for us has been the different conventions used by the organizations who submitted documents, or in a particular part of the process. There’s no quick solution for this: we’re text mining (a mixture of manually coding to extract information based on conventions we identify, fuzzy matching, and unsupervised / supervised machine learning techniques) to better categorize common phrases and language used in government consultation submissions.
We’ve written up a more extensive document on these and other issues here.
Going forward, we hope to add more ways of navigating the documents, as well as provide users with a mechanism for directly augmenting the data (perhaps through machine-learning assisted tagging). We’d also like to include methods for correcting errors in the data in a traceable, transparent manner. Ideally the entire toolset will be adaptable to navigating other public consultations and forms of policy documentation.
We plan to publicly release our tool in the coming month. If you have any suggestions or tips based on the work we outlined above, please feel free to leave a comment. Otherwise, watch this space!