Doing Science with Data

Cybera MetricsAs part of Cybera’s mandate to explore new technologies and share our experiences with stakeholders, we are actively looking for new ways to improve our processes and bring value to our research and education community.

One of the first projects I worked on when I joined Cybera was our metrics dashboard… More specifically, I was asked to see where this project could take us and whether or not it would be valuable. It proved to be a hit among our staff and stakeholders. Most importantly, it made our operations more visible and showed the value we bring to our stakeholder community in a clear, digestible manner. This initiative has since fed into the “data-driven” organization and culture that Cybera is both striving for and evangelizing to others.

What isn’t visible — but always plays an essential role in our dashboards — is the thought process and team effort that goes into producing these metrics. It involves ingesting and exploring our data (observing), asking questions, formulating hypotheses, data wrangling (formatting), analyzing and communicating our findings (and repeat)… If I’m not mistaken, this is very similar to the Data Science process that is currently buzzing. And, to be honest, it is also similar to the scientific research process wherein my background lies (shameless self-plug) and where I found my original scientific inspiration (thank-you Mr. Koch).

At Cybera, we are currently sharpening our data science skills. Our goal is to explore tools and develop data products that we hope will support, spur and drive innovation within Alberta’s technology, educational and public sectors. Recently, my colleague David Chan and I underwent  the same training that many people go through on their journey to becoming data scientists. The intensive five-day bootcamp, run by Seattle-based start-up Data Science Dojo, brought together attendees from wide professional and educational backgrounds (including students, business professionals — finance, automotive, health — and IT developers) looking to get their “feet wet” in data science. Over the five days we focused on the R language and covered key areas such as classification (e.g. decision trees), text analytics, recommender systems, evaluation (including the importance of cross-validation), big data engineering tools (e.g. Hadoop, Spark, etc.) and working with streaming data.

Much was gleaned from the course content itself, but also from the participants and instructors who shared their own data science experiences. As part of the training, I participated in my first Kaggle competition (Titanic: Machine Learning from Disaster) which involved predicting passenger survival from the training dataset (the highest score obtained was 0.80383). The take home message for me from the bootcamp and conversations: work in teams and just keep at it through practice, participation and continuous learning 🙂 .    

Kaggle - Titanic


Stay Tuned

One of the key aspects of doing data science is communication. As our data science team ramps up our efforts we will continuously share our discoveries and products with you, our stakeholder community. By putting our work out there we hope to inspire you and your organizations to do the same, so I invite you to follow us as we embark our data science journey, and send us your own data learning stories.