Systems and statistics: moving data science beyond machine learning

By Jordan Engbers, PhD from Desid Labs Inc.

collabo-blue

Data science focuses on applying statistical methods — primarily machine learning algorithms — to transform data resources, big or small, into data products that provide actionable insights. This near-complete reliance on statistical methods has led some to argue there is little difference between data science and statistics (but see this rebuttal). Still, this data-centric approach to solving problems has been “unreasonably effective” and the impact of data science is expected to continue rising. However, as we bet the future on data science, we need to ensure the resulting data products create the rightimpact. To do this, we will need to expand data science beyond machine learning.

General systems theorist Gerald Weinberg wrote: “science and technology have been unable to keep pace with the second-order effects caused by their first-order victories”. Our world is a complex system, and technology and decisions can have far reaching and unexpected effects. As we accelerate decision-making processes through applied data science, the effects of our predictive analytics and automated algorithms must not outpace our ability to ensure we are building the future we want and need. To address these concerns, we need to expand beyond machine learning and incorporate systems thinking into data science. This will improve the performance of machine learning models, expand the application of big data, and allow examination of second-order effects on a systems level.

Machine learning algorithms have implicit and opaque assumptions, usually only known to experts of those algorithms. The assumptions may be about how the data were generated (e.g. normal distributions) or how variables relate to each other (e.g. independent or dependent). A new approach called model-based machine learning allows these assumptions to be explicitly modeled, and uses a generic inference method to determine which machine learning algorithm is most appropriate. Knowledge of the system is also incorporated in the underlying model. This approach offers several advantages, such as improving algorithm performance through intuitive tweaking of model parameters. By taking a hybrid approach that allows us to incorporate prior system knowledge, we can potentially increase the success rate of data science projects. Also, by taking a step back from the data and looking at the system, organizations may better understand how the project can fit within their overall goals — something that is currently lacking and profoundly affects the success of a project.

While machine learning can benefit from systems models, systems models can benefit greatly from big data and machine learning. Agent-based models, where an “agent” (an autonomous entity with a defined behaviour in the system) have been used to simulate everything from biofilm formation tosupply chain logistics and financial markets. Big data is now being used in agent-based models of citiesto improve the accuracy of agent behaviour. Here, big data and machine learning improve how we predict individual behaviour, and the system model can then show how individual responses to change will affect the system as a whole. For example, this could allow city planners to examine how changing road layouts may affect crime rates, an unintuitive result due to complex system dynamics. A hybrid systems/statistics approach provides another avenue to use our data resources.

Perhaps most importantly, including a systems approach into data science will allow us to examine how decisions based on predictive analytics will affect the system as a whole. Techniques like system dynamics and agent-based modeling allow us to examine the behaviour of a system given a set of components and interactions. In healthcare, where limited resources and patient well-being must be balanced, it is especially important to see how decisions based on predictive algorithms could affect resource utilization. Dr. Deborah Marshall, a professor and health economist at the O’Brien Institute for Public Health, is already applying such models to osteoarthritis care in Alberta. When combined with big data and predictive analytics, these models become even more powerful, allowing decision makers to see how different decisions based on machine learning will affect the system as a whole.

While it is easy to make a case that this hybrid approach will improve the effectiveness of data science, it is equally important to consider the consequences of foregoing systems thinking. As predictive analytics programs proliferate, it will become increasingly difficult to keep track of their second-order effects. Already, algorithmic bias has resulted in discrimination in product pricingoffensive tagging of online photos, and predatoryadvertising. Inadvertent or not, these biases can have a disparate impact on groups of individuals, often “inheriting” the prejudices of the society at large. While it is no substitute for thoughtful consideration of the data, a systems approach may identify these problems before they happen and allow us to ensure our data resources create the future we actually want.