Jason Rudy, Data Scientist / Programmer and Matt Lewis, Programmer / Product Manager
Artificial intelligence (AI) and machine learning is something most Americans have heard a lot about in recent years, and for good reason – advances in technology have made AI-powered products and services common-place in consumer applications, and increasingly, machine learning helps power technology across many sectors.
Both AI and machine learning are becoming frequently discussed topics in our healthcare sector. Many are wondering – what does machine learning really mean?
Fundamentally, machine learning is a set of techniques that allows the training of computer programs using example data. In the healthcare field, machine learning, coupled with large sources of patient / member data, can help generate predictions around population health, cost of care, events such as hospital admission and readmission, and give insight into the potential for individuals to experience specific outcomes and conditions (e.g. heart attacks, strokes, chronic kidney disease).
There are many different uses and approaches to machine learning tasks, but here we’ll be specifically talking about supervised machine learning – the type used to create predictive models – as it relates to modeling healthcare costs, events and conditions.
In healthcare, using claims data to better understand and predict costs, events and conditions can be incredibly useful for proactive actions by many types of organizations. For example, at Advanced Plan for Health (APH) we build models to predict overall risk levels, medical and pharmaceutical costs, events such as admissions and readmissions and conditions such as heart attacks, strokes, and more. We are constantly working to improve the accuracy and precision of those models, and to add more areas of prediction into the mix. In this post, we’ll give just a small taste of what it takes to turn healthcare data into accurate predictive models.
The first step in building a predictive model is deciding which features of the data to use, and how to represent those features. In healthcare, how we categorize and utilize patient member data has an enormous impact on how the model will function. For example, diagnosis codes can be used to categorize records into more general groups and comorbidities. Lab results can help to show trends over time, or can be used as binary thresholds. Some diagnoses, such as a broken leg, should expire after a reasonable interval, while others, such as hemophilia, are permanent conditions. Intelligently shaping our input data is a practice called feature engineering, and it represents the creative process of applying domain knowledge to choose a reasonable set of variables on which to make predictions. Our current generation of predictive models is based on a combination of hand-engineered features, features based on published studies and literature, and features based on statistical analysis of those patients who are over- or under-predicted.
Once that is complete, we start to train a machine algorithm on our data. Choosing the right machine learning algorithm to create a useful predictive model depends on many factors: The amount of training data available; How the algorithm handles different heterogeneous data and missing values; Whether the data is highly dimensional and How the algorithm handles the bias / variance tradeoff. A machine learning method we use often is py-earth, an implementation of Multivariate Adaptive Regression Splines (MARS). We use py-earth for multiple reasons – it can handle large numbers of predictor variables, it automatically detects interactions among predictors and it works well with missing data.
Multivariate adaptive regression splines were originally developed in the late 80’s and early 90’s by Jerome Friedman at Stanford University. It combines a greedy stepwise variable selection routine with a fast method for detecting optimal knot placement in order to efficiently search for predictive variables and low-degree variable interactions. It uses the selected variables and interactions to construct a model that is a sum of products of hinge functions, which are a way for the model to account for non-linear effects. The hinge function is very simple – it is given by h(x) = max(x,0).
At each step of model fitting, py-earth selects a new variable and knot value and adds two new terms to the model of the form h(x – k) x p and j(k – x) x p and, where x is the selected predictive variable, k is the selected knot location, and p is either 1 or a term from a previous iteration of the model. It then prunes terms from the resulting model based on a score which penalizes model complexity in order to avoid overfitting. What does this all mean? Let’s go through a simple example.
where k1 is the selected knot location and c0,1, c1,1 and c2,1 are coefficients fitted by linear regression. The symbol at the top of N is called a hat, and it indicates that it is an estimate for N. At the next step, the algorithm will select, say, A, age, and the new model will be:
The next step may select an interaction, say between age and BMI. In this case, we will be adding terms that are multiples of some existing term, and we could end up with a model such as:
This process will likely go on for many more steps, but at some point the model will cease to improve, and this process, called the forward pass, will stop. Some terms will then be pruned to produce a final predictive model, say
The above is meant only as an illustration of the concept. In practice, we combine the method illustrated above with other techniques such as bagging, gradient boosting, cross-validation, normalization, and calibration techniques to produce a final predictive model suitable for production use.
While all of these techniques are quite interesting and powerful, they are nothing without good data on which to train them. In healthcare, a major limiting factor for machine learning is the availability and quality of training data – in a sense, a predictive model is only be as good as the data set it’s based on. With a larger and more robust data set, machine learning can reliably capture more and more complex relationships among variables. The APH data set, which consists of medical and pharmacy claims, labs, biometrics and more data from millions of individuals across a variety of ages, is what makes this kind of predictive modeling possible, useful, and even a little exciting.
To learn more about how Advanced Plan for Health’s phenotype predictive analytics capabilities are helping our clients to more proactively understand costs and risks, and address them as early as possible contact us here, or call us at (888) 600-7566.
About our authors:
Jason Rudy is a data scientist and programmer with over six years of experience specializing in healthcare. He has his M.S. in bioinformatics and medical informatics. He is an author of the machine learning package py-earth as well as several other science and healthcare-related Python and R packages. Jason lives in San Francisco and programs chess engines in his free time.
Matt Lewis is a programmer and product manager with experience working with healthcare data. He’s worked on software projects in industries ranging from the federal government to workforce startups. Matt lives in Portland, Oregon, and has a degree in Politics, Philosophy, and Economics from Claremont McKenna College.