Even with advances in machine learning (ML) techniques, computing power and programming accessibility in the last five years, approaching an ML question can be a daunting process. How do you take reams of messy, unstructured or incomplete data and turn it into something of value? This is a question my company — a machine learning solutions provider — has been working on for years, and one that I’ve spent a lot of time thinking about.
Every problem is different, but there are several key steps that are fairly constant in the ML estimation process. Following these steps won’t solve every issue that might arise, but it may answer some common questions and help you get started.
Identify the Question
This seems obvious, but it’s one of the most common issues we see in the process of machine learning. Algorithms and techniques are only as accurate as the person asking the question, so spending time formulating exactly what you want to know can save a great deal of trouble down the line. At its core, a good ML question is specific, both substantively and technically. Generally-stated questions can be a useful starting point for thinking about a given business problem, but when building a predictive engine, both the inputs and the outputs have to be specified very clearly.
- General question: “How can we improve customer satisfaction?” This is a good jumping-off point to think about the problem, but it does not translate into a statistical process.
- ML question: “What factors are most closely associated with negative customer reviews?” From the general question, we can identify a specific one: what aspects of the customer experience correlate with negative reviews.
Identify the Metric You Want to Estimate or Predict
This is another obvious issue, but one that comes up often. ML models, like any other numerical process, deal in concrete values. If you want to measure customer satisfaction, what metrics do you have? Are they explicit such as star rankings, or implicit like a drop-off in usage rate? Can you get a reliable set of previous records such as a ‘training set’ that can be used to fit an ML model and predict future values?
Identify the General Statistical Approach That You Need
Say you want to identify relationships between input and output features: you might expect that people who call customer support often or spend more time on the phone are more likely to leave critical reviews. However, you want to know the degree to which this is the case. You might want to use a parametric model that focuses on explaining these relationships in simple terms that can be used as rule set going forward. Alternatively, you might be interested in prediction rather than explanation: such as reliably identifying disgruntled customers and making efforts to keep them interested. For the latter, you might want to use a more “black-box” ML model that allows for complex interdependencies between inputs and outputs in order to maximize accuracy.
Continuing with our call center example, imagine trying to identify in real time which of your callers might become unhappy, then rerouting these calls to a higher tier of support. If you wanted to build this, you’d need to hypothesize the reasons an individual caller would become disgruntled. Then, you’d need to build a model that factors in each of these variables.
On the other hand, the “black-box” machine learning technique automatically detects the correct combination of model variables and weighs their relative contribution to the entire model. They automate about 80 percent of feature selection. For this, you’d throw all of your available data on these callers into a set function, and then rely on the computer to produce a predicative model. However, these models are difficult to interpret even though they’re powerful. In short, these ‘black-box’ models can accurately identify what or who — and can sometimes tell you why — but they are not very good at answering the question “what if.”
Prepare the Input Data for Analysis
ML techniques are statistical engines that work with numbers. If you have significant data stored in messier formats such as email records, call logs, security footage and so on, then this data will need to be converted into a tabular, numeric format. You might also have issues with missing or miscoded data: these problems need to be identified and solved as well. This can take significant time and effort. By my own experience, expect to spend 80-90 percent of project time cleaning, manipulating and pre-processing data to ready it for analysis.
For our own use, we built a system that automates the pre-processing of the data to reduce model development times. Some of our customers opt to use it for themselves, while others prefer to keep their data in-house.
Identify the Right Algorithm and Computing Platform
Interestingly, the choice of algorithm is often less crucial than it might seem. There are dozens, even hundreds, of algorithms that may prove useful in answering your ML question. However, nine times out of ten this question devolves into questions of availability and logistics. Availability deals with how easy it is to put a ML model into production. What coding platform do your data scientists use? Are there readily-available ML packages or libraries that they can access to accomplish this task? If there are, use them! Some reasonably good ML packages are even open source and can be found in standard distribution of ‘R’ or in free Python packages like “scikit-learn.”
Logistical issues are often a more salient issue when building an ML system. Is the data set small enough that everything can be run on a laptop? Or is it large enough to necessitate distributed computing methods like Hadoop? Identifying logistical issues by asking the right questions early on can save both time and money.
This is a lot of work summarized in five deceptively simple steps. However, tackling an ML solution in this rough order is, in my opinion, a good conceptual foundation. Clearly identifying the problem and the metric you want to estimate helps eliminate uncertainty or miscommunication between the data science team and other members of your organization. Setting up an appropriate statistical method, supported by adequate programming and computation infrastructure, will help answer this question clearly and quickly, with a minimum of technical and logistical issues. Identifying the pre-processing jobs will help set up a realistic timeline to completion, and maximize the leverage your ML approach can get when dealing with various data sources. The last step in this process is the easiest: start the algorithm running, cross your fingers and hope for the best.
A version of this post originally appeared on the Serial Metrics blog, here.