Since machine learning is a panacea, your company should be able to use it profitably, right? Perhaps; perhaps not. OK, I’m just kidding about the panacea: that’s just marketing hype. Let’s discuss whether you have what it takes to harness artificial intelligence — and how you could get to that point if you’re not yet there.
To begin with, do you know what you want to predict or detect? Do you have enough data to analyze to build predictive models? Do you have the people and tools you need to define and train models? Do you already have statistical or physical models to give you a baseline for predictions?
Here, we’ll break down what you need for your AI and Ml projects to succeed, discussing their ramifications to help you ascertain whether your organization is truly ready to leverage machine learning, deep learning, and artificial intelligence.
You have plenty of data
Sufficient relevant data is the sine qua non of predictions and feature identification. With it, you might succeed; without it, you can’t. How much data do you need? The more factors you’re trying to take into account, the more data you require, whether you’re doing ordinary statistical forecasting, machine learning or deep learning.
Take the common problem of predicting sales, such as how many pairs of navy blue short-sleeved blouses you will sell next month in Miami, and how many of those you need to have in stock in your Miami store and your Atlanta warehouse to avoid back-orders without tying up too much money and shelf space in stock. Retail sales are highly seasonal, so you need statistically significant monthly data from multiple years to be able to correct for month-to-month variations and establish an annualized trend — and that’s just for standard time-series analysis. Machine learning needs even more data than statistical models, and deep learning models need multiples more than that.
One statistical model you might build would analyze your chain’s monthly blouse sales nationally over 5 years, and use that aggregate to predict total blouse sales for next month. That number might be in the hundreds of thousands (let’s say it’s 300,000). Then you could predict blouse sales in Miami as a percentage of national sales (let’s say it’s 3%), and independently predict blue short-sleeved blouse sales as a percentage of total blouse sales (let’s say it’s 1%). That model points to approximately 90 sales of blue short-sleeved blouses in Miami next month. You can do sanity checks on that prediction by looking at year-over-year same-store sales for a variety of products with special attention to how much they vary from the model predictions.
Now, suppose you want to take into account external factors such as weather and fashion trends. Do short-sleeved blouses sell better when it is hotter or sunnier than when it is cooler or rainier? Probably. You can test that by including historical weather data in your model, although it might be a little unwieldy to do so with a time-series statistical model, so you might try decision forest regression, and while you’re at it try the other 7 kinds of machine learning models for regression (see screenshot above), and then compare the “cost” (a normalized error function) for each model when tested against last year’s actual results, to find the best model.
Is navy blue going to sell better or worse next month than it did the same time last year? You can look at all monthly sales of navy blue clothing and predict annual fashion trends, and perhaps fold that into your machine learning models. Or you might need to apply a manual correction (a.k.a. “a wild guess”) to your models based on what you hear from the fashion press. (“Let’s bump that prediction up by 20% just in case.”)
Perhaps you want to do even better by creating a deep neural network for this prediction. You might discover that you can improve the regression error by a few percent for each hidden layer that you add, until at some point the next layer doesn’t help any more. The point of diminishing returns might come because there are no more features to recognize in the model, or more likely because there just isn’t enough data to support more refinements.
You have enough data scientists
You may have noticed that a person had to build all the models discussed above. No, it isn’t a matter of dumping data into a hopper and pressing a button. It takes experience, intuition, an ability to program and a good background in statistics to get anywhere with machine learning, no matter what tools you use — despite what vendors may claim.
Certain vendors in particular tend to claim that “anyone” or “any business role” can use their pre-trained applied machine learning models. That might be true if the model is for exactly the problem at hand, such as translating written formal Quebecois French to English, but the more usual case is that your data isn’t well-fit by existing trained machine learning (ML) models. Since you have to train the model, you’re going to need data analysts and data scientists to guide the training, which is still more an art than it is engineering or science.
One of the oddest things about hiring data scientists is the posted requirements, especially when compared to the actual skills of those hired. The ads often say “Wanted: Data Scientist. STEM Ph.D. plus 20 years experience.” The first oddity is that the field hasn’t really been around for 20 years. The second oddity is that companies hire 26-year-olds right out of grad school — that is, with no work experience at all outside of academia, much less 20 years — in preference to people who already know how to do this stuff, because they are afraid that senior people will be too expensive, and despite the fact that they asked for 20 years of experience. Yes, it’s hypocritical, and most likely illegal age discrimination, but that’s what’s been happening.
You track or acquire the factors that matter
Even if you have gobs of data and plenty of data scientists, you may not have data for all the relevant variables. In database terms, you may have plenty of rows but be missing a few columns. Statistically, you may have unexplained variance.
Measurements for some independent variables such as weather observations are easily obtained and merged into the dataset, even after the fact. Other factors may be difficult, impractical or expensive to measure or acquire, even if you know what they are.
Let’s use a chemical example. When you’re plating lead onto copper, you can measure the temperature and concentration of the fluoroboric acid plating bath, and record the voltage across the anodes, but you won’t get good adherence unless the bath has enough peptides in it, but not too much. If you didn’t weigh the peptides you put into the bath, you won’t know how much of this critical catalyst is present, and you’ll be unable to explain the variations in the plate quality using the other variables.
You have ways to clean and transform the data
Data is almost always noisy. Measurements may be missing one or more values, individual values may be out of range by themselves or inconsistent with other values in the same measurement, electronic measurements may be inaccurate because of electrical noise, people answering questions may not understand them or may make up answers, and so on.
The data filtering step in any analysis process often takes the most effort to set up — in my experience, 80% to 90% of total analysis time. Some shops clean up the data in their ETL (extract, transform, and load) process so that analysts should never see bad data points, but others leave all data in the data warehouse or data lake with an ELT (with the transform step at the end) process. That means that even the clearly dirty data is saved, on the theory that the filters and transformations will need to be refined over time.
Even accurate filtered data may need to be transformed further before you can analyze it well. Like statistical methods, machine learning models work best when there are similar numbers of rows for each possible state, which may mean reducing the number of the most popular states by random sampling. Again as with statistical methods, ML models work best when the ranges of all variables have been normalized.
For example, an analysis of Trump and Clinton campaign contributions done in Cortana ML shows how to prepare a dataset for machine learning by creating labels, processing data, engineering additional features and cleaning the data; the analysis is discussed in a Microsoft blog post. This analysis does several transformations in SQL and R to identify the various committees and campaign funds as being associated with Clinton or Trump, to identify donors as probably male or female based on their first names, to correct misspellings, and to fix the class imbalance (the data set was 94% Clinton transactions, mostly small donations). I showed how to take the output of this sample and feed it into a two-class logistic regression model in my “Get started” tutorial for Azure ML Studio.
You've already done statistical analyses on the data
One of the big sins in data analysis and problem solving is jumping to cause. Before you can figure out what happened and why, you need to step back and look at all the variables and their correlations.
Exploratory data analysis can quickly show you the ranges and distributions of all the variables, whether pairs of variables tend to be dependent or independent, where the clusters lie, and where there may be outliers. When you have highly correlated variables, it’s often useful to drop one or the other from the analysis, or to perform something akin to stepwise multiple linear regression to identify the best selection of variables. I don’t mean to imply that the final model will be linear, but it’s always useful to try simple linear models before introducing complications; if you have too many terms in your model, you can wind up with an overdetermined system.
You test many approaches to find the best models
There’s only one way to find the best model for a given data set: try all of them. If your objective is in a well-explored but challenging domain such as photographic feature identification and language recognition, you may be tempted to try only the “best” models from contests, but unfortunately those are often the most compute-intensive deep-learning models, with convolutional layers in the case of image recognition and long short-term memory (LSTM) layers for speech recognition. If you need to train those deep neural networks, you may need more computing power than you have in your office.
You have the computing capacity to train deep learning models
The bigger your dataset, and the more layers in your deep learning model, the more time it takes to train the neural network. Having lots of data helps you to train a better model, but hurts you because of the increase in training time. Having lots of layers helps you identify more features, but also hurts you because of the increase in training time. You probably can’t afford to wait a year to train each model; a week is more reasonable, especially since you will most likely need to tune your models tens of times.
One way around the training time issue is to use general purpose graphics processing units (GPGPUs), such as those made by Nvidia, to perform the vector and matrix computations (also called linear algebra) underlying neural network layers. One K80 GPU and one CPU together often give you 5 to 10 times the training speed of just the CPU if you can get the whole “kernel” of the network into the local memory of the GPU, and with a P100 GPU you can get up to 100 times the training speed of just the CPU.
Beyond a single GPU, you can set up coordinated networks of CPUs and GPUs to solve bigger problems in less time. Unless you train deep learning models year-round and have a huge capital budget, you might find that renting time on a cloud with GPUs to be your most cost-effective option. Several deep learning frameworks, including CNTK, MXNet and TensorFlow, support parallelized computation with CPUs and GPUs, and have demonstrated reasonable scaling coefficients (~85% in one test) for networks of very large virtual machine (VM) instances with capable GPUs. You can find these frameworks and more already installed into VM instances with GPU support on the major cloud providers.
Your ML models outperform your statistical models
Your simple statistical models set the bar for your work with machine learning and deep learning. If you can’t raise the bar with a given model, then you should either tune it or try a different approach. Once you know what you’re doing, you can set up trainings for many models in parallel under the control of a hyperparameter tuning algorithm, and use the best results to guide the next stage of your process.
You are able to deploy predictive models
Ultimately, you are going to want to apply your trained models in real time. Depending on the application, the prediction may run on a server, in a cloud, on a personal computer or on a phone. The deep learning frameworks offer various options for embedding their models into web and mobile apps. Amazon, Google and Microsoft have all demonstrated the practicality of this by producing consumer devices and smartphone apps that understand speech.
You are able to update your models periodically
If you’ve trained your own model on your data, you may find that the model’s error rates (false positives and true negatives) increase over time. That’s basically because the data drifts over time: your sales patterns change, your competition changes, styles change and the economy changes. To accommodate this effect, most deep learning frameworks have an option for retraining the old model on your new data and replacing the predictive service with the new model. If you do this on a monthly basis you should be able to stay on top of the drift. If you can’t, your model will eventually become too stale to be reliable.
To return to our initial questions, do you know what you want to predict or detect? Do you have enough data to analyze to build predictive models? Do you have the people and tools you need to define and train models? Do you already have statistical or physical models to give you a baseline for predictions?
If so, what are you waiting for?
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.