ON MEASURING MACHINE LEARNING MODELS AGAINST CONCRETE BUSINESS OBJECTIVES

REVIEW NOTES: DATA SCIENCE FOR BUSINESS BY PROVOST & FAWCETT: CHAPTER 7

I enjoyed reading this chapter. It’s insightful and well explained with detailed examples, diagrams and graphics, on a few data science topics that correspond directly to conventional scientific research in computer science. That makes me happy, because these are crucial points, yet rarely are the focus of Kaggle Competitions, books on Machine Learning or Statistics, the latest and greatest in TensorFlow, PyTorch, AutoML libraries (etc, etc) and too infrequently discussed in DL/AI/ML social posts and blogs. Below I have written about the points that are well worth taking home. These topics are broadly on:

Careful consideration of what is desired from data science results.
Expected value as a key evaluation framework.
Consideration of appropriate comparative baselines, in machine learning models.

Back at the University, when looking at new grants, proposals or projects with Jaguar-Land Rover, Bristol Robotics Labs and the like, the first point of decision was to set a sensible accuracy target and acceptable range. This provided the benchmark for assessment of completion and progress. It also decided whether enough domain knowledge or related work was known to establish this and whether the target objective was truly focused correctly on an beneficial outcome. Reading this chapter by Provost and Fawcett reminded (but to be honest, reassured me) that both groups are on the same mission – end goal focused objectives.

Deeply Consider the Desired Outcomes

Assign purpose, action and costs to FN. FP, TN, TP of the confusion matrix. Assign a numbered value for the cost and benefit. Via the Expected Value framework, we can measure a classifier model’s costs.

The TP and TN of a confusion matrix are the known true cases of some belief; that is, we know it to be the case (True Positive) and we know it not to be the case (True Negative). Assigning a meaning for Positive and Negative in the context of the problem is vital to begin determining which outcomes we want to aim for. The FP and FN costs might be very different.
In the Customer Churn example a Positive might be customer will stay (P) and a Negative might be a customer will Churn (N). The cost for misclassifying a customer that will churn and failing to send an incentive (FN), could have a higher cost than misclassifying a customer that will not churn and sending an incentive (FP). In the FP case, sending an incentive could cost £1 and give a customer discount of £5 over 12 months, for a total of -£6. In the FN case, failing to send could mean losing a customer and associated profit e.g. -£60 (assuming subscription with profit £5 p/m) over 12 months. For a classification model tested with 15 FN cases and 10 FP, we can calculate the cost of FNs = 15 x £60 = -£900 and cost of FPs = 10 x £ 6 = -£60. If we combine the benefits (TP, TN) of the data science task using the Expected Value Framework (factoring in class priors, below) then we can determine the expected value (cost/benefit outcome) of the data science and business task.
In the Marketing Ad Campaign example, the cost of mailing promotional materials to one contact was measured, as was the benefit (profit) of making a promotional sale via that promotion.
Assign a cost to running the data analysis assessment task, for example if additional computing or human resources are required. This can be compared against “sending to all”, “sending to random set” or “sending to none”.

Expected Value Framework

This is a mathematical framework for assigning values to outcomes and probability to decisions and later concrete costs to classification models.

Start by assigning purpose, action and costs to the confusion matrix TN,TP,FN,FP cases for the task at hand.
Train (on a balanced dataset) and test (on the original test dataset) the classification model. Using the test accuracies, allocate the actual classified number counts per quadrant of a confusion matrix.
To produce a Cost/Benefit Ratio for this model, we can assign an individual case an average cost/benefit. Using a count of class priors (i.e. a representative proportion or number of record instances in a given class) from the test dataset, factor the classified number per quadrant over its class prior to give its probability. This gives us a confusion matrix to represent the model’s performance, and the probability of each quadrant case occurring in a normal (expected as) live test scenario.
For the data science task at hand, for example Incentivise to Mitigate Customer Churn, assign actual costs to each quadrant. Which is to say, assign a cost/benefit to each ‘purpose and action’ quadrant.
Finally, we multiple and summate the cost/benefits of actions with the actual number or actual probability of classified cases.
Using the individual case average/probability, leads to an Expected Cost/Benefit Value Per Individual Case. Using the raw classified numbers, leads to an Expected Cost/Benefit Value For the Data Science Task. Both provide an evidence-driven decision basis to choose whether or not to proceed with the business task.

Benchmarking

Set a basic ML model as a benchmark to improve against, like decision stump, or even simpler, majority class, or a base assumption (like “same as previous occurrence”). The reason is that pure classification accuracy is often not a good indicator of business value, but a comparison against a known and understandable benchmark can have meaning to all stakeholders. In the case of academic research, where peers will know of the latest algorithms to solve a particular problem, those latest algorithms provide the benchmark against which a new solution to an existing problem should be measured. But that differs for business managers and directors with empirical experience and a honed expert technique for decision making. If the model performance cannot reach and exceed that alternative (existing) technique’s performance, then there is no incentive to pursue the new method.

The authors identify the importance of carefully considering the meaning of TP,TN,FP,FN and their purpose (or action) and their cost in the context of the business task at hand. Each insight that guides towards a business decision and business action has an associated cost. In the context of business operations, model classification accuracy is an abstract vanity metric; it is detached from the underlying costs/benefits of the business actions and tasks.

Carefully select and calculate each metric for its specific business task, with the context of its meaning, action and costs. Such a metric may not (are unlikely to) fall under typical classification model metrics of Recall, Precision, F-Measure, Sensitivity, Specificity, AUC-ROC, etc. Use that selected metric to compare and benchmark models or to reevaluate models when new or updated data is available.

Dealing with Imbalanced Datasets, Labelling via Semi-Supervised and Class Priors

The majority of real data is sourced from imbalanced observation sets. These skewed or imbalanced classes are said to have class priors, knowledge or a numerical expression of the each class, i.e. count or proportion of instances in each class. Of course, that data needs to have been labelled in the first place, which can be carried out manually, artificially or via a semi-supervised learning process, such as Label Propagation, Label Spreading (both in sklearn), “Copy-Paste Modelling” (label some, train a model, have it label more, repeat) or for example, using the Pomegranate library (in Python). Several other techniques work for this depending on the data; e.g. data labeller tools (Snorkel), generative models (like GANs), a number of statistical resampling techniques or data augmentation techniques.

Training a model using an artificially balanced training set seems (according to the examples in the chapter) to improve the model classification performance. My personal straightforward thinking, is that a model of the class will fit better if it has parsed more samples (of equally high quality) than less samples. (NB: there are various model development clauses with this statement, for example using a process to counter overfitting and underfitting the model, etc.) Thus a skewed dataset will tend toward a skewed classifier; whether that is desirable, depends on the use case. A balanced training dataset can be selected using the class priors (a representative proportional number of instances); and the corresponding test set can be selected from the remaining instances.

Testing must be performed on the true representation of the dataset population. The skewed test sample will allow us to transfer some meaningful insights into the live test scenario.

The process of comparing models is not yet complete, in that the cost of FP,FN (for example) should be measured in the context of business action benefits and costs. The Benchmarking and Expected Value sections above describe this mechanic for comparison.

Thank you to Foster Provost and Tom Fawcett for writing their 2013 book, find more details on GoodReads.
Thank you to Anne Nygård for the post feature photo. ^^