By Pamela Bleve, Data Scientist and Machine Learning Engineer at CounterTack
In today's blog post, we will learn about:
1. Imputing data in highly imbalanced datasets. The imbalance issue.
2. Training machine learning models with highly imbalanced datasets
Obtaining good training data is on of the most challenging problems when using machine learning for cyber security. Classifiers are only as good as the data used to train them, and reliably labeled data is especially important in supervised machine learning.
The imbalanced dataset problem has been attracting much attention in the machine learning community. There are many real-world applications that are faced with class imbalance problem. A few examples for these applications are fault diagnosis, anomaly detection, fraud detection, computer vision, face recognition, medical diagnosis and others where this issue is highly significant and the frequency of one class (e.g. cancer) can be 1000 times less than another class (e.g. healthy patient).
As an example, the most common issue which influences the classification, during breast cancer malignancy grading, is directly caused by imbalanced number of cases of the malignancy classes. This acts as a challenge for pattern recognition algorithms and often results in a significant reduction of the classification accuracy for the minority class.
One biological disease related example is that in microRNA dataset which is associated with cancer, the number of experimentally validated microRNAs are much greater than the number of microRNAs that are not associated with cancer. Minority class is generally considered a noisy dataset and consequently overlooked by the classes that have majority because of the high imbalanced dataset used to train a classifier.
This hard problem is currently faced by all the organizations in the cyber security field as well.
This article will focus on highly imbalanced intrusion datasets and how to deal with the class imbalance problem when the organization doesn’t have exposure to a large amount and wide variation of attack traffic.
Classifying Network Endpoint Security attacks
One common cyber security goal is preventing bad actors from getting on unauthorized computer networks and finding an intruder who has bypassed such preventions and is possibly already on the network.
The most frequently used algorithm to identify events of security compromise on a network is anomaly detection. In a real-time scenario, you have a billion examples of normal traffic inside the network activity data, and only a few examples of abnormal/bad/malicious traffic, that belongs to some of the general categories:
- dos (Denial of Service)
- rl2 (Unauthorized accesses from remote servers)
- u2r (Privilege escalation attempts)
- probe (Brute-force probing attacks)
It can be difficult to accurately predict the class distribution of real-life data. In a particular scenario, for instance, when designing a network attack classifier for deployment on a network that doesn’t contain any database servers, the sql-injection attack traffic might be near nonexistent.
If we examine a training data class distribution as following:
One obvious observation is that the classes are tremendously imbalanced. For instance, the u2r class is smaller than the dos class by three order of magnitude in the training set. If we ignore this class imbalance and use the training data as is, there is a chance that the model will learn a lot more about the benign and dos classes compared to the u2r and r2l classes, which can result in an undesirable bias in the classifier.
Even if we want a binary detection between benign and malicious activities (normal vs abnormal), it is a tough task to build a classifier on such imbalance of class examples, because the classifier would simply label everything as normal and produce a very high classification accuracy. Anomaly detection algorithms were created for grossly imbalanced datasets. Since the diversity found in networks makes modeling normality difficult, and it can lead to a high rate of false alarms, the anomaly detection problem can be turned into a binary classification problem.
Transform into Binary Classification:
If we take a look at this example:
# 0 == benign 700,000
# 1 == malicious 56050
Only about 8% of the observations were balanced. Therefore, if we were to always predict 0, we'd achieve an accuracy of 92%.
Train model on imbalanced data:
In this example:
clf_0 = LogisticRegression().fit(X, y)
# Predict on training set
pred_y_0 = clf_0.predict(X)
print( accuracy_score(pred_y_0, y) )
the model has 92% overall accuracy, but it is because it's predicting only 1 class.
print( np.unique( pred_y_0 ) )
#  == benign
This model is only predicting 0, which means it's completely ignoring the minority class in favor of the majority class.
As explained with this simple example, in our cybersecurity domain, machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases. Conventional algorithms are often biased towards the majority class because their loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration. In the worst case, minority examples are treated as outliers of the majority class and ignored. The learning algorithm simply generates a trivial classifier that classifies every example as the majority class.
Our goal is not to maximize simple accuracy (or, equivalently, minimize error rate), but to be more careful with the rare class examples, because they are much more important to classify.
This post can expose accessible ways to learn classifiers from imbalanced data. Most of them involve adjusting data before or after applying standard learning algorithms.
- Choose the right measurement
- Don’t use accuracy (or error rate) to evaluate your classifier. Accuracy applies a naïve 0.50 threshold to decide between classes, and this is usually wrong when the classes are imbalanced.
- Visualize classifier performance using a ROC curve, a precision-recall curve, a lift curve, or a profit curve
- Oversample the minority class:
- i. Upsampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal. There are several heuristics for doing so, but the most common way of doing so is to simply resample with replacement. Replicating data is not without consequences - since it results in duplicate data, it makes variables appear to have lower variance than they do.
- Undersample the majority class:
- i. Undersampling randomly downsamples the majority class. It can make the independent variables look like they have a higher variance than they do.
- ii. An example of the usage of this approach (implementing a Bagging Classifier, which balances bootstrapped samples prior to aggregation) is shown in the following figure:
- Synthesize new minority classes:Balance the training set.
- i. The best known example of this approach is Chawla's SMOTE (Synthetic Minority Oversampling Technique) system. The idea is to create new minority samples by interpolating between existing ones, slightly perturbing feature variables to generate synthetic data in a way that does not contaminate the original characteristics of the minority class.
- Oversample the minority class:
- Neighborhood-based approach:
- It examines the instance space carefully and decide what to do based on their neighborhoods. For example, Tomek links are pairs of opposite instances that are very close together. The strategy might involve performing k-means clustering on the majority class and removing data points from high-density centroids.
- Discard the minority examples and treat it as a single-class (or anomaly detection) problem.
- At the algorithm level, or after it:
- Adjust the class weight (misclassification costs):
- i. Many classifiers that take an optional class_weight parameter that can be set higher than one
- ii. The following figure shows the effect of increasing the minority class' weight by ten (example taken from the scikit-learn documentation
- Adjust the decision threshold
- Modify an existing algorithm to be more sensitive to rare classes
Learning from imbalanced classes continues to be an ongoing area of research in machine learning with new algorithms introduced every year, but in our cybersecurity domain, facing the harder problem could be not having enough examples of the newest rare class.
When looking for a security vendor for your organization making sure that the handling of this problem is facilitated in the correct way is half the battle as, depending on what approach is chosen, it can make a significant impact on the ability to detect unknown, never before seen threats.