A-Z of Machine Learning and Computer Vision Terms

Class Imbalance

Class imbalance refers to an uneven distribution of classes in a dataset, where some class (the “majority” class) has many more samples than another class (the “minority” class).This situation is common in real-world classification tasks – for instance, in fraud detection, fraudulent transactions might be only 1% of the data (minority) while legitimate transactions are 99% (majority). Similarly, in medical diagnostics data, healthy cases often vastly outnumber disease cases. Class imbalance can be problematic because most machine learning algorithms assume or perform best when the classes are roughly balanced. The model will tend to bias towards the majority class, since simply predicting the majority every time minimizes overall error; as a result, it may largely ignore the minority class, which is usually the class of greater interestFor example, a classifier might achieve 99% accuracy on the fraud dataset by always predicting “not fraud,” but such a model is essentially useless for catching actual fraud instances.The presence of class imbalance means that evaluation metrics like plain accuracy become less informative – one must look at metrics that capture minority-class performance (such as precision, recall, F1-score, area under the ROC curve, etc.). It also necessitates special techniques during modeling. Data-level methods include re-sampling the training data: one can over-sample the minority class (e.g. duplicate minority examples or generate synthetic ones using methods like SMOTE) or under-sample the majority class (remove some majority examples) to achieve a more balanced dataset.Algorithm-level methods include using cost-sensitive learning or class weight adjustments – assigning a higher penalty to mistakes on the minority class during training, so the model is incentivized to get those right.In practice, a combination of approaches may be used. For instance, one might slightly over-sample the minority class and also use a weighted loss function that emphasizes minority-class accuracy. Another strategy is to use one-vs-all or threshold-moving techniques to adjust the decision threshold for the minority class to achieve a desired recall. It’s also important to have a properly stratified validation scheme: evaluation on imbalanced data should reflect the costs of different errors. In summary, class imbalance is a common challenge that can lead to biased models if not addressed – the key is to recognize it and apply techniques that restore focus on the minority class performance without introducing too much overfitting or noise by naive oversampling.

A-Z of Machine Learning and Computer Vision Terms

PyTorch

Q

Quantum Machine Learning

Query Strategy (Active Learning)

Query Synthesis Methods

R

RAG Architecture

ROC (Receiver Operating Characteristic) Curve

Random Forest

Recall (Sensitivity or True Positive Rate)

Recurrent Neural Network (RNN)

Region-Based CNN (R-CNN)

Regression (Regression Analysis)

Regularization Algorithms

Reinforcement Learning

Responsible AI

S

Scale Imbalance

Scikit-Learn

Segment Anything Model (SAM)

Selective Sampling

Self-Supervised Learning

Semantic Segmentation

Semi-supervised Learning

Sensitivity and Specificity of Machine Learning

Sentiment Analysis

Sliding Window Attention

Stream-Based Selective Sampling

Supervised Learning

Support Vector Machine (SVM)

Surrogate Model

Synthetic Data

T

Tabular Data

Text Generation Inference

Training Data

Transfer Learning

Transformers (Transformer Networks)

Triplet Loss

True Positive Rate (TPR)

Type I Error (False Positive)

Type II Error (False Negative)

U

Unsupervised Learning

V

Variance (Model Variance)

Variational Autoencoders

W

Weak Supervision

Weight Decay (L2 Regularization)

X

XAI (Explainable AI)

XGBoost

Y

YOLO (You Only Look Once)

Yolo Object Detection

Z

Zero-Shot Learning

Class Imbalance

Further Reading

Explore Our Products

Lightly One

Lightly Train

Lightly Edge

Ready to Get Started?

Get the most out of your data