The learning rate is a hyperparameter that controls how much the model’s weights are updated during training in response to the estimated error (loss) gradient. In gradient descent optimization, the weight update is typically: w := w - η * (∂L/∂w), where η (eta) is the learning rate. A high learning rate can speed up training but might overshoot minima or cause divergence (the training loss might not decrease because steps are too large). A low learning rate ensures more stable convergence but training becomes slow and can get stuck in local minima or plateaus. Often, learning rate schedules or adaptive methods are used: starting higher then decaying, or methods like Adam adjust the effective learning rate per parameter. Tuning the learning rate is crucial for efficient training. A common heuristic is to try different powers of 10 (e.g., 1e-1, 1e-2, 1e-3, 1e-4) or use techniques like learning rate finder to pick a good value. Too high usually shows divergence (loss increases), too low shows very slow decrease of loss.
Data Selection & Data Viewer
Get data insights and find the perfect selection strategy
Learn MoreSelf-Supervised Pretraining
Leverage self-supervised learning to pretrain models
Learn MoreSmart Data Capturing on Device
Find only the most valuable data directly on devide
Learn More