Feature hashing (the “hashing trick”) is a fast and memory-efficient way to convert features (especially categorical or string features) into a fixed-length numerical vector. It works by applying a hash function to the feature (or feature-value pair) to determine an index in a vector, and then incrementing the value at that index. This allows potentially infinite or very large feature spaces to be collapsed into manageable vector sizes, at the cost of possible collisions (different features hashing to the same index). It’s commonly used in text processing: each word might be hashed to an index in a, say, 1e6-length vector, thereby avoiding maintaining a huge dictionary of all words. Collisions introduce noise but empirically often don’t hurt performance significantly if the hash space is large enough. Feature hashing is stateless and simple (no need to store a map of feature->index), thus it’s useful in streaming or large-scale situations.
Data Selection & Data Viewer
Get data insights and find the perfect selection strategy
Learn MoreSelf-Supervised Pretraining
Leverage self-supervised learning to pretrain models
Learn MoreSmart Data Capturing on Device
Find only the most valuable data directly on devide
Learn More