Sliding Window Attention is an attention mechanism pattern for transformer models that restricts each token’s attention scope to a fixed-size window of neighboring tokens.This approach was introduced as part of architectures for long sequences (such as Longformer), to address the quadratic complexity of standard self-attention. Instead of attending to all tokens in the sequence, a token only attends to those within a window of size w around it (for example, the closest w/2 tokens before and after).Using this local attention window significantly reduces computation to O(n · w) from O(n²) (where n is sequence length). Stacking multiple transformer layers with sliding window attention still allows a degree of global context (upper layers can indirectly attend farther as windows overlap). In practice, sliding window attention enables transformers to handle very long inputs efficiently by focusing on local context, and it can be combined with occasional global attention points to preserve some long-range information.
Data Selection & Data Viewer
Get data insights and find the perfect selection strategy
Learn MoreSelf-Supervised Pretraining
Leverage self-supervised learning to pretrain models
Learn MoreSmart Data Capturing on Device
Find only the most valuable data directly on devide
Learn More