We switched from Pillow to Albumentations and got 2x speedup

Are your GPUs sitting idle because your dataloader is too slow? We’ve been there. In this post, we’ll share how diagnosing CPU bottlenecks led us to swap from PIL to Albumentations for data transformations—nearly doubling throughput and pushing GPU utilization near 100%.

Switching from Pillow to Albumentations resulted in higher GPU utilisation and higher throughput.

At Lightly we’re doing a lot of experiments around training pure vision and vision language models. A lot of the methods are using self-supervised learning (SSL) and some of these methods such as DINO are quite augmentation-heavy. It applies 6 augmentations on 8 views per image, totaling 48 augmentations per image.  When training these models, we noticed GPU utilization drop to around 40–70%. The culprit? A data preprocessing bottleneck that kept our CPUs swamped.

After testing various image decoding and augmentation pipelines, we discovered that switching away from Pillow to Torchvision + Albumentations gave us a 2x speedup and kept our GPUs fully occupied. In this blog post, we’ll reveal our experiments, share the results, and explain how you can optimize your own data preprocessing pipeline. Enjoy the read—and faster training!

Understanding the Pipeline

Once we found out that the dataloading is the bottleneck and CPU utilisation is at 100%, we dived deeper into the image preprocessing pipeline.

Our image loading and preprocessing pipeline looks roughly like this:

  1. Image is loaded from disk
  2. Image is decoded
  3. Augmentations are applied
  4. Image Tensor is used in PyTorch

To identify the potential bottlenecks we can use math and do some additional measurements.

1. File Loading

When loading 1400 images/s with 0.15MB on average per image we require a bandwidth of 213 MB/s. This is only a fraction of the bandwidth our SSD-based filesystem provides so we can rule out the 1st stage as being the bottleneck. If you use remote storage, we recommend moving your data to local storage.

2. Image Decoding

Decoding the image from its byte representation to a tensor in memory is a CPU-intensive operation. We benchmarked doing this operation using PIL, OpenCV, or Torchvision’s new read_image() function.

Torchvision now includes its own optimized image decoding pipeline. It uses libjpeg turbo (see here) and other optimizations directly under the hood for you.

In some related benchmarks, we had already found out that decoding images in jpeg format is much faster than decoding them in png format for all libraries.

This is due to two reasons: png files are larger than jpg files because they use lossless compression, and the decoding process for png files is more compute-heavy, making it significantly slower. Thus, we had already converted all our datasets to jpeg. If you have not done this yet, we highly recommend it.

3. Augmentations

Many computer vision models, especially those for self-supervised learning, rely heavily on augmentations. Augmentations can involve resizing the image, adding blur or color jitter, flipping, cropping, and more.

Again, there are different libraries to perform augmentations, which we benchmarked. Note that there isn’t one single way of doing augmentations with torchvision:

By default, both torchvision transforms v1 and v2 operate on PIL images. Both versions use the same code. Under the hood, they may use either normal PIL or PIL-SIMD, a version of PIL that speeds up some operations.

  • By default, both torchvision transforms v1 and v2 operate on PIL images, using the same code under the hood. They may use either normal PIL or PIL-SIMD, a version of PIL that speeds up some operations.
  • Torchvision transforms v2 can also operate on tensors instead of PIL images directly.

We also tested augmentations with both versions of torchvision transforms. Interestingly, we did not observe any performance benefit when using v2 over v1, even though we had expected a small advantage due to its updated pipeline.

Benchmarking Setup

We used the following setup to benchmark the different approaches:

  • Dataset: COCO train2017 dataset of 118,000 images with a resolution of 640x480 pixels. Total dataset size is around 18GB.
  • Hardware:
    • AMD Ryzen 3960X CPU with 24 physical cores
    • 4 x RTX4090 GPU
    • SSD storage
  • Software:
    • Python 3.10.16
    • Torchvision 0.20.1
    • PIL 11.0.0 or Pillow-SIMD 9.5.0.post2 (latest version as of January 2025)
    • Albumentations 2.0.0 (latest version as of January 2025)
  • Image transformations:
    • We used the transform outlined in the DINO paper and implemented in the Lightly library. The DINO transform generates 2 global and 6 local patches, applying these steps to each of the patches:
      • random resized crop
      • random horizontal flip
      • color jitter
      • random conversion to grayscale
      • gaussian blur
      • conversion to a torch tensor
  • Dataloader settings:
    • 48 workers, one for each virtual core

Albumentations is slow on multi-processing

When doing benchmarks with a single-core, we found Albumentations to be much faster than using PIL. However, as soon as we started to use it within a multiprocessing setup like the torch dataloader, it showed bad scaling behaviour and became even slower then using PIL. Finding out why took us almost a day of debugging, you should not need to spend on it, so here is the problem:

Albumentations uses cv2 under the hood, which spawns multiple threads. However, this makes dataloader processes spawns multiple threads, blocking the CPU and making the dataloading slow. This is a known issue. However, you can easily solve it: Just call cv2.setNumThreads(0).

Use Pillow-SIMD Instead of Pillow If Possible

One of the simplest ways to improve image processing performance is by swapping out Pillow for Pillow-SIMD. The famous Performance Tips and Tricks guide by fast.ai highlights this as a key optimization step.

Pillow-SIMD replaces Pillow’s use of libjpeg with libjpeg-turbo, a faster JPEG decoding library. For many setups, this “drop-in replacement” can provide a 20–30% speedup in image decoding tasks with no code changes required—just update your dependencies.

If you must stick to standard Pillow, we strongly recommend following the fast.ai guide to ensure you are getting the best possible performance. Simple adjustments, like recompiling with the turbo library, can still yield significant benefits.

Results

To find the fastest dataloading libraries, we only benchmarked the dataloading. This makes the following benchmark independent of the model and GPU used and easier to reproduce. Furthermore, it reduced the variance between different repetitions to <1%.

Our benchmarks show that the combination of torchvision for image decoding and albumentations for augmentations outperforms the other methods.

Using PIL and any version of torch transforms gives a speed of about 1650 img/s. Replacing PIL with PIL-SIMD instead increases the speed to about 2160 img/s. Using torchvision for both the decoding and the transforms reduces the speed again is only as fast as using default PIL.

The clear winner is albumentations increasing the speed to more than 3200 img/s, about twice as fast as using default PIL.

Conclusion

Removing PIL from the data preprocessing and using Torchvision for decoding the image and Albumentations for augmenting it gives a huge speedup compared to the usual PIL+Torchvision setup. Thanks to this rework, our GPUs now run at almost 100% utilisation and we can train our models much faster. The only disadvantage is that our electricity bill is going to be higher ;)