Apple M1 and M2 Performance for Training SSL Models
We want to know how fast Apple M1 and M2 chips are for training self-supervised learning models.
The number of benchmarks for training ML models using the new Apple chips is still low. Furthermore, most results only compare the M1 chips with earlier software versions that might not have been optimized when the tests were conducted. That’s why we decided to run our own benchmarks.
In order to measure the performance of Apple M1 and M2 chips for training, we setup a simple benchmark. Training a SimCLR model with a ResNet-18 backbone on cifar-10. We measure the time it takes to complete one full epoch. For our experiments, we use various M1 and M2 chips and also compare CPU vs GPU performance.
In detail, we run benchmarks using the following devices:
- 14-inch Macbook Pro 2021 with the M1 Pro and the 14 core GPU (referred to as M1 Pro in this post)
- 13-inch Macbook Air 2023 with the M2 and the 8 core GPU (referred to as M2 in this post)
- We compare the results against a reference implementation using an Nvidia A6000 Ampere GPU
TL;DR
- On the M1 Pro the GPU is 8.8x faster for training than using the CPU.
- The M1 Pro GPU is approximately 13.77x slower than an Nvidia A6000 Ampere GPU.
- The M1 Pro GPU is 26% faster than the M2 GPU.
- PyTorch running on Apple M1 and M2 chips doesn’t fully support torch.compile and 16-bit precision yet. Hopefully, this changes in the coming months.
In the following table, you will find the different compute hardware we evaluated. On the right side, you find the average time per epoch in minutes. All Apple M1 and M2 chips use the latest nightly build from 30.6.2023 whereas the Nvidia A6000 Ampere chip uses an older PyTorch version from 2022.
Setup & Experiments
We give an overview of the software and hardware components used for the experiments.
We use the examples for training a ResNet-18 using SimCLR on Cifar10 from the LightlySSL benchmarks. Instead of training the models for 200 epochs, we will just train them for 2 epochs on the Apple hardware. We leave all the other parameters (batch size: 128, precision: 32, number of workers: 8) the same.
The training code automatically evaluates the model after each epoch using a kNN classifier. Two epochs are equal to training the model on the training data twice and evaluating the model twice. For our experiments, we use the average time per batch.
We have reference results on an Nvidia A6000 Ampere GPU for comparison. The Nvidia GPU uses 97.7 min for 200 epochs, or 0.49 min per epoch.
We use 8 for the number of workers. We did not tune or change any parameters when switching from the system using the Nvidia A6000 GPU to the M1 and M2 chips. However, when monitoring the CPU and GPU usage, we noticed that on the M1 and M2 devices, we were constantly above 90% usage, which lets us assume that we’re close to the limit of the available hardware.
Installing PyTorch with GPU support on Apple M1 and M2
For our experiments we need to install PyTorch on the Apple M1 and M2 hardware.
We follow this guide here: https://developer.apple.com/metal/pytorch/
Finally, our test system on M1 Pro has the following packages installed:
Results
Note that we don’t use torch.compile or 16-bit precision due to a lack of support on the Apple chips. As of today, Apple M1 and M2 GPUs do support 16-bit precision, but torch lacks support for autocast which is required for scaling the gradients and using automatically higher precision for training. Therefore, we don’t use any of these features.
We discuss the various benchmark results in more detail here. As a reference, we orient ourselves to two publicly available benchmarks for training ML models on the Apple M1 hardware.
Results on M1 the Pro GPU
Let’s take a look at the detailed results on the M1 Pro GPU. Since we use PyTorch Lightning for the experiments, we also get a summary of the accelerator used and the model size.
You can see that the GPU has been found:
GPU available: True (mps), used: True
The total time for the two epochs is 13.5 min. The time per epoch is therefore 6.75 min. Which is 13.77x slower than the A6000 GPU from Nvidia with 0.49 min.
Results on the M1 Pro CPU
Out of curiosity, we also ran the same benchmark on the M1 Pro CPU, which has 6x performance cores and 2x efficiency cores.
As expected, the results are worse than when using the GPU. The CPU took 118.6 min for the two epochs, or 59.3 min per epoch. The M1 Pro CPU is 8.8x slower than using the M1 Pro GPU. These results show a bigger difference between GPU and CPU performance than previous benchmarks:
Prior ML Benchmarks on Apple M1 Hardware
We only found two other benchmarks. Both are dated to May 2022 when initial support for PyTorch on Apple hardware was announced.
Compared to the other two reported results, our benchmark of training a ResNet-18 is less compute-intensive. Both the VGG-16 and ResNet-50 are bigger models with more parameters and more flops.
According to the official torchvision pretrained models, we can get the following numbers about the three models:
- The VGG-16 model has 138 million parameters and 15.47 billion FLOPs
- The ResNet-18 model has 11.7 million parameters and 1.81 billion FLOPs
- The ResNet-50 model has 25.6 million parameters and 4.09 billion FLOPs
Outlook
Half precision (fp16) support
Although the Apple M1 and M2 GPUs support fp16 the software stack around PyTorch is still lacking support on some areas. For example, the current issue is still open https://github.com/pytorch/pytorch/issues/88415 preventing us from easily use mixed precision using autocast. The good news is that fp16 is already supported for M1 and M2 chips meaning that we can create fp16 tensors and run operations on them.
Support for torch.compile for M1 and M2 Chips
If you try to use torch.compile you will get the following error RuntimeError: Unsupported device type: mps as the MPS device is not supported yet.
If you like this post and want to read more from me you can follow me on Medium.
Igor Susmelj,
Co-Founder Lightly