EpicKitchens-100 in LightlyStudio: From Video Clips to Searchable Embeddings

Learn how to preprocess EpicKitchens-100 video clips and load them into LightlyStudio with action captions and metadata. Explore 37K clips using embedding plots, text search, and diversity sampling.

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.
Book a Demo

Table of contents

Product
LightlyStudio
Category:
Update
Reading time
6

EpicKitchens dataset has gained popularity in the computer vision community for its rich annotations for a set of egocentric videos, designed to develop models for robotics-related tasks. In this blog post, we show how to preprocess the EpicKitchens dataset and visualize it in LightlyStudio.

At the end, we will have loaded and explored a dataset of more than 37000 video clips, each with a caption describing the action in the clip. We show how to explore the embedding plot and slice and dice the dataset for further analysis.

Pro tip

For more information, check out LightlyStudio Docs.

Understanding Different EpicKitchens Datasets

For a newcomer, the structure of the EpicKitchens dataset can be a bit overwhelming. In fact, EpicKitchens is a collection of datasets, each with its own structure and annotations, and different ways to access the data.

The main datasets with video recordings are:

Moreover, separate, derived datasets annotating the data from EPIC-KITCHENS-100 are available, such as:

  • VISOR - Dense instance segmentation annotations
  • EPIC-Sounds - Audio annotations
  • EPIC-Fields - 3D digital twins

Downloading EPIC-KITCHENS-100

For our tutorial we focus on EPIC-KITCHENS-100 and download videos and annotated actions.

Note: You can skip the downloading and preprocessing steps if you are only interested in the final result. We uploaded it as the lightly-ai/epic-kitchens-100-clips dataset to HuggingFace (24GB).

Download Videos

The first obstacle is that EPIC-KITCHENS-55 videos and the extension part of EPIC-KITCHENS-100 are distributed separately. For simplicity, we focus on the extension part of EPIC-KITCHENS-100.

The videos are officially hosted on DataBris servers, but the mirrors are slow. Luckily, the extension dataset is also available via AcademicTorrents and HuggingFace, we are going to use the HuggingFace mirror:

# Install HuggingFace CLI according to <https://huggingface.co/docs/huggingface_hub/en/guides/cli>
curl -LsSf <https://hf.co/cli/install.sh> | bash

# Download videos (464 GB)
hf download awsaf49/epic_kitchens_100 --repo-type dataset --include "*.MP4" --local-dir ./EPIC-KITCHENS-100

The download size is big. To follow along, you can download a subset of the videos with the official downloader, as follows:

# Clone the helper repo for downloading videos
git clone <https://github.com/epic-kitchens/epic-kitchens-download-scripts.git>
cd epic-kitchens-download-scripts

# Download the 10 shortest videos
python epic_downloader.py \\\\
    --videos \\\\
    --specific-videos P03_15,P03_26,P06_02,P09_01,P26_30,P04_19,P07_106,P03_110,P02_05,P26_12 \\\\
    --output-path ../EPIC-KITCHENS-100

Download Action Annotations

The action annotations are available in the epic-kitchens-100-annotations repository, which we simply clone:

git clone <https://github.com/epic-kitchens/epic-kitchens-100-annotations.git>

Verify the Folder Structure

After downloading, you should have the following folder structure. The videos are organized by participants P01 - P37, and each participant has a videos folder with the video files. The action annotations are in the epic-kitchens-100-annotationsfolder in EPIC_100_train.csv and EPIC_100_validation.csv files.

.
β”œβ”€β”€ EPIC-KITCHENS-100/
β”‚   β”œβ”€β”€ P01/
β”‚   β”‚   └── videos/
β”‚   β”‚       β”œβ”€β”€ P01_101.MP4
β”‚   β”‚       └── ...
β”‚   └── ...
└── epic-kitchens-100-annotations/
    β”œβ”€β”€ EPIC_100_train.csv
    β”œβ”€β”€ EPIC_100_validation.csv
    └── ...
   

‍

Preprocessing the Videos

We cut the videos into clips, one for each annotated action. The annotations provide the start and end times of each action, an example annotation looks like this:

narration_id,participant_id,video_id,narration_timestamp,start_timestamp,stop_timestamp,start_frame,stop_frame,narration,verb,verb_class,noun,noun_class,all_nouns,all_noun_classes
P01_102_0,P01,P01_102,00:00:01.100,00:00:00.54,00:00:02.23,27,111,take knife and plate,take,0,knife,4,"['knife', 'plate']","[4, 2]"

We have let an AI assistant write a Python script which loads the annotations from the two files with pandas, and then calls ffmpeg to cut the clips from the videos. We also downsized the videos to 854x480px.

We uploaded the script together with its outputs as the lightly-ai/epic-kitchens-100-clips dataset to HuggingFace. You can run it as follows, make sure ffmpeg is already installed on your system:

pip install pandas tqdm
python cut_clips.py

It expects the folder structure described above, and creates a clips folder with the cut clips, named by their narration ID, e.g. clips/P01/P01_102_0.mp4 for the example annotation above.

Note: For the 464 GB dataset of videos, the script ran for about 8.5 hours on a 47-core machine, not exhaustively using all cores. It created 37455 clips, with a total size of 24 GB.

See Lightly in Action

Curate and label data, fine-tune foundation models β€” all in one platform.

Book a Demo

Loading the Clips in LightlyStudio

Now the difficult part is done, and we are ready to load the clips in LightlyStudio. First we install dependencies, we use pandas for loading annotations and tqdm for displaying progress:

Create a Python script load_clips.py with the following content:

import lightly_studio as ls
import pandas as pd
from tqdm import tqdm

# Load video clips into a LightlyStudio dataset
dataset = ls.VideoDataset.load_or_create()
dataset.add_videos_from_path(path="./clips")

# Load narration CSVs
train_csv = pd.read_csv("./epic-kitchens-100-annotations/EPIC_100_train.csv")
val_csv = pd.read_csv("./epic-kitchens-100-annotations/EPIC_100_validation.csv")
file_name_to_row = {}
for _, row in pd.concat([train_csv, val_csv], ignore_index=True).iterrows():
    filename = f"{row['narration_id']}.mp4"
    file_name_to_row[filename] = row.to_dict()

# Add metadata to each video
for video in tqdm(dataset, "Loading annotations"):
    row = file_name_to_row[video.file_name]

    # Add a caption
    video.add_caption(row["narration"])

    # Add metadata
    for key, value in row.items():
        video.metadata[key] = value

# Start the LightlyStudio GUI
ls.start_gui()

We first create a video dataset and add the videos from the clips folder. Then we load the annotations from the two CSV files into a mapping from file name to CSV row. Finally, we loop through the videos in the dataset and add the narration column as the video caption, and populate video metadata with all the other columns from the CSV. Finally, we start the LightlyStudio GUI:

python load_clips.py

Once the data is loaded, it is persisted in the lightly_studio.db file. The GUI server can be safely stopped by pressing Ctrl+C in the terminal, and restarted by calling ls.start_gui() again:

python -c "import lightly_studio as ls; ls.start_gui()"

Note: Loading annotations one-by-one can be very slow. To process the whole dataset, we used a more optimised version of the script with bulk inserts. You can find it on HuggingFace.

Exploring EpicKitchens with LightlyStudio

Get a Quick Overview

On the initial screen, we see a grid of all the videos together with their captions. The bottom left shows that we loaded 37455 videos. We can hover over each video to see it playing, and double-click to open the video details page. There we can see all metadata loaded from the CSV.

Captions can be also inspected in a dedicated tab, where long captions are displayed in full. If there were multiple captions per video, they would all be displayed here. Caption editing is supported.

Caption editing in LightlyStudio

Understand the Dataset

LightlyStudio computes embeddings with the Perception Encoder model for all the videos, so that they can be easily visualized and searched. In the embedding plot, we see that the videos are organized in clusters. We can lasso-select a cluster to see which videos are in it. Selected data can be easily tagged.

We can also use text search to find videos with specific content. When submitting a query, the text is embedded with Perception Encoder and compared with indexed video embeddings stored in a local database for high performance.

To get a smaller, representative sample of the dataset, we navigate to Menu β†’ Selection and select 100 videos using the β€œDiversity” strategy. The selection is performed in Rust. Selected images are tagged with a chosen tag.

Conclusion

To summarise, we have shown how to:

  • Overcome the difficulties of loading the EPIC-KITCHENS-100 dataset
  • Preprocess the videos into clips corresponding to annotated actions
  • Load and explore the dataset in LightlyStudio

This only scratches the surface of the capabilities of LightlyStudio. To see how to edit captions, export the annotations, and more, check out our documentation at https://docs.lightly.ai/studio/ and stay tuned for more updates.

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.
Book a Demo

Explore Lightly Products

LightlyStudio

Data Curation &Β Labeling

Curate, label and manage your data
in one place

Learn More

LightlyTrain

Self-Supervised Pretraining

Leverage self-supervised learning to pretrain models

Learn More

LightlyServices

AI Training Data for LLMs & CV

Expert training data services for LLMs, AI Agents and vision

Learn More