Too Much Data on the Edge? How to Build Data Pipelines for Edge AI

Sensors around the world are collecting massive amounts of data

Too Much Data on the Edge? How to Build Data Pipelines for Edge AI

The surge in Edge AI adoption brings a unique challenge: managing an overwhelming amount of data. Imagine a world where every device, from security cameras, cars, and phones to fitness trackers, generates data that’s too vast to be traditionally processed. This is the world of Edge AI, where the need for efficient data pipelines is not just a convenience, but a necessity.

Why AI is Going to the Edge

Visualization showing how edge computing is turning dreams into reality
Own illustration; inspired by Pushpak Pujari

Edge computing runs software close to the data source, be it cameras, smartphones, or IoT devices. The benefits are manifold:

  • Bandwidth: The reduced need to transmit large data sets to the cloud is a game-changer, especially for video data.
  • Latency: Low latency enables advanced applications like ADAS and UAVs.
  • Resilience: Distributed computing leads to greater system resilience and efficiency.
  • Costs: Edge computing can significantly reduce operational costs.
  • Privacy: Local data processing enhances data privacy.

The problem: Too much data from too many devices

Visualization of the spectrum of Edge AI
Source: “Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing” (Zhou et. al., Proceedings of the IEEE, 2019)

The Spectrum of Edge AI: According to Zhou et al. (Proceedings of the IEE, 2019), Edge AI spans several levels, from cloud-based training with edge inference (Level 1) to complete end-device training and inference (Level 6). However, this progression raises a question: how do we manage data effectively at higher levels, especially when devices are often offline?

We must do more than stream all data from the devices to the cloud since this would create too many costs and contradict the whole reasoning of why we moved to the cloud in the first place. At the same time, we need access to real-world data to ensure our models work in production in different environments.

The Challenges of Edge AI: Managing data on the edge is filled with challenges:

  • Device Heterogeneity: The diverse range of devices and platforms complicates uniform data handling.
  • Resource Constraints: Limited processing power and storage capacity on edge devices pose significant challenges.
  • Custom Deployments: Tailoring AI solutions for specific edge environments is complex.
  • Feedback-Led Iteration Difficulties: Continuously improving models based on edge data is a difficult task.

Thus, we face several problems in efficient data and model management for Edge AI: Firstly, there’s the problem of imbalanced data, which skews our model’s learning process. Secondly, the sheer volume of data generated on the edge is overwhelming while at the same time is essential to improve our models yet remains largely inaccessible and underutilized. Thirdly, the data in our cloud storage is less relevant since it does not represent the real world data the model sees and struggles with on the edge. Therefore, the critical question we face is:

How can we access the edge devices data efficiently to fix our models?

Requirements for a potential solution

A solution to those problems would need to help select the right data on the edge and tackle those requirements:

  • Diverse Data Sources: Handling data from numerous devices across different domains.
  • Selective Data Retrieval: The difficulty in extracting only relevant data from customer devices, particularly when these devices are often offline.
  • Offline Data Logging and Processing: A need for a solution that logs data not effectively understood by current models for selective retrieval and model fine-tuning.

The Solution

Active learning loop
Own illustration

Active Learning in Edge AI

Active learning plays a crucial role in edge AI by facilitating:

  • Model Output-Driven Data Retention: Prioritizing data that significantly improves the model.
  • Class Imbalance Resolution: Tackling skewed outcomes in data sets.
  • Privacy Considerations: Ensuring data privacy while enhancing model performance.

Reflecting on Previous Insights: In our previous blog, “Navigating the Future of Edge AI,” we highlighted the practical challenges of data management, deployment, and drift in edge AI. Effective solutions require datasets that accurately reflect real-world conditions, a balance in deployment across diverse hardware, and adaptability to continuous data changes.

Data Curation’s Crucial Role: Efficient data curation is key, optimizing datasets for enhanced model performance and training efficiency. It involves real-time monitoring and the selection of valuable data at the edge.

Data pipeline by Lightly
Source: Lightly.ai

Introducing Lightly’s Solution

As we address these multifaceted challenges, Lightly emerges as a pivotal solution. Utilizing active learning, Lightly offers a nuanced approach to data management on the edge:

  • Efficient Data Selection: Lightly’s software, designed to operate offline on edge devices, intelligently selects the most relevant data.
  • Addressing Offline Challenges: It effectively manages data from offline devices, fetching it periodically for model refinement and learning across diverse domains.
  • Enhancing Model Efficiency: Lightly focuses on improving models without the need for excessive computational resources.

Conclusion: The journey to build effective data pipelines for Edge AI involves more than just managing data; it requires intelligent, efficient, and privacy-conscious data processing. Lightly’s approach, grounded in active learning and data curation, stands as a vital tool in transforming the challenge of excessive data into an opportunity for more impactful Edge AI applications.

Matthias Heller, Co-founder Lightly.ai

Thanks Laura Schweiger and Igor Susmelj for reviewing this blog.