Why are data pipelines important?

Data pipelines are important because, without a professional setup, errors can occur which jeopardize dataset quality. At the same time, a lot of resources are wasted on manual work due to the inefficient setup (read more about what data pipelines are here, and how to build a data pipeline here). In some of the worst cases, we have seen millions of dollars of resources wasted on bad data which naturally resulted in disappointing outcomes from the whole machine learning project. 

I have seen many machine learning teams flop due to reasons that could have been avoided. The key mistake frequently observed is the neglect of a thorough consideration of the machine learning pipeline at the start of the project. This is often due to a lack of “real world” machine learning experience in the engineering teams. Thus, starting with a rather academic setup will most likely fail in a production setting. In Figure 1, you can find four of the most common failure reasons I have heard from ML teams and explanations as to why this is not a good idea: 

Figure 1: Machine learning projects failure reasons; own table by author

Above, the mentioned reasons in Figure 1 do not include wrong assumptions on the business case and bad project management. The common theme in all the named reasons is that the end result is not the focus and determining success is postponed until late in the project development process. In practice, this can mean that a poor machine learning pipeline is only exposed when the model is first trained and deployed in production. This could be avoided by setting up from the get-go a machine learning pipeline that closes the model-data feedback loop. For this, a data pipeline that enables fast end-to-end iterations through automation and scalability is required. 

Automation and scalability are therefore important. Below the reasons why those two key points are so crucial for data pipelines are elaborated: 

1) Automation: The model-data feedback loop can only be executed if iterations are fast (e.g., two weeks instead of one year in many companies). Therefore, spotlighting weaknesses much earlier in the development cycle before significant investments have been made and steering resources and development efforts into the right direction is crucial. 

2) Scalability: Architecture matters. Most machine learning projects start with small datasets; however,  dataset sizes and data storage can multiply quickly. Thus, building a pipeline that can scale from the get-go is crucial, to avoid having to redesign the entire pipeline from scratch later on.

All of these insights come from my own experience as a co-founder at Lightly but are also true in light of industry facts. Machine learning development underlies an S-curve for development (see Figure 2 below). On the Y-axis we have the AI model accuracy and on the X-axis we have the investment of resources into the development. Machine learning accuracy also behaves like an S-curve. In the beginning, with little resources, decent accuracy can be reached. However, the higher the targeted accuracy the higher the costs for each additional accuracy point. 

Figure 2: Machine Learning Economics S-Curve; Illustration by author 

Therefore, companies with professional data pipelines have a higher success rate for their machine learning projects than companies without because they are able to more efficiently allocate their resources with respect to their end goal. 

That’s why data pipelines are becoming more and more crucial in machine learning in general. But, in computer vision, there is a particular rise. There are two reasons for this.

The first reason is that machine learning technology is leaving the experimental and prototyping phase. Good accuracies are no longer good enough. ML-based products need to be reliable and work in the real world. This trend is accelerated by the fact that by today there is growing industry adoption in many areas such as autonomous driving, robotics, visual analytics, and inspection. The margin for error in an autonomous car is close to 0. 

The second reason is that machine learning models are being deployed in many smart products meaning that they also continuously need to be updated. In a certain sense, machine learning is like software development. In software, we update products with code; in machine learning, we update products with data. This particularly applies to computer vision, where the visual world is changing constantly. There, data drift is an often occurring phenomenon that could lead to model failure (e.g., new product packaging in self-checkout systems or the Cybertruck of Tesla in autonomous driving). Thus, automated data pipelines become of crucial importance. 

Conclusion

Data pipelines are critical because they allow one to work directly with the model in production and iteratively improve it through fast re-training cycles. This is being enabled through (1) automation and (2) scalability.

Matthias Heller,
Co-founder Lightly.ai