How high quality data can improve AI projects
The collection and handling of data is a key requirement for completing a successful AI project. AI technologies have been known for focusing on data and therefore, collecting massive amounts of it. This approach was originally used by large consumer internet companies, because this way they got deep learning to work in the first place. However, in the case of industrial automation, the data set sizes are much smaller, as it is not possible to collect huge amounts of data. Therefore, what’s important is not the quantity of data, but the quality. If the quality is not right, the project will not be successful. In order to achieve a widespread AI deployment in manufacturing, our focus must shift from high quantities of data to high quality data.
In practice, this means that manufacturers have to focus on correctly classifying, grading, and labeling defect images, rather than just increasing the number of images. A developer training an AI model for manufacturing will focus on the quality of the data available, rather than trying to tweak the resulting AI model by changing specific values or changing the statistical methods used to sample the images and create the model. According to AI theory, if just 10% of data is mislabeled, manufacturers need 1.88 times as much new data to achieve a certain level of accuracy. If 30% of the data is mislabeled, manufacturers need 8.4 times as much data compared to a situation with clean data.
So how can we ensure that we have high quality data sets? deevio's solution was to develop a labeling tool, where non-programmers can easily label images. Manufacturing experts are the ones who have the knowledge of their products and the factory. They are able to determine if the collected data is of high enough quality. It is important to make AI technologies easily accessible and usable in order to maximize the accuracy and quality of the learning data. After the acquisition of sample images, the respective manufacturing experts should label the images, locating and identifying the defects on the parts. Naturally, different experts may have different opinions. Therefore, it is advised to create a defect catalog which specifies various possible defects, so that future quality inspectors have guidelines on how to label data correctly and improve the final system performance.
If the training data is optimally labeled, it is possible to train deep learning-based machine vision systems reliably, even if the quantity is limited. Moreover, this process enables the manufacturing experts to establish a guideline of what is a defect and gives knowledge on how to improve the quality of data. One additional bonus is that if the data is curated well, it is possible to analyze the production accuracy of the parts, which can save time and energy in the future. Lastly, by utilizing the strengths of quality managers and AI model developers, the labeling of data and the optimization of AI models is more effective.