top of page
  • Writer's picturedeevio

How to Acquire a Visual Quality Inspection Dataset

Updated: Nov 29, 2021

At deevio we learned that a good dataset is the key to creating a successful automated visual quality inspection system. Conversely, if the dataset is inaccurate or inconsistent it’s very difficult or sometimes impossible to train an accurate model, no matter which state-of-the-art deep learning architectures or model training techniques are used.

This post is targeted at our customers to help them acquire and label a high quality dataset by avoiding common mistakes and following best practices.

Preparing your project

Keep your business goal in mind

The main goal(s) of a system must be clear from the start. Do you want to decrease costs, reduce the amount of missed defects, reduce the amount of falsely rejected products, increase production speed, etc? Should the system be fully automated? Or will a human be involved in the process?

Knowing the answers to these questions and sharing them with our data scientists will help at every stage of the project, starting from data acquisition. Additionally, this will help to define an appropriate metric to evaluate the model and train it accordingly.

First limit the scope, then scale up on success

With your predetermined goal in mind, the scope can be reduced to something limited but relevant to that goal. It could be just one product, just one defect type, just one perspective, up to hundreds (but not much less) of images (if they cannot be acquired and labeled automatically), and so on. For example, if the goal is to decrease costs caused by incorrectly rejected parts, pick a single product type where these costs are the highest.

A limited scope is sufficient to estimate if the problem is solvable in principle, but it takes less time and effort to acquire and label the data. It’s also easier to reacquire and/or relabel a smaller dataset, in case errors are discovered down the line.

Decide if defect localization is required

It may be enough just to know if a product is defective or not. In this case a single label per image is enough and that is an image classification problem.

If the exact location of a defect on an image is required, there are few options. If it is enough to get a bounding box around a defect it is an object detection problem. If it is necessary to see locations of defects on a per-pixel level it is a semantic segmentation problem (when separating adjacent defects of the same type is not required) or instance segmentation problem (when it is required to separate adjacent defect instances). See an illustration of these differences below.

Localized labels require much more effort to create, of course. But even in cases when it is enough to just get a per-image label, using segmentation can make a model more explainable by design. This can be an alternative to the methods of explainable AI like integrated gradients, that try to explain a decision of a classification model.

Computer vision tasks. Credits to the authors of the photos: Ty_Swartz ( and JACLOU-DL (

Decide if distinguishing defects is required

It’s important to decide if it is necessary to simply distinguish between defective and good products, or to detect several different types of defects like scratches, cracks, dirt, or porosities. Additionally, for some applications it may have sense to introduce gradations of defects e.g. small/medium/large scratches.

Similarly to localization, even if knowing if a product is defective or not is enough to take an action e.g. on an automated production line, having more detailed labels can help to improve model explainability or accuracy at the cost of more laborious labeling.

Find an appropriate vision system configuration

A good vision system with properly selected and calibrated configuration is required to acquire a high quality dataset. The images must have the right properties such as sharpness, brightness, contrast, and resolution good enough to distinguish between defective and good images, as well as to detect various defect types.

To ensure that the results of a feasibility study or a proof-of-concept can be reproduced in production, the configuration of a vision system must stay the same, including the environment that surrounds it. For example, a system that worked well in a dark room may not work if moved to a sunlit room (without additional tuning).

deevio can help you to create a proper vision system or acquire photos of your products for you.

Use lossless image format if possible

Visual inspection datasets often have very small defects, like scratches, that could be just a few pixels in size. In such cases, unless storage capacity or bandwidth becomes a problem, it is preferable to store images in lossless format like PNG or BMP, rather than a lossy one like JPEG.

The labelling process

Involve an experienced visual inspector in the labeling process

We’ve seen several cases where the visual inspectors on a production line were not involved in data labeling. Instead this task was often delegated to less experienced workers e.g. working students which received only brief training. The labellers often followed incorrect labeling rules due to a lack of training, and the model was learning these incorrect rules from the data.

For example, an untrained person was not able to distinguish between “still OK” and “already NOK” as accurately as people who were professionally experienced, and tended to label every borderline case as a defect. As a result the model trained on that data was reproducing the same mistake.

Therefore, we strongly suggest involving experienced visual inspectors from the quality control or production department in the labeling process.

Make sure the task can be unambiguously solved by a human

Сan an experienced person unambiguously recognize the presence or absence of a defect, its type and location, looking just at the acquired images? If this is not the case, it is very likely that the deep learning model cannot do this either. When it is hard or impossible for a human to decide, this means the data are ambiguous. If they can decide, but need to check the part itself instead of just the images, this is still an indication for ambiguity in the data (and a hint that the acquisition should be improved). The reasons for ambiguity could be from low photo resolution, non-optimal perspective or lighting to uncertainty in the criteria of what a defect is.

Training a model on such data harms the overall performance, and there is no reliable way to evaluate the performance of the model without knowing what the correct label should be.

Have a defect catalogue

A good way to achieve certainty in the reject criteria is by compiling all defect types in written form, along with example images for each defect type in a defect catalogue.

The defect catalogue must be regularly updated when new defect types or new product types are introduced, or when an uncertain case is spotted.

Such a document also makes the knowledge transfer process more quick and reliable and allows you to involve more labellers when necessary.

Make sure different annotators are consistent with each other

In cases where more than one person is involved in labeling the data it is important they apply exactly the same rules.

A practical way to ensure this is to ask each labeller to independently label a small sample of the data, and then compare any discrepancies between their labels. If such discrepancies exist it is very important to reach a consensus, verify it by labeling another sample independently, update the defect catalogue, and relabel data accordingly.

Even when a single person has labeled the entire dataset, asking another person for review in the way described above is a good idea as that helps to check if labeling rules are explainable and unambiguous.

Make sure that all the important cases are well represented

When, even considering the limited scope, the data still falls into multiple categories, they should be equally represented. Otherwise the model is likely to have low accuracy on underrepresented cases.

The most common source of underrepresentation is that usually just a small fraction of products are defective, but when acquiring a dataset it makes sense to keep the ratio close to 1:1.

Other categories that should be equally represented are:

  • different product types

  • different defect types

  • different camera perspectives e.g. top side of a product vs. bottom side of a product

Acquire data for a proof-of-concept once but keep labeling some data in production

Real-world data usually changes over time, which often leads to degradation of model accuracy. There are generally two sources of degradation:

  • Data drift refers to changes in input data e.g. changes in lighting.

  • Concept drift refers to changes in relation between input data and target e.g. requirements became more strict and products with tiny scratches that were considered good in the past are now considered defective.

The way to detect this issue is to regularly label some small amount of data while the system is in production to monitor its accuracy and retrain a model when necessary. If a dataset for a proof-of-concept system is not acquired in one day, different dates should be equally represented.

If you want to make a hold-out dataset keep it consistent with rest of the data

Hold-out datasets (also known as test set) contain images of products that were not seen by a model during training and can be used to estimate model accuracy on future production data. If a customer provides a single dataset, it will be split into training and hold-out parts by us.

If a customer is able to prepare their own hold-out dataset, it must be similar to the training data i.e. have the same date of acquisition, the same products, the same defect types, and so on. Otherwise it will significantly under- or overestimate model accuracy.

Avoid introducing bias

A model can become biased, meaning that it learns erroneous assumptions about the data, if there is a systematic difference in irrelevant attributes of images of defective and good products.

For example, if a customer draws circles around defects with a red marker, a model will likely learn to search for red circles instead of actual defects. If the hold-out dataset has the same bias, the problem may not be revealed until testing on production data. The customer could avoid this by circling the defects not on real parts, but on copies of their photos, or better yet by using specialized labeling software and not a general purpose graphical editor.

The example with red marker seems obvious, but there could be more subtle irrelevant differences in images of defective and good products. For example, one of two product types might have a much higher rate of defects, or there might be different lighting conditions when acquiring images of defective and good products. Especially for a small dataset, it is therefore good practice to acquire images of defective and good parts at the same time in a mixed order. Try to keep everything but the presence or absence of defects identical.

Bias introduced because the defect is circled.
Bias would be introduced when images with circled defects would be included in training.

Avoid duplicates

There must be no duplicate images, as they could end up in both training and hold-out parts of a dataset and thus lead to overestimation of model accuracy in production.

A more subtle problem, which is harder to detect, are repeated photos of a single product in the dataset. They can also lead to overestimation of model accuracy.

To prevent the second issue, it is good practice to include unique identifiers of products in file names of their photos, especially when there are multiple photos per product. It is also good to include any meaningful attributes like product types, camera perspectives, etc. either in file or folder names or in an accompanying table.

Make sure the data are clean before acquiring more

Model accuracy usually increases with the amount of data. If a very high level of accuracy is desired, the amount of the data must be appropriately large.

However, before acquiring more data, it makes sense to be sure that the existing data are already accurate and consistent, as existing problems are likely to be repeated in the future, making the efforts to grow the dataset less efficient.

Even if the dataset size is limited, it is still possible to estimate the relation between the data amount and model accuracy and predict if adding more data is likely to make a model significantly more accurate, or if the saturation point has already been reached.

An illustration from "Deep Learning Scaling is Predictable, Empirically".
An illustration from the paper "Deep Learning Scaling is Predictable, Empirically" by Hestness et al..


Be ready to work on improving your dataset while keeping in touch with our data scientists

As we can see, there are many potential problems in acquiring and labeling of visual inspection datasets and each of them may make estimates of model accuracy unreliable. Many of these problems can only be solved by or with the help of the domain expert, which is you, the customer. This is because rules of distinguishing defective products from good ones are often domain specific and customer specific. So plan to allocate people and resources in advance to help reacquire, relabel or grow your dataset, in order to achieve your business goal with the help of an automated quality inspection system.

Although we tried to make this post as comprehensive as possible, it’s impossible to cover every possibility - each case is unique. Get in touch with our data scientists through the lifetime of a project, starting even before data are acquired, and we will do everything we can to assist you.

677 views0 comments


bottom of page