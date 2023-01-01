Yuka’s existing database is massive, containing over 4 million products, and more are added every day. The database is growing rapidly as approximately 1,200 new products are added daily. That’s a huge amount of data, and Yuka’s small team certainly cannot manually review each new product that’s added to the platform. Adding a new product to the platform also often requires multiple transcription tasks – Adding a food item, for example, requires the application to scan both a nutritional table and an ingredients list.

Yuka partially accounts for this quantity of data by first using OCR to scan product images for text describing the product’s nutritional information and ingredients. This process is not perfect, however – OCR does not have a high accuracy across all labels, and there are a lot of instances where it fails. For example, OCR doesn’t perform well on images that feature inconsistent lighting, obstructions, or irregular text surfaces.

To ensure only high-quality information is added to the application, Yuka checks that OCR achieves a sufficient detection rate before adding a product to a database and generating its health score. If the OCR results are insufficient quality, these product images need to instead be labeled by a human annotator. Because of OCR’s limited robustness, about 60% of the images submitted to Yuka generally need to be outsourced to a human annotator. In a given day, this could be as many as 500 to 1000 images! Labeling this many images manually was out of the question for Yuka’s small team. They wanted to get transcription results quickly, too – When a user adds a new product to the database, Yuka aims to provide the product’s health score within 2-3 hours!