How It's Made
7 steps to get started with large-scale labeling
How Instacart built a crowdsourced data labeling process (and how you can too!)
Organizations that develop technologies rooted in information retrieval, machine learning, recommender systems, and natural language processing depend on labels for modeling and experimentation. Humans provide these labels in the context of a specific task, and the data collected is used to construct training sets and evaluate the performance of different algorithms.
How do we collect human labels? Crowdsourcing has emerged as one of the possible ways to collect labels at scale. Popular services like Amazon Mechanical Turk or FigureEight are examples of platforms where one can create tasks, upload data sets, and pay for work. However, homework needs to be done before a data set is ready to be labeled. This is even more important for new domains where there are no existing training sets or other benchmarks… domains like grocery!
At Instacart, we are revolutionizing how people search, discover, and purchase groceries at scale. Every day, our users conduct millions of searches on our platform, and we return hundreds of millions of products for them to choose from. In such a unique domain, collecting human labels at scale has allowed us to augment Instacart search and generate best practices that we hope to share.
Introducing our “Pre-flight Checklist” of tasks for implementing large-scale crowdsourcing tasks. This list is independent of a specific crowdsourcing platform and can be adapted to any domain.
- Assess the lay of the land
- Identify your use cases
- Understand your product’s data
- Design your Human Intelligent Task (HIT)
- Determine your guidelines
- Communicate your task
- Maintain high quality
Before we jump in, a note on terminology: we use the terms rater, evaluator, or worker interchangeably to mention a human who is processing a task. In a task, humans are asked to provide answers to one or more questions. This process is usually called labeling, evaluation, or annotation, depending on the domain.
1. Assess the lay of the land
The first step to approaching human evaluation is to understand what your organization has already done. Make sure to ask the following questions:
- Have we done any similar human evaluation tasks before?
- Do we have any human-labeled data?
If your organization has already collected human evaluated data, make sure to understand existing processes. Do you have vendors with whom you already work? Is there an established way to store human-labeled data? Existing approaches can influence how you design your crowdsourcing task, so it’s important to take stock. Understand what went well in previous projects and what lessons were learned.
If you’re starting from scratch, focus on an area that the organization would like to know more about. For example, you may not know how good your top-k organic results are and want to quantify that metric.
At Instacart, we had previously completed a few ad-hoc projects, but now that we are beginning to run large-scale projects, we are revising the methodology.
2. Identify your use cases
Creating human evaluated data is often a costly and time-consuming process. Make sure to ask yourself:
- What do we want the human evaluated data to accomplish? Is there a metric in mind?
- Why is human evaluated data necessary here? Is this a critical project or nice-to-have?
- Is this a one-off attempt or part of a larger continuous project?
Your data could be used as general training and evaluation data, as a way to quality test the output of your model, or as a reference collection to benchmark current and future models. Each of these use cases may require different approaches, which you should keep in mind.
Moreover, make sure that your use cases will genuinely benefit from human labeling. Crowdsourced tasks require proper setup and a budget and should only be reserved for tasks requiring human input.
At Instacart, we wanted to measure the relevance of our search results. Labeled data helps us understand how relevant the products we show to users are when they enter a query into their search bar. This data can be used for training and evaluating models and measuring the quality of our search results.
Most Recent in How It's Made
How It's Made
One Model to Serve Them All: How Instacart deployed a single Deep Learning pCTR model for multiple surfaces with improved operations and performance along the way
Authors: Cheng Jia, Peng Qi, Joseph Haraldson, Adway Dhillon, Qiao Jiang, Sharath Rao Introduction Instacart Ads and Ranking Models At Instacart Ads, our focus lies in delivering the utmost relevance in advertisements to our customers, facilitating novel product discovery and enhancing…
Dec 19, 2023How It's Made
Monte Carlo, Puppetry and Laughter: The Unexpected Joys of Prompt Engineering
Author: Ben Bader The universe of the current Large Language Models (LLMs) engineering is electrifying, to say the least. The industry has been on fire with change since the launch of ChatGPT in November of…
Dec 19, 2023How It's Made
Unveiling the Core of Instacart’s Griffin 2.0: A Deep Dive into the Machine Learning Training Platform
Authors: Han Li, Sahil Khanna, Jocelyn De La Rosa, Moping Dou, Sharad Gupta, Chenyang Yu and Rajpal Paryani Background About a year ago, we introduced the first version of Griffin, Instacart’s first ML Platform, detailing its development and support for end-to-end ML in…
Nov 22, 2023