How It's Made

Unveiling the Core of Instacart’s Griffin 2.0: A Deep Dive into the Machine Learning Training Platform

Authors: Han LiSahil KhannaJocelyn De La RosaMoping DouSharad GuptaChenyang Yu and Rajpal Paryani

Background

About a year ago, we introduced the first version of Griffin, Instacart’s first ML Platform, detailing its development and support for end-to-end ML in a previous post. In our last week announcement, we unveiled Griffin 2.0, our advanced ML Platform, shedding light on the transition from Griffin 1.0 to 2.0 for a more cohesive user experience, refined model management, distributed computation capabilities, and enhanced deployment automation. The Griffin 2.0 post provided insights into all key components of the system, including the Griffin UI, Feature Marketplace, ML Training Platform, and ML Serving Platform.

In this article, we focus on the technical details of constructing the ML Training Platform (MLTP) within Griffin 2.0. MLTP introduces innovative functionalities such as a centralized web interface, a distributed computation framework, standard ML builds, orchestration services, and a scalable metadata store, which collectively contribute to the comprehensive creation and management of training workloads at Instacart.

Crafting Strategies

In the development of Griffin 2.0 MLTP, our goal was to establish a unified and centralized platform where all MLEs can seamlessly create, track, and manage their training workloads. We aimed to support distributed machine learning, preparing the platform for tasks such as distributed training, batch inference, and Large Language Model (LLM) fine-tuning. Simultaneously, we wanted to address the limitations identified in Griffin 1.0, as detailed in our previous post. These considerations led us to make the following strategic design decisions:

  • A Singular Interface: Griffin 1.0 required MLEs to navigate through multiple systems for comprehensive training workload information. Griffin 2.0 brings all tools into a unified web interface, greatly simplifying the user experience and streamlining model training development.
  • Centralized Unity: Griffin 1.0 incorporated various training backend platforms, resulting in increased maintenance overhead. Griffin 2.0 consolidates all training workloads onto a unified Kubernetes platform.
  • Standard ML Runtime: Griffin 1.0 lacked a standardized approach for modeling frameworks and versions, leading to maintenance challenges. Griffin 2.0 introduces standard runtimes for various ML frameworks with consistent building blocks and package versions to address this concern.
  • Horizontal Scalability: Griffin 1.0 relied on vertical scaling for accommodating increased training data and evolving model architectures. In Griffin 2.0, we adopted Ray to support horizontally scalable distributed workloads without imposing excessive complexity on our MLEs.
  • Metadata Store for All: Griffin 1.0 lacked effective model lineage information, making it challenging for users to understand and manage the complete model lifecycle. Griffin 2.0 implements a centralized metadata store to ensure thorough model lineage management throughout the lifecycle.

How It’s Built

System Architecture

Fig 1: Architecture of ML Training Platform

We’ve constructed the MLTP with the following major building blocks (Figure 1) to make it a centralized service and platform with distributed computation capabilities:

Metadata Store

We’ve established a series of data models to manage all aspects of the model lifecycle, including:

  • Model Store: Specifies the raw, untrained model architecture used as input for training jobs.
  • Offline Feature Store: Contains metadata for features in our offline storage to be utilized for training.
  • Workflow Run: It represents a specific training job request submitted to our workflow services.
  • Model Registry: Maintains information about models generated by training jobs. This registry is used in post-training processes like evaluation, batch inference, and real-time inference.

API Endpoints

We’ve developed an API server with accessible REST endpoints for users to interact with the Metadata Store to achieve the following:

  • /api/training/models/: Create and retrieve model architecture in the Model Store
  • /api/registry/models/: Generate, save, and retrieve the trained model from Model Registry during post-training processes
  • /api/features/: Index and retrieve offline features
  • /api/training/dataset/: Generate and retrieve training dataset
  • /api/workflows: Create, retrieve and terminate training jobs in Workflow Run

Workflow Orchestrator

Fig 2: Workflow Orchestrator consisting of MLTP API service and ISC worker

The workflow orchestrator consists of two major components, as illustrated in Figure 2:

  • MLTP API service: The workflow APIs empower users to customize their training jobs with specifications like worker count, CPU/GPU units, memory, SSD attachment, and runtime choice, all while shielding them from the complexities of Kubernetes operations. Figure 3 shows an example of workflow creation and resource specification UI to create a Ray Cluster. When a user initiates a workflow, the MLTP API service communicates with the ISC backend worker to generate Kubernetes resources.
Fig 3: Workflow Service UI screenshot to specify a Ray cluster resource specification
  • ISC (Instacart Software Center) + Kubernetes: We chose Kubernetes as the orchestration platform for our training workloads. This choice allowed us to centralize the methods for launching workflows to different training backends within a single, unified platform. We collaborated closely with Instacart’s core infrastructure team who built and managed the Instacart Software Center (ISC). ISC is a suite of tools for creating, validating, and deploying software across the company. We integrated Ray and Kubernetes in ISC to enable company-wide adoption and leverage existing build & deploy features. Upon receiving workflow requests from the MLTP service, the ISC worker establishes a unique Kubernetes namespace for each workflow definition to ensure resource isolation and share identical environment variables and authorization settings. Subsequently, the ISC worker generates RayCluster and RayJob Custom Resource Definitions (CRDs), initializes them, provides an endpoint for the Ray dashboard URL, monitors the status of the launched containers, and manages cluster lifespan.

Training Application

Here’s the typical process on MLTP for creating a training job:

  1. Users begin by customizing their input, which includes tasks like organizing features and trying out different model designs. They then select a specific model from the Model Store, pick input data from the Training Dataset, and set training configurations to start a formal training job.
  2. When they’re ready to initiate a training workload, they have the option to either access the Griffin UI or directly employ Python SDKs to send a request to the workflow services.
  3. The workflow services create Kubernetes resources based on the user’s input. This can be a simple single-container job or a more complex multi-node Ray cluster.
  4. After training, the results, like metrics from MLFlow and logs from Datadog, are shown visually. The model weights and other training items are then stored in the Model Registry for easy access in tasks like evaluation and inference.

Our design considerations for streamlined and standardized training job creation are as follows:

Unified Interface

The centralized service and APIs offer a unified interface for managing the model development lifecycle:

Fig 4: ML development lifecycle supported by a unified MLTP web interface
  1. Users can prototype by scheduling training to a Ray cluster from their laptop or Jupyter server to leverage distributed computation.
  2. Once the model is ready for production, users can utilize the Griffin UI to create a “production” workflow definition.
  3. Alternatively, if there’s an existing production pipeline in Airflow, they can rely on Griffin task operator and sensor operator, which integrate with Griffin workflow APIs to submit and query workflow runs.

Throughout all stages of the model lifecycle, users interact with MLTP using the same API interface.

Standard ML Runtimes

When we surveyed training applications implemented by our users, we found the majority of code isn’t focused on model development. Those applications shared many similarities, for instance, parallel executors for feature transformations, batch inference, and distributed GPU workers for parallelizing training batches. Optimizing these components was time-consuming for MLEs, requiring them to learn various frameworks and APIs, not to mention the debugging complexity.

To simplify model development and encourage more users to adopt MLTP, we’ve developed standardized ML runtimes with structured and configurable input parameters. This solution offers predefined approaches that cover the majority of use cases for building different models at Instacart, be it decision trees or neural networks. Figure 5 shows the high-level structure of training configurations.

Fig 5: Structure of Training Configuration Schema

Underneath, we rely on Ray APIs to implement various components of the training pipeline as shown in Figure 6. Most of these building blocks are shared across different standard runtimes for various ML frameworks. Moreover, these standard runtimes are easily scalable, allowing users to initially test in Ray local mode and then effortlessly scale the workflow in remote distributed environments.

Fig 6(a): Configurable Data Loader and Feature Transformation
Fig 6(b): Configurable Distributed Data Parallel Trainer
Fig 6(c): Configurable Distributed Batch Inference

Lineage

Thanks to the centralized data storage layer, the model lineage is now much better managed. We can effortlessly trace the following lineage pairs, as illustrated in Figure 7:

  • Model architecture — Workflow run
  • Training dataset — Workflow run
  • Workflow run — Model Registry
  • Workflow run — Output Metadata URLs (Datadog, MLFlow, Ray Dashboard, etc.)
Fig 7: Lineage of Training Lifecycle

This simplifies the training-to-post-training process for users, helps them be more productive in both experimentation and deployment and enables them to reproduce results for sharing and debugging purposes.

Lessons Learned

We learned a few lessons during the journey of building the next generation of ML training infrastructure:

  • Unified Solutions: To scale ourselves effectively with a growing number of ML use cases and a limited ML infra team, we opted to unify our ML training solutions. As a result, we can now offer a more consistent training job abstraction. For instance, transitioning to Kubernetes as our only orchestration platform requires a one-time migration for existing training jobs, but it enables us to provide a consistent experience between tasks and offers users additional benefits like distributed computation and enhanced metadata management.
  • Balancing Flexibility and Standardization: We aim for our platform to be highly flexible and capable of accommodating a wide range of ML applications for model training. Simultaneously, we recognize the importance of providing standardization to cater to the majority of use cases and accelerate development velocity.
  • Consider the Bigger Picture: In our redesign of next-gen MLTP, we didn’t just focus on training alone. We also took into account model serving and feature engineering. We collaborated closely with stakeholders from other Griffin 2.0 projects to co-design the data models of training jobs, ensuring seamless integration into the end-to-end process for easier deployment and metadata management.

Acknowledgments

The MLTP project received support and funding from various teams such as Cloud Foundation, Build & Deploy, Developers Productivity, Ads ML, Fulfillment ML and many more at Instacart. Additionally, we are immensely thankful for the guidance offered by members of our ML consulting group: Jin Zhang, Chuanwei Ruan, Aditya Subramanian, Liang Chen, Trace Levinson, and Peng Qi.

Instacart

Author

Instacart is the leading grocery technology company in North America, partnering with more than 1,500 national, regional, and local retail banners to deliver from more than 85,000 stores across more than 14,000 cities in North America. To read more Instacart posts, you can browse the company blog or search by keyword using the search bar at the top of the page.

Most Recent in How It's Made

One Model to Serve Them All: How Instacart deployed a single Deep Learning pCTR model for multiple surfaces with improved operations and performance along the way

How It's Made

One Model to Serve Them All: How Instacart deployed a single Deep Learning pCTR model for multiple surfaces with improved operations and performance along the way

Authors: Cheng Jia, Peng Qi, Joseph Haraldson, Adway Dhillon, Qiao Jiang, Sharath Rao Introduction Instacart Ads and Ranking Models At Instacart Ads, our focus lies in delivering the utmost relevance in advertisements to our customers, facilitating novel product discovery and enhancing…

Dec 19, 2023
Monte Carlo, Puppetry and Laughter: The Unexpected Joys of Prompt Engineering

How It's Made

Monte Carlo, Puppetry and Laughter: The Unexpected Joys of Prompt Engineering

Author: Ben Bader The universe of the current Large Language Models (LLMs) engineering is electrifying, to say the least. The industry has been on fire with change since the launch of ChatGPT in November of…

Dec 19, 2023
Introducing Griffin 2.0: Instacart’s Next-Gen ML Platform

How It's Made

Introducing Griffin 2.0: Instacart’s Next-Gen ML Platform

Authors: Rajpal Paryani, Han Li, Sahil Khanna, Walter Tuholski Background Griffin is Instacart’s Machine Learning (ML) platform, designed to enhance and standardize the process of developing and deploying ML applications. It significantly accelerated ML adoption at Instacart by tripling…

Nov 22, 2023