The main function of MLOps is to automate the more repeatable steps in the ML workflows of data scientists and ML engineers, from model development and training to model deployment and operation (model serving). Automating these steps creates agility for businesses and better experiences for users and end customers, increasing the speed, power, and reliability of ML. These automated processes can also mitigate risk and free developers from rote tasks, allowing them to spend more time on innovation. This all contributes to the bottom line: a 2021 global study by McKinsey found that companies that successfully scale AI can add as much as 20 percent to their earnings before interest and taxes (EBIT).
“It’s not uncommon for companies with sophisticated ML capabilities to incubate different ML tools in individual pockets of the business,” says Vincent David, senior director for machine learning at Capital One. “But often you start seeing parallels—ML systems doing similar things, but with a slightly different twist. The companies that are figuring out how to make the most of their investments in ML are unifying and supercharging their best ML capabilities to create standardized, foundational tools and platforms that everyone can use — and ultimately create differentiated value in the market.”
In practice, MLOps requires close collaboration between data scientists, ML engineers, and site reliability engineers (SREs) to ensure consistent reproducibility, monitoring, and maintenance of ML models. Over the last several years, Capital One has developed MLOps best practices that apply across industries: balancing user needs, adopting a common, cloud-based technology stack and foundational platforms, leveraging open-source tools, and ensuring the right level of accessibility and governance for both data and models.
Understand different users’ different needs
ML applications generally have two main types of users—technical experts (data scientists and ML engineers) and nontechnical experts (business analysts)—and it’s important to strike a balance between their different needs. Technical experts often prefer complete freedom to use all tools available to build models for their intended use cases. Nontechnical experts, on the other hand, need user-friendly tools that enable them to access the data they need to create value in their own workflows.
To build consistent processes and workflows while satisfying both groups, David recommends meeting with the application design team and subject matter experts across a breadth of use cases. “We look at specific cases to understand the issues, so users get what they need to benefit their work, specifically, but also the company generally,” he says. “The key is figuring out how to create the right capabilities while balancing the various stakeholder and business needs within the enterprise.”
Adopt a common technology stack
Collaboration among development teams—critical for successful MLOps—can be difficult and time-consuming if these teams are not using the same technology stack. A unified tech stack allows developers to standardize, reusing components, features, and tools across models like Lego bricks. “That makes it easier to combine related capabilities so developers don’t waste time switching from one model or system to another,” says David.
A cloud-native stack—built to take advantage of the cloud model of distributed computing—allows developers to self-service infrastructure on demand, continually leveraging new capabilities and introducing new services. Capital One’s decision to go all-in on the public cloud has had a notable impact on developer efficiency and speed. Code releases to production now happen much more rapidly, and ML platforms and models are reusable across the broader enterprise.
Save time with open-source ML tools
Open-source ML tools (code and programs freely available for anyone to use and adapt) are core ingredients in creating a strong cloud foundation and unified tech stack. Using existing open-source tools means the business does not need to devote precious technical resources to reinventing the wheel, quickening the pace at which teams can build and deploy models.