Comcast is one of the leading providers of communications, entertainment, and cable products and services. It employs classical machine learning (ML) models and deep learning for a variety of computer vision tasks, natural language processing, and personalization of customer experience. As the company invokes models billions of times per day, it needs quite a robust ML infrastructure.
Do you like this in-depth educational content on applied machine learning? Subscribe to our Enterprise AI mailing list to be alerted when we release new material.
Before the recent upgrade of Comcast’s ML platform, their researchers and software engineers were encountering a number of significant challenges with the existing ML infrastructure:
- There was no model management or tracking.
- Model packaging and deployments were not standardized.
- Deployment required code to be rewritten from research to operations (e.g., converted from Python to Java). This process was very time-consuming and prone to mistakes.
- Model deployment was taking days and sometimes even weeks. Researchers learned that if they simplify models, they can get them into production much quicker, and thus were motivated to introduce not better but simpler models.
Comcast has two use cases for their machine learning models:
- The on-demand use case, when user action invokes a model.
- The streaming use case, when a model is invoked by a streaming action. Comcast receives a tremendous amount of data in its streaming use case.
Here you can see how a machine learning pipeline looks at Comcast:
First of all, as you can see, the ML team at Comcast has a feature store that ensures consistency and robustness of the features. Then, data is consumed, parsed, transformed, and normalized. Finally, the model is invoked, and the upstream action is taken or the current state persists.
With the existing challenges in mind and considering the organizational needs, the company set the following requirements for the new platform:
- It should not require converting code from research-ready models to production.
- It should enable easy experimentation and model tracking.
- The researchers should be able to deploy their own models.
- The platform should enable A/B testing for models in production.
- There should be a possibility to modularize and inject custom metrics and workflows at each step.
Enterprise ML Platform with MLFlow and Kubernetes
After considering a number of different options, Comcast ended up with a solution that combines several open-source tools for developing and deploying ML models at scale.
The machine learning researchers at Comcast use the following technologies:
- Databricks notebooks and Spark for coding and training models.
- MLflow for packaging and tracking the particular models.
- Docker for rapidly deploying server environments in “containers”.
- Kubernetes, specifically Kubeflow, ArgoCD, and Seldon Core, for model deployment.
The company discovered that Kubernetes pods are well suited for the different stages of their machine learning pipeline. In particular:
- Data consumption and parsing can be ensured by the Kafka consumer pod and the Kinesis consumer pod.
- Data transformation and normalization can be ensured by the Data Transformation pod and the Data Normalization pod.
- Finally, Kubernetes ensures that models can be deployed meeting all the requirements set by Comcast, via the Model Service Orchestration pod, the Single Model Service pod, A/B, Ensemble, and Multi-armed Bandit model pods, and other possible combinations.
The open-source tool MLFlow introduced by Databricks supports standard packaging formats, including scikit-learn, H2O, Tensorflow, and others.
How Does it Work?
Now that the new Enterprise ML platform has been implemented, the research and model flow at Comcast has the following steps:
- The researchers write their code in Databricks in Python. In more complicated cases, when they want something that goes beyond the existing templates, the researchers need to write their own YAML or JSON configuration files for the Seldon core graph implementation.
- The code gets committed to git.
- Then ArgoCD, part of Kubeflow, is used to pick up changes in the YAML or JSON configuration files:
- If the change is detected in git, the ArgoCD workflow will be kicked off.
- It constructs the Docker images, uploads them to the Docker hub, executes deployment in Kubernetes, and downloads the images.
The new solution satisfies the company’s needs with regard to throughput capabilities:
- 700 requests per second on a single node with a single replica of the particular model on a machine with pretty poor specifications;
- Up to 100 000 requests per second for some of the production models at burst.
If you want to learn more details about Comcast’s Enterprise ML Platform and see a demo, please check out the video below that this article is based on:
Enjoy this article? Sign up for more updates on applied ML.
We’ll let you know when we release more technical education.