December 9, 2021
My name is Igor Gorbenko. I am the Team Lead and Architect of the ML-engineering team in Tango. Today, I will tell you how and why we’ve implemented the Machine Learning Operations process in Tango, what challenges we have addressed, and our results so far.
People love to be taken care of. Whenever you go to your favorite app and see the content meaningful precisely for you, the content you particularly like, you feel a certain amount of care, don’t you? In the fight for user satisfaction and loyalty, the personalization of services is essential.
Machine learning algorithms relentlessly analyze gigabytes of data, preparing a personalized set of recommendations for each Tango app user.
We implemented MLOps processes and built a recommendation system that incorporates these algorithms to make Tango users around the world a little happier.
Tango is a high load service used by people around the world 24/7. Every day our Data Lake is filled with several terabytes of “raw” data to process and extract valuable information.
Tango is a real-time service, and if a user doesn’t like the content, they’ll less likely keep their attention on the screen longer than a second. Not only do users want to get “their” content, but they also want it quickly since no one is willing to take the time to look at a boring slider on the screen.
Despite the frequency of releases, every significant change to the system undergoes an architectural review where we comprehensively consider the pros and cons of the proposed changes.
Our objective was to develop a concept and build a platform that would allow us to create other systems on it while getting an acceptable performance, regardless of the load.
The entire infrastructure is in the GCP cloud, but we try to do the cloud-agnostic solution, not tightly coupled with the particular cloud vendor as much as possible.
The backend is a microservice architecture in Java (Spring Boot) + GKE, Redis Cluster, Kafka, MySQL, Dataflow + PubSub, Bigtable, MongoDB.
We collect a lot of information, starting with “events,” one of the essential components of the whole system. Events reflect the user’s behavior along with their interaction with the application. They are the basis of our knowledge; that’s where we get all the helpful information. They are also an essential part of DWH, which sits on Google BigQuery combined with Dataflow and Google Cloud Composer.
We won’t go into details of our data warehouse since that’s a topic for a separate article. Let’s say that Batch Processing is necessary for us to dive into tons of data, some of them to train the models.
Every second in PubSub, thousands of events from the application arrive in near real-time to Data Lake, the raw data repository. Hundreds of scheduled ETL processes process this raw data and transfer it to Data Lake several times daily.
This processed and enriched data in DWH provides the basis for model training.
I could foresee a question about delays in delivering data to the repository and, as a result, training models on old data. Well, this does not happen to us. Optimized ETL procedures coupled with monitoring and alerting allow timely data delivery to the final destination.
Models are trained on schedule, extracting specific sets of features designed for each model. Mind that each model training is different: some take a few minutes, while others, such as heavy models, might take several hours.
Thus, collaborative filtering model training takes about five hours. The latter includes preparing an extensive array of features, training the model on those features, and saving the training results.
Just as data processing, model training has monitoring and alerting. Thanks to this, we can analyze the impact of changes in model code on the training speed, size, and parameters such as MAE (Mean Absolute Error), etc.
Model training and artifact creation on Cloud Storage proves useless if the new artifacts aren’t involved in Prediction Services.
Reading artifacts from object storage on every request can’t be considered the best solution for a heavily loaded real-time service. The stringent latency requirements for prediction issuance force us to keep the actual version of the model in pod memory.
A Jenkins API call is triggered after each model training and artifact creation, raising a new set of pods and subtracting new artifacts. After the model is loaded into memory, tests are executed to get a “baseline prediction” to see if the service with the new model: 1) is up and running 2) the service delivers a prediction to a group of consumers. And only then the incoming traffic is switched to this new set of pods, while the old ones are terminated.
This cyclic process happens after each retraining – several times a day.
The output of each prediction is logged and, in turn, also generates Data Lake events. The BI team can perform various analytics, build hypotheses, and validate experiments based on this data.
Prediction Services are built on the Flask API and de-platformed on the Google Kubernetes Engine + New Relic, Grafana, and VictoriaMetrics.
The Prediction Services input should represent a set of features, which, after passing certain business conditions, are used by the model and will play be a factor in the prediction output. For instance, filtering online streamers by the number of gifts received or by the date of the very first stream in the app can be based on data from the storage. However, this approach doesn’t guarantee meeting latency requirements since the data flows into the storage with an inevitable delay.
Google Cloud Bigtable, deployed as a cluster on multiple nodes, meets the primary conditions for use in a real-time system. This is where we store and promptly update all relatively cold features needed for future services. We use Bigtable to pre-aggregate (in an easy-to-consume form) historical (loaded using batch jobs) and real-time (loaded using stream jobs) data and features used later for prediction.
Going back to the fast result issuance environment, it’s hard to imagine such a system without Redis as the Online Feature Store for issuing prediction and the issuance result caching system.
We use a Redis Cluster with multiple masters and slaves for better fault tolerance and scalability. Our secret sharded composite is generated based on the Offline Feature Store data. This sharded key, which lies evenly across all the nodes in the Redis cluster, makes sure that Prediction Services reads a random batch as quickly as possible, thereby not creating a hotspot on the node.
While the Offline Feature Store is primarily a PubSub + Dataflow + Bigtable interaction, the Online Feature Store is a bit more complicated. It already uses PubSub + Dataflow + Redis Cluster + Google Kubernetes Cluster, which deploys a composite key builder that extracts data from Bigtable.
Besides the ever-changing components mentioned above, there is also a coordinator or orchestration mechanism, a proxy service (or set of services) that is part of the recommendation system. Using data from the Online Feature Store, it prepares arrays of features and asks Prediction Services for recommendations for each request.
The coordinator is the single point in the system with which the recommendations’ consumers interact. Additional mechanisms can also be implemented here, such as caching, filtering data to send to Prediction Services, working with AB tests, experiments, and more.
Initially, during the POC phase, we created all Prediction Services on Cloud Run but came across many limitations during the load-testing. Thus, the VPC Serverless Connector, which allows services on Cloud Run to access VPC resources such as Memorystore or the custom Redis Cluster, has limited bandwidth, leading to problems during load tests.
Using the off-the-shelf Memorystore for Redis service also gave us quite a few headaches:
In the end, we abandoned it in favor of a custom Redis Cluster with an admin panel in Redis Insight and good monitoring through Grafana.
We are trying to create solutions with minimal coupling with a particular cloud vendor, but this is not always possible. For example, the PubSub + Dataflow bundle is almost ideal in all respects for processing large streams of streaming data. However, even the most stable managed services might have bugs and errors leading to unstable operations;
We indeed have to mention our performance issues with Bigtable. We’ve tried hundreds of approaches to storing data, forming Row Keys: with hash, without hash but in reverse order, in actual order but with extra composition, within business logic, etc. We tried using the main cluster for writes and the replica for reads, implementing compression and timeouts to read data. Re-populated data into different tables, then merged everything into one big table, populated them differently, tried implementing ClickHouse solution instead of Bigtable. This field could indeed make a topic for a separate article, “Tango with Bigtable” 🙂
Monitoring the system is crucial for us; we need always to know what’s happening there and which components need extra attention. The rapid transition from POC to one of the leading production components drove us to implement logging and monitoring. We needed to quickly figure out how to do the logging in a way that would be fast and at a reasonable price at the same time. Thus, we’ve developed an approach encapsulated in a shared library of logging and a library of metrics. The latter is supplemented and expanded with each iteration.
The core functions of our MLOps have proved to be really good. We have tested them in a real-time recommendation system and are using them successfully in two other systems.
As for the real-time recommendation system, after switching in all the tabs of the Tango app, the Follow rate has increased by more than 10%, and FTP (First Time Purchase) increased too, which directly correlates to overall revenue.
By no means are we not going to stop being always open to new concepts and approaches.
As for the models, they are sure to become even smarter and more sophisticated. The use of reinforcement learning can improve the quality of recommendations too.
We are also considering a test application of Kubeflow on one of our projects, but that’s what we’ll talk about a bit later.
I’ll be happy to answer all your comments and questions.