January 27, 2022
Hi, I am Daniil, Head of Data Science at Tango, the leading live streaming platform worldwide. Data Science is responsible for a plethora of various fields across the company, from conversion funnel optimization and revenue prediction to moderation and fraud prevention. However, today I’d like to give you a sneak peek into the other one, namely Recommender Engine.
What is the Recommender Engine? In short, this is a combination of models, environment, infrastructure, and processes that have only two aims: to show the most interesting and individually relevant content to each viewer and simultaneously expose those streamers to the audience as large as possible. It sounds pretty simple, isn’t it? We face several challenges, making it less straightforward and, therefore, extremely exciting. In this article, I’d like to empathize with some of them.
The field of recommender engines is not that unique to Tango. There are a lot of other companies that face similar challenges of delivering individually-picked content to the users and have state-of-the-art recommendation systems. YouTube video suggestions usually correctly guess what I’d like to watch next. The same is true for Netflix, Instagram, Spotify, and TikTok. So, why cannot we at Tango just make a similar recommender logic and get some rest?
The answer is in the word “dynamic.” All the mentioned services deal mainly with static, unchangeable items: movies, videos, songs. These items are already filmed or recorded and will not change. In other words, if I watch a video on YouTube and like it, people with similar tastes will probably also like this video because this is the same piece of content. Even the recommendations of streams at the services that have it (YouTube) heavily rely on the static streamers’ content.
The Tango situation is quite different. As we operate only in the live streaming domain, the content is created and consumed in real-time. Therefore, we have to recommend content that does not exist yet. What if a particular streamer is generally the perfect match for a specific viewer, but the current stream is not such a great fit, or the streamer has just gone out of the frame and streams the empty room? Is it still a good fit? Not really.
So, how do we overcome this?
On the high level, there are three time-dependent levels of data.
The long-term data is the general data about the streamers and viewers, not depending on the current stream.
The mid-term data is the data about the current stream from its beginning. It includes both the content assessment of the stream and the viewers’ behavior data (how long they watched, how involved they were).
The short-term data is the most recent behavior data of the stream (last minute). This enables to dynamically adjust the settings to the most real-time events, such as the streamer leaving the camera frame, causing viewers to leave the stream.
Ok, now we know what data we have. But how to use it?
Before we proceed, it is essential to understand what the recommendation is. Long story short, there is an ensemble of models behind the recommendation engine. When the viewer opens the app, the models estimate the likelihood of a good individual match with each online streamer and return the list of streamers ordered from the best one onwards. This likelihood is derived using different models from the ones based on collaborative filtering and image recognition to those using NLP and other methodologies. We use a sophisticated cascade of multiple models for each request and apply the combined results to give the best recommendation. And here we come to the long-term data, which is exactly the data we use to train the models. Thus, the long-term data with the help of models gives an initial ordered list of online streamers at each viewer’s request. What next?
One might recall that we have the list of streamers based on how good they might match with the viewer by the moment. But streamer and stream are not the same! The perfectly matching streamer might have a non-interesting stream and vice versa. That is where the mid-term and short-term data come into play. They adjust the initial list’s order to additionally promote the streams with the relevant content and demote those with a worse match or where viewers’ behavior indicates loss of interest.
As a result of streamers’ recommendations adjusted by the estimated stream match, viewers see the content that is individually relevant to them and currently on fire. Hooray!
How many items (movies) does Netflix have? Around fifteen thousand overall. How many items (streamers) are there on Tango? More than half a million unique streamers over the last month. Any of these streamers and any new one might start streaming at any moment. Therefore, we should always be ready. In other words, we should always train our models to give any viewer (more than eight million unique viewers last month) the match-ordered list of streamers even if all streamers happen to go online simultaneously. Our models have to support this.
Due to the number of both streamers and viewers, it is evident that most viewers have never watched most of the streamers. Of course, we do not calculate the adjacency matrix of views (all streamers on one axis, all viewers on the other, in each cell one if the viewer watched the streamer, and 0 otherwise). But, if we had the matrix, it would be 99.99% sparse.
This combination of the tremendous number of users and extreme sparsity leads to the very Data Science problem: how to validate the model? The usual validation method of collaborative filtering-based models uses metrics such as precision-at-k. In brief, suppose we have all the available viewer-streamer pairs that historically had interactions (views, gifts, follows) and their rating (goodness of match score). Then we randomly take out some share of them as a validation set, train the model on the rest of the data, predict the best top matches for the viewers and see how many of the actual top-rated validation set streamers were there.
This approach is excellent, and it for sure helps in many recommendation engine problems. However, having the scale of data such as ours, the precision-at-k method might be useless. Due to the number of streamers, the perfect recommender engine might recommend a lot of even better-matching streamers than those in the validation set. In such a case, very few of the validation set streamers might appear in the top ones, and the metric will seem low. If, in turn, the model is really of poor accuracy and gives mismatching streamers a high predicted rating, the precision-at-k metric will be relatively low as well.
Precision-at-k is only one example of many model evaluation metrics. Still, the outcome of the others is relatively similar. With Tango’s number of users and their sparsity, no offline metrics can evaluate how well the models would perform relative to each other before deploying them. Not a problem, let’s assess them in production!
How do we do it? First of all, when we have a newly trained model, we get the most of the available offline evaluation methodologies. Even though they can not be as good as we need to compare the models, we can catch some poor-performing ones. After these checks, we start the production AB test. Honesty, calling it AB is a bit of underestimation. Given that every new model has a bunch of hyperparameters, it makes more sense to call the process ABCDE… -test. Overall, we start with multiple versions of the new models turned on a small share of the users, usually around 1% maximum. There can be up to twenty such experimental model versions with consistently dedicated users. After that, the tournament system comes into play: the models that show better performance take over the users of those that perform poorer, the latter being “eliminated,” and the competition continues. Of course, if we see that all experimental models show much poorer performance than the production one, then the tournament stops with no winners, and the production model stays. But if not, in the end, the best experimental model variation remains and is, in turn, compared to the production model. The fittest survives!
What do we know about the content of the stream? Virtually not that much. Explicitly only the stream hashtags indicate what is inside. However, streamers set the hashtags themselves. Therefore, sometimes it happens that the hashtags are not entirely correct. On the one hand, there is direct hashtag abuse, where streamers write misinformation hashtags that should attract viewers. This case usually happens with #music, #dance, or other art-related hashtags. On the other hand, some hashtags are too subjective but can still be considered hashtag abuse. For example, sometimes you might see the stream with the hashtag #beautiful, open the stream, but … never mind. In addition, some hashtags do not give a clue about the stream content. For example, hashtags such as #welcome and #followme are of no value from the algorithm point of view. Overall, hashtags, as they are, cannot give much information because of the abundance of meaningless noisy hashtags and the hashtag abusers. But this does not stop us from our desire to label the stream correctly. How do we do it?
First, some of our models help us to detect hashtag abuse. The simplified version of one of them looks as follows. There is a group of viewers with strict preferences towards some content, say, music. After identifying these people, we can see their reaction to the streams hashtagged as #music. If most of them leave the stream shortly after they started watching, this might indicate that either it is not about the music or that the quality is not good enough to be promoted to the other music-lovers. We can extrapolate this approach to multiple other content groups and the viewers that love it.
Secondly, we have the history of streamers’ hashtags, which gives us more insights into what the streamers usually perform.
Thirdly, we have the whole internal infrastructure that regularly takes the screenshots from the streams, processes them with the Image and Video Recognition models to get the correct labels of what happens inside. This system is automatic and highly scalable. So, we can regularly get labels about what happens in many streams. In addition, there are people in Tango who permanently manually label the content to make sure the viewers get the most from the service. Knowing the current labels and the history of the previous ones of both the current stream and the streamer, we can build accurate content profiles of streamers.
Combining all the approaches above enables us to build a content meta-model, which helps us give the most individually relevant content to each viewer while promoting high-quality content.
While opening the app, I bet you do not want to wait long to see the content. Neither do we. Therefore, our brilliant ML Engineers developed a state-of-the-art infrastructure that makes the whole client-server-model-model-…-model-server-client cycle run with a median of under 100 milliseconds (!!!). Suppose you are as impressed with these numbers as I am. In that case, I strongly recommend reading the article on how we do it from our incredibly talented leader of ML Engineers, Igor Gorbenko.
There are many other exciting and interesting things that Data Science does in Tango, but let’s leave it for the next articles. In the meantime, I will always be happy to answer any questions.
Thank you, have fun and use Tango!
Head of Data Science