Discovering Emerging Short Form Video Trends with Clustering and HitL


The rise of social media platforms has driven the popularity of user-generated content (UGC) and it is changing from the bottom up at a pace faster than ever. Platforms are competing for user attention and engagement - some are winning, but some are falling behind. How can a social media platform stay ahead of the game?

One of the most effective ways to incentivize users to create and engage with content on the platform is by curating and promoting trends. A trend is a type of content repeated and shared amongst creators - a specific subject matter, action, or format that rises in popularity and eventually becomes viral. Common types of trends include dance challenges, transitions, and comedy skits. While videos in a trend follow a similar format, creators usually express their creativity and individuality by adding a personal touch to the videos. For that reason, the variety of the videos within the same trend can be wide and unpredictable.

The key elements of these video trends include the video’s image, motion, audio, and text overlay. For example, a transition trend is a trend where the participants show a sudden change in their appearance, often signified by a dramatic sound effect in the background music and sometimes accompanied by dance moves, lip syncing, and text overlay. The lifecycle of a trend can vary from days to months depending on various factors. To remain competitive, social media platforms need to detect the trends while they are still emerging and put them to use immediately by matching them to the right creators and promoting them to the suitable audience. 

Scale Content Understanding improves platform experiences by enriching content metadata, discovering trend insights, and flagging sensitive content. One feature of Content Understanding is content data enrichment for better trend detection. To facilitate this process, we developed proprietary algorithms to detect trends early, quickly, and thoroughly. 

ML Clustering Overview

We took a human-in-the-loop (HitL) machine learning approach to detect trends. We represent trends as clusters of videos estimated from video-extracted embeddings. Once formed, these clusters are sent to a human pipeline (which we call “Cluster Labeling”) where taskers answer a series of questions about the videos in a given cluster - namely, if a sample of the videos corresponds to a trend and the potential name of the trend. The following are the phases of the ML Clustering process:

Phase 1: Embedding Extraction

To extract embeddings, we use a variety of multi-modal information extraction machine learning models, each of which is trained on large-scale, common datasets with high distribution overlap to the social media domain.

Phase 2: Clustering

Once the embeddings are extracted, we perform a clustering pipeline to take in these embeddings and output clusters.


Given the embeddings, we first perform Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction (inspired by BERTopic). UMAP takes in high-dimensional data and reduces it down to a lower dimensional space scalably (e.g. UMAP performs computation on tens of thousands of points on the order of tens of seconds). It reduces dimensionality in a way that preserves global Riemannian geometric structure and pairwise distances. On these low-dimensional embeddings, we perform an iterative hierarchical clustering process. Each individual point in embedding space is initially represented as a cluster. We then calculate the proximity between all pairs of clusters. We calculate the proximity of two clusters in a given pair as the variance in embedding space (i.e. the average sum of squares from the joined cluster mean embedding vector) of the points corresponding to if these clusters were to be joined. The two clusters that are closest to each other are joined given the above definition of proximity. This cluster-joining process is performed iteratively until all clusters are above a specified variance that is provided as a hyperparameter.

Stop Condition Variance Hyperparameter

We carefully tune the variance hyperparameter in the stop condition to balance between making the clusters too coarse or too granular. For large variance hyperparameters, the final clusters are allowed to span a wide region in embedding space, meaning they are not semantically similar enough to represent trends (which tend to be extremely specific). Diametrically opposed, for small variance hyperparameters, the clusters end up with too few videos for users in the cluster labeling pipeline to recognize the trend. We therefore carefully tune the stop condition variance hyperparameter. This provides a balance between the clusters being fine enough that they represent groups of videos that are semantically similar enough to correspond to trends, and coarse enough that users from the cluster labeling pipeline can recognize the trend. This hyperparameter is normalized in order to reduce the amount of per-dataset manual tuning.

Phase 3: Cluster Post-Processing

To improve the quality and throughput of the taskers, we prune out clusters that are composed of a number of videos that is smaller than our definition of a trend. We also add a post-processing density score to help taskers focus on the clusters which are tightest in embedding space - i.e. the ones that are most likely to be semantically similar. This density score is calculated as the intra-cluster variance of each cluster in the dimensionality-reduced embedding space.


Our trend detection product consistently performs at a higher than 90% recall and 90% precision in production. Our HitL ML approach enables us to stay flexible with customization towards specific content niches while maintaining a high throughput. We are able to process tens of millions of videos with a few hours of turnaround. Our priority is to not only detect trends promptly, but also equip our customers with the data and insights on these trends to take actions immediately. 

At Scale, we believe investing in data is the key to building better platform experiences and discovering trend insights. If you’re interested in learning more about Scale Content Understanding, speak to our team today. 

The future of your industry starts here.