Key Highlights
- Discord’s machine learning systems evolved from simple classifiers to complex models serving hundreds of millions of users
- The company overcame scaling challenges by adopting distributed computing with Ray, an open-source framework
- Discord built a custom platform around Ray, resulting in a +200% improvement on business metrics with models like Ads Ranking
The rapid growth of Discord’s user base led to an increased demand for more sophisticated machine learning models. As the company’s models became more complex, they encountered significant scaling challenges, including the need for multiple GPUs, larger datasets, and increased computational power. This move reflects broader industry trends, where companies are struggling to scale their machine learning capabilities to meet growing user demands.
Overcoming Scaling Challenges
The adoption of distributed computing was a crucial step in addressing these challenges. Discord turned to Ray, an open-source distributed computing framework, to build a custom platform that would make distributed machine learning easy to use. The platform included custom CLI tooling, orchestration with Dagster + KubeRay, and an observability layer called X-Ray. By focusing on developer experience, Discord aimed to turn distributed machine learning into a system that developers would be excited to work with.
The company’s efforts paid off, as they were able to transition from ad-hoc experiments to a production orchestration platform. This enabled the development of models like Ads Ranking, which delivered a significant improvement on business metrics. The success of this model demonstrates the importance of scaling machine learning capabilities to drive business growth.
Credit: discord.com
Building a Custom Platform
Discord’s custom platform was built around the following key components:
- Ray: an open-source distributed computing framework
- Dagster: a workflow orchestration tool
- KubeRay: a Kubernetes-based Ray operator
- X-Ray: an observability layer for monitoring and debugging These components worked together to provide a seamless developer experience, allowing developers to focus on building and deploying machine learning models without worrying about the underlying infrastructure.
Conclusion
Discord’s journey to scaling their machine learning capabilities is a testament to the importance of distributed computing in driving business growth. By adopting a custom platform built around Ray and focusing on developer experience, the company was able to overcome significant scaling challenges and achieve remarkable results. As the demand for more sophisticated machine learning models continues to grow, companies must prioritize scaling their capabilities to stay competitive.
Source: Official Link