Scaling an application is a pivotal milestone in its lifecycle. It demands meticulous preparation, robust tooling, and a keen eye for optimization. In our journey of scaling, we encountered various challenges and uncovered invaluable insights. Here's a recount of our experience and the lessons we've learned.

Proper Preparation is Crucial

Metrics Collection with Prometheus and Grafana

We made a strategic decision to employ Prometheus for data collection and Grafana for visualization. This combination provided us with profound insights into the performance of our application.

Creating Custom Metrics

To gain a comprehensive view, we crafted custom Prometheus metrics for external services:

  • Gauge for RabbitMQ Queue Length: Monitoring queue length in RabbitMQ allowed us to manage workloads efficiently.
  • Histograms for Redis and MongoDB Access: Understanding access patterns to Redis and MongoDB enabled us to fine-tune their configurations.
  • Histograms for Background Job Durations: Analyzing job durations provided critical information for optimizing our background processing.

Leveraging prom-client Library

The prom-client library for Node.js proved to be a powerful tool for creating application-specific metrics. It streamlined the process and ensured the accuracy of our measurements.

Load Testing with Locust and GitHub Actions

We engineered load tests using Locust and integrated the environment set up into our CI/CD pipeline using GitHub Actions in Azure. This approach helped us simulate real-world scenarios and uncover performance bottlenecks early in the process.

Redis: Optimizing Latency

Redis played a pivotal role in our application, and choosing the right hosting platform was paramount:

  • Initially, we opted for a Redis instance hosted on redis.io in AWS. However, we encountered unexpected latency of at least 25ms, which prompted us to reevaluate our choice.
  • Transitioning to Azure Redis instances drastically reduced latency to a range of 2-5ms, exceeding our expectations. Additionally, we benefited from cost savings on inbound and outbound traffic.

RabbitMQ: Performance Tuning

Selecting the right message broker was crucial for ensuring seamless communication between services:

  • We initially chose RabbitMQ from cloudampq, hosted in Azure. Fortunately, we experienced satisfactory latency from the outset.
  • Switching to LavinMq further enhanced throughput, providing higher performance at a lower cost. The flexible pricing model allowed us to test high-performance instances without a significant financial commitment.

MongoDB: Navigating Latency Challenges

Hosting our MongoDB instance on Atlas MongoDB Cloud gave us the flexibility to choose from the top three cloud providers:

  • Opting for a serverless configuration provided the agility we needed. While latency averaged around 25 to 30ms, it met our requirements for the moment.

Efficient Job Management with BullMQ

Managing background jobs efficiently was essential for the smooth operation of our application:

  • Leveraging BullMQ for job management, which uses Redis in the background, proved to be a wise choice. Its delayed job execution feature was particularly valuable.
  • With a well-optimized Redis setup, the overhead of job management was minimal, allowing us to process numerous ad server requests concurrently for different users.

Additional Insights: Logging and Correlation IDs

During load testing, we observed unexpected behavior in our application. This highlighted the importance of robust logging:

  • We recommend implementing a correlationId concept and updating your logging strategy to include it. This enables you to trace the entire journey within your log aggregation tool, such as Loki, for easier debugging.

Next Steps: Automation and Performance-Based Scaling

Looking ahead, our focus is on automating performance tests as part of our CI/CD pipeline. This will ensure that tests are consistently executed, results are interpreted, and deployment decisions are based on accurate performance metrics.

Additionally, we plan to scale external services dynamically based on real-time performance metrics. This adaptive approach allows us to optimize resource allocation and maintain high application performance.

In conclusion, scaling an application demands meticulous planning, continuous monitoring, and a willingness to adapt. Through our journey, we’ve learned that with the right tools and strategies in place, scaling can be a transformative process, setting the stage for even greater achievements in the future. And we are not yet at the end of our journey …chnology can be harnessed to create a secure, and user-centric platform. The multi-tenancy approach, fortified by Clerk’s capabilities, paints a picture of an ecosystem where diverse organizations coexist harmoniously while enjoying a seamless user experience.an additional event named DefaultValuesSet was introduced, triggered upon OrganizationCreated. In case of replaying the event stream it may happen we create the DefaultValuesSet event several times. You have to consider preventing re-creation of the event (e.g. if the action has important side effects like sending a mail to the user). We decide not to check if the event has already been created. Instead, the services listen for new default value events and check if they already have stored default values. If they do, they skip the event; otherwise, they apply the default values.