How Razor pay Scaled Their Notification Service | Ravi Singh

Post

editor-img
Ravi Singh
Oct 13, 2022

How Razor pay Scaled Their Notification Service

One of the most valued fintech businesses in India is Razorpay, a payments service. The number of transactions on the system has been increasing exponentially as a result of the expansion in users and payment volume.

media

Razorpay's Notification service, a platform that handled all client notification needs for SMS, E-Mail, and webhooks, was one service that needed to be updated.

Existing Notification Flow

  1. API for the notification services receive the request and after some validations, it is sent to the AWS SQS queue.
  2. Workers will consume the message from the SQS queue and send the notification. The executors will write the result to the MySQL Db and data lake of Razorpay.

3. Scheduler nodes will check the MySQL databases for any notifications that were not sent out successfully and push them back to the SQS queue to be processed again.

Even this architecture can handle 2000 transactions per second then also Razorpay doesn't meet their SLAs with P99 latency that increases from 2 seconds to 4 seconds.

media

Challenges when Scaling Up

  1. Database Bottleneck - Read Load on DB during peak
  2. Customer Responses - Some customers have slow response times for webhooks and this was causing worker nodes to be blocked while waiting for a user response. Scaling of the worker's is limited to the input/output operations on db.

3. Unexpected Increases in Load - Load would increase unexpectedly for certain events/days and this would impact the notification platform.

In order to overcome these issues, the Razorpay team decide to :

  1. Prioritize Notifications.
  2. Eliminate the database bottleneck.
  3. Manage SLAs for customers who don’t respond promptly to webhooks

Rearchitecting Notification System

  • Not all notifications are equal. It means Transactional notifications is more important than marketing notifications.
  • One type of notifications should not affect other types of notifications.

Solutions: Queue

To ensure customer events are not affecting others, they use rate limiting on their APIs. Each queue, event, and customers have some configurable rate limit so that the message goes into separate queues.

media

Reducing the Database Bottleneck

As load increases, the worker/executor increases but DB is not elastic and hence input/output operations on DB become the bottleneck. Vertical scaling is costly and they can't do it forever.

Solution: Writing Database Asynchronously

the team decide to write database asynchronously with AWS Kinesis. Kinesis is a fully managed data streaming service offered by Amazon and it’s very commonly used with real-time big data processing applications. The worker nodes will now write the status for the notification messages to Kinesis rather than MySQL

media

Managing Webhooks with Delayed Responses

When the webhook call is made, the worker node is blocked as it waits for the customer to respond. Some customer servers don’t respond quickly and this can affect overall system performance.

To solve this, engineers came up with the concept of Quality of Service for customers, where a customer with a delayed response time will have their webhook notifications get decreased priority for the next few minutes. Afterwards, Razorpay will re-check to see if the user’s servers are responding quickly.

Observability

Razorpay built a robust system around observability to ensure the system scale well with increased load.

They have dashboards and alerts in Grafana to detect any anomalies, monitor the system’s health, and analyze their logs. They also use distributed tracing tools to understand the behavior of the various components in their system.


Tagged users
editor-img
Aryan Agarwal
@aryankush25
Technical Lead | Driving Innovation at Glue Labs | Ex - Software Development Engineer at GeekyAnts Just Learning New Things!! 😉
editor-img
Sailesh Verma
@chocolate-89
Developer @Gluelabs
editor-img
Arshnoor Singh
@arshnoor1411
Software Developer @FIFO. Software Engineer @Glue.