Our system has long relied heavily on AWS SQS for message queuing in the past. It normally cost us approximately $6,000 USD per month. Due to business requirements, we had been tasked with reducing AWS expenses, and AWS SQS is one of the key areas we were looking at.
Over the prior years, we had strategically optimised the AWS SQS to make it work efficiently and cost-effectively. This could translate into a variety of approaches that you could search for on Google. I could name a few general approaches, which are:
- Utilise batch operations: Use SendMessageBatch, ReceiveMessageBatch, and DeleteMessageBatch to send, receive, and delete multiple messages in a single API call, drastically reducing request counts, which means reducing costs because AWS SQS pricing costs us based on the number of requests.
- Remove queues that aren’t needed: Having a system that regularly queries CloudWatch’s metrics (e.g.
ApproximateNumberOfMessagesVisible) and delete queues that are idle to avoid paying for unused resources. - Long Polling: Set up ReceiveMessageWaitTimeSeconds to a value up to 20s. That is to reduce “empty receives” (polling without messages), which saves us costs on request charges.
- Configure Dead-Letter Queues (DLQs) to automatically route messages that fail to process, avoiding endless retries and saving costs from repeated failed attempts.
Even after trying various solutions, we still want to optimise AWS costs further.
Solutions
So, we began exploring alternative tools for SQS and identified two potential candidates for this change: RabbitMQ and Kafka. To truly save costs, we would need to self-host the service. I know this isn’t a new idea, and I’m not claiming it’s a perfect solution.
I believe a good solution doesn’t have to be the most expensive solution, and it should be based on context. As engineers, we don’t just focus on the technical side, we weigh the pros and cons of each paid/free tool and many other factors to determine what works best for us in each stage. I mean, it ties into business goals, for example, if your business is having good sales, and the IT budget is not a matter, and your team has a goal to achieve something that brings more benefit to the business, then why bother saving SQS cost while it’s running at a reasonable level? Regardless of self-hosting message queue solution can save you cost, but it can be a burden when your team is small and has other important things to prioritise.
In reverse, if your business aims to optimise cost as the main purpose, whether to allocate the capital expenditures to other critical areas, or simply to make the business run more cost-effectively in operation during the recession period. I don’t think it’s a good idea to keep spending a lot on SQS while we can achieve the same result with a cost-effective solution. There’s a tradeoff in the added operational overhead, but in our case, it’s an acceptable compromise.
With that being said, we can apply the same approach to any other tools, e.g. AWS RDS (Amazon Relational Database Service), etc. But please do research and make a decision based on your own circumstances.
RabbitMQ and Kafka
When I first started looking at both tools, it was tough to tell which would be the best fit for us. At a glance, they seemed identical. But after some testing, I put together a quick comparison based on what I was looking for. These were basically the main things we cared about:
When I first started looking at both tools, it was hard to say which one was the best fit for us then. At a glance, they seemed identical. After some testing, I put together a quick comparison based on what I was looking for. Basically, here were the main things that we cared about:
| Factors | RabbitMQ | Kafka |
| The team’s experience with the tool | I have experience in building/deploying the system in the past. Other developers can pick it up fast. | None of us has experience, a steep learning curve, as far as I could tell. |
| Best Used For | Web servers that need rapid request-response. It also shares loads between workers under high load (20K+ messages/second). RabbitMQ can also handle background jobs or long-running tasks like PDF conversion, file scanning, or image scaling. It fits our context. | Streaming from A to B without resorting to complex routing, but with maximum throughput. It’s also ideal for event sourcing, stream processing, and carrying out modelling changes to a system as a sequence of events. Kafka is also suitable for processing data in multi-stage pipelines. It matches our requirements less, and seems to be overkill |
| Cost | We need at least a cluster of three nodes (3 EC2 instances) to set up to ensure high availability (HA). We chose the r7g.large instance type. | By that time, to set up a cluster of Kafka for an ideal HA architecture, the docs said it needed 8 nodes with 3 nodes of Zookeeper (KRaft Mode was not ready then), and 5 nodes of Brokers. I actually could run a cluster of 3 nodes as well, with each node containing a Zookeeper and a broker. However, this way seems to be risky, and we don’t have much experience with this. For now, KRaft Mode is in use, allowing us to run a cluster of at least 3 nodes as the same RabbitMQ. But this needs a higher instance type. |
| Performance | 4K-10K messages per second on average, can send millions of messages per second, but it requires more brokers to do so. This is enough for us when we estimated | 1 million messages per second on average |
To us, the most important factor that drives us to adopt RabbitMQ is the Cost factor, which is cheaper than operating Kafka. Apart from the above factors, there are many other factors (e.g. message handling, security, programming support, ease of use, etc.) that I don’t list here. Although most of them are supported by both tools, as far as I can see.
The migration from SQS to RabbitMQ took us nearly a month to complete. To set up a RabbitMQ cluster, as I mentioned, we provisioned three EC2 instances with r7g.large, each instance has a 100GB EBS volume to store data. In us-east-1 region, r7g.large costs $0.1071 USD / per hour on-demand, S3 costs $0.023/per GB/month on-demand, we calculate like below
- 1 x EC2 instance node costs us:
(0.1071 * 24 * 365) + (100GB * 0.023) = $940.496/ per year. - 3 x EC2 instance nodes:
940.496 * 3 = $2,821.48/per year (equipvalent to $235.12/per month)
Result
We still had some necessary reliance on SQS because some services can only use it, which prevents us from migrating completely away from SQS. However, migrating a part of the message queues from SQS to RabbitMQ was worth it.
We focused on migrating high-workload queues, which brought the SQS cost down to about $2,600 per month, plus $235.12 for self-hosting three RabbitMQ server nodes. That’s roughly $2,835.12 per month in total cost, compared to the original $6,000, which saves $3,164 per month (equivalent to $37,968/year saved) and achieves a 50% cost reduction.
I haven’t factored in the operational overhead because it’s tough to measure, but since moving to RabbitMQ over the past two years, we’ve rarely had to deal with it. Occasionally, we’ve run into a few minor issues that we could handle smoothly or resolve on our own without much trouble. Overall, I don’t think the operational overhead is significant enough to account for.
Hope you enjoy reading the post, and please share it if you find it useful. Thank you.
Check my other blogs for AWS cost optimisation
- AWS EBS Cost Optimization: 10 Tips to Reduce Your AWS Bill
- Migrating Graphite from Python to Go for Grafana: How I Reduced Disk I/O and Cut AWS Costs
- Mitigating NXDOMAIN Attacks with AWS Route 53 Strategies
- Optimize AWS Data Transfer Costs with Nginx Gzip Compression
- Dkron – Distributed Cron System – Solution to replace traditional cron and optimise cloud cost
Discover more from Turn DevOps Easier
Subscribe to get the latest posts sent to your email.

