AWS SQS Deep Dive

Everything you need to know to get started with AWS SQS

Rajan Kafle

Jan 18, 2025

SQS Introduction

Amazon SQS is a queue-based service that acts as middleware between producers and consumers.

The producer sends a message to the queue, and the consumer retrieves and processes it.

When the message is successfully processed, it is deleted from the queue.

Publishing Messages

There are 2 ways of publishing messages:

Using AWS services
Using AWS SDK through various programming languages.

Several AWS services can directly publish messages to SQS. Some of the most notable include:

Amazon SNS
Amazon S3
AWS Lambda
AWS Step Functions
Amazon EventBridge
AWS IoT

AWS SDK

The AWS SDK is a comprehensive collection of tools, libraries, and documentation that allows developers to interact with AWS services using their preferred programming languages.

It simplifies the integration of AWS services into applications by providing pre-built functions and classes for tasks such as sending data to S3, managing EC2 instances, interacting with SQS queues, and more.

The AWS SDK supports multiple programming languages. Below is a simple example in Node.js demonstrating how to publish a message to an SQS queue using the AWS SDK:

import { SendMessageCommand, SQSClient } from "@aws-sdk/client-sqs";

const client = new SQSClient({});
const SQS_QUEUE_URL = "queue_url";

export const main = async (sqsQueueUrl = SQS_QUEUE_URL) => {
  const command = new SendMessageCommand({
    QueueUrl: sqsQueueUrl,
    MessageAttributes: {
      Version: {
        DataType: "Number",
        StringValue: "1",
      },
    },
    MessageBody:
      "Information about current NY Times fiction bestseller for week of 12/11/2016.",
  });

  const response = await client.send(command);
  return response;
};

Now that we have covered the various methods and services that can publish messages to SQS, let’s start exploring the SQS service itself.

SQS Types

Standard Queues

At-least-once delivery
No message order guarantee
Unlimited throughput

2. FIFO Queues (First-In-First-Out)

Exactly-once processing
Strict message ordering
Limited to 300 messages/second (without batching)

Standard Queue Message Duplication

For high availability SQS stores copies of your messages on multiple servers,

Rarely, if a server is unavailable when deleting messages then the copy of the message remains with that sever.

Next time when you pool for the message, you might receive the same message again.

That’s a duplication error.

Standard Queue Message Ordering

SQS standard queues do not guarantee the order in which messages are delivered to your consumers.

Solutions:

Implement Order Handling Logic:

Message Deduplication: If you receive duplicate messages, discard them.
Message Sequencing: Store messages in a local database and process them in the order they were originally sent.

Use FIFO Queues.

Standard Queue Concurrency

Concurrency in SQS refers to the ability to process multiple messages simultaneously.

This is achieved by having multiple consumers (often Lambda functions) working independently on different messages from the queue.

Carefully managing concurrency is crucial to optimizing processing speed while avoiding issues like message duplication or exceeding resource limits.

Key SQS Trigger Settings for Lambda:

Batch Size:

Controls the number of messages Lambda receives per poll.
Higher values increase throughput but can overload consumers.

Batch Window:

Sets a time limit for message accumulation.
If Batch Size is not reached within the window, messages are still processed.

Maximum Concurrency:

Crucially limits the number of concurrent Lambda invocations triggered by the SQS queue.
Overly high values can cause:
- Message contention: Multiple Lambdas compete for the same messages.
- Exceeding Lambda concurrency limits: Leading to function throttling.

Standard Queue Use Cases

Ideal for scenarios where high throughput is crucial but strict message order is not.

Decoupling: Separate user actions from time-consuming background tasks (e.g., media processing).
Workload Distribution: Distribute tasks across multiple workers (e.g., credit card validation).
Batching: Group messages for later processing (e.g., database updates).

Note: Applications must be designed to handle potential message duplication and out-of-order delivery.

Standard Queue Limitations

In-flight messages: Limited to 120,000. Exceeding this leads to errors.
Delay Queue: Minimum 0s, maximum 15-minute delay.
Long Polling: Maximum 20-second wait time.
Message Backlog: Unlimited storage for total messages.

Key takeaway: Manage in-flight messages to avoid exceeding the quota.

FIFO Message Duplication

SQS FIFO prevents duplicate messages within a queue.

Methods:

Content-based deduplication:
- SQS hashes message content (not attributes) to identify duplicates.
Explicit deduplication ID:
- The producer assigns a unique ID to each message.

Deduplication Scopes:

Queue Level: Applies deduplication across all messages in the queue.
Message Group Level: Applies deduplication only within the same message group (default for high-throughput queues).

SQS FIFO Throughput

Batching: Up to 3,000 messages/second (300 API calls with 10 messages/call).
No Batching: Up to 300 API calls/second.
High Throughput Mode: Available for increased performance (contact AWS for quotas).
Throughput Limits: Can be configured at both queue and message group levels.

Lambda and FIFO Concurrency

Single Group, Single Instance: Only one Lambda instance processes messages within the same message group at a time, ensuring order.
Multiple Groups, Concurrent Processing: Lambda can process messages from different groups concurrently, maximizing throughput.

Example:

Messages from the same group are processed sequentially.
Messages from different groups are processed in parallel by different Lambda instances.

SQS FIFO Use Cases

Financial Transactions: Order processing, payment processing.
Data Pipelines: Event streaming, data integration.
Messaging/Chat: Chat applications, and messaging platforms.
Workflow Orchestration: Task scheduling, process automation.
Auditing/Logging: Audit trails, log aggregation.

FIFO queues ensure order is maintained, crucial for these scenarios.

SQS FIFO Limitations

Lower Throughput: FIFO queues have a lower throughput than standard queues, especially without batching.
Higher Latency: The strict ordering and deduplication mechanisms can introduce slight latency compared to standard queues.
Message Group Dependency: If a message group is stalled (e.g., due to consumer failure), subsequent messages within that group will also be blocked.
Limited Concurrency: While concurrency is possible across different message groups, processing within a single group is inherently sequential.

SQS Security

SQS employs a multi-layered security approach to protect your data:

Encryption:

In-transit: All communication between your application and SQS is encrypted using HTTPS.
At-rest: You can encrypt messages at rest using AWS Key Management Service (KMS) keys. This provides an additional layer of security for your sensitive data stored within SQS.
Client-side encryption: For maximum control, you can encrypt messages before sending them to SQS and decrypt them within your application.

Access Control:

IAM Policies: Control access to SQS resources based on user, role, and group permissions. Define which actions are allowed on specific queues.
SQS Access Policies: Control access from other AWS accounts or services. Essential for cross-account communication and integration with other AWS services.

By implementing these security measures, you can ensure that your SQS queues and messages are protected from unauthorized access and data breaches.

Message Visibility Timeout

Purpose: Prevents other consumers from processing a message while it's being handled by the current consumer.

Default: 30 seconds.

Functionality:

The consumer receives a message.
Visibility timeout begins.
Consumer processes the message.
The consumer deletes the message from the queue.

If timeout expires: The message becomes visible again and can be processed by another consumer.

Dead Letter Queues (DLQs)

Purpose: Store messages that fail to process successfully within the visibility timeout.

Benefits: Isolates errors, aids debugging, and enables message recovery.

Key Considerations:

DLQ type must match the source queue (FIFO or Standard).
Set the appropriate retention period for the DLQ.

DLQs improve application reliability and error handling in SQS.

DLQ flow

When should you use a dead-letter queue?

Error Isolation:

When messages consistently fail to process due to issues like invalid data, system errors, or transient network problems.
DLQs isolate these problematic messages, preventing them from clogging the primary queue and disrupting the flow of other messages.

Debugging and Troubleshooting:

DLQs provide a centralized location for analyzing failed messages.
By examining the messages in the DLQ, you can quickly pinpoint the root cause of processing errors, such as invalid data formats, missing dependencies, or code bugs in your consumer applications.

Improved System Resilience:

By capturing and isolating failed messages, DLQs enhance the overall resilience of your message-driven systems.
They prevent cascading failures that can occur when a single message repeatedly fails to process, potentially impacting the processing of other messages in the queue.

Message Recovery:

DLQs enable you to recover from temporary processing failures.
After identifying and resolving the underlying issues, you can manually re-process the messages from the DLQ, ensuring that no data is lost due to processing errors.

When should you NOT use a dead-letter queue?

Transient Errors:

If message processing failures are primarily due to transient issues like temporary network disruptions or brief service outages
Implementing extensive retry mechanisms within your consumer application might be more suitable than relying solely on a DLQ.

High-Volume, Low-Impact Failures:

For applications where occasional processing failures have minimal impact, the overhead of managing a DLQ might outweigh the benefits.
For example, if a small percentage of messages fail to process due to minor data inconsistencies, it might be acceptable to simply discard these messages.

When Message Order is Critical and Retries are Essential:

In scenarios where maintaining strict message order is paramount (e.g., financial transactions), and retries are crucial for successful processing.
Relying heavily on a DLQ could disrupt the expected message order and lead to unexpected behavior.

Delay Queue

Purpose: Intentionally delay the delivery of messages to consumers.

Functionality:

Delay messages for up to 15 minutes.
Set a default delay at the queue level.
Override the default delay for individual messages using the DelaySeconds parameter.

Key takeaway: Delay queues are useful for scheduling tasks, throttling message consumption, and implementing backoff strategies.

SQS Long Polling

Purpose: Reduces API calls and latency by having the consumer wait for messages to become available.

Mechanism: Consumer waits for a specified time (up to 20 seconds) for messages to arrive.

Benefits:

Fewer API calls.
Improved efficiency and reduced latency.

Implementation:

Enabled at the queue level or using the ReceiveMessageWaitTimeSeconds parameter.

Key takeaway: Long polling is generally preferred over short polling for optimal SQS performance.

SQS Extended Client

Handles large messages: Allows sending and receiving messages exceeding the 256 KB SQS limit (up to 2 GB).
S3 Integration: Stores message payloads in S3 and sends a reference to the queue.
Improved Scalability: Efficiently handles large volumes of data by offloading storage to S3.
Enhanced Performance: Potentially improves message processing speed by reducing network data transfer.

Essential SQS APIs

Core Operations:

CreateQueue: Create a new SQS queue, specifying attributes like retention period.
DeleteQueue: Permanently delete an SQS queue.
SendMessage: Send messages to an SQS queue.
ReceiveMessage: Retrieve messages from an SQS queue.
DeleteMessage: Remove messages from an SQS queue after processing.

Queue Management:

PurgeQueue: Delete all messages within a queue.

Message Control:

MaxNumberOfMessages: Configure the maximum number of messages to receive per ReceiveMessage request (default 1, max 10).
ReceiveMessageWaitTimeSeconds: Implement long polling for improved efficiency.
ChangeMessageVisibility: Adjust the visibility timeout for a specific message.

Batch Operations:

Batch APIs: Utilize batch operations for SendMessage, DeleteMessage, and ChangeMessageVisibility to significantly reduce API calls and improve cost-effectiveness.

Message Handling

To optimize costs and performance, SQS messages should be kept under 256KB and processed using batch operations whenever possible.
Implement proper visibility timeout settings based on your processing needs. Typically, start with 30 seconds and adjust based on actual processing time.

Queue Configuration & Security

Standard queues offer high throughput with at-least-once delivery, while FIFO queues guarantee exactly-once processing with ordered delivery.
Security should be implemented using IAM roles with least privilege principle, and sensitive data should be encrypted before transmission.

Monitoring & Performance

Implement CloudWatch alarms to monitor key metrics like queue depth and message age.
Set up alerts for abnormal patterns and ensure proper scaling of consumers based on queue depth.
Long polling should be used to reduce API calls and costs. Key monitoring points:

Track ApproximateNumberOfMessagesVisible metric
Monitor DLQ for failed message patterns
Set up alerts for processing delays

Error Handling Strategy

Implement robust error handling with retry mechanisms and proper exception logging.
Use exponential backoff for retries and ensure failed messages are properly routed to DLQ after maximum retry attempts.
Consider implementing idempotent consumers to handle duplicate messages safely.

Rajan’s Newsletter

Discussion about this post