Building High Availability into your environment
Understanding SLAs in AWS
Which services should I use to build a decoupled architecture?
Managing RTO and RPO for AWS Disaster Recovery
The course is part of this learning path
Written by: Jorge Negron
When the idea of de-coupling is usually introduced, it shows up in the context of application development and messaging services. Let’s get started with that!
What does decoupling an application mean?
A decoupled application allows each component to perform its tasks independently - it allows components to remain completely autonomous and unaware of each other. A change in one component shouldn't require a change anywhere else. More importantly, a failure in one layer of the application should not propagate to other layers but remain isolated where the component failure occurred.
Consider the diagram of a tightly coupled application as shown. We’re keeping it simple for the sake of illustration. Notice the receive layer invokes the transcode layer which then invokes the publish and notify layer. This is a simple three layer image processing application.
In this type of implementation a failure in one of the layers may cause a negative impact on the following layer and impair the functioning of the entire application. The possibility of de-coupling presents itself by introducing a messaging mechanism where messages can be sent and received by the different layers. Ideally the messaging model needs to be a “one to one” message passing in that one message is generated and put into a queue and that one message is then picked up by the next processing layer.
The Amazon Simple Queuing Service or SQS lends itself for this type of implementation behaving like an email system for different application layers and is able to maintain a copy of the message received even if there are no consumers listening to pick up the request for processing.
In this particular implementation we are going to be using Amazon SQS and we will implement each of the application layers as fleets of EC2 instances in AutoScaling groups.
By using the SQS queues between each processing layer we have achieved a loose coupling of the systems which are now exchanging messages in order to transfer requests between layers. This is an asynchronous connectivity of the systems and allows you to increase or decease the number of EC2 instances that receive and process the messages in parallel. You can also configure AutoScaling to grow and shrink the size of each application layer fleet based on usage and demand.
If an EC2 instance fails to process a message it is retained in the corresponding queue which will then be picked up upon restoration of the EC2 instance or by another EC2 instance on the same auto scaling group for that layer.
In this case Amazon SQS behaves as the equivalent of an email system for the different application layers. In general for SQS the equivalent of a mailbox is called a queue. Applications that put messages into a queue are called producers and applications picking up messages are called consumers. This is the vocabulary used by the AWS documentation.
The general flow of a message in a queue is as follows:
- An application produces a message and sends it to a queue
- An application consumer is usually listening or “polling” the queue for new messages and picks it up when requested in order to process it.
- When a message is picked up by a consumer the message is locked and a “visibility timeout” is set so that the message becomes invisible to all other polling consumers, therefore ensuring that each message is processed at least once.
- The invisibility of the message is maintained until the consumer is finished processing the message and issues a DeleteMessage call to the queue in order to delete the message.
- If for any reason the message is not processed successfully, and the DeleteMessage call is not issued, the visibility timeout for the message expires. The message becomes visible and available again to be picked up by another consumer or by the restored consumer that failed to complete the initial processing.
There is a type of queue known as a Delay Queue where once a message is received you can delay the delivery of the message for a number of seconds.
The predefined delay behaves just as a visibility timeout in that a message is not returned when the ReceiveMessage Request is made. Once the predefined delay is complete the message and the queue behave as clarified before where once the message is picked up the visibility timeout becomes active. The minimum and default delay for a message is 0 seconds and the maximum is 15 minutes.
Details about SQS:
Amazon SQS is usually the first service that is mentioned in a conversation about application de-coupling. Some details about Amazon SQS are:
1) The visibility timeout for a message in a queue by default is 30 seconds. The minimum is 0 seconds and the maximum is 12 hours. If processing a message will take more than 30 seconds you would want to increase the visibility timeout accordingly to meet your application’s processing time. This will make sure your applications consume a message only once and have enough time to process them. The visibility timeout can be set for the entire queue or for an individual message if needed. For an individual message you can use the ChangeMessageVisibility call with a VisibilityTimeout parameter in seconds. The ChangeMessageVisibility call has no impact on the other ReceiveMessage commands issued later. Please note that If your consumer needs longer than 12 hours to process a message, you need to consider using Step Functions instead of SQS.
2) There are two ways in which a consumer can listen for messages in a queue and they are called short polling and long polling.
A) By default, queues use short polling. Using short polling a consumer issues the ReceiveMessage request to find messages available and SQS sends a response even if the request found no messages available.
Short polling takes place when the WaitTimeSeconds parameter of a ReceiveMessage request is set to 0 and this can happen in two ways:
First, The ReceiveMessage call sets WaitTimeSeconds parameter to 0.
Second, The ReceiveMessage call doesn’t set WaitTimeSeconds parameter , but the queue attribute ReceiveMessageWaitTimeSeconds is set to 0.
B) To use long polling the consumer issues the ReceiveMessage request with a WaitTimeSeconds parameter greater than 0 and less than or equal to 20 seconds. Long polling can also happen if the ReceiveMessageWaitTimeSeconds queue attribute is set to a number greater than 0.
In this case SQS sends a response after it collects at least one message available and up to the maximum number of messages specified in the request by the MaxNumberOfMessages parameter. An empty response only happens when the specified polling wait time expires.
3) The minimum message size is 1 byte and the maximum message size by default is 256KiB. The Amazon SQS Extended Client Library for Java is very useful in enabling the processing of large messages up to 2GB by leveraging Amazon S3 along with SQS messaging.
If you need to deal with messages larger than 256KB the SQS Extended Client Library allows you to define if messages are to be stored in Amazon S3 all the time or only when the message size is bigger than the 256KB limit.
You can also send a message with the link to an object stored in an S3 bucket. You can get the message object from Amazon S3 and delete the message object from the S3 bucket if needed. Once again, The maximum message size when using the SQS Extended Client Library for Java is 2GB.
4) The standard SQS queues make an effort to maintain the order of the messages but do not guarantee that message order will be maintained and only guarantees At-least once delivery of messages. There can be a maximum of 120,000 in flight messages in a standard queue, which means messages have been received from a queue by a consumer, but not yet deleted from the queue after processing completes. If you use short polling this quota will cause your consumer to get an OverLimit error message if it tries to receive a message and you have than many messages in processing. If you use long polling SQS returns no error messages. You should always delete messages from the queue after they’re processed in order to avoid breaching this maximum quota.
In order to guarantee message order and implement guaranteed EXACTLY-ONCE delivery you need to use a FIFO queue which is a different type of queue than the standard. FIFO queues perform a little slower than standard queues and that should make sense on account of the mechanism to maintain message order as First-in-First-out and implement the EXACTLY-ONCE delivery mechanism. FIFO queues can have a maximum of 20,000 messages while processing which means messages have been received from a queue by a consumer, but not yet deleted from the queue after processing completes. Please keep the performance comparison between standard and FIFO queues in mind when designing your applications. It’s also a common data point tested during exams.
The difference between short poling and long polling is also an important detail to remember for exams and certification.
5) Next up, we already discussed what happens when a messages is not processed successfully, basically the application fails in the processing and the DeleteMessage call is not issued by the consumer application and the visibility timeout for the message expires making the message available to consumers once again.
This situation assumes that the failure to process the message was rooted on some form of compute malfunction which can be restored through using CloudWatch alarms and automated remediation in general.
There’s a second possibility for such a situation and that is when the actual message is malformed or otherwise corrupted. This can potentially cause a never ending cycle of the message being consumed by the ReceiveMessage Request, not processed accordingly because it’s malformed and the DeleteMessage call is not issued by the consumer application. The message then becomes visible again for the cycle to repeat itself.
In order to guard against this possibility of message corruption and infinite attempts to process it, you can define a “dead letter” queue in order to capture messages that cannot be processed when the message has been delivered for processing a maximum number of times as defined by the maxReceiveCount for a queue. Every time a message is picked up by the ReceiveMessage Request the ReceiveCount is incremented by one.
Reaching this pre-defined limit will remove the message from normal circulation and place it into the dead letter queue for examination as to the reason why it cannot be processed. Once the issue has been repaired you can move the message back to the queue that delivered it using the dead letter queue redrive capability. Please note that in this case dead-letter queues can potentially break the order of messages in FIFO queues. A redrive allow policy is the resource that defines source queues and their corresponding dead-letter queue as well as the conditions to move messages from one to the other.
As such, it is important that dead letter queues be monitored carefully and messages arriving get examined as soon as possible by either automated functionality such as a lambda function or human examination.
This will require a mechanism for notifications also in the form of a messaging service. The Simple Notification Service or SNS is commonly used in combination with SQS for dispatching notifications and trigger automatic remediation and human intervention via push notifications at the same time.
This course covers the core learning objective to meet the requirements of the 'Designing for disaster recovery & high availability in AWS - Level 2' skill
- Analyze the amount of resources required to implement a fault-tolerant architecture across multiple AWS availability Zones
- Evaluate an effective AWS disaster recovery strategy to meet specific business requirements
- Understand SLA for AWS services to ensure the high availability of a given AWS solution
- Analyze which AWS services can be leveraged to implement a decoupled solution
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.