Managing RTO and RPO for AWS Disaster Recovery
AWS CloudFormation Operations
The course is part of this learning path
This section of the SysOps Administrator - Associate learning path introduces you to automation and optimization services relevant to the SOA-C02 exam. We will understand the service options available and learn how to apply these designs and solutions to meet specific design scenarios relevant to the exam.
- Understand how to decouple architecture using Amazon Simple Notification Service and the Simple Queue Service
- Learn how AWS CloudFormation can be used to optimize and speed up your deployments using infrastructure as Code (IaC)
Hello again, and welcome to Cloud Academy's Advanced AWS CloudFormation course. Today' we're going to be talking about stack life cycles and how we can detect the end of, or failure of, stack creation of update.
Now before we get into the way that we're going to technically accomplish these things, let's quickly look at a rubric for grading the business value or quality of our DevOps maturity in our company.
So this slide helps to explain to non-technical or semi-technical users the DevOps level that they're at, and how far they have to go. So today, we're going to be working on some advanced and intermediate techniques.
The first and most clear benefit that we will get when we're using CloudFormation and automating the usage or invocation of the scripts is in a test system. So when I automate my infrastructure stack with a declarative model at CloudFormation, my test systems will always be current because my continuous integration or continuous deployment system will be keeping it up to date and deploying new stacks on every commit.
We'll also be increasing the comfort level of you and the wider business with major cloud-wide changes because it'll be so much easier for us to test things. Furthermore, we can make this sentence true. Our cloud tells us if things will work automatically. This will include things like automatically failing an infrastructure stack declaration whenever we make a change in source control to the document. Or it will tell us that our integration tests that run on top of a full stack work.
Beyond the English descriptions, there's a DevOps scorecard that we can use to quickly evaluate the maturity of our DevOps system on 17 different areas. The max score I can get is a 51 and the lowest I can get is a 0. So let's see which areas we are working on today for this lecture.
First and foremost, we're looking at creating fully automatic creation of our stacks. This is separate from one-click creation, like you may have done during the basic CloudFormation tutorial because now we'll be using a computer to actually do the invocation of a CloudFormation command for us.
When we're using declarative infrastructure tooling, such as CloudFormation, it is much simpler to produce identical dev, test, and prod environments. During our build deploy pipeline, we'll also be able to automatically deploy essentially anything that we want once we get to a reasonable level of sophistication using our CloudFormation templates.
We'll also be able to end-to-end test code. That is, we can run full production-like systems created from a CloudFormation template and validate that every single piece works in a prod-like environment. As part of our end-to-end testing code, we'll also have the ability to run full automatic architecture tests. That is, we will be able to run load testing, or failure testing, whenever we want because we can create the stacks on demand.
This is a diagram to illustrate the types of technologies that we should expect to be working with as we progress through these levels of maturity and we can see here that we'll be working with advanced CloudFormation. We'll be working on build and deploy scripts and automatic architecture creation.
The first topic we should be concerned with covering if we're looking at doing automation of any process is the life cycle of the process as it stands. Now, what do I mean by life cycle? I mean the phases that a stack will move through as we issue it commands. Since we're talking about issuing commands to our stacks, let's divide this up and work on each topic based on the three create, update, and delete actions that we have available to us.
In addition to dividing states by create, update, and delete commands, we further subdivide by OK and error, meaning statuses you should expect to see during the normal progression of a command versus when we see a failure condition. Create in progress represents the intermediate state between when we've initiated the command and when the stack is complete. Once we're inside create complete, all is well and this will not change unless we issue a further command.
While we are in the middle of create in progress, there is a potential for us to fall over into create failed. The stack will enter this state if any resource that makes up the stack fails or has a problem. After we have reached create fail, the default behavior is to proceed down this error path line into rollback in progress. You can override this default at some point and disable roll back so you can inspect the create failed stack, but we'll follow the default path lines since for each of these commands the defaults are the most explanatory. After we have a create failed signal, the default behaviors to proceed to roll back in progress. Rollback in progress constitutes the stack deleting all of the resources that it has created so far during the stack creation. After the rollback in progress step is done deleting, the stack will enter rollback complete state. This means that the stack no longer exists in terms of resources, but it still will exist in the console for you to inspect. If there is a problem during the rollback in progress phase where CloudFormation is unable to delete a resource for any reason, the stack will enter the rollback failed state. Rollback failed is a pretty serious condition because it means that our stack was both unable to create itself, and then unable to clean itself up after it was deleted. We may have problems if we try to delete the stack and also get a delete failed.
Moving over to update, we see that we have a relatively similar progression through both the OK and the error areas, but there are some differences. For update, we have the update in progress state, which is analogous to create in progress. It simply means that things are progressing all right and we haven't finished yet. We also have the update complete cleanup in progress, which means if we complete the update phase that creates the new or updated resources during a completion, but we haven't finished deleting the old resources that needed to be fully replaced. That is, once we've entered the update complete cleanup in progress state, all of our new resources exist, and we're only deleting our old resources that we will no longer be using. If all goes well during this phase, the stack enters the update complete phase.
Should something negative happen during either update in progress or update complete cleanup in progress, the stack will enter the update failed state. The update failed state says that something went wrong, and this is the state that your stack will stay in if you force it to not do run a rollback. Update failed rollback in progress represents the state where the stack is in the middle of undoing all of the latest actions that happen during the update, or the update cleanup. This includes actions like changing back properties and doing an inverse update, or recreating deleted properties.
Once we finish an update rollback, the cleanup process will also trigger for the rollback, just like it would during a success condition for update. This does the same thing, deleting any new resources that were created during the OK portion of the cycle and removing them in favor of the old resources.
The update rollback complete state represents that the failed stack has stabilized to the original state that was created before we ran the update operation. That is, if we are in update rollback complete state, then the stack is not in the state that you attempted to update to, but the previous one that you tried to update from. If there is a problem during the update rollback, the stack will enter the update rollback failed state. This a relatively serious state because it means that nothing worked and the stack now must be force deleted.
The delete command is the most simple command and the easiest to reason about because it has the fewest states. In the OK path, there are only two states, delete in progress and delete complete. In the failure case, there's only delete failed. However, the delete failed is the most serious condition on all of the possible state values because it means that we can't remove the stack.
If we can't remove the stack, we may need to submit a support ticket because the CloudFormation team must manually remove the stack from the CloudFormation console while leaving the resources associated with the stack in place because those resources are the ones that are making the stack fail during deletion.
If you do this by submitting a support ticket or commenting in the forums on Amazon, then you will be responsible for deleting the constituent resources that are included in the CloudFormation stack the team deletes. This should give you a good idea of all the possible stacks that you can see or stack in, either on a console on via the CLI SDK or HTTP APIs. We went over these commands in the different states because we will need to look at these states during any of our automation process so we can understand when the stack is stabilized and is ready for testing.
Now we're going to move on to another diagram that illustrates how data flows around the CloudFormation system. This is an important diagram because we need to be able to tap into different locations that data is exposed to be able to detect stack completion and test it.
While this diagram might seem a little intimidating at first, we'll walk through each step so you can understand conceptually how data flows about the CloudFormation system. As we step through this slide, if you ever get stuck or lost, just hit the pause button and review the order in which the data flows through the system.
We should notice that we have a user in the bottom left-hand corner here. The user can be an actual human being sitting at the console or on a terminal, or it can be a machine. In addition to making note of where the user sits on this page, we should also make note of where the CloudFormation service itself is represented in the diagram. It is directly above the user.
All right, let's take a look at the order in which data flows around the system. First, we see that a system must request a change to CloudFormation via the Amazon Web Services command interface console SDK or HTTP APIs. The request change should contain command type, the template, the parameters and the metadata, which is exactly what we need to understand how to build the stack. After we've composed the request, the request is issued to the CloudFormation service. CloudFormation will immediately acknowledge the attempt to create, update, or delete a stack, or it will throw an error. The errors at this phase are in a request response format, and thus can only cover things that CloudFormation can detect immediately. These classes of errors that CloudFormation can detect upfront are rather limited. They include things like malformed JSON detection, schema validation detection, basic logic detection, like circular references, and the like.
If these preliminary checks pass during the request response cycle, CloudFormation will acknowledge that the request has started. This does not mean that the stack is complete, but simply means that the kickoff has begun. After CloudFormation does the error checking or acknowledges, we can expect a request to come back to our user or automation with a pass/fail. This will be in the standard Amazon Web Services request format with an error object in the message and a status code, or with passing information.
Because the stack has not been created yet and only kicked off, the pass operation here simply returns the Amazon resource name of the stack that has been kicked off. We use this ARN that the pass/fail validation returns to the user or server to do further operations like polling for status.
As soon as the pass/fail validation has come back, the user can begin polling for the status of the stack that is being created. This is just a normal HTTP call, which goes out to CloudFormation to describe the stacks that are currently running. The describe stack will return many different pieces of metadata, including the stack status. We can use the stack status for further logic, but let continue down this flow path.
Once a command has been acknowledged by CloudFormation, CloudFormation begins issuing resource commands to a new stack instance model in its system. A stack instance is simply a line item in the console that you can observe. CloudFormation generally uses the term stack to describe what I call a stack instance, but I want to be extra clear that we understand that this is a single representation of a reusable template. When CloudFormation issues the command to the stack instance, the stack instance will further delegate the command to an actual resource within its template. This is done implicitly with any built-in resources as CloudFormation will know how to do a creation operation on anything that comes from CloudFormation itself.
The resource itself is responsible for implementing logic that interprets the property values that are passed in, and creating the resource represented by that JSON. Once that creation or failure is complete, the resource responds back to the stack instance to tell it how that individual resource performed during its create, update, or delete. Then the notification about the individual resource is passed back up to CloudFormation. CloudFormation then publishes the event, which contains data around reasons for failures or reasons for success to SNS. When publishing to SNS, CloudFormation can either publish to a new topic that is created during the spin-up of the stack itself, or to an existing topic that you define by the ARN.
In typical SNS fashion, once the topic receives a message, it broadcasts that message to multiple users or consumers. There are two main applications for this type of topic. One, being a user subscription, where we just want to manually monitor what is happening with the stack. Or two, a custom listener that implements custom logic while listening to an SNS stream and then implements custom behavior. I use Lambda here in my example because it is a convenient SNS subscriber. However, you can use any SNS subscriber you want. The SNS subscriber should then implement and custom behavior that you want, such as fanning notifications out or executing other stack logic, and then respond to the user.
So let's quickly recap the entire flow, since it's pretty complex. First, we have a request that needs to be composed and sent to CloudFormation. Once CloudFormation has issued a command, it then immediately acknowledges or errors out with basic messaging around the type of error and such. The response to the command is then sent to the user as a pass/fail validation, where fail represents any upfront easy template validation issues that CloudFormation could detect, and a pass means that the stack has begun creating, updating, or deleting.
Once the user receives a passing response, which will contain the ARN at the created stack, the user has the opportunity to pull the service to see the status manually. CloudFormation also begins issuing resource commands to the stack instance, which then delegates the commands to each individual resource, which has its own implementations of wrappers around Amazon Service endpoints. As data comes out of each of the resources as a pass or fail on the create of the individual resource, those values are passed back to CloudFormation as events. Once CloudFormation has the events, it publishes it an SNS topic, which we can subscribe to like a normal SNS topic. One common use case for subscribing to that topic is to subscribe a user by their email to a very, very long-running stack. For instance, one where we need to create a large cluster database that could take up to an hour. Another more powerful technique is to make a server or Amazon Service subscribe to the SNS message and then implement custom behavior that triggers off of certain stack events, such as completion.
Here I use Lambda, but we can use any SNS subscriber we want and implement the logic on any platform we want. We also have the opportunity to send custom notifications out of our server or Lambda or SNS subscriber that receives the message. We can use this to enrich the message rather than sending the raw one.
Finally, it doesn't really matter if this is a real user or a machine. As long as the entity is able to assume Amazon Web Services roles that give it permission to manipulate CloudFormation, then this script will run fine.
Now that we understand how the CloudFormation data flow works, we can use this new knowledge to create neat automation systems that improve our DevOps situation. During the next video, we'll go over how you can integrate this knowledge with a CI or CD system to create automated full stack testing of your entire cloud architecture.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.