Rolling and Canary Deployments


Start course
1h 11m

As modern software expectations increase the burden on DevOps delivery processes, engineers working on AWS need increasingly reliable and rapid deployment techniques. This course aims to teach students how to design their deployment processes to best fit business needs.

Learn how to: 
- Work with immutable architecture patterns
- Reason about rolling deployments
- Design Canary deployments using AWS ELB, EC2, Route53, and AutoScaling
- Deploy arbitrarily complex systems on AWS using Blue-Green deployment
- Enable low-risk deployments with fast rollback using a number of techniques
- Run pre-and-post-deploy whole-cloud test suites to validate DevOps deployment and architectural logic

If you have thoughts or suggestions for this course, please contact Cloud Academy at


Welcome back to Cloud Academy's advanced deployment on AWS course. Today we'll be going to talk about rolling in Canary deployments, so we've got a couple of sections here to talk about. We want to talk about what rolling in Canary deployments both are, do a little bit of definition, define how we might do them, and look at a couple of diagrams. We'll talk about the distinction when to use one versus the other versus something like a blue/green or another deployment methodology. And we'll also talk about when not to use them. So when is it not advantageous to use one of these advanced deployment techniques.

So getting into our definitions of rolling and Canary. So rolling deployment. Deploy one or more servers with new code in batches. So you may have already done this already, this isn't a particularly exotic method of deployment, but it is distinctive from the way that you might normally deploy where you just turn a server off and put new code on and turn it back on, because the way that we're doing things is using an AutoScaling launch configuration. So effectively we might have an AutoScaling group behind a load balancer or something, and traffic comes into the load balancer, the ELB, the way that we might do a rolling deployment is simply by changing the AutoScaling launch configuration associated with the AutoScaling group, and then delete instances and allow them to repropagate or recreate. Or add instances using the new launch configuration group first and then terminate old instances depending on how close you are to your capacity threshold.

Now this is great because if issues arise we can roll back fractionally and roll forward if we want. So if we imagine that we get 20% of the way through rolling through these instances and selectively terminating or refreshing the instances that have the old launch configuration to the new one, then we can imagine that we can just go backwards and start redeploying using the original launch configuration then launch new servers using that so we get a nice rollback.

So this what it looks like in a diagram for those of you that are more visual thinkers. So we've got a lot going on here. I'll step through it one step at a time. Over here on the top center, that's the first step that we're looking at, when we're creating this rolling deployment process or script, right? So you can do this manually or you can script it, but the important part is that you just learn these steps and make them so, inside of your deployment logic. Your deployment logic being either a manual process that you walk through or a scripted process. I of course, since I'm doing DevOps training, I must suggest to all the students watching this video that they script this always but you can actually do this manually as you're learning how it actually works.

So, without further ado, let's update this with new version in the top center here, where we're moving from an old launch configuration to a new launch configuration. All AutoScaling groups have an associated AutoScaling logic configuration. Now the piece that you're updating when you do this old to new should be the user data script that bootstraps everything or the AMI that launches the image so it's pre-bootstrapped. So, what you're trying to do there is update the instances that you're updating or the configuration of the instances that you're updating, so that they bootstrap themselves with new versions of code.

So after we've done that we have two different ways that we can do this. They're almost equivalent. We can update with a higher capacity to plus one to add our first new node and start terminating. That's typically the best way to do it. So if we look at our AutoScaling group there on the top right. We need to update the group itself, not just the launch configuration which pertains to the individual instances. Update the group itself, to a higher desired capacity and potentially a higher maximum capacity. So in this diagram, we are looking at a system that might have originally required four instances to service all traffic, but during the deploy, we update to require five instances.

Now, why do we do that? Because when we add this extra node for capacity, that new one since we've reconfigured the AutoScaling group to use the new launch configuration, will be, as you see in the bottom right there, launched as a new version. Launched as a instance with the new version of the code, or the new AMI or the new bootstrapping logic. So, then after we have overprovisioned to five, we've actually gone through this next step that I'm about to talk about here, so if you would imagine that we would have four and one but now we have three and two, we get to three and two by doing something like terminating the old version instances first. So I might terminate out of my four, if I had four old versions and I add up to my capacity to add one, then I'd terminate one of my old versions. The AutoScaling group because I altered the desired capacity to five will actually replace the old version node with one of the new version. And I won't have any downtime because I did that plus one on the capacity even once I removed a node I still had four nodes which was all I needed to service traffic in the first place. So then after doing that step where I did update a new version, I update with a new version of the launch configuration, I updated the AutoScaling group to have one higher desired capacity and then I terminated one of the old version nodes. This is what the steady state would look like here, I would have had three old version and two new version.

So this is actually a diagram of what it looks like in the middle of the deployment. Now can go in two directions here. We can see if there are problems with the way that our code went out or the way that our system is behaving we could actually run this process in reverse simply by switching left to right which we're doing and terminating the new instance versions rather than the old instance versions and switching the launch configurations again. So we have the ability to do a rollback by simply doing this process in reverse since old and new if you just swap the text for old and new in there you could see that the system will work in the other direction. Or if we want to proceed with the rolling update launch since we've checked that the instances that have been rolled out with the new code still work, all I need to do is continue terminating the with old version nodes, those three nodes at the bottom left, as that white box says, continue terminating. Terminate them at a rate that is equivalent to how long it takes for the AutoScaling group to launch and bootstrap the new configuration. So if you're bootstrapping in launch code for the instance class and your business logic takes two minutes I wouldn't suggest terminating the old version instances any faster than two or three minutes, simply because you may drain the capacity, right? So it's pretty straightforward when you realize you switch the launch configuration, two in the capacity up by one, and start deleting the old instances at a rate that allows the launch configuration and AutoScaling group to replace the old instances with the new version of the code. So pretty straightforward there. Although there are a lot of moving pieces you can script this fairly simply and this is actually how Elastic Beanstalk performs rolling updates on your behalf if you ever use that application container service.

So now looking at Canary deployments, we want to think about Canary deployments in Amazon at least as deploying a Canary group of servers behind an ELB and setting up a Route 53 weighted round robin records for the current and Canary ELBs at 100% and 0% respectively. So we've got our original servers that are behind an ELB with an AutoScaling group and in launch configuration, then we've set up an entirely separate group rather than in a rolling deployment where we use a single load balancer and just update the launch configuration in place. In Canary, we might want to create two separate groups with two separate ELBs and two separate launch configurations using Route 53 above the ELBs to delegate traffic at a certain split between the two ELBs. Rather than using the ratio of instances between new and old as we do in the rolling deployment, we can actually fine tune the ratio using a weighted round robin record and only directing the correct amount of traffic at the two ELBs we could also tell in this case get 100% of traffic serviced behind each ELB with the same code. So all the Canary group will all be running Canary and all of the old group will all be running old so we can check the ELB and use ELB metrics as well to measure the efficacy of the new code. So using the ELB DNS for the internal test raise the Canary to greater than 0% then to 100%. So there's three steps here.

First, when you're doing an internal test, you can create the Canary group and keep the weighted record at 0% but just redirect traffic to the ELB DNS directly rather than the DNS that you use for your domain. So if I have is usually where you direct people to direct your traffic, rather than using that domain while you still have 0% on the Canary group you should delegate a test group of users or test internally by switching their DNS to use the internal ELB DNS or the externally resolvable ELB DNS but the one that's long and not typically used or addressed directly.

Then after we do that first test there, the reason that we call this a Canary group is because there's an old way that people used to test if mines were safe, they would fly a canary into the mine and see if it died from gas poisoning which is a little morbid but, this part is the Canary part here where we send the Canary group out into the brave new world of the new code so I would raise my weighted round robin records to greater than 0% traffic delegated to the ELB and the AutoScaling group on the Canary side. So if I raise that to 5% then I can expect that 5% of my traffic that's coming in is serviced by the Canary deployment rather than the new one.

After I validate at scale by increasing that percentage above 0, eventually I'll be comfortable enough increasing the Canary group's percentage to 100% at which point that's no longer the Canary group that's the primary group and I could actually destroy the old system or just let it scale down and run it at a one node or zero node scale. So all I've done there is shift across the gradient from 0% to 100% of traffic delegation on my weighted round robin records.

So for those of you that are visual, we can see this one, using rather than the step by step description maybe the diagram here might help a little bit. So here rather than having one ELB and having a group where I delegate traffic simply by setting a ratio by terminating old nodes we have Amazon Route 53 as the apex of my system where I split people between the old and the new code. So on the left-hand side here we have our original or normal launch configuration sitting on an AutoScaling group behind an Elastic Load Balancer. And on the right side, we have our Canary group.

So if you see between the normal and the Canary, the only difference there is the launch configuration of the code. The only difference is the launch configuration and then the only control mechanism that we have at the top is to shift the weighted record to one ELB from the other. So when we're starting out, we're actually in the middle of a deployment here. This would be if we would set the weighted round robin records to a 3:2 ratio where we're redirecting, we're sending 60% of traffic to the old Elastic Load Balancer and 40% of the traffic to the new Elastic Load Balancer. That's what we're looking at here. Now because both sides are AutoScaling groups, you can actually adjust that weight and expect the system to scale out reactively without having any problems. So long as we do it slowly enough.

So if you imagine all we've done here is deploy the same thing twice, and then allowed the Canary group to service some fraction of that traffic, right? So this is not too terribly exotic we've just created two services and then we've moved across a gradient of 0% to 100% service running on the Canary.

So when should we use a rolling or a Canary deployment over other deployment methodologies? So one time that you might want to use a rolling or a Canary deployment versus something like a fully immutable stacked deployment or a blue/green, is when you have a huge bank of servers. So if you can imagine, if I have a lot, a lot of servers, the rolling deployment is the fastest way for me to replace the code on a lot of instances without creating downtime. Simply because if I try to do some other methodologies it will require me to spin up a larger portion of my system before I do a switchover. Rolling is the fast way where I don't have to do a large incremental adjustment on my deployment even if I have 1000 servers when I'm doing rolling I only have to bump up the node count by one to keep my capacity the same and then run a rolling deployment. So it's lower cost, and relatively quick. If I want to deploy as quickly as possible safely, this still the best way to do it.

There are other ways that will allow us to do it even faster if we want to do a full stack immutable deployment for instance with 1000 instances, I would need to spin up 1000 new instances to spin the entire second stack up before I did a DNS switchover. And that's not the fastest. So immutable, rolling and Canary does beat mutable, so rather than in place modifying the code on each of the nodes if I'm removing the nodes and adding them back into service like we were looking at with these AutoScaling group managed deployments rather than doing in place changes on the code it still beats the heck out of it for the same reasons that we were talking about in our immutable deployment lecture so we can get the same pace and the same speed of deployment with none of the hassle of the mutable systems and almost no additional expenditure because we can easily just roll up a single extra server.

Now the only time that this doesn't really work is when we have breaking schema changes because then we have an extra database management layer that we need to make sure that we're doing data synchronization in which case we will want to use something more like a blue/green deployment which we'll want to learn about in these future lectures. So if I have breaking schema changes like we said I shouldn't be using a Canary or rolling deployment simply because we can't have two nodes, two node types or two code types communicating with the same database if the schema change is breaking. Because they won't be able to share the same database.

You want to have instantly reversible change so while the Canary or the rolling deployments are faster to roll code out, it's not instantly reversible. So they're a little bit faster, on the deployment in the first place if you need to quickly roll a change out but, on the rollback, they are not actually the fastest because we would need to go and do the roll so if I get 80% of the way through my traffic shift on a Canary or 80% of the way through my redistribution on a rolling deployment then I'll need to rollback 80% of the deployment and wait for the nodes to cycle through for this to work.

So you shouldn't be performing a Canary or rolling if your system is such that it lends itself best to, and your requirements are such that they lend themselves best to blue and green. So blue and green is just a generally preferable way to do things if you can afford it from a time and cost perspective. So if you have any kind of breaking architectural changes where you're fundamentally changing the layout of things, then you shouldn't be doing a Canary or rolling deployment simply because this kind of deployment is focused towards rolling out new code not towards rolling out new architectures. If you can imagine there's not really an AutoScaling group equivalent for if you need to add a different kind of database behind everything.

So there'll be different techniques that we learn in this next lecture for blue and green for doing these breaking architectural changes. So the next lecture will be talking about these blue/green deploys that I've alluded to throughout this lecture, and how it differs from rolling in Canary deployments and how in general in the Cloud it is generally superior if you can afford the time and effort expenditure for doing one of these deployments.

About the Author

Nothing gets me more excited than the AWS Cloud platform! Teaching cloud skills has become a passion of mine. I have been a software and AWS cloud consultant for several years. I hold all 5 possible AWS Certifications: Developer Associate, SysOps Administrator Associate, Solutions Architect Associate, Solutions Architect Professional, and DevOps Engineer Professional. I live in Austin, Texas, USA, and work as development lead at my consulting firm, Tuple Labs.