Blue-Green Deployments


Start course
1h 11m

As modern software expectations increase the burden on DevOps delivery processes, engineers working on AWS need increasingly reliable and rapid deployment techniques. This course aims to teach students how to design their deployment processes to best fit business needs.

Learn how to: 
- Work with immutable architecture patterns
- Reason about rolling deployments
- Design Canary deployments using AWS ELB, EC2, Route53, and AutoScaling
- Deploy arbitrarily complex systems on AWS using Blue-Green deployment
- Enable low-risk deployments with fast rollback using a number of techniques
- Run pre-and-post-deploy whole-cloud test suites to validate DevOps deployment and architectural logic

If you have thoughts or suggestions for this course, please contact Cloud Academy at


Welcome back to Cloud Academy's Deployment on Amazon Web Services course. Today, we're going to be talking about blue-green deploys in this lecture, and we're going to cover what is a blue-green deployment. So I eluded to blue-green, if you watched the previous lecture, several times inside of the Canary and rolling deployments. We need to talk about the differences between these Canary and rolling deployments, and the blue-green. We'll talk about when to use blue-green deployments when it's appropriate, and when it's not appropriate to use blue-green deployments. Finally, we'll go over some tools that Amazon Web Services provides to us to make these blue-green deploys significantly easier than it ever has been on any other cloud provider or on my colo or on-site data center.

When talking about blue-green deploys, we need to realize that green is your production system, and actually these terms are interchangeable and switched around all the time. But realize that the colors don't really matter, and the order. Depending on where you look this up, they might be swapped. Green is your production system, where blue is the system that you're going to promote to green, or vice versa. It really doesn't matter. One color is production, and one is needing to be promoted to the other color, so we've got two separate systems. We have a production system, and then a copy of our production system that's slightly different with a new version. So if your blue, or your second version looks good and passes tests, promote it to the green via a DNS or proxy switch, plus database swap. That one's a mouthful. But really it provides you the ability to test independently an entire cloud full stack or sub-stack, make sure that it passes tests, and then promote it to the place that the production system was in via a DNS swap or proxy switch, and a database swap. So we can think about that as a "hot swap", where we're just changing which thing we're pointing to very quickly whenever we do a full validation that the entire system or subsystem is working. So colors don't matter. Like I said, you just need two environments. I've heard this as a red-black deployment as well before. The colors, unfortunately, are used differently in a lot of different places, so that might be a little confusing. But just realize that blue and green represent a production system and a standby system.

This is what it looks like using the blue and green like the terms that I've used, but of course we could change the colors around here. So green remains available for rollback, in the case that we need to switch or anything but here's what's going on. We have green on the left here. We can imagine that this was our original production server, so we might have to create a second stack here on the right, we have our testable blue stack, create basically a copy of this other thing. Except for we note that the difference between the two colors is that we're running Version 1 of the infrastructure, or instance code, on one side and Version 2 on the other.

Now, we can do this side-by-side deployment and run tests by running... If you see, we have the master and the test database. We can actually run integration tests on the blue stack if we want to replicate our database to provide seed data to run integration, etc., etc. There's all kinds of things that we can do for the data management layer, because we have state that we need to carry over. But the idea being that all I need to do here to start using a new version of code is to just switch which ELB I'm pointing at. Because we're doing two separate versions and we have a fast switch like that, the time that it takes me to do a deployment in terms of how long it takes for changes to take effect and the amount of time it takes for me to make a rollback happen is the shortest you can possibly make it. Because all I'm doing is a DNS-level or reverse proxy-level control, so this is really important. When you're taking an Amazon Certification exam, they might ask you, "Here are some different scenarios in which I want to deploy code as quickly as possible. What's my best option?" Well, some of your options will be these rolling or Canary deployments that we talked about earlier on in this course in a previous lecture. At least one of the answers will be a blue-green deployment at the DNS or reverse proxy switchover layer. This will be the correct answer for the deploy... "Redeploy a new piece of code and allow for rollback as quickly as possible," this is the answer, a blue-green style deployment where we have a reverse proxy or a DNS switchover. Where we might be migrating or switching over the database as quickly as possible. Because we can leave that green stack on the left that's running Version 1 on hand for however long we want, depending on our business requirements. Our business might say we need to be able to rollback for six hours, until we know that we've validated Version 2. Or they might say, "Oh, it's very expensive for us to be running two sets of production infrastructure at once. We want to terminate that green side within 15 minutes, as long as things look good." So the time that we retain that secondary green rollback is variable depending on your business requirements. There's no set number.

Our differences from our Canary or rolling, we kind of eluded to some of these already. Blue-green can be on and off, not necessarily gradual. So this is not necessarily true. If you overprovision, you can do Weighted Round Robins. But the idea here is that you can do a binary switch off and on and not worry about scaling. Because if you remember, with the Canary deployments we had reliance on Auto Scaling to shift traffic and have the new Canary groups scale outwards. Whereas, blue-green, we can actually do a binary switch and flick it on and off, and do it as quickly as possible. So we can ditch being gradual with blue-green, whereas we can't with any of the other deployment methods.

We have an activation and rollback as nearly instant. So after I run my tests on my blue stack, or on my standby stack that's getting ready to be the Version 2 code, whenever I'm ready after running my tests and I've validated that I believe the stack will work, I can do my activation by the reverse proxy or DNS switch. That's an extremely fast change to shift 100% of traffic over without any hiccups, and no users will ever know the difference.

That works in the opposite direction. Unlike the Canary and the rolling deployments, we can have a rollback in the same method. We can do a reverse proxy or DNS switch back to the original production stack if the standby fails once you start putting it under production load. We can do rollback faster than any other way you can possibly design this, because we have just a swap at a pointer level effectively, cloud-level pointer.

So we can support breaking schema changes with something like a blue-green. If you imagine when we need to start performing writes on a database or have a significant amount of schema change, for the blue-green environment I can actually deploy a new database and synchronously start replicating all of my writes to the new schema by using a schema translation script that I write to the migration script. So real-time schema migration script, and then whenever it's time to switch over to the new schema, all I have to do is switch the DNS and all of my writes start going to the new schema. So this actually supports breaking schemas changes a lot better than a Canary or rolling deployment, simply because we can be operating using two separate databases at once. Albeit a little bit scary, we actually do have the capability to do a live database migration.

Blue-green always costs more money, because it requires a full secondary environment. That's just a fact of life. It's great, though, if you can actually afford to operate a secondary environment for a brief period of time while you're doing your testing validation, and then post-DNS switch validation.

Blue-green is good at verifying high-risk deploys. So if we have a major new version, maybe we've upgraded a full version number, we can use the blue-green deployment to do a very, very good, thorough check if the new deployment is going to work. If you imagine, if you have a staging environment if you've ever upgraded through staging environments before, when you're doing your operational process management... If you validate your staging deployment and operate it at the exact period of production, you can actually think of blue-green deployment as if you were to just start redirecting all of your production traffic to your staging environment after verifying the staging works.

Lots of businesses are very comfortable with validating their high-risk deployments on a staging environment. You can look at your blue-green deployment as having a production and a production standby, or staging ++ environment that you're about to switch into production, so we could rollback quickly. Like we were talking about before in an earlier slide, this is the correct answer on any Amazon Web Services Certification test if you are asked a question relating to being able to roll back quickly and mitigate risk as best as possible. Blue-green is hands down the best way to do it, because reverse proxies are the fastest switching mechanism, followed shortly by DNS and we can retain the original stack on-hand if we ever need to switch back. So we can isolate updates until after verification, like we were talking about there for verifying high-risk deploys. But the isolating updates until after verification is the method by which we do this high-risk deploy verification.

So we can ensure whole-system immutability. This is getting a little bit into what we're going to talk about next lecture. But if we imagine looking at that blue-green slide that we had before, I included an entire ELB and not just the instances behind the ELB as part of my blue-green switchover. If we have different application service components that we need to switch over as well, maybe a session store or something, we can actually replace entire pieces of infrastructure beyond just instances or servers. We can replace entire stacks to verify that our system works before we do a switchover. So I can dramatically increase the complexity of that which we are verifying in a blue-green deploy.

Some problems, it's less useful for A/B feature tests. This is simply because if we have shared state like database state, in a blue-green deploy the migration of the database while supported is actually fairly difficult. This kind of A/B feature test is also very expensive if we're running a full production-level infrastructure and just redirecting only partial traffic. The A/B feature test is better done as a Canary deployment, where we do rather than two full production scale deployments, we have two subsections of a single production deployment. That's better for A/B feature tests.

So it's expensive and slower for very large systems. If you imagine, we compare a rolling update versus blue-green, if we're doing a blue-green deployment on a 1,000-server bank that's load-balanced, we would need to deploy, you guessed it, 1,000 servers to be able to do the switchover gracefully. Now, that's not the case for a rolling deployment. If we do a rolling deployment, we only need to create one more server to bring the total count to 1,001 and just rotate through the entire bank of 1,000 deploying new code by terminating the old launch configuration instances and adding on the new ones automatically. So it's cheaper and potentially faster if you just start propagating through.

Significant schema changes require a lot of tooling. If you want to avoid downtime, you need to use pretty sophisticated tooling to, on the fly, update the schema for, for instance, a SQL Database. This is considerably easier for NoSQL Databases that are already built with a base or an eventual consistency model in mind. But this challenge remains for anybody that's operating on a SQL data store that tries to operate over ACID. It's very hard to get ACID to work correctly whenever you're doing a blue-green switchover with two different schemas.

So if we have long-running database transactions, those are extremely difficult in blue-green. These would be things like big stored procedures that you're running, or something. You can't do the switchover between two databases, even if you have the tooling in place for a significant schema change. You can't do the switchover until all database transactions are stopped. We need to make sure that we have the long-running transactions reflected on the new infrastructure. So if we have a four-hour query or something running, we have a problem when we think about, "Okay. How do I start switching 100% of traffic over and changing the DNS?" Because we'll have that hanging around four-hour request going, or stored procedure.

Fortunately, Amazon, relative to the rest of the universe, makes this whole blue-green thing, even though it's a lot more complicated than a normal deployment, a lot easier. So we get CloudFormation. It's templated and repeatable resource creation. If we imagine blue-green as our higher level way to do a switchover and do deployments, blue-green has the capability of deploying, again, more than just individual servers, like a rolling or a Canary. If we're deploying new versions of entire sub-stacks, we should be using CloudFormation where we're swapping one template for another and doing DNS switch.

We also get CloudWatch so we can detect if new versions are actually working correctly. You can imagine, it's a little bit scary whenever we do our switchover to make sure that things keep coming along. Because we have CloudWatch and we can instrument both our old and our new versions, we can measure the metrics before and after, and see if our metrics of usage, different log error frequencies, those kind of things. We can measure and make sure that those things don't peak or go up after we do a switchover on a deploy. If we do see them go up, we can do a rollback using a blue-green. So there's a new database migration tool for SQL that's actually named Database Migration Tool. But you can actually use Lambda or some other compute service for the eventually consistent NoSQL Databases.

So there's a number of different tools for doing data migration, including Data Pipeline, DBMS, all kinds of tools that you can use on the Value-Added Services panes inside of the Amazon Console. Just take a look around. Then, of course, we have Route 53, our dynamic DNS service, where we can do our DNS-style swaps with ease. Whenever we're thinking about upgrading from our blue to green, we can just do a switch.

So Amazon again makes this a lot easier than it ever was before. You couldn't even think about doing this when you were running on bare metal. So yay for Amazon. We can do our sophisticated blue-green deploys with our fast switchover and rollback, and our low-risk and our high validation by using all of these value-added services that Amazon gives us.

If you want to see what one of these blue-green deployments looks like, actually if you think back to the Immutable Infrastructure video, which was the second lecture in this course, this is actually a blue-green deploy when I go through and run through that new instance deployment. So if you want to go and see one of those deploys again, we didn't call it that at that time. But if you think about all the concepts that we just learned, like switchover at some sort of network layer, that is a blue-green deploy. So go take a look if you're still curious, but hopefully it's still fresh in your mind.

Next up, we'll be talking about whole cloud tests, so this is a flavor of deployment where we need to validate that an entire subsystem works and we need a way to do automated testing that we might not have done before. But we are familiar with doing automated testing on our individual application codes for instances. So in the next lecture, we'll see how we can use the tools that Amazon Web Services gives us.

About the Author

Nothing gets me more excited than the AWS Cloud platform! Teaching cloud skills has become a passion of mine. I have been a software and AWS cloud consultant for several years. I hold all 5 possible AWS Certifications: Developer Associate, SysOps Administrator Associate, Solutions Architect Associate, Solutions Architect Professional, and DevOps Engineer Professional. I live in Austin, Texas, USA, and work as development lead at my consulting firm, Tuple Labs.