The course is part of this learning path
Join cloud experts Neel Kumar and Mike McLaughin from Aviatrix for a technical chalk talk on how you can solve some of the common issues that can occur when running cloud networking at scale. This group of chalk talks and technical demonstrations provides a practical reference for how to solve complex cloud networking challenges. First, we outline the common architectures and issues faced when scaling cloud architectures, then we workshop a transitive architecture use case defining best practices and design patterns. We discuss multi-cloud implementation, provider limits, hub and spoke architecture patterns, VPN and connectivity. Next, we set up a transitive controller in the AWS console with two instructional demos.
Learning Objectives
- Recognize and explain the common issues that occur when running complex cloud networks
- Describe and implement transitive architecture designs using a hub and spoke model
- Implement and maintain VPC connectivity at scale
Intended Audience
This course will suit anyone running or planning to run cloud services at scale.
Prerequisites
an understanding of Cloud networking and the AWS Virtual Private Cloud will help you gain the most from this Chalk Talk.
We recommend completing the AWS Networking & Content Delivery learning path in order to gain practical knowledge and hands-on experience if you are not familiar with cloud networking and the virtual private cloud.
Content Overview
First, we outline the common architectures and issues faced when scaling cloud architectures, then we workshop a transitive architecture and design pattern. Next, we set up a transitive hub in the AWS console with a hands-on demo, and discuss the following:
- Cloud Networking - The Common Journey
- The Common Patterns with VPC Design
- Designing a Transitive VPC Architecture
- Managing Network Security at Scale
- DEMO - Setting up a Transitive Controller
- DEMO - Setting up a Transitive Hub
Aviatrix.com
Aviatrix is an Advanced AWS technology partner highly regarded in the cloud community for helping AWS customers solve advanced networking challenges.
I strongly recommend reading more about Aviatrix on their website at www.aviatrix.com.
Aviatrix have a number of AWS quick start architectures at the links below.
https://aws.amazon.com/quickstart/architecture/aviatrix-global-transit-hub/
https://aws.amazon.com/quickstart/architecture/aviatrix-user-vpn/
Feedback
If you have any questions or suggestions for this course, please contact Cloud Academy at support@cloudacademy.com.
If you have any questions for Neel or Mike, you can contact them directly at info@aviatrix.com
- Let's start with the problem first.
- Right.
- We just explained you had these VPCs. And I'm gonna just label them as VPC 1, VPC 2, VPC 100. And they're in different regions, different availability zones. The real need is to go to the internet.
- And this could be for patches, bootstrapping, updates, et cetera.
- Exactly. It could be API access as well, all of those things.
- Yeah.
- Now for that requirement, what is the first thing that comes into mind. It's security policies, right, Amazon already gives you security policies at every instance level.
- Yeah.
- But the security policies are not URL-based, they are IP-based. And just take an example of, let's say you have API access to salesforce.com. Just that can be more than 100 IPs. So that's not going to be sufficient. You might try that, let's call it approach number one. Your approach, we'll call it app.
- OK.
- App1 is security policy.
- Amazon are whitelisting or blacklisting here, right?
- Doesn't work, right? The next thing you would really try is, well I already have my data center. I have solved this problem, this is not a new problem. I have a firewall here, which can control all the other internet access. Why can't I just take everything, move it back here, and egress from here to the internet?
- I was just thinking that, 'cause you were gonna have to put a firewall on each of these VPCs, surely, that would be the only other way to go about it.
- Right, right, so we'll talk about that as App3, or approach 3. Approach 2 is hairpin through the data center.
- Okay, which would make the engineering team quite happy.
- Yeah, or the security team would be, I have the same firewall to manage, but the problem, you have now forced yourself extra latency, you might be thinking about a choke point in terms of bandwidth, and most importantly, in cloud remember every egress costs money. A packet going from here to here, additional nine cents per gig per month. So, you will get a big no after you doing all the calculation, and you might still do that initially, none of these approaches are bad, it's just that they don't last very long, it is a patch for a small time and then you realize it's not gonna work. So what's the approach 3? This is where you were going before, which is, can I put a firewall in each one of these VPCs?
- Yeah.
- Well, certainly you can. But the problem is cost.
- Right.
- Right, a firewall in each VPC.
- Plus I'm creating a lot of complexity for myself.
- That's right, it's cost of the software, the cost of the instance, and cost of the engineers who now don't have to manage just one but one, two til 100. 100 firewall needs an army of security engineers to manage, that's not gonna work. So what's my approach 4? Natural thing is, why can't I centralize it? Let me create a security VPC, let's call it Sec VPC, I put a big firewall here, and I force all the traffic to come here, and egress from here. Well, this doesn't work. Why, because in AWS specially, the packets from here to here, they will get dropped. If the origin is not from the same VPC it will get dropped.
- It will be dropped, yes.
- Unless you are building IPSec tunnel terminating on firewall. But if you have 100 IPSec tunnel terminating on firewall, you are back to the problem of the transit. So there is no easy way, so a centralized firewall doesn't make it easy as well.
- No.
- Plus you still have the problem of double egress because a packet going from here to here, another two cents per gig, it's not as bad as nine cents, but still money.
- Plus you still got a transit cost for that as well, right?
- Transit, of course, for a time doubles. And you still have latency because not all of these VPCs are in the same region, so now you're taking a packet from east sending it to your security VPC which is in west, and egressing from Australia, from Japan, from India. You might think, I will put a security VPC in every single region, which might give you a little bit more but now you are increasing cost, it's borderline benefit, but still doesn't work.
- That's interesting, 'cause that would probably be the first approach people would take, wouldn't it, to be putting a firewall in each region when they do set up in that way.
- Possible, yeah. And you might try different, you might start here, go here, but what we have seen is a journey around this problem. When you have two VPCs you just do this, nobody cares. But because the situation in the cloud is not frozen to be the same situation over a period of time, it changes, your approach needs to be ready for the future state that you are going to be in.
- Right.
- So, there is no good solution in this. If you think about what are we trying to do? We have a few machines, these are not the developer machines, so you need to worry about they going to Yahoo and the music site and Youtube, these are applications. Application servers generally access a limited set of URLs. So just to manage those URLs, you have to handle so much complexity, that's one. The second is more of a compliance requirement. Because there is always a chance of people getting hacked, you always need to have a finite list of things that you are allowing to go out to internet. And for that small problem, all the solutions either looks like insufficient or an overkill. Now if you think about it, what do people really want? And we should probably wipe it out and get to a clean diagram. So we were at the problem space of trying to control egress from various VPCs, and we looked at all of the challenges of the various methods, and turns out to be there is no good design. So, if you were a developer and you had to build it from scratch, what would you like to achieve? Right, we're back to our VPC 1, VPC 2, and VPC 100. We already said we want something which is inline execution, so we are not paying for extra egress, right?
- Yep.
- So it should work from here. Let's imagine a piece of code, which is sitting out here in the VPC, and it can get directly go out to internet from here. Same here: you have a piece of code, it can directly go from here.
- So we're talking a software solution.
- Say that again?
- We're talking software.
- Software based solution. And it should run on an instance which is running in the VPC. That's step one. However, you don't want to be managing separately 100 instances, so you want a centralized manager. So let's say this is your Centralized Policy Manager.
- Now I think that's a really big benefit because straight away I'm thinking that's too much complexity for me to manage three, I mean remember in our first design I was thinking about putting a firewall in each VPC.
- That's right.
- I'd already created a rod to beat myself with.
-Alright.
- So now I'm thinking same thing, but software. But now I can see you're gonna make this easy for me.
- Exactly, and you have a dotted line coming back in. So what do we call this design? Inline execution, but distributed operation, or a centralized operation. The execution is happening here, the code is happening here. Now the beautiful thing about this policy manager is, all it is doing is it's allowing you to create policies, P, and apply policies. Right, create policy, apply policy. So this is creation, this is execution.
- So this is my whitelisting and blacklisting rules.
- Correct.
- And my security groups as well?
- Possibly security groups, right. First let's talk about egress, then we will extend it to the security groups there for stateful firewalling. So, now the next most critical thing here is, because of this 100, and not all VPC will have the exact same URL, you want to give me something that can be applied to network as code. So, tag based, it should be tag based. I should be able to create a tag, calling it my security guy's favorite URL, and a tag called app tag, app 1, app 2, database 1, database 2, sensitive application, PCI compliance based tag, and those should be able to dynamically apply to whichever VPC I choose, right? And that would really solve the problem of egress, because then everybody is going to internet independently, and yet you have a centralized system.
- So am I able to report on each of these egress rules, I mean traffic-wise?
- Exactly, what was applied, what packets were denied. So now what you want is, this centralized engine should be able to have a pair, or a partner, called logging, which is collecting information from each one of these systems, and reporting, giving you a simple, nice dashboard saying these were the URLs which were tried to be accessed, these were allowed, these were denied. As well as source information of which instance requested that URL, so you can go to any level. And you can do this logging either in this platform or Splunk,
- I was just gonna ask that.
- Sumo, whichever is your favorite tool. So this logging platform, which is part of the solution, should not be its own platform, but variously extendable to any logging platform of your choice. So, if you had a solution like this, the last part that you would want in it is, these packets which are going to internet, there are also packets going between these. Imagine, if your app is sitting here, and your database is sitting here, you most likely want a firewall rule that says traffic should be able to go only this way, and the traffic coming back should be denied.
- Yep.
- Generally referred to as layer 4 stateful firewall. And the internet access is generally referred to as layer 7 egress control. So if you could do inline execution of your layer 7 and layer 4 from this code, and centralized management of all of those pieces of code, you've got yourself a perfect solution. And all of the activities are then reported into our centralized logging tool. This has the advantage of cost, it has advantage of less complexity, easy to manage, you can do network as code, you can do it over Terraform, APIs, whichever you want.
- So these software based firewalls, can they be done, or managed, or provisioned by CloudFormation or from Terraform?
- Absolutely, in fact so much so, the modern practice and what we as Aviatrix recommend to our customers, is, if you in any way have a Terraform script, or CloudFormation, to create the VPC. And you any way know that there is a base list of URLs that you wanna filter. What you might not know is in the journey of that VPC what other application will go there, and therefore other tags will apply. But you have a baseline tag and firewall link, and a baseline layer 4 stateful firewall link that you wanna allow or deny, if you could write two or three more lines in the same Terraform, which is what you're using to create, wouldn't that be great?
- Wouldn't that be great
- And that's the practice that we have been preaching here at Aviatrix, we have built that solution and brought it to bear.
- Look, I love the idea of setting rules by tags, I think everybody's used to that, so it makes it very simple, I think this saves a lot of time and takes out a lot of complexity, are there any best practices or implementation tips you can think of that will help if I wanted to design this.
- Right, there is. There is the part about, if I run code in every single VPC, and it is an instance. If it is a large instance, let's say c4.4xlarge, I'm gonna pay $8,000, and if I have 100 VPCs that's $800,000, that's not going to be sufficient. So this code better be able to run on a very small instance, or be run on an instance that is already there. Now you remember from our other conversation there is a transit gateway. If somehow an egress control can be part of the transit gateway, same instance being able to do the routing and security and firewalling, now you have the best of all worlds, where you are not spending a lot of time and money on each one of these instances.
- So your control would be in your transitive VPC?
- Correct.
- Now, can I do bastion hosts, or anything else along inbound access control as well?
- The inbound is trickier, because inbound is not lightweight operation, plus not all of your VPCs are going to have inbound. Most likely you will have one or two VPCs with a few applications, and there will be a lot of security protocols and parameters around it. And that is really a separate --
- A separate, it's a whole separate --
- Yeah, conversation, because that doesn't apply to scale like this, very rarely company has all of their VPC allowing ingress traffic, that's very rare. So that can be a separate design, there we have the ability to make it something more complex, because it is about that dedicated V--
- I mentioned this gets harder when you have an even larger or more mature network, have you got any tips on how you can do this at scale?
- Yeah, yeah yeah. One: the cloud, like we were saying, the journey in the cloud is not static, it starts with a few VPCs, it becomes a few hundred VPCs, and in fact going into thousands. It's now being referred to as VPC sprawl at many places, you can read up about it. So, at scale there are multiple newer problems that start coming in. We go back to our favorite three VPCs, this is VPC 1, VPC 2 and let's call it VPC N, right, this N is a very large number. There are multiple problems, first is, at number N you would start having just the provider limits. It could be route table limits, it could be security problems. I'm just gonna call it provider limit.
- Yeah, which is basically that thing that, where you have an account, there's a sort of soft limit to how many security groups or VPN connections you can have.
- That's right.
- And that can change per provider et cetera, something you find out about at last minute.
-Exactly.
- Rather than up front.
- And it's a problem of how you be aware, because you might just be blind-sided by it, you don't know, and suddenly you realize oh now what do I do. The two parts of the problem in this, one is, how do we become aware of those problems and watch out for those, and then how do we architect ourselves so that, either we never get into the problem, or if we do, then we have an alternative path out of it.
- That's the thing we want.
- Right, so that's problem number one. The second problem is related to throughput. Because, all design that we are doing in the cloud ultimately will start breaking, throughput-wise. Why, because this is not like data center, you are limited to the instance that you can go to, both in terms of how much it costs, as well as availability of those instances. A lot of software solution design for the data center, they were designed with the hardware in mind. And they were shipped with that hardware. But that's not the situation in the cloud, there are only different types of instances and your software will perform the way it performs, and there is a limit to how much you can scale this up. And therefore the architecture that you choose should be able to scale up, scale down, and scale out, all three modes are very important in cloud. Scale up, down, and out. Throughput, however, can be very challenging because that's not just about instance, it's about the facility provided in AWS. A lot of cloud providers are giving internet speed at a higher and higher bandwidth. If you think about it, last 10 year journey of the cloud, 2008 to 2018, internet has gone from being one gig to now 10 gig, 40 gig, and now people even are talking about 100 gig. You can get that through a direct connect or you can do it direct internet. However, encryption on direct connect is still at one gig level. There is a 100x difference between where the encryption on internet is, and where the internet has gone. And now there are many attempts at trying to solve this problem, people have gone into Intel chipset with SR-IOV, they have tried to do software level optimization, they have tried to do panelization, but this has been a hard problem. And only in recent times, in 2018, for the first time you will hear things like jumbo IPSec and others, for the first time there is a way to get encrypted traffic to the level of 10 and 20 gigs, which was not possible before.
- If you're designing for a situation like this, and you know that you're going to be hitting that type of requirement, you are better off starting with something like jumbo everywhere. Remember our transit design?
- Yep.
- Yeah, this is jumbo frames, this is jumbo IPSec, many things jumbo and you can research on it, there's a lot of information online about jumbo frames, jumbo IPSec, but that's the second problem that you would hit in scale. Third problem is all about automation. You know, automation is a good idea to begin with, however the amount of automation depends upon the scale, if you have 10 VPCs to manage, it's one level of automation. If you have 100 VPC to manage, it's a different level of automation. And therefore we recommend that, if you know you are going to go there, pay more attention to automation, automate every part of it, automate the build part of it, the destroy part of it, the recreate part of it, automate every single policy change.
- That's a really good point because I think you hit the nail on the head there, expect to go there rather than think you might.
- That's right.
- If you design for scale, upfront, you're gonna have a much easier scalable journey, aren't you?
- Correct.
- Okay, so automation is a key point.
- Automation is the key part. And the final part of it is, whether you are designing to be multi-cloud and multi-region. Let's start with multi-region and then multi-cloud. A lot of times the journey starts with one region. I'm drawing here now, one region and one provider. One region, one provider. We can call it AWS east--
- See that all the time.
- Yeah, that happens. But it doesn't stay there, it goes into six regions and three providers, right? Three and six. And that's also a scale problem, because you might not have a lot of VPC in a region, you might just have five VPC in a region, but if you have six regions, and three providers, you are back to the scale problem. And pay attention to how you design your transit, how you design your egress, all of that should take this into consideration.
- Alright, I like that as a design philosophy, having a best practice partner manage your transitive strategy, means that you do have less complexity if you do need to add more regions, or if you want to go multi-provider.
- Right.
- And I mean, honestly I see that in the field more and more, where people start with one provider and one regimen, and then the business dictates or starts to direct that we look at other ways of doing this. There may be many reasons for that, it could be due to existing business relationships, it could be due to a whole lot of things outside of the network engineer's control.
- That's right.
- I really like that thought process of look for that best practice provider who can help you with this layout, so, the transitive design is really crucial, you've touched on some really good points here. The egress one is something that I think people would think about last when they get caught out.
- That's right
- Because, honestly, I think that's the last thing people realize.
- Til you can fly under the radar. Then you hit this problem.
- Yeah, and the fact of also thinking through the encryption speeds with being limited to one gig throughput, so jumbo frames is an interesting space, the speed we're seeing in the networking is constantly increasing. Having a solution that can also increase with that is really really interesting. Are there any other things you can think of that might be worthwhile considering from an architectural perspective, in terms of scale?
- I think these are the four points, and what is important is the architects out there who are listening to it, please know that most likely you are gonna have a scale problem.
- Yes.
- You might start small, but there is all the evidence that this world is going to be more going here and here, than just a small isolated environment, and if that is so, whichever solution you are using, ask yourself and your provider how do you account for multi-region, how do you account for multi-provider, how do you account for throughput, how do you account for automation, how do you account for limits. And ask those questions in advance, be ready with it, so that you are met with no ouch at the end, oh no I forgot about it, because you are ready for those eventualities.
- That's a fantastic strategy, nice one, alright.
Lectures:
Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built 70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+ years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.