Designing a Transitive VPC Architecture
Advanced networking at scale
The course is part of these learning paths
Join cloud experts Neel Kumar and Mike McLaughin from Aviatrix for a technical chalk talk on how you can solve some of the common issues that can occur when running cloud networking at scale. This group of chalk talks and technical demonstrations provides a practical reference for how to solve complex cloud networking challenges. First, we outline the common architectures and issues faced when scaling cloud architectures, then we workshop a transitive architecture use case defining best practices and design patterns. We discuss multi-cloud implementation, provider limits, hub and spoke architecture patterns, VPN and connectivity. Next, we set up a transitive controller in the AWS console with two instructional demos.
- Recognize and explain the common issues that occur when running complex cloud networks
- Describe and implement transitive architecture designs using a hub and spoke model
- Implement and maintain VPC connectivity at scale
This course will suit anyone running or planning to run cloud services at scale.
an understanding of Cloud networking and the AWS Virtual Private Cloud will help you gain the most from this Chalk Talk.
We recommend completing the AWS Networking & Content Delivery learning path in order to gain practical knowledge and hands-on experience if you are not familiar with cloud networking and the virtual private cloud.
First, we outline the common architectures and issues faced when scaling cloud architectures, then we workshop a transitive architecture and design pattern. Next, we set up a transitive hub in the AWS console with a hands-on demo, and discuss the following:
- Cloud Networking - The Common Journey
- The Common Patterns with VPC Design
- Designing a Transitive VPC Architecture
- Managing Network Security at Scale
- DEMO - Setting up a Transitive Controller
- DEMO - Setting up a Transitive Hub
Aviatrix is an Advanced AWS technology partner highly regarded in the cloud community for helping AWS customers solve advanced networking challenges.
I strongly recommend reading more about Aviatrix on their website at www.aviatrix.com.
Aviatrix have a number of AWS quick start architectures at the links below.
If you have any questions or suggestions for this course, please contact Cloud Academy at email@example.com.
If you have any questions for Neel or Mike, you can contact them directly at firstname.lastname@example.org
- Let's think about how we would approach this if we had if we had a clean slate.
- Yeah, this starting from scratch and have the luxury of some time and experience of the opportunity to kind of go back and start all over. So, uh, yeah, so...
- Let's do it.
- Yeah, let's do this. So, um, what we need to keep our on-premise cause our database is still gonna be there, right.
- So, that's not gonna change. And you still got your edge device sitting there on the at edge of your on-premise. And you still got two data sets. So We'll keep this relatively simple.
- And so the next logical thing is to, if you don't want to have all those connections, like we saw. We saw that mess of connections.
- We don't want all those lines drawn out of our data center. We just want a single clean line.
- Mm hmm
- We already have direct connect.
- It's uh, it's perfect for this situation. You're gonna have lots of traffic flowing down.
- And it's just a natural next step We'll draw our direct connect up and what we're going to do is instead of connecting to all of our application VPCs, we're going to connect to a new VPC and we're gonna basically label this the Transit VPC.
- OK, Alright so this is the dedicated VPC just for transitive
- That's right, exactly right Yeah and we're gonna put a since we know AWS doesn't provide transitive networking by default we need to put a router in here, we need a cloud router inside of this VPC that's gonna handle the routing. It's gonna handle transit into our application VPCs. So we'll draw that in for now and we'll come back to this. So we have to put any rules around the side of blocks or the usage down a number range for this particular Transit VPC
- This one you have to be more careful about. This is the one place where you're going to have to be more careful about not overlapping and colliding with things in your on your data center So this is a little place where your gonna I just now that we've got time to plan this out we can talk with our networking team and get a the right side arranged here so it doesn't overlap with anything in our
- OK, so we do this as a design first.
- Yeah, exactly right. Yeah, that's right. And then if you remember from our previous discussion, we have three VPCs in the account
- DEV, PROD, TEST
- That's right, yep. So we'll do
- Plus there's bound to be a couple more that pop up
- Yeah exactly
- Up to 300
- Yeah exactly Exactly, we'll keep it at a simple for this drawing here and then you can imagine it going on right
- Yeah and so these are your application VPCs and they're gonna basically if you think of your environment of what you wanna build and accomplish you wanna accomplish these guys being able to connect down to your databases down here right So we got our database that we had earlier and that's where we need to connect to and now there may be other things down here that you need at your directory and so on but ultimately we're trying to get to connectivity between them. So the natural next progression is to start connecting each of the VPCs to your Transit VPC
- Mmm hmm
- And we're gonna draw these in as a lines going straight down to our... Yeah
- Straight away the difference there is that we're not terminating those on the virtual gateway inside this Transit VPC
- Yeah exactly right We're gonna connect and then natural to make this even more seamless, we're gonna make another gateway here, well we're gonna call it gateway and terminate on the gateway in each of these. This is gonna keep it as simple as possible. And the.. keep the costs down as well.
- That's our goal right so let's think this through from a top down perspective Yes, we want simplicity so that the teams can move quickly and then the teams can scale without having to be held up,
- Yeah exactly right
- Which is what we saw in our real world scenario and also yeah we want to ensure that the costs are as efficient as possible
- Right exactly
- And we don't have to keep going and asking for more budget.
- Exactly, exactly right. And be able to scale up and down as needed And you can... nice thing about having an instance here is that you can scale those instances up and down if you think of a T2 Micro that's a relatively inexpensive instance and that's where you can start especially in a DEV environment, you don't have a need for a high bandwidth boxer typically you're not doing much there and staging and production is where you can focus in on what you really need for your end user connectivity When and then the other thing that gateways provide us from a perspective of HA and sort of High Availability that gives you the concept of what we talked about earlier that failing over to another tunnel.
- Alright walk me through the HA side of it. So we've got two different availability zones or we..
- That's right, yeah Exactly right and the nice thing about that with having two separate availability zones That keeps us from... if there really ever is a problem with a AWS and one of the availability zones go down then you've got that connectivity going to each one and you're not gonna lose connectivity from those. Now in your DEV environment you may choose not to have HA right? So this may or may not be there
- Would you do it three or two? Because you've got three availability zones in most regions
- That's right. Yeah you definitely could. We typically see from ... we work with customers we typically see two just because, from a cost perspective it's just we try to balance those right? Cover time and costs and so on And each one of those adds to the cost But on the DEV side what we usually see is you think of the DEV test and something's down for a few minutes which is not a big deal versus production where you actually want it to stay up. And so each one of these will have their own lines coming down to another gateway and an availability zone down here
- Right, right
- So, we'll have each of those connections coming down to the secondary gateway then you have a fail over when as needed and you can have them all automatic so there's no issues with wandering in, looking at it and waiting for it to go down then calling somebody and trying to keep tabs on it
- OK so how are we setting these up? What do we have to do to get these gateways?
- Yeah so these and from an aviatrics perspective these gateways are EC2 instances, they're in the marketplace and they're controlled by a single controller. So we have one box down here that we call the controller which is a console very similar to what you're expecting coming from AWFs or Azure kind of environment you got one place to go for your environment
- So you have a controller here and that controller can automatically provision these for you so either through a GUI or Terraform or through... and that's a traditional way for you to try to automate this you can build a script for it and push those out automatically.
- Perfect. Yeah, OK. So you do this like cloud formation as well?
- Yeah exactly, yeah, yep
- So I take the.. I get the controller from the marketplace and then it will provision these virtual gateways for me or not? Or do I do each of those individually?
- Yes, you'll do each one of them individually. So if we start from the controller, when we're actually building us up from a clean slate we'll have this controller is actually our first thing we're going to install and then we're gonna have these VPCs and we're gonna tell the controller, "I've got this DEV VPC" and you can say, "I want "this HA or not" from the controller. And then you from the staging and so on and add on the next VPCs Once you've got the controller placed, basically everything else just becomes like a click of a button or a script. So those, as you add on that second as we talked about earlier we added on our second region and our second application then those can all be done the same way. And usually what happens is you have a Terraform or you have confirmation and it just all rolls out automatically once you apply that to the second one. Then that's the beauty of having it all from individual pieces, those individual pieces and become components in your script and then you just replicate it rather than having this mess of trying to do it by hand.
- OK so straightaway we've got a much easier way of provisioning and adding and scaling so I think we added a lot more agility there. We're not gonna end up with bottlenecks where we do need to add perhaps a new region or three or four new VPCs quickly?
- So that's all gonna be automated. - That's right. Yeah, Exactly right. And the thing of a cloud team, the cloud teams are naturally following that automation They're using CFCD pipelines or they're using their own system to kind of automate and when they do that then this all becomes code and so that code can then just be replicated and so you're not then calling someone and saying, "I need to go build this same "environment," then they are like, "What did you build? "I don't know what you built." You've got it all checked in in code and so this second environment can be replicated really easily and the connections can then just be built automatically through that.... Through the code itself.
- But what makes it more clean and even easier from a perspective of debugging is before we were talking about when the network engineers they're see a problem. Or sorry when the cloud ops see a problem, they call the network engineer and the network engineer has to like go and figure out what you did in the cloud all this, there's a natural line now that happens. So you have basically this line that kind of cuts through the Transit VPC and you can think of that as cloud operations can operate up here
- Mmm Hmm
- And down here can be your networking and your infrastructure team, your network engineers can operate down on this side and this side the ... when there's a problem, the cloud team can all operate on their own. They can look at it, they can investigate it on their own, and they can do all that work by themselves. So up here, these connections We look at these as software defined connections. These are software defined, there's no BGP involved. Which means that there...
- Wow I was just going to ask you that like already we seem to taken out a lot of the BGP complexity..
- All the complexity is gone. There's no, none of the BGP and as a cloud operations you probably don't really understand BGP And you don't need to at this point. Cause now you can actually look at this and you understand that this is all AWS components these are all simple designs so you're not actually trying to like dig in well, where did this route come from? Where did... This is all in your AWS route table, everything is in a place you expect it Down here we use BGP because these routers down here are expecting it, right? And they're working with their expecting the routes to be automatic or propagated. So now you've got all your side ranges coming up from the on-premise side. Coming up via BGP so it advertised up to the, to our gateways and the gateways themselves then handle that through the control, the control will then say, "these routes that "are coming up from BGP, I'm going to send them off "to my spokes so the spokes know how to get down." So that's all handled automatically, no more trying to figure that out and no more kind of debugging from... all over the place.
- I love it. Alright so I can see the simplicity really now. I'm thinking just from a sort of cost perspective, are we generating any more traffic here? Are we going... is there anymore costs we have to consider or factor in using this design?
- There's just different way of looking at it now. You've got your direct connect cost and this direct connect can be a VPN tunnel as well. See both use cases...
- Which you have for HA, right
- That's right yeah.
- So VPN could be direct connected
- Yeah exactly right. Yep exactly and then from going up your traffic going up you have additional costs in terms of your instances now but remember you're taking away what's ...
- Taking away a lot...
- Which is VGW, you used to have a cost of VGW, and then the VPN tunnel that was there before, so you're replacing it with something smaller, something more scalable Something that you can actually control the scale on rather than just a whatever's there. So you're kind of giving and taking on both of those perspectives and from the... debugging perspective and the troubleshooting perspective you've actually added a lot more value by having just real instances in there rather than all the complexity of I don't know what's under this hood of the VGW.
- Yep. Okay so, I think we've got a much simpler design. Let's just think through some best practices here so obviously you wanna create this design and socialize it and perhaps like document it as best you can, so that you get buy-in from all of the teams that would be impacted or benefit from this design
- So there's probably a bit of a div-ups process There's gonna be some management buy-in or some executive sponsorship that would help
- at this stage. Are there any other sort of best practices, like we got a side of block here, we got ... we still have some BGP to negotiate and manage, so we still have some routing configuration to do. Right?
- Yeah, yeah exactly. In that case, when you think about that from a separation of role perspective, then you now your network engineers are kind of controlling all that and they basically decide what gets propagated up So you have a little bit more of your best practices to consider in terms of like we talked earlier the sider overlap, we need to be careful on this Transit VPC but beyond that you don't need to worry as much. You're gonna be propagating some routes coming up from the data center side but now you have, since you have all this, these, the gateways and the controller itself running here you can now monitor for overlaps You can be told rather than bringing down the whole system and taking down our environment, now you can actually look at it and get an alert before you actually put a VPC in and from there now that you've got gateways, you're gonna see NANI and handle that all from the simple interface so even if you don't want to understand NANI, you understand that I've got an error and that there is and overlap So I know how to fix that and I don't need
- That's so much easier.
- To call my...yeah exactly
- It gives people a bit of flexibility on what they can do as well, which I think is good just in terms of the stage of PROD. Alright so we talked about perhaps like adding replication and looking at using other providers as well so how would we do that? Let's envisage we've got a Microsoft server that we want to add We shift the database to Azure
- Yeah , right.
- How would you do that?
- Yeah there's a couple of different approaches that we've seen folks take, but like the more natural one, if you've got your database, where we talked about using Azure, and we assume the AWS server here, now you've got your VNET inside of Azure And this VNET is where your database is gonna be over here now. The natural approach that is another gateway here and again a multi-AZ again then connecting that straight down to your Transit VPC and now you can connect seamlessly between production AWS and production Azure VNET.
- So all I need to do is install this agent on a host inside that Azure network
- Exactly right, and then you don't need to know any more and then the beauty of having a centralized controller and a centralized console is that you can do that all from the console and once it's there you have a one console to go to, you don't need to go to Azure, to AWS... It kind of keeps the complexity down a little bit more, makes it easier for operations folks.
- Okay, any sort of reporting or other sort of value adds we'd get from using this one controller? Cause I really like it, I think it solves a lot of problems.
- Yeah and it guess it's sort of following the natural flow of what ops teams are used to. It's having the portal from Azure, the console from AWS, a single place to go what's nice about the controller from the operations perspective and from the reporting and so on you get all the data about what's the latency between my development and my transit or how long is the latency from my development all the way down to my on-premise, which is what you're typically looking for.
- Looking for that.
- Especially from a perspective of I'm looking at my logs and production, I've seen some long times for a couple of requests. What's going on? You actually get a really simple view because you have a centralized console you have one place to go and look for that. You can see, oh there's something up with this tunnel here, let me go investigate that or maybe that we've got a problem with your direct connect and you can go there. So it's all kind of in one place for you rather than kind of hunting around for it, so.
- What if we had like... imagine we just had no cost constraint, would we go for another direct connect in here? Perhaps?
- Yeah that's right. No cost constraint, another direct connect gives you a really ...
- And then the other direct connect gives you full redundancy right? Which traditionally you'll see or typically you'll see in the field you might see a direct connect, you might see a VPN tunnel as a secondary... you do see the direct connect is a common use. There are lots of folks use that so another direct connect is a very common way of doing that.
- And we can add multi-regions so we've talked about that so that could easily be done. So if we wanted to move to another region as well
- Yeah exactly right. No problem at all. Multi-region and multi-account. The nice thing about having another account is that it gives you the ability to separate out your cost controls, do all the things all your folks want to do internally to help manage where the costs are going. You don't have to worry about saying oh, it's a separate account, that's gonna be too hard, too much complexity. With something like this, with a system where you've got a centralized controller and individual gateways, then you just do the same thing as you did with your Azure, there's no difference, it's just another landing for your connectivity.
- I love it because straightaway, the minute we do introduce other providers, there is a bit of complexity that goes with that so if we're adding a little bit more management transparency, I think or just simplicity, that's gonna help a lot. So I mean I think we've got a lot more agility for a start and we're able to move quicker we don't have thosewe had
- Yeah on the agility side we have, we look back what we had before was remember, we had one place down here where there was a choke point where you were just like there's no network engineer, that is waiting and now there's a security guard and now you're clouds ops can all operate on their own. There's a transit here where they're connecting so this is all in cloud ops. So you wanna add another account of Azure or Google, in this case, you can just do it on your own. Your cloud team has that flexibility rather than waiting on some other team and so you have that agility.
- I love the differentation of the roles as well I think that's a very key difference in a mature design is that you've actually worked out the roles quite clearly so that you haven't got one person that's having to do three or four different things
- Exactly right, yeah
- Which is great when you first starting off you just tend to do what you need to do
- That's right, and the thing about that is like now there's no more fighting. It's not... I guess, it's a natural way of separating and it's a very clear line. It doesn't... no longer are you saying well you guys are idiots up here for doing this You don't have to worry about any of that stuff anymore. It's handled for you or its ... there's a clear line where you're saying this is clearly a networking problem versus a ops problem.
- So what other best practices could we help with here like what else can we do to upfront to help? I mean I think the benefits of using a Transitive network solution are evident. I can see that. I'm thinking, it's always gonna be about selling us in. So selling it into the management team to get the buy-in to do this design but also we have some good opportunities to improve the operational efficiencies by documenting things correctly, like script library.
- Exactly right, yeah
- Having it perhaps part of the Center of Excellence if you are going through a transition, that could be quite useful too.
- Yeah, especially, yeah... once one group is gone down this path I think you just follow this path and that you've got all the learnings basically written down and documented at that point, so it's all in one place. That's one of the nice things about this architecture. It's not throwing something over the fence and saying hey network team go replicate it for me. It's all there for you It's all very clean.
- And this is also a reference architecture in AWS startup
- That's correct, yeah. Exactly right, yeah.
- What is it, what is it called the ...
- The Global Transit Architecture
- Yeah there's that tip light where...
- That's right, exactly right. Yeah and then the nice thing about ... that's the other good point it's a... AWS is basically promoting this architecture because of the all the things you saw, AWS from like a buy-in perspective, if you're internally, that's just makes it so much better. Most people trust what AWS does, cause AWS has seen so many environments, they know what works, what doesn't work This was one of their architectures that they've seen implemented and they've seen for the reasons we just outlined, the simplicity, getting more agile And because AWS has seen how customers want the agility up in the cloud. They wanna go fast. That's what they're all about. They found that this architecture.
- Okay, so that's gonna be another great saving and time saving... cause you can start with these blueprints.
- Yeah, exactly right.
- Fantastic. I think that's a really good solution.
- Yeah and we see a lot of customers using it and it works really really great.
Head of Content
Andrew is an AWS certified professional who is passionate about helping others learn how to use and gain benefit from AWS technologies. Andrew has worked for AWS and for AWS technology partners Ooyala and Adobe. His favorite Amazon leadership principle is "Customer Obsession" as everything AWS starts with the customer. Passions around work are cycling and surfing, and having a laugh about the lessons learnt trying to launch two daughters and a few start ups.