Join cloud experts Neel Kumar and Mike McLaughin from Aviatrix for a technical chalk talk on how you can solve some of the common issues that can occur when running cloud networking at scale. This group of chalk talks and technical demonstrations provides a practical reference for how to solve complex cloud networking challenges. First, we outline the common architectures and issues faced when scaling cloud architectures, then we workshop a transitive architecture use case defining best practices and design patterns. We discuss multi-cloud implementation, provider limits, hub and spoke architecture patterns, VPN and connectivity. Next, we set up a transitive controller in the AWS console with two instructional demos.
Learning Objectives
- Recognize and explain the common issues that occur when running complex cloud networks
- Describe and implement transitive architecture designs using a hub and spoke model
- Implement and maintain VPC connectivity at scale
Intended Audience
This course will suit anyone running or planning to run cloud services at scale.
Prerequisites
an understanding of Cloud networking and the AWS Virtual Private Cloud will help you gain the most from this Chalk Talk.
We recommend completing the AWS Networking & Content Delivery learning path in order to gain practical knowledge and hands-on experience if you are not familiar with cloud networking and the virtual private cloud.
Content Overview
First, we outline the common architectures and issues faced when scaling cloud architectures, then we workshop a transitive architecture and design pattern. Next, we set up a transitive hub in the AWS console with a hands-on demo, and discuss the following:
- Cloud Networking - The Common Journey
- The Common Patterns with VPC Design
- Designing a Transitive VPC Architecture
- Managing Network Security at Scale
- DEMO - Setting up a Transitive Controller
- DEMO - Setting up a Transitive Hub
Aviatrix.com
Aviatrix is an Advanced AWS technology partner highly regarded in the cloud community for helping AWS customers solve advanced networking challenges.
I strongly recommend reading more about Aviatrix on their website at www.aviatrix.com.
Aviatrix have a number of AWS quick start architectures at the links below.
https://aws.amazon.com/quickstart/architecture/aviatrix-global-transit-hub/
https://aws.amazon.com/quickstart/architecture/aviatrix-user-vpn/
Feedback
If you have any questions or suggestions for this course, please contact Cloud Academy at support@cloudacademy.com.
If you have any questions for Neel or Mike, you can contact them directly at info@aviatrix.com
- So we see a lot of really interesting problems in the field, you know. There's always something challenging out there. One of the recent ones that I was working with a customer who's just starting their journey to the cloud. They have a traditional data center.
- So it's really?
- Yeah, exactly. Where where everybody's starting. They're down here and there on premise and they actually had a couple data centers.
- Yeah.
- They start out with just one and they're thinking, "I've got this production application, so I need another data center. I've got to scale it quickly. I've got to get it offered up to customers very fast." The data center, as everybody knows, is just too slow. The said, "The time to move data is BS, let's just move to the cloud."
- Yep
- So as with anyone, you start off, you grow to the cloud. Just like Neil was describing, you start with a single VPC, one or two. In this case, they had a single VPC, where, because they needed to move fast ... They had a production situation where they had an application that needed to get out and quickly go, a lot of users right? The way to do that is AWS, right, in the cloud. They started with their one VPC, put there instances in there and got going really quickly. And that worked great for them. The problem was that on that VPC still needs to connect to their data center?
- They've still want connectivity, right?
- Yeah, their application is here and their database down in the data center. Very traditional kind of approach. You don't have time to move your database but you have time to get your application up there and quickly install then out the environment in the Amazon. They first started looking around they realized Direct Connect is the answer. They went and started looking at that and they provisioned with Direct Connect and they find out, wait a minute, that's gonna take me 20-25 days. So, they started this provisioning process of working on the Direct Connect. But it was too slow for them to get going now.
- Because you have to provision Direct Connect through a partner usually, right? So it takes a lot longer than most people think.
- Especially when you're first starting out you don't realize that. You don't realize this is a problem, I've got to go get my Direct Connect and it's not tomorrow, it's actually a couple weeks away.
- At least! You don't know how long it's gonna take.
- What do they do in the meantime?
- In the meantime, they did the traditional VPN connection, IP sec tunnel, by coming up here and turning on the VGW.
- So, that's a customer gateway?
- That's right, this is actually to your router on premise, represented in AWS by the customer gateway.
- IP Sec. Then a traditional IP sec tunnel going across. This worked great for them because the database wasn't getting pounded. It was just a typical web application, had normal levels of throughput needs. They built this connection, worked great. Got their users happy, everyone was happy in the management team, so they were able to release. The next thing that happened though was their developers are saying, "You moved the application to the cloud, now how do I test this. I've got bugs that I know are related to the way I've put it in AWS, so how do I make this work so I can actually test it before we roll out the next version?" So, that actually led to another VPC being created. Now what we get is a dev VPC, We'll call this one Prod now, right? Now you've got your Prod VPC and you got a Dev or you might have a Test VPC, right? So that led to the naturally kind of flow of putting the application here, testing it out here and then moving it into production. That sort of led next to having a third VPC where you have a staging environment to stage it and so on.
- Common CLCD problem, right?
- Exactly right, you end up with multiples of these. You might have another environment where you do testing verses development.
- Yep, yep. Each one of these had the exact same needs. As your waiting on this direct connect to develop So, then you do the same thing, you already learned how to do this, so you build another one down to the data center. You do that again with each of your VPCs. As you go you might have more of those. This is what this customer did. They ended up with three VPCs.
- That's the natural thing that you do,
- Yeah, exactly!
- 'cuz you need to have this working.
- That's right!
- You can't just sit around waiting, so you put a connection in place.
- Exactly and the things is they're being pushed by their ... Just like everybody, right? Gotta get this out, gotta get this out, we'll deal with it later, we'll fix it later. Don't worry we'll be fine, right? So, that's in your one account here.
- So that's three separate IP sect tunnels.
- Yep!
- They've all got pretty much the same bandwidth, only one region. I wonder what would happen here if you had to look at another region, that could happen, right? That's for sure happened and these guys actually had something very similar. This for them was a ... US East two, they were in Ohio and they had a need to be in New York for the same application basically, so we they needed to replicate that environment. You can imagine that coming over here to another region, US East One, and the same thing gets pushed to here, your three VPCs. Devs, staging, and production. Those same VPCs need the exact same connectivity down here. You go through the same process. For this customers interest, they ran into another problem along the way though. Because, what happens when these VPN tunnels need to be established, you actually have to connect with your router.
- There's a lot of them, I'm seeing a lot of BGP, a lot of security briefs and it's like oh.
- That's what happened, in fact their security was where the problem came about. Their security team said, "Hang on a second, this is in a different region, this is a whole new region that we haven't even looked at yet." So it actually got hung up here. This line that was down here, these three lines actually got into a cue, waiting for the network team to approve it from the security teams perspective. So they had a little bit of down time from that perspective. That complexity here was growing, but also the amount of going through, back and forth, and approving, and compliance and all of that.
- Everything is slowing down with more complexity.
- So this is one application so if we picture this as our account wrapping around this...
- Yes.
- Then what happened is, another application comes on board, and the exact same thing happened to them. It was in roughly a parallel, they weren't doing these in serial because it's teams working separately.
- Two teams working separately, it's so nice.
- Exactly, you're not like one after another, you're actually these kind of things altogether. They had the same sort of needs over this other application. You can picture having this exact same picture drawn again. With again, six lines, so at that point they said, "Hang on a second, this is becoming a mess, how do we make this more clear.
- I've had that situation so many times, especially as you do have teams working independently. Often, you don't find out about the level of cross over, until it's too late.
- That's the problem, it becomes a you don't see it until the point where you've gotten too far into it.
- Is there anything else that was starting to raise it itself as a bit of a problem? We've got security briefs, security management, the amount of connectivity, possibly using BGP routing is gonna get messy, as well. What's happening down here, like we've got a lot... Who's managing all of this?
- Exactly right and that's where things became problematic for these guys. We have another copy of this somewhere else. One of the teams had in their development environment, they got a new VPC, they had the connection established, which is another team they had to work with that we went through before. So when they had a problem though, that development VPC, One of the instances came up trying to get to the database, the developer basically, what's he gonna do? He can't debug this 'cuz it's like you said, it's a bunch of BGP sessions, at this point you've got Direct Connect involved, so you've got guaranteed BGP sessions involved in there. So, it became overly complex for that developer, he can't... If you're a developer you don't really understand BGP, you don't know the routing, you don't know... Plus, like it's someone else's problem at that point. So these guys had that issue come up and so we had to put a support ticket in, wait a couple of hours, maybe another day before someone got involved in it and got it connected, started looking at it and found the issue and debugged it. During that time, this instance isn't connected and he has no ability to get his testing done and get his work done.
- You probably got like a few like, a database sitting down here, is there any kind of replication or off site requirements, as well?
- That's exactly right and these guys had a data center too with another direct connect and then they wanted fail override so, now you wanted to be able to like, this data center could go down, what do I do now? Well, the way it is now you're going to have another connection going to everyone of the VPCs. Each one of those you gotta call you're network engineer and say, "Hey, can you help me out, and do this and help me get this established?" So it became very, very messy at this point.
- Right, so we've got a pretty messy networking infrastructure problem here.
- Yeah, for sure!
- Let's just imagine, if we could step back and design this again, or perhaps design this from another perspective, how would we do that? Any ideas?
- Yeah, there's a couple of things that came up. Fact, these guys are going through this right now. What's interesting about this problem is... They are looking at it from all the different options that AWS is providing them. So, I want to use AWS and use what I've got with the tools available to me. So, they're starting off with Direct Connect Gateway, 'cuz that's a natural approach. It can span the different regions and Direct Connects can connect to multiple VPCs.
- Yep, good solution. So then it worked out and it works fine for them. The trouble is you have the exact same problem so they're realizing now as they start to connect those, you still have to involve the networking team, you still have BGP sessions, you still have a very complicated environment that your cloud team can't handle the difference between what's cloud, what's the networking side, what I'm I supposed to work on, what am I supposed to pass off to someone else. In fact, what happened when that situation occurred, the networking team was basically fighting with the cloud team like, "You guy are idiots, you didn't do this right."
- Yeah. The cloud teams is saying, "We don't know anything about BGP, we don't know what we're doing, we didn't know we had to keep the CIDER ranges from overlapping.
- 'Cuz you have overlapping CIDER ranger by now, surely.
- Oh for sure, especially if you're connected to the data center there's no way to avoid it. These guys, when you're in the cloud you're not thinking about that, you're not calling anyone saying, "Hey, can you provision a CIDER for me that's not going to overlap and collide with something down here."
- The team are just trying to get done what they need to get done, it's simple.
- That's right, yeah, exactly right.
- That's often the problem, isn't it? You get so much pressure on exceleration, being agile, getting stuff done very quickly and you get used to that, often these things you don't find out until the last minute that you've of a problem here.
- Exactly, and that's the beauty of AWS that it's so simple to come in now...
- Simple to do.
- I can create a VPC now, I can create another instance really quickly, you can create your own CIDER. No one's like telling you, stopping you from doing any of those things, so you can easily create a problem very, very quickly. You can create a mess very quickly without even knowing that you've gotten to that point.
- I can see another one coming too, there's bound to be like someone wanting to use another cloud provider. You might use some AI services from Google cloud platform.
- Azure database is a very popular, what we're seeing now. Imagine over here that you now have your database is maybe replicating up to Azure. Maybe you've got some other needs, we've had several different uses cases over here. You might have active directory and if that's the case and your database is over here, rather than the data center, now you've got to be able to get from here to here. Now what do you do? Do you go back to the data center and come back? Which is a traditional way, so all you traffic is going back to the data center for no reason, then coming back up to here. So then that becomes even a more complicated, because now when you're trying to get from Azure, now you've got three different consoles. You're going to the router, down here to look at a problem, you're going up to Azure. You need an expert for each of those environments. These guys typically don't know Azure as well as AWS, or they may know AWS less than Azure. So, it's that problem I'm trying to minimize and link it down into the smallest problem.
- Probably the first approach we'd take would be to use some sort of shared environment. We'd look at sharing or proxying some services in some way. How would you see peoples first try and solve this problem.
- The natural way if you going and looking at AWS to say I've got this problem, how do I solve this, the first thing you're gonna run into is the transit architecture.
- Yeah, right, right.
- That's typically what we see the most. The next natural progression is to go to a transit architecture and that involves basically having a hub and spoke model, right? That becomes much cleaner, much simpler to view.
Lectures:
Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built 70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+ years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.