In this course, you'll gain a solid understanding of the key concepts for Domain Eight of the AWS Solutions Architect Professional certification: Cloud Migration.
Our learning objectives for this domain are to cover how to plan and execute for applications migrations, and build our ability to design hybrid cloud architectures
We are going to examine a number of sample questions and scenarios we’ve created to help us achieve these learning objectives. We’ll use the options presented as a way to discuss and extend our knowledge and problem solving ability.
By the end of this course, you'll have the tools and knowledge you need to successfully accomplish the following requirements for this domain, including:
- Plan and execute for applications migrations.
- Demonstrate ability to design hybrid cloud architectures.
This course is intended for students seeking to acquire the AWS Solutions Architect Professional certification. It is necessary to have acquired the Associate level of this certification. You should also have at least two years of real-world experience developing AWS architectures.
As stated previously, you will need to have completed the AWS Solutions Architect Associate certification, and we recommend reviewing the relevant learning path in order to be well-prepared for the material in this one.
This Course Includes
- Expert-led instruction and exploration of important concepts.
- Complete coverage of critical Domain Eight concepts for the AWS Solutions Architect - Professional certification exam.
What You Will Learn
- Essential skills for cloud migrations for the certification exam.
- Cloud migration scenarios.
So let's start with a sample question. Your company hosts an on-premises legacy engineering application, with 900 gigabytes of data shared by a central file server. The engineering data consists of thousands of individual files, ranging in size from megabytes to multiple gigabytes. Engineers typically modify five or 10% of the files a day. Your CTO would like to migrate this application to AWS, but only if the application can be migrated over the weekend, to minimize user downtime. You calculate that it will take a minimum of 48 hours to transfer 900 gigabytes of data using your company's existing 45-Mbps internet connection. After replicating the application's environment to AWS, which option will allow you to move the application's data to AWS without losing any data and within the given time frame. Okay, so like most of the solution architect professional questions, there's a lot to take in and try and process an answer in that two minute window. So let's go back to our basic steps, let's highlight the key facts. So the current infrastructure is on premise, we've got 900 gigabytes of data, five to 10% of the data is hot, 90% of that is cold, and the CTO has imposed a hard time constraint of 48 hours. So, what is the, what is the key issue or crux of this question? It is that we cannot upload the files we want to AWS. Now once we've defined that, every other decision becomes easy, as we just need to decide what is the best option to fix that problem out of the ones we've been given. Now only a portion of the files have been modified. Now that makes an interesting opportunity to migrate a portion of these files before the proposed migration date. I mean, we could look at the backup strategy first, to see if there was an archive or backup set that we could use to shift data in that way. That would be the easiest approach. We'd need to define a way to ensure matching up any migrated content didn't hinder being an overhead, managing versions and iterations with what our current production environment is. It may even involve parallel processing of some sort, again, that's going back to that planning, and how do we plan to ensure there is minimal impact on the business. The cloud adoption framework can be a useful tool for this type of migration planning, and the purpose of this course is to help you be better as an architect, as well as pass the exam. So these type of tools can really help you in the real world, when you're out there facing these types of situations. So the cloud adoption framework is just one tool that helps give business teams a guide and framework for what needs to be done to enable a migration like the one that they're describing here. And the reality is there's always many parties involved with a transformation project. And they're both positive and political, and as it turns out, as an architect, it often falls into your plate to negotiate with these many interested parties on behalf of the business. Cloud is a disruptive technology. And often that means not everyone is as excited about shifting to it as you and your executive sponsor are. So you are going to come up against the wide range of issues and objections, implicit and otherwise. And it does help to have a roadmap and a framework to follow to help you get through some of those things. Common hiccups you can come across can be around data sovereignty. If someone in the business, like a company lawyer, or maybe an enterprise architect, has issues with moving data off site, or off shore, to the nearest or closest AWS region, it's really good to uncover and deal with these issues earlier, rather than later. You only need to look through the AWS case studies to see that data sovereignty is actually a bit of a non issue, otherwise so many of the corporates, large corporates and governments that are listed there wouldn't have done what they've done. But well, you're just gonna come across this at some point for sure. The security blog and compliance white papers can also be quite a good help here. The interesting thing I find is that often when you come up against an objection like data sovereignty, it starts as an objection, but it ends up being a positive for the migration, as often AWS can provide a higher level of durability or compliance over what a business currently has in their local data center or on premise. Some of the other blocks that you sort of have to navigate your way round are obviously security, and the sooner you can advance and progress that discussion the better. If it's not being discussed up front in the planning stage then it's probably going to end up being a submerged rock which you'll hit later on in the project. Network performance is another common issue that you need to own. How can we guarantee the network latency will be as expected? Of course you can't. So it's important to build the right expectations and mitigations so your customer understands why things like direct connect can provide a more consistent network performance, and why having multiple direct connect connections delivered by multiple partners can improve durability and availability. And sometimes proof of concept can work very well to prove or disprove some of the problems or concepts that have been bantered around. But it can also work against you if not handled correctly. Because often the proof of concept, the modus operandi is to try and keep the costs down, so that proof of concepts are small fast and cheap. And so they end up being run on the smallest instances, without optimized disks or connectivity, so it is important that you aim to define your environment that is going to be as close to your proposed production environment as possible. Cloud adoption framework does provide a number of perspectives, business platform, people, process, and operations and security. And while it's quite theoretical, it can be just useful as a template for kicking off discussions on those important aspects of any migration project. Anyway, so back to our design. In the real world, after running our data classification exercise, we would probably want to evaluate network latency, define the compliance requirements, roadmap at least and work out our back up and restore strategy, multi site environment in future, or is this going to be more of a pilot light, more stand by environment? And again, proof of concepts can be really effective at proving or disproving and allowing you to stand up and test environmental factors. So we probably want to run some sort of proof of concept to evaluate all the environmental factors of this migration. Anyway, too much digression, let's get back to these options. So option a, copy the data to Amazon S3, using multiple threads and multi-part load for large files over the weekend, and work in parallel with your developers to reconfigure the replicated application environment to leverage Amazon S3 to serve the engineering files. So multi-part upload allows you to fire off a number of uploads in parallel. To do that you can use the multi-part loader from the console, or of course, best on use in the API. It certainly is one way to upload large files, as Amazon is three limit single file uploads to five terabytes. It doesn't solve the crux of our problem though, because how long will it take us to upload this content using that process, it's still network bound. The scenario says you calculate that it will take a minimum of 48 hours to transfer 900 gigabytes of data using your company's existing 45 megabits per second internet connection. Now the weekend is two days, which is 48 hours at best. So the internet connection alone is just not going to cut it. The second part of the option proposes using S3 to store and deliver these files, which on its own merit makes sense. But we know very little about the application and how it works, so not only is it something we wouldn't be able to recommend, having no prior knowledge, but also, that would mean redesigning the application in some way, which is likely to impact the business and ultimately, is not something we could do over two days. This one I think we just have to leave. So let's look at option b. Option b, sync the application data to Amazon S3, starting a week before the migration. On Friday morning, perform a final sync, and copy the entire data set to your AWS file server after the sync completes. So on a first look, this sounds feasible, as we can guesstimate that it would take, sort of three to five days to copy 900 gigabytes of data up to S3. And as sync is incremental, it does mean only changed data would be copied during that time window of Friday to Monday, once the majority of the data has been transferred. So on the face of it, it could work. If we go back to the crux of the issue, we still have to upload 800 odd gigabytes of files to AWS. So this option is still network dependent. All we have done really is rely on the sync command to alter the time constraint to be more favorable for us. The fact still remains, we are gonna burn a lot of bandwidth and it's possibly prone to timeouts and network fluctuations while we're doing it. So our timeline could be impacted, and the longer we have with a copy process in place, the longer we have the opportunity to have synchronization issues later. Plus, one thing that's really bugging me with this, is the cost of shifting one terabyte of data transfer, has to be considered as a factor in this. Like I said, I'm not throwing it out, I think that this is viable, it's doable. Anyway, option c, copy the application data to a one terabyte USB drive on Friday, and immediately send overnight, with Saturday delivery, the USB drive to AWS Import/Export to be imported as an EBS volume. Mount the resulting EBS volume to your AWS file server on Sunday. Hmm. Well, if you have data you need to migrate to the AWS cloud for the first time, AWS Import/Export disk is often much faster than transferring data via the internet. The timing is a little dubious, as it assumes we would have Import/Export servers in our region first of all. And that the data can be handled over the weekend as proposed. I'm probably being a bit too real world with that, but let's check and let's have a look here. But let's see, well there we go, if it's received that day, import will start immediately, on the same business day. Well, okay, it's viable. Let's just have a look at how much it would cost to use this service, because literally if we were to shift all our data through Import/Export, we just basically send the disk off, it's copied over, and it's available as an EBS volume as, from a snapshot which is a fantastic service. It is what it's designed to do. Here, if we look at the calculator here, we've got a cost estimation of $118 for one terabyte disk, well, okay, that's a pretty reasonable price. Okay, I just did a bit of research, and I found an old support page that says that Import/Export disk can only be received during business hours on business days. So I think the concept of sending it over the weekend is not going to work. If the option proposed sending it earlier in the week, then I think it could perhaps have some merit, but because they're literally making it so time constrained, I don't think it's the best use case for Import/Export disk. Unfortunately, because it ticks every other box. Let's have a look at the next option and see where we are. So option d, leverage the AWS Storage Gateway to create a Gateway-Stored volume. On Friday, copy the application data to the Storage Gateway volume. After the data has been copied, perform a snapshot of the volume and restore the volume as an EBS volume to be attached to your AWS file server on Sunday. Okay, so this is another approach again, it's not a match if you haven't worked it out already. But let's review and see whether Storage-Gateway could be match if it wasn't for this major faux pas which I'm assuming you probably have gotten, if you haven't I'll let you know in a minute. So you recall that with a Gateway-Stored volume, when we set it up we have to provide two disks, when we set up the AWS Storage-Gateway. One is a right cache, and one is as a local storage volume. So with Gateway-Stored, your data is backed up to this local volume, and then that data is asynchronously copied up to S3. In the Gateway-Stored volume solution, you maintain your volume storage on premises, in your data center. That is, you store all your application data on your on-premises disk. The Gateway VM uploads data to the AWS cloud, and that solution is ideal if you want to keep data locally on prem. Because you need to have low latency access to all your data at any time. And to also maintain a backup. So that differs from Gateway cached volumes, where you used Amazon S3 as your primary data storage, while retaining only frequently accessed data locally, in your Storage Gateway. So would a Gateway-Stored solution work for our solution? Gateway-Stored volumes can range from one gigabyte to 16 terabytes, so in theory, per volume, yes. Each Gateway-Stored volume can support up to 32 volumes, and a total volume storage of 512 terabytes, 900 gigabytes is not a concern. You can create a snapshot of a Storage Gateway volume, it's a simple way of managing volumes. Snapshots are incremental, but that doesn't create an issue as the first snapshot will contain all the data as per any other incremental snapshot in AWS. Yes, you can use the snapshot as a starting point for a new EBS volume, which you can attach to EC2. So in principle, what is described here could work. Is there anything missing? Let's go back to the crux of our issue. Are we still network bound with this proposal? The problem is that all our data is still not copied to AWS. So if we create the Gateway, we provision the cache, and the volume, 80% of our data is still residing on our local storage. The asynchronous backups to S3 will start, they're gonna take the same amount of time to copy that uploading would take if we did it via the S3 console, or via FTP, or any other upload procedure. The bandwidth between the Gateway VM and Amazon S3 is the constraint. It doesn't fix that problem. So this is quite a difficult question, because we've got copying up to S3 using sync, running a Import/Export disk job, and having the data transferred on a terabyte drive, or to implement Storage Gateway and have the Storage Gateway manage that upload for us over a period of a week or however long it's going to take. But the problem is all of those are going to take a long time and I think the cost of shifting this needs to be factored in to this. And the cheapest by far, if we run any of these other services, we're gonna have to shift 900 gigabytes plus a terabyte of data over the network, which is going to come at a cost. As this is not a real world scenario, no one's said anything about cost in the exam question, we have been given information about Import/Export disk being done over the weekend, which we can only assume is relevant, so I think on the base of it, option b. Like I said, I think all three of those are feasible, but I think I'd choose that one. Okay, let's explore some of the use cases provided by AWS in the storage white papers. So company A manages two disparate sets of information, table orientated data is maintained in an on-premise Oracle Database, while a SAN is used as a repository for file-based information. Now for further safeguarding of these vital assets, tapes are used for backup and for disaster recovery. They've got approximately 30 gig of new information, which is generated each day. Now the problem is, the backup and archive management processes are really cumbersome and they're getting expensive. And while restoring archived information, it can take days to complete. So the IT team is really motivated to boost the reliability of new storage architecture, get away from tapes and the speed of archiving and restoring is a big factor. While cost is also a consideration. So in this situation, the best cloud-based storage architecture would be to employ Amazon S3 as a destination for both file-based and relational data. For file-based artifacts, access to S3 could be via the AWS SDK for Java, and the AWS toolkit for Eclipse. And for relational data, Oracle's secure backup cloud module can take advantage of the existing RMANN scripts to back information directly from the Oracle Database into S3. And finally, the already present third party storage management solution they have can be used to manage this entire process, including the encryption and other security details. Okay, so company B is already using a cloud-based solution, and they can extend their architecture to further leverage additional cloud storage techniques. They're maintaining a data warehouse on a high CPU, extra large EC2 instance, with 10 800 gigabit EBS volumes holding the information itself. While this architecture does meet their business needs, some new requirements may require them to extend their storage, specifically a change to organizational policy, which now means they have to do more frequent data snapshots. And these images need to archive supplementary data approximately every one or two hours. So to best meet this requirement, S3 can be added to the mix. So EC2 instances and supported EBS volumes can continue with their current roles, the developers can write a script through a small application that temporarily quiesces the data warehouse, and then uses the EBS API to create incremental snapshots stored on Amazon S3. The new snapshot application can run every 60 or 120 minutes, using a Linux Chron Job or a Windows scheduled task. Perfect. So company C's got a digital asset management system. And they hold a lot of sensitive intellectual property and their documents, drawings, images, and to support these services, they have a proprietary database running on internally managed servers. And as part of their ongoing drive to reduce costs and improve customer service levels, they're looking to leverage the power and scalability of the cloud. Now, electronic assets are the lifeblood of their company, so they really want to protect their information and they need things to be stored in encrypted formats, and serviced by a robust backup and recovery mechanism. They also need rapid access to data, and storage and bandwidth being pretty scalable and affordable to them. And this new data management system needs to provide workflow capability to support some of the daily and weekly operations that they run. So possibly topography for that solution could be internally configured a controlled EC2 instances, which will act as database servers. The enterprise can use Amazon's existing database orientated AMIs as a starting point for those database builds. Having several EBS volumes to provide scalable secure storage, and transaction log support for the database server, and S3 can be used providing encryption and versioning of the archived digital assets. And then mid-tier thinking DynamoDB, could hold key values, schemer less information, that can scale really quickly. And for the workflow, probably using simple Q Service, which can provide highly scalable queueing infrastructure, and that can be accessed by EC2 via SOAP web interfaces. So company D is a media company with a lot of multimedia content. Now they're storing that internally on hosted servers and publishing it. There's a lot of meta-data that goes with the images. So the internal servers are failing to keep up with the demand they're getting from their readership. And the sheer volume of data is soon overwhelming the disk storage they have. So their architects decide to publish all content to AWS cloud based storage, so they can get rid of the need to purchase and maintain internal servers. That way they can directly fetch content from AWS. So the combined solution they're thinking is Amazon S3, CloudFront and DynamoDB. Now this application can be built either with the AWS toolkit for Java, or perhaps templates for Visual Studio, and the multimedia content will be stored in S3, when a file or artifact is uploaded, a series of key value pair entries can be created in DynamoDB, these entries can contain meta-data related to the multimedia content, and they will be created using either the SOAP or REST API. Or maybe even using a toolkit for Java, or the .NET Framework, whatever they feel is going to be the best fit for their requirements. And then when a user wants to retrieve a particular object, the application will search DynamoDB using the metadata supplied in a look-up form from the website, and then access to DynamoDB will either be direct, SOAP or REST API implications from the application platform. And when a reference to multimedia entry has been located, a call to Amazon S3, or a corresponding CloudFront distribution will be made to retrieve the multimedia object itself, so to maximize efficiency when accessing this information. So to maximize delivery for, especially for hot content, things that are requested a lot, and they're going to use Amazon CloudFront to handle delivery.
About the Author
Andrew is an AWS certified professional who is passionate about helping others learn how to use and gain benefit from AWS technologies. Andrew has worked for AWS and for AWS technology partners Ooyala and Adobe. His favorite Amazon leadership principle is "Customer Obsession" as everything AWS starts with the customer. Passions around work are cycling and surfing, and having a laugh about the lessons learnt trying to launch two daughters and a few start ups.