Learn and understand the holistic approach to solving a problem and go beyond feature matching with this curated course from Cloud Academy in partnership with Calculated Systems LLC.
Engineer centric, this course, split into four modules, will cover a broad range of topics to deepen your knowledge in the realm of AI and IOT. Starting with gathering business requirements and matching capabilities to technologies, you will move on to collaboratively build solutions to problems with stakeholders; all leading to being able to build and develop key processes and solutions. The core development takeaways from this Calculated Systems course include developing process, technology, and executable solution architecture; all culminating in picking the right solution for your team and enterprise to successfully move forward. An important addition to this course will be found in the final module where you will learn to expertly identify gaps in, both, technology, and training.
This course has been produced by Cloud Academy in partnership and collaboration with Calculated Systems LLC.
- Understanding and evaluating business and technical requirements
- Learning to collaborate and reach out to discover and resolve improvements with stakeholders
- Analyzing how to develop and architect solutions
- To be empowered to take the initial steps for implementation of technologies
- This course is geared to engineers, solution architects, and team leaders.
- There are no prerequisites before starting this course
For module one, we're gonna start by defining the requirements. The most important part to start at, is understanding why the company cares. Basically, why are they funding this? What will the realization be? What niche are you filling? By understanding these factors, you can start to understand the mindset of the people who are ultimately gonna have to judge whether or not the project was successful. So to start with these three questions of why does the company care, what is the realized impact to the business, and what specific niche are we filling, let's examine this in the concept of the IoT automotive case study. So in this case, the company has publicly stated to have a level three autonomous car. A level three autonomous car is conditionally autonomous, not quite fully self-driving, but it's a big statement, if they're making that to investors.
Also, a lot of other companies out there have started to make self-driving cars and if this company doesn't get something on the market soon, they're gonna begin to lose face, particularly as they are a luxury brand. And when we think about, why did they make this claim? Who cares? Ultimately, whose pocketbook is it gonna come out or benefit? The company believes there's an upsell potential. They believe that a self-driving car is something that will sell. And also, investors have an expectation as a result of these statements of the new sales. Frankly, investor expectations will drive a lot of business and upsell potential also gives you an understanding of the magnitude of the business that you're working with. A small impact to the company will often fall by the wayside, to put it frankly, versus a larger project will definitely get the resources it needs to be driven through.
But beyond just the company-wide goal, wherein the project specifically do we sit? So to be assigned the entirety of the self-driving car project is unlikely and you're most likely going to be assigned a sub-part of it that will fit within a larger vision. So the project, specifically, that we're going to be assigned and tackle today is the data from the vehicle is being generated but has know where to land. This is a classic data movement problem and it is our responsibility or problem, depending how you look at it, that to land the data, data store the data, and there's an optional objective to enrich the data. Oftentimes the engineers in this project want additional information such as maybe the city that the GPS coordinates are located in and if that could be solved during the flight and landing, that is a huge advantage. The most critical need, though, is to get a prototype up and running with a path to production. But vehicle engineers are unable to collaborate with analysts today. So in this case study, looking at these parts, you begin to see which particular departments are going to have a stake in the solution.
At the crux, it looks like we are in the middle of the entire project. The project cannot move forward at full speed without this component. Furthermore, the vehicle engineers are gonna be pushing for this to get solved, so that they can have the analysts begin to look at their data and the analysts' quality of life is either gonna improve or they might not be able to even do their job as of right now. So now that we understand where we sit within the company what does this solution even need to do, beyond how it's going to move data from one site to the other? What type of technology do we need? Is it going to be big data? Is it going to be small data? Is it going to be fast data? How do we even begin to narrow down the scope of what we're going to approach? For this, we typically look at two main problems and that will begin to give insight as to, at least, the magnitude and the class of technology we're looking at, and that would be the three Vs of volume, velocity, and variety of the data, and how do people want to interact with the solution? To dig into each of these, let's start with the three Vs. This can be applied to almost any use case and it really starts to tell you, should we be looking at Hadoop? Should you be looking at a local laptop? And this is really where you should start on most problems in the data domain. So velocity, this is the rate in which events come in. If you're familiar with fluids or plumbing, there's two types of velocity.
You could think of it as the rate at which something is flowing volumetrically, that is the amount of data coming in. This could be measured in gigabytes per second or you could look at it as a pure velocity of events per second. Those two are often tightly related but it's important to understand the difference of them because landing few larger messages is a little bit of a different challenge than landing many smaller messages. The second V, for volume, is not quite as complex but there's two different types of volume to understand. There's the dynamic volume and the cold storage volume. A dynamic volume is stuff that is going to be rewritten regularly, while the cold storage volume is stuff that might be archived. Now, if you're familiar with different cloud solutions or even different types of on-premise solutions, they'll often designate data hot, warm, cold, glacier, archived. It's up to you in this case to understand and set your own thresholds but just keep in mind the amount of dynamic versus archival data that we're gonna be facing.
And finally, variety. This is how much does the information change over time? So you may imagine that a bank transaction, for example, is going to be a low variety. You're always gonna get money in, money out, time stamp, and maybe that's it. In some cases though, the flexibility may be high. Common cases where this happens is when the application is still under development or perhaps a research and development project where we're not sure what the scheme will be when the data finally comes in from the field. So how do these help you answer the question? Well, once we understand them, we can start to narrow down the scope of the project. By understanding the volume, velocity, and variety, we can start to say, is it a MySQL application? Is it an Apache Hive application? Is this a big data problem? Is this a small data problem? And the reason this is important to understand upfront, is it helps inform whether or not this is a reasonable project. As the project coordinator or engineer, it's your responsibility to be able to inform corporate or your manager if this isn't the right setting or if expectations are off. It's best to call out now, if you think that there's a mismatch between what corporate's expectations are, and what you're scoping the project to be.
An example of this would be if they think it's going to be an easy and quick project, however, you're being pointed towards a large, big data system and there's not the budget or the willpower to support such an endeavor. But, assuming everything's in line, rough idea, remember, we don't need a full concept idea yet, we can move on to how do people want to interact with the solution? So when we talk about who's interacting with it, we can break it into a few people. We have the internal and external customer. So is an external, from your client, or maybe from your professional services partner, going to interact with this? Or is it just going to be your immediate team or is it gonna be another team? What is this person's ideal solution? Basically, we need to understand how this person would want to interact with it, if everything was perfect. Even if we can't meet these needs, 'cause sometimes they're just completely unreasonable, it's important to understand what their perspective is, 'cause it'll influence what they will like. And also, it's important to understand what this customer, remember, both internal and external, what their technical proficiency is. If we're designing a project for a bunch of Excel users, we might have a very different vision than if we're designing a project for a bunch of heavy-duty Java developers. And once again, this helps narrow down the scope of what the solutions are.
If it's a series of Excel users, perhaps business-intelligent dashboards are going to be more applicable, versus the Java developers might prefer a well-featured API. Also, it could help inform the type of solution, such as some users may only be familiar with SQL, while others may know NoSQL, and other people might even know some more niche, specialty databases, such as CloudSpan or Athena. So if we were to take the three Vs and how people want to interact with it and apply it to the automotive case study we've been discussing, it starts to paint a clear picture of what this solution will look like. So in querying and talking to different departments, different teams, we're able to build a consensus between both the automotive engineers and what the analysts have traditionally observed, that we should expect a peak load of around 50 gigabytes per hour. Unfortunately at this early stage of the project, we don't have a good understanding of will that peak in like one or two minutes? Perhaps when they're unloading an R&D test vehicle or will this be an average over those hours? So this helps us inform that we need a large system that could probably scale out further, just to help handle these spikes or maybe it points to an elastic solution, based around the cloud. Furthermore, when we started examining the volume, we actually noticed that the business was missing a requirement.
There wasn't actually a good understanding of how many years of data we wanted to retain or if one year, two year, three years, or maybe only six months was enough. Talking to the analysts and based on what they've seen, planning around one year of retention was most likely a good start and we already know that we need a scalable system from the velocity, so a scalable solution for the volume also makes sense. So if we were to start this project out, assuming one year retention, which the analysts agree is probably a good minimum, it points to at least 400 terabytes of uncompressed raw data. Now, we should be able to assume you get a good compression ratio, which will help us decrease the total amount of storage needed. However, derivative data sources also begin to appear. As a side-note, in many big data systems, you shouldn't use the entire disk space for storage, so if we're assuming we need 400 terabytes of data, it's safe to assume we should provision 800 terabytes of disk. Now, this would be in, we're designing an mid to upper hundreds of terabytes solution. So it's absolutely in big data range, but well within the capabilities of what traditional big solutions can begin to offer.
And when we discussed the message variety and message formats, this is where we got a very high degree of variability. It turns out that the test cars all have different types of sensors, all running different types of firmware, and we really can't rely on a reliable message format. Basically, different cars are going to send different messages in different formats. We do have the ability to enforce some standards but the variability is gonna be immense. However, it was pointed out to us that we do have one thing on our side and that's messages are self-contained. One message will not influence the next message, so we can treat them as parallel and allow the schema to drift between them and not worry as much, as if we had chains of linked messages. When we started to examine who would be interacting with our solution, we really identified three distinct types of users. We had the automotive engineers, the analysts, and research and development. Then of course, they all had competing priorities.
The automotive engineers are the people working on the car and they of course want immense flexibility. They want it basically a fire-and-forget, where they can send us any data, in any format, and we'll store it and be happy and not pester them. Now, requests like this can be hard to accommodate but it does capture their perspective. Any time that they spend standardizing their message format is lost research and development time so we should investigate if we're able to meet these requirements 'cause it'll actually speed up the development of the car overall. The analysts, conversely, wanted pretty much the opposite. They wanted a highly reliable structure so that they could run their repeated queries. We also discovered that they had a strong preference towards SQL-based solutions. So this is where it starts to get complicated between these two groups of users. The two primary groups of users, one wants a completely unstructured fire-and-forget and the other wants a structured system for running queries, which potentially points to an immense amount of internal ETL work, where we have to extract the incoming messages, transform it, and then load it for the analysts.
So we're gonna have to be creative and carefully examine the options of how we can design a system that can accommodate these without making an IT swamp. And then there's also a third type of user, although they weren't quite as numerous, is that research and development, outside of the main production analysts, would want some exposure to the data. Basically, we can expose them the same data sets the analysts have and it just means that we need to make a smart system that can share data sets among multiple teams. In my opinion, the presence of R&D more points to a need for permissions and security, rather than any additional technical requirements beyond what the analysts are imposing. The last major factor that needs to be defined at this stage is the business importance.
How critical is it to have high availability and disaster recovery? Basically, this could, in some cases, double the infrastructure costs for the project, if we need to keep an active-active solution. In fact, if the replication is a third party tool, it might do more than just double. If we needed an active-active, it might be two and a half times the cost, versus just having a single site. So to define the differences between high availability and disaster recovery, high availability is the ability to a standard server failure. This is there's a problem in the data center but we're going to replicate and distribute the data to handle this. A lot of big data solutions do this out of the box. However, not every solution has built in disaster recovery. This is if a data center fails, how long can the application be down? How much data can be lost and how do we recover? You might hear the terms RPO and RTO, for a recovery point object and recovery time object, in reference to disaster recovery but we're not going to go too into depth on this course. So just to bring this back to the IoT case study, we basically have two identified phases. We have development and production. In development, we need to be able to withstand one or two servers going down.
And in production, well that hasn't been defined yet, but as we are, we've already determined we're designing a highly scalable system. We should be able to just scale out any of the big data technologies to handle this. And disaster recovery also has two phases. In development, the application can go down for long periods of time and data loss is also acceptable during development. However, management has asked us to have a disaster recovery plan for the long term, so any technology we implement, even if we don't need to go into this into depth now, should have a plan that we're ready to propose on day zero for how we can bring us up to production over the coming months. So in summary for module one, gathering the business requirements is a critical upfront step. If we don't understand who is paying for it, why they care, and what they really want to see out of it at the end of the day, we're probably going to be off mark. After we understand the business requirements, we need to define the functional requirements.
These are derived from factors such as what the users want, what we actually need to do technically, and how that all relates to the previously gathered business requirements. And all of this should really happen before we start to define the technology at all. I know we took some steps towards saying, this is going to be a big data solution, it's going to be in the upper hundreds of terabytes range. We do not need disaster recovery but beyond that, I really warn people to not define the project too tightly yet, as we're going to go into a more in-depth exploratory and requirement understanding phase over the next few modules.
About the Author
Chris Gambino and Joe Niemiec collaborate on instructing courses with Cloud Academy.
Chris has a background in big data and stream processing. Having worked on problems from the mechanical side straight through the software side he tries to maintain a holistic approach to problems. His favorite projects involve tinkering with a car or home automation system! He specializes in solving problems in which the data velocity or complexity are a significant factor, maintaining certifications in big data and architecture.
Joe has had a passion for technology since childhood. Having access to a personal computer resulted in an intuitive sense for how systems operate and are integrated together. His post-graduate years were spent in the automotive industry, around connected and autonomous vehicles expanded his understanding of Streaming, Telematics, and IoT greatly. His career most recently culminated in a 2+ year role as the Hadoop Resident Architect where he guided application and platform teams to maturing multiple clusters. While working as Resident Architect, he architected a next-generation Self-Service Real Time Streaming Architecture, enabling end-user analysts to access to streaming data for self-service analytics.