Learn and understand the holistic approach to solving a problem and go beyond feature matching with this curated course from Cloud Academy in partnership with Calculated Systems LLC.
Engineer centric, this course, split into four modules, will cover a broad range of topics to deepen your knowledge in the realm of AI and IOT. Starting with gathering business requirements and matching capabilities to technologies, you will move on to collaboratively build solutions to problems with stakeholders; all leading to being able to build and develop key processes and solutions. The core development takeaways from this Calculated Systems course include developing process, technology, and executable solution architecture; all culminating in picking the right solution for your team and enterprise to successfully move forward. An important addition to this course will be found in the final module where you will learn to expertly identify gaps in, both, technology, and training.
This course has been produced by Cloud Academy in partnership and collaboration with Calculated Systems LLC.
Learning Objectives
- Understanding and evaluating business and technical requirements
- Learning to collaborate and reach out to discover and resolve improvements with stakeholders
- Analyzing how to develop and architect solutions
- To be empowered to take the initial steps for implementation of technologies
Intended Audience
- This course is geared to engineers, solution architects, and team leaders.
Prerequisites
- There are no prerequisites before starting this course
In module two we are going to begin doing technical discovery. Module one defined the capabilities that we needed while this module begins to narrow down how we are going to fill those capabilities. To use a transportation analogy, module one defines that we need to get from point A to point B, in a certain amount of time. In module two we are going to start narrowing down the ways we can fill that capability by specifying should we use a car a train or an automobile. Now, of course in the context of the internet of things case study we're doing, it's going to be a little more complex than that, so let's begin digging into it. Now when we begin trying to figure out what people have done before, you're going to quickly find that everybody has an opinion. Most likely this type of problem has been solved in your organization before, however, most likely the specific problem hasn't been solved. In my experience, most data problems fall into one of three main archetypes, the data isn't be collected, the data isn't where it needs to be, or what do you do with the data once you have it.
So the data not being collected boils down to either we don't know where the data's coming from or even if we can describe where the data's coming from, it's not being bought into via a usable state. Now if the data's being collected on the edge, such as by a cellular or embedded device, that counts as being collected and moving it would be the next main challenge. So this would mean the data is in some type of system we control or maybe a partner controls or a vendor controls but it's not where it needs to be. This would be as if the data was on some FTP server and it needs to be in our central database or perhaps it's on the edge on some type of data collection device and needs to be bought back centrally as well. This could also refer to inter-data center movement, perhaps the data is in one database but it needs to be in the new analytic database. There's a lot of ways that this particular problem can be applied. And finally, what do you do with the data once you have it? There's a big jump from having the data in the spot where we can analyze it or use it to define our next actions and actually knowing the insights from it. So this type of problem would be more like all of the data is in the database formatted and nice but now we need a reliable way to process it. So when we begin to exam these problems with reference to the internet of things case study that we're using, a few insights begin to emerge.
So if we ask ourselves is the data not being collected or generated, well that's the main challenge the vehicle engineers are solving. It's actually out of scope of our project to collect and generate the data, although it still poses a main challenge to our project. I like to use the expression it's not our fault but it is our problem. So although we're not solving for how to collect and generate the data, the schemas are changing constantly as we learned in module one, that the engineers would really like it so that we could change the schemas as we go. So although we don't need to help collect and generate the data, any solution we have does need to keep a little post-it note of keep in mind that this is still a solution under way. Problem two of the data isn't where it needs to be, is actually the most relevant type of problem to the automotive case study and it manifests itself in a few ways. The most basic form is that the data is on the vehicle when it needs to be in a data base that the analysts can use.
However, beyond just the physical movement of the data, from the edge or from the service center to the data center, is a secondary type of problem of where the data isn't where it needs to be and that we need to somehow transform the schema and potentially enrich it while it's in flight. So, the enrich step, although that is in an analyst step, we might be able to combine it with the movement step and combine multiple streams into one as part of our movement. To jump ahead a little bit, this is would imply we need to use some type of streaming or batch moving technology that has data processing beyond just simple message.
And finally the third archetype of problem of we don't know what to do with the data once we have it. So depending on how we solve the second type of problem, the movement, we might need to still enrich the data once it's landed and we should be talking to the analysts how they want to work with it once they have it and even if we don't have the right technical answer, the analysts will be figuring this out as they go, so this is a bit beyond the scope of our problem, but once again, how the analysts want to use the data is going to help us define our technical answer for this. So although we don't need to worry about how they're going to use it, we do need to connect. So once again, not our fault but it is our problem.
So after understanding how the problem is or in this case study, it's mostly data movement problem with a few post-it notes around the outside, we need to understand, have people tried to solve this before? The odds are that this is not a unique project, although this exact project is probably has some unique details. Odds are somebody has tried to move data from one point to another in the history of this company. So meeting with the stakeholders, particularly senior people and that doesn't necessarily mean executives, that means people who have been with the company awhile or have a lot of industry experience, simply asking them how has this worked in the past, have you worked with anyone who has solved this? So figuring out maybe there's somebody with specific niche experience and honestly asking people's opinion at this stage will help give really good insights towards what the solution should be. These types of questions, although you do not have to ultimately listen to some of these colleagues, will help build buy in, cohesion, and honestly might save you a ton of trouble, particularly if these departments have tried to solve this problem before.
So for this automotive case study, we went around and talked to the different engineers in both the analyst and the automotive on vehicle side of the house. And as you might have imagined, they have been trading some data in limited amounts on CSV dumps on a USB thumb drive. Now in cases like this it's clear that there are some scalability problems, particularly as we approach hundreds of terabytes a year, but don't influence the responses just yet, it's important to listen because they might have some interesting insights. One of the more senior engineers who had been here for a few years had also pointed out that beyond a USB thumb drive, at one point they tried to use FTP servers to transfer the files, although for some reason that practice has fallen out of favor.
Now the interesting thing came up in answer number two, when we asked them about has anybody worked on solving this before and they pointed out that in very early stage development that they developed the restful API on the cars and that some people had used this to a limited fashion to pull data from it. Once again, it seems like the people who are spearheading the data collection via rest API were no longer doing it but the rest API is still fully featured and confirmed at this stage. Also, on the analyst side, people had been using Python to enrich the data along with Tableau to view the data.
So now we know that the analysts can describe the enrichment using Python and that they still do do that, although it does seem to be a little bit of a batch process. And we went around and asked people how they think we should solve it. People did like the CSV transfer although they admitted it wouldn't scale and it was slow to have to load large amounts of data by a USB drive and then physically hand it over and another person piped up and asked about mysql. Now examining these answers, most of them would not be the most useful in the long run, such as the CSV dumps on the USB drive don't scale up to the hundreds of terabytes and mysql also can't handle it. However, it shows that the data is useful in it's current state because if an analyst can take a CSV dump, it means they can at least interpret the data so we know that the inflight enrichment might be minimal.
However, as a counter indicator to enrichment being minimal is the presence of Python being used and this could represent more complex transformations, but the important thing here is that the analysts are already able to program the enrichments they need, which gives us confidence that we will be able to incorporate this already defined logic into our streaming or transport layer enrichment. The final set of questions that we need to ask ourselves are what are the long term plans for this project? Is this going to be a cloud or on premise application long term? Is this going to be supported by a very specific team? Is that team already selected? Is there any company policies that might influence the data or types of projects that this will be used on? Particularly is there sensitive information that we might not even be allowed to retain or have to lock down? So to dig into this, understanding if we're going to go on the cloud is very important. Basically the entire suite of technology changes if we're on the cloud versus on premise or maybe we would have to go through a hybrid approach. For example, if we had a cloud first or cloud only approach, we could start to use technologies such as Google Dataflow or Amazon's Kinesis or if we had it on premise only, maybe we could use something like Apache Hadoop or Spark.
A hybrid approach would target technologies that could work on both or maybe even be moved between the two. An example of this would be Kubernetes with containers running Spark inside of it or maybe it's just containers running Python or some other solution that could be run on premise or on cloud. Understanding this is very important because often times a solution will be optimized for on premise, on cloud, or hybrid and you could do something like define a cloud first approach and learn you're only on premise or you could select a technology that doesn't adapt to the cloud well and learn you only have to go on the cloud.
In this course's description, you'll find some links to courses that we have that describe the different cloud first technologies and could help you understand what type of problems you want to consider when defining a cloud project versus an on premise project. So, also when starting to define the requirements of the technology of which approach to use, we need to understand who's going to be supporting it long term. Is there a central IT group? In my experience most companies will have one of these if they are larger, however smaller companies might not have such a team and the same developers who created the program need to maintain it. Particularly if there's a large central IT team, it's important to understand what infrastructure that can be leveraged.
Perhaps there's and existing Kubernetes deployment or Apache Hadoop deployment where you can just put this solution on top of it. Leveraging these already existing central IT capabilities could greatly decrease the amount of effort we need to expend towards deploying our solution. If your company has a central IT department, it's very important to consider them upfront and treat them like a stakeholder very similar to the automotive engineers or analysts, because even the best solution if not maintained will degrade over time and the people who will be helping you or maybe completely maintaining it, are going to be very important for the project's long term viability. It's also important to understand if you have any company policies pertaining to the data.
This could be everything from how to handle HIPPA medical data to payment card industry data, such as how can you retain credit card data. Some industries are notoriously stricter on this, such as health care or financial but you still might find interesting policies in manufacturing, particularly around control data that could lead to safety. So understanding these requirements up front will definitely help you avoid painful stops later when you're undergoing a security audit getting ready for final launch. So in our automotive case study, central IT will be providing ongoing support.
They do have minimal cloud experience, but they are beginning to move workloads to the cloud. They're not comfortable with a cloud only solution and actually have nominated a hybrid solution. So what this forces us to do is not necessarily use some of the cloud first technologies like Dataflow or Kinesis, but allows us to focus on open source technologies or portable technologies, such as Apache projects. It does mean that some of the more restrictive technologies such as the appliance based oracle ones might not be the best choice and over all if we can work with IT to find something that's within their core skill set but can also live on the cloud, we're going to have a more widely accepted solution. For the purposes of the case study, there's not much company policy that effected us during the early stages.
Analysts may face, might face some restrictions if they start to try to attribute certain cars to individual owners, but that is out of the scope of our project right now as that is more of a data governance problem for the analysts than anything our solution needs to account for. So in summary, when defining the requirements from the technical perspective, start by defining the archetype of the problem. Figure out if this has been done before, what did the individual stakeholder prefer? And remember the stakeholders are both the end user, the people contributing data to the project, central IT, anybody who's gonna have to interact with the product over the course of its life. And also look for policies that might influence the final design, whether they be disaster recovery, high availability, or personal information protections.
Chris Gambino and Joe Niemiec collaborate on instructing courses with Cloud Academy.
Chris has a background in big data and stream processing. Having worked on problems from the mechanical side straight through the software side he tries to maintain a holistic approach to problems. His favorite projects involve tinkering with a car or home automation system! He specializes in solving problems in which the data velocity or complexity are a significant factor, maintaining certifications in big data and architecture.
Joe has had a passion for technology since childhood. Having access to a personal computer resulted in an intuitive sense for how systems operate and are integrated together. His post-graduate years were spent in the automotive industry, around connected and autonomous vehicles expanded his understanding of Streaming, Telematics, and IoT greatly. His career most recently culminated in a 2+ year role as the Hadoop Resident Architect where he guided application and platform teams to maturing multiple clusters. While working as Resident Architect, he architected a next-generation Self-Service Real Time Streaming Architecture, enabling end-user analysts to access to streaming data for self-service analytics.