Defining Project Architecture
Learn and understand the holistic approach to solving a problem and go beyond feature matching with this curated course from Cloud Academy in partnership with Calculated Systems LLC.
Engineer centric, this course, split into four modules, will cover a broad range of topics to deepen your knowledge in the realm of AI and IOT. Starting with gathering business requirements and matching capabilities to technologies, you will move on to collaboratively build solutions to problems with stakeholders; all leading to being able to build and develop key processes and solutions. The core development takeaways from this Calculated Systems course include developing process, technology, and executable solution architecture; all culminating in picking the right solution for your team and enterprise to successfully move forward. An important addition to this course will be found in the final module where you will learn to expertly identify gaps in, both, technology, and training.
This course has been produced by Cloud Academy in partnership and collaboration with Calculated Systems LLC.
- Understanding and evaluating business and technical requirements
- Learning to collaborate and reach out to discover and resolve improvements with stakeholders
- Analyzing how to develop and architect solutions
- To be empowered to take the initial steps for implementation of technologies
- This course is geared to engineers, solution architects, and team leaders.
- There are no prerequisites before starting this course
For Module Three we're going to begin discussing the project architecture. The goal of this is to get to a point in which we could hand it off to individual development teams to actually begin building out the project. We're going to use a three phased approach to designing these specific architecture. Starting with the process, moving to the solution, and finally to find the exact technology. So first we're going to define what key capabilities do we need? Then we're going to assign those capabilities to specific classes or types of technology. And then we're going to define the exact piece of technology down to like a software name or package that we should be using. So starting with the Process Architecture, we're basically trying to assign capabilities that the program will need to meet the business requirements and the technology requirements defined in the first two modules. Here we're trying to outline the high level tasks.
Think of this as a checklist without any sub-bullet points but we're not trying to assign specific technologies or vendors. So it might be good to say, that we need to store data and make it accessible to the analysts, but it's not appropriate to say that we need to use SQL Server 2018. That will come in the later steps in defining the architecture. So if we begin to go through the capabilities that this project needs, first we need to be able to receive data. This is the data transmitted from the automobile and it needs to come into the data center somehow. Key capabilities here are, data is going to be coming in from outside of our controlled IT environment and have to be engrussed into the central location. Once the data has landed somehow, we then need to be able to buffer it or store the messages.
Basically, when receiving raw data from the field we need to be able to keep a running log that is accessible and potentially even repeatable for all the messages coming in so that they can be replayed and tested. Once this, these messages are land and buffered and stored in their raw format, we need to be able to route them to the correct final destinations or potentially even enrich it. Basically in this stage, these messages are going to be transformed, they're going to be cleaned up, they could be combined with other types of data or other tables, but ultimately we're looking to get it from the raw landing spot into some type of storage solution. Basically we need something that the analysts and researchers can use and that they can repeatably go over and use in their research.
Now, storage feeds back to logic and enrichment because previous events and other types of data could feed back into it. So the logic and enrichment capability and the storage capability are going to be tightly linked and these two teams, if they are separate development teams, will need to coordinate closely to make sure that they're interweaved. And finally we need some type of interface to external users. For this project we're talking about the analysts and researchers need to plug into the storage system. We could also though consider maybe what an end consumer might need to plug into the end system. The technologies and capabilities shown at this stage are extremely high level. This is going to be the framework in which we begin to apply the other business and technology requirements that we gained in the other two modules on top of. For example, then the next steps we'll be start talking about the type of scale we need to handle for the storage which was defined by technical requirements.
Here we're just trying to keep it to the capabilities. For the Solution Architecture we're beginning to assign the type of technology to use for the process architecture. So you should be thinking in terms of are we using NoSQL or SQL solution perhaps we're using a message buffer or a message bus, perhaps we're using a restful API server or perhaps we're using some type of web hooks. Don't think in terms of MySQL or HBase though, so you're not trying to think of specific brands or distributions of technology. Once again we're keeping it to the class of the technology. So for receiving the data, we're looking at a restful api server in this case study. So in this case study, the automobile already had an ability to make posts which are a way to send data over http protocol or the internet. They can also make get requests to query data.
This makes a restful API server a clear choice of technology in landing the data, because the vehicles already have the ability to speak along that protocol. The decoupling buffer of where does the restful API server dump its messages in raw format to, we're going to use a messaging system. This system is able to handle a high amount of though put, typically, without needing to rely on logic and enrichment. This allows all of the car's and automobile's messages to be stored as is without the risks of interpreting it or if we want to try new enrichments we'll have a log of messages to go back to and replay throughout the testing and eventually production. Now there's a few types of choices for what to put next. There's a lot of ways to handle messages coming off them. But one of the best modern ways is an event processing engine.
This treats every individual message as an event and handles that in a streaming fashion. An alternative would have been to think of the messages as batches, but by handling each specific event we're able to handle the enrichment stage as well where we're able to take the message, parse it, understand it, enrich it in semi- or even real time and then eventually route it to it's final storage location. Now if you remember we have the business requirement that the schemas are going to potentially be changing a lot as we go, so a NoSQL solution is the best technology for handling that type of problem. We could use the event processor to enforce the schema and take incoming restful API server messages from the car, transform it, force it to fit a very specific SQL solution, but if we select NoSQL at this stage, we're able to store the messages exactly as is and then the analysts are able to send queries to the NoSQL database to get what the car events were. Now, there was the requirement from the SQL analysts that they want to have a reliable way of parsing these messages. So even though we're using NoSQL, we can still use a standardized schema for some of our tables, or column families depending of which solution we go with.
So, the NoSQL is shown to pipe back into event processing because, as we're only handling these events as individual triggers, we could have it so that, as a message comes through it gets stored in a schemaless fashion and then we could enforce a standardize schema using the same engine. So there's a lot of reuse and a lot of attachment between the two of them. And finally, the interface to external users. Although we're looking to define how this will be done, it does change a bit drastically depending on the NoSQL solution chosen. Some of them have JDBC connections, some of them have client API's written in different languages.
As we pick a NoSQL solution, the interface will become apparent, however, it is very hard to define at this stage. The final type of architecture review that we will be doing, is a Technology Architecture. In this stage, we're starting to select the specific technologies that will be used. Vendors are a concern to the extent of if they limit capabilities or have different functionality. But it's best to list multiple viable technologies at this stage, and then review them to find common traits between different stages.
So think in terms of are we looking for a MySQL or a HBase, perhaps we should list MySQL and SQL server, or perhaps we should list HBase and Bigtable for a certain step, but don't think in terms if we need to decide if it's Oracle or Microsoft at this stage. So in selecting a specific technology, for the restful API server, there are ton of open source capabilities as well as some close ones. For this project, we tended to stick a little more towards open source due to the need for portability between on-premise and Cloud and the licensing concerns for a closed source solution could get difficult. And in that domain, there's many strong options. There's classic options such as Tomcat or Spring. Basically there's a dozen Java libraries that could handle this, there's a dozen Python libraries, and then there's some Apache projects such as nifi that could potentially handle multiple roles.
For the messaging system, we have to make a little bit of a choice here, or we will soon, between something like a server bus or a true messaging cue. So, RabbitMQ, Pubsub, and Apache nifi can all buffer messages in them, but that makes it a play once and then you have to either go back to your audible archive to replay that message. But really they're a fire once, confirm receipt, and the message is gone. We also have the option at this stage to consider if we want something like Apache Kafka, where replay ability is built in. Realistically, any of these four technologies would work great, however we can begin to cross a few out at a later step, such as Pubsub being a Cloud specific technology, and might not work in a hybrid environment. For the event processing engine there's also tons of strong options. You could use everything from the Google's data flow, which is basically managed Apache Beam, so we do have some portability there.
There's Amazon's Kinesis, and then there's Apache Spark and Apache nifi as well. And then for NoSQL options, there's many strong ones, some classic ones including MongoDB, but then there's also Cloud specific ones such as Bigtable or DynamoDB, or an Apache one such as Hbase. And finally, all of these NoSQL solutions have some common interfaces available to them. Most support some form of a JDBC, most of them support a Python, a Java interface, some of them have restful API's built in, many of them can do export to file dumps. Once we select a very specific one, we'll be able to make that final decision. So for this case study, we did end up making specific choices and applying the architecture. We basically pulled all the data we gathered from modules one and two, and are able to narrow it down to a very compact, easy to support footprint in which some technology was actually able to cover multiple steps.
This is an ideal breakout section that we will be producing future labs and courses on, to dig into how the solution works, how to interact with it, and how people build this for themselves. But to go through exactly which technology we used, for the restful API server we went with Apache nifi. It has the capability to double as an event processor which gave it some immediately strong options. Also being completely open source and portable, we are able to move it between on premsis and Cloud extremely easily. Also, as it does have a very strong graphical user interface lack of IT training made this a very strong choice to work with. For the service bus we went with Apache Kafka. So we're sticking with the trend of purely open source technology due to its portability and licensing use and, although nifi was an option for this service bus, replay ability of the message block was deemed to be a very useful and important trait.
We went back, we checked with some of the stakeholders, and being able to run events through the event processor exactly as they were received and have multiple people be able to read from different points of the cue, was deemed to be an extraordinary capability that we didn't want to avoid. Now it's worth noting nifi can do this out of the box as well, however, its read once and fire method does make it hard for multi-tenet users to replay messages over and over without impacting each other. For the event processing, as we eluded to earlier, we went with Apache Nifi.
It's extremely in dept at handling large amounts of concurrent messaging, so even as our data volume grows to extreme scale, nifi can handle massively parallel data processing. For NoSQL solution we ended up going with Hbase. Products such as DyanmoDB and Bigtable, were originally looked at for their ease of use and managed service, but a hybrid strategy requires us to be able pull data down from the Cloud and onto on-premises. Other classic solutions such as Mongo would begin to struggle at the data volumes were talking, particularly, after one year when we're in the hundreds of terabytes.
And furthermore, Hbase is supported by the Hdupted distributions available, so we're able to use this product for both our hybrid Cloud strategy and we can use a single vendor for handling all of the support of these products. Also, Hbase has a interesting feature called Phoenix, which is a co-processor that sits along with it and provides a JDBC style SQL connection. It's not quite true SQL, but it's close which helps us work with the individual analysts who have a strong SQL preference.
Altogether though, the event processing, the NoSQL and the interface, we noticed an interesting possibility here where nifi using a GUI, we're able to actually allow automotive engineers to send their own structured messages and these analysts can actually go in and build their own parsers that are then reusable and that solves a ton of our problems that we are originally facing where the schema is changing then the analysts have no ability to help control that other than wait for central IT. So this unified solution here allows us to use a single vendor, it allows us to use certain pieces of technology repeatedly and it allows us to have multiple people who might not have full blown training begin to use the software due to its ease of use and graphical user interfaces.
About the Author
Chris Gambino and Joe Niemiec collaborate on instructing courses with Cloud Academy.
Chris has a background in big data and stream processing. Having worked on problems from the mechanical side straight through the software side he tries to maintain a holistic approach to problems. His favorite projects involve tinkering with a car or home automation system! He specializes in solving problems in which the data velocity or complexity are a significant factor, maintaining certifications in big data and architecture.
Joe has had a passion for technology since childhood. Having access to a personal computer resulted in an intuitive sense for how systems operate and are integrated together. His post-graduate years were spent in the automotive industry, around connected and autonomous vehicles expanded his understanding of Streaming, Telematics, and IoT greatly. His career most recently culminated in a 2+ year role as the Hadoop Resident Architect where he guided application and platform teams to maturing multiple clusters. While working as Resident Architect, he architected a next-generation Self-Service Real Time Streaming Architecture, enabling end-user analysts to access to streaming data for self-service analytics.