This course takes you through a case study of a real-world scenario in a financial setting. We'll go through how data gets processed, stored, and presented for an insurance dashboard. This dashboard displays all different types of insurance and is powered by an underlying database. So over this course, you'll both see some specifics for the industry vertical of insurance and some specifics on how to process data.
This course is a little less technical than some of our other database-related content, so if you don't have Python or SQL skills, that's fine. This course is really ideal for those who want to learn a little bit more about how all of the pieces go together to understand more of where to dive in deeper.
If you have any feedback relating to this course, please let us know at firstname.lastname@example.org.
- Learn how to put together a data processing solution for an insurance company
- Plan the project
- Build the front-end interface, the back-end database, and the middleware that connects the two
- Put the whole dashboard solution together
- Data engineers and database administrators
- Anyone who wants to gain insights into how a data-handling solution is built in a real-world environment.
To get the most out of this course, you should have some basic experience with databases and building IT solutions.
Now that we understand the requirements and everyone's on the same page, the next step is to build the database. So the back-end database is really the foundation of your data application. And it's important to make sure that this is in really good shape, because if you spend a little extra time planning here, you're gonna have a solid bedrock for your product, and you're gonna be able to ensure that everything else works smoothly. If you have a good, solid database, you're going to be able to get what you need for the front end. And also, if your database is well designed, it will be easy, hopefully, to expand it later if you need to add additional data.
Okay, so the first step in building the database is to choose the technology solution. There's really two options. The first one is a relational database, which is a good solution if you have structured data. If you can come up with a consistent schema for your data relationships, a relational database such as postgres or MySQL is almost certainly the way to go. There's also NoSQL databases such as MongoDB or Cassandra. These are good solutions if you have unstructured data or you have data where things change a lot, maybe a fire claim is just very different structurally from, say, a flood claim. And since you want to capture both in a single database, that would maybe be a good example of dynamic data. But for most things, you're gonna have a good structured dataset, especially if your data is coming from other relational databases. So generally, relational databases are the way to go, unless there's some really compelling reason to go the NoSQL route.
Once you've chosen your technology, then you need to start designing your database. So for relational databases, that means the database schema. You want to have a schema that's flexible enough to hold all the data fields that you want to have now or in the future. So, for example, we know that for our claims dashboard, they want to be able to see the date of the claim and the type of the claim. So for example, we know that for our claims dashboard they want to be able to see the date of the claim and the type of the claim. So if you build a schema that doesn't include, say claim type, we're gonna have a problem So you want to make sure that the schema has space for all the things that we may want and to consider future expansion. We might want to either add additional fields that might be requested in the future or think about how you could design the database to make it easily expandable if the claims team comes back to you and says that you did a great job, but can we have this field and this field and this field, that you want to make sure that your database structure is flexible enough to support that.
For BI applications, dimensional data models, also sometimes known as snowflake models, often work better than the standard normalized models. Dimensional models have wider tables but require fewer joints. So when you're actually serving the data to the front-end application, your SQL queries will run faster.
So let's go back to our Acme example. The next thing you do is call a meeting with your DBAs to find out what data is actually available. So there's some good news and there's some bad news. The good news is we do in fact, have detailed claims information for the last 10 years, so we'll be able to look at historical claims, we’ll be able to look at data for the last month, the last year, the last five years, which is really great for our claims team because that's exactly what they asked for.
The bad news is that unfortunately, the claims team has switched claims processing systems twice in that period, so there are three different formats of data in three different locations. So we need to handle different data sets and combine them all into our data warehouse and get them to be all in the same format. And there's gonna be some work there.
But there's more good news in that we do have detailed policy information, but again, this comes with a little bit of bad news in that it's in yet another database. So now we're talking about four different data sources that we're going to need to blend into our single database for our BI solution.
Now that we understand the available data, we can pick our technology solution and build our database schema. So what are we gonna do? We've decided to use a relational database that has two tables. I would recommend a nice open-source solution such as postgres, which is pretty standard and flexible, So our two tables are going to be the policy table and the claims table. Our policy table is going to have a primary key of policy ID, and it's going to hold all the information about the policy. So the policyholder, their name, the insured property, the insurance limits, the deductibles, all of the information that you would expect in an insurance policy. Even if you don't work in insurance, you probably have insurance. So hopefully you should be familiar with the types of data that would go into this table.
Similarly, our claims table will hold detailed information about each claim. We're going to have a claim ID field that will work as the primary key for this table. But there's also gonna be a foreign key relationship to the policy table through the policy ID. So each claim is going to be associated with a particular policy, because when you make a claim, it's always a claim against a policy. It's not just a claim off in a vacuum. The claims table holds detailed information on each claim, including the type of claim, the date the claim was filed, whether it was accepted or rejected, the date the claim was finalized and some other information that might be helpful.
So now that we have our database all set, we need to think about how we're going to get data into that new database, and this is where the data pipeline comes in. So it's a good idea to use a tiered approach to data ingestion. Your first tier would be just your raw data from various original sources. You get everything gathered up and you stick it in your database so it's all in one place. In your second tier, you can transform that data into your database schema that we just designed, which is our policy and our claims tables. So this is where the ETL happens. You extract the data from the raw tables, transform it to fit your new schema and load it into the tier two tables. So that's E. T. L. Extract. Transform. Load. Tier two is where that transform happens. Your next tier, tier three is where we're going to do some cleaning, augmenting and maybe enhancing of the data. So you know you're gonna look at the data. Maybe it's messy. Um, maybe you need to translate or update data codes. Maybe there's just some way that you could make it better. And this is the tier where you would do that. So once you've transformed the data into your schema, you can enhance it in this tier. And then your final tier is where you would combine the cleaned tables of the cleaned data from all the sources into one master set of tables. So remember, we have, in our example, three claims databases and one policy database. So we need to combine everything in to one big master set of database tables.
So how do you pick the right tool for your data pipeline? There are a lot of commercial and open-source tools on the market. A good choice for this would be Apache NiFi. NiFi is nice because it uses a graphical interface, which lets you visualize the data flow, which is good for visual thinkers. It can take in lots of different data sources. It can take in CSVs, it can take in JSONs. It can read in from relational databases and NoSQL databases and write out to databases as well. It also allows you to write scripts in various languages, like Groovy or Python, to process and transform your data. And there's a wide range of built-in processors that, for a basic project like this, you could probably do everything you need with the built-in processors. But you can also write custom processors if the built-in processors aren't going to do what you need. So it's a pretty neat tool overall and very flexible.
Okay, so back to our claims example. We've set up our data pipeline, and we've decided that we're gonna use Apache NiFi to build this. So we now need to create a data flow that extracts data from our four primary sources. Remember, we have three separate claims systems, plus the policy data, and then we create a data flow that will transform the raw data into our database schema, which is the policy and the claims tables that we've created. And that's what we're talking about with tier two, where you're doing that transforming of the raw data into your database scheme.
Introduction - Case Study Scenario - Planning - Front-End Design - Middleware Design - Claims Dashboard: Putting it Together
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.