The course is part of this learning path
This course discusses some of the fundamental concepts of data management and looks at the differences between spreadsheets and databases for managing data. We'll look at some specific examples to understand when spreadsheets makes sense and when it makes sense to switch over to a database, which is sometimes a much better option for more complex datasets.
Specifically, this course aims to give students a practical hands-on introduction to database concepts. In addition, we'll gain an understanding of how to select the right database and we'll go through the basics of setting up an RDS instance on Amazon. This course includes a practical example of a company that is looking to choose a database, to give you an understanding of how databases work in the real world.
If you have any feedback relating to this course, please contact us at support@cloudacademy.com.
Learning Objectives
- Understand the difference between spreadsheets and databases and when to use one or the other
- Learn about the different types of database available and the various features and characteristics to consider
- Learn how to choose the right database
- Learn how to deploy an Amazon Aurora instance
Intended Audience
This course is designed for anyone who wants to improve their knowledge of databases and understand when it makes sense to use them as opposed to a spreadsheet.
Prerequisites
To get the most out of this course, you should already have a basic understanding of simple data structures such as comma-separated values, as well as an understanding of cloud concepts in general.
So to begin our discussion about spreadsheets in earnest, let's discuss when they actually make sense. We'll discuss the specific codified criteria in a moment, but everybody attending this course should keep in mind that for some applications, spreadsheets, Excel, or Google Sheets are actually the preferred choice. In fact, in many situations, sticking with a spreadsheet makes a lot of sense because it simplifies your administration, maintenance, and upkeep, and a database is just overkill.
So although in this class we'll be discussing a lot of what makes a database good, keep in mind it is completely valid to keep using spreadsheets, as long as you feel that it's not becoming a limit to how you work with the data. So when deciding if a spreadsheet still makes sense, the first thing to really talk about is the structure of the data. This means, is it flat? Are we dealing with a simple list? Is there nested values? So typically with spreadsheets, that best case is flat data.
Now, this simply means that the data can be represented by a series of columns and rows, and typically, not always, the data can be represented in numbers, dates, alphanumeric texts, and other simple types. If you start seeing nested data such as JSONs or heaven forbid, CSVs within a CSV with escaped commas, it's probably time to move beyond a spreadsheet. But as long as it's flat, it's typically a good candidate for considering sticking with a spreadsheet.
Additionally, the size of the data is very important. Typically, I like to say less than 20 columns or less as a rule, this is not by any means a codify rule you'll find in any textbook, but as the data starts to get wider and you start getting into the 25, 2,700-column range, the spreadsheets become harder and harder to manage.
Spreadsheets are at their strongest when you can see a lot of columns all at once to get a good feel for the data. But if it's hard to find the column you want, then maybe it's time to go to a database, where you can search for the data by the field name rather than having to scroll all the way to the right or left to see the column you want. And in terms of length, not just width, after you get beyond a few hundred rows, spreadsheets also start to lose their power.
Now, remember, a spreadsheet is typically processed on your local machine, or chances of Google Cloud, on a server designed to support a spreadsheet. As you start to get to bigger and bigger calculations, you might find a spreadsheet limiting. Perhaps you'll find it lagging. Perhaps you'll just find it hard to define the functions you need. These are not real hard rules, but if you find your data losing structure or getting too big in both the width or the length direction, that is a strong indicator that you're moving beyond what a spreadsheet is.
Now, in my opinion, more importantly than the size or the cumbersomeness, or even the difficulty to parse the data, is how many people are going to be updating it. Now, if you're the only person working with a data set, perhaps you can tough it out. You can handle the giant spreadsheet and be a hero.
But if you start having to pass that giant spreadsheet to your colleagues and multiple people are gonna have to be working on it, maybe at the same time, then you start to have a lot more trouble. Who is going to own the data? How do you make sure that there's not two edits going through at the same time? And maybe you have to merge them at the end or account for discrepancies between multiple copies. This, in my opinion, is a much bigger problem than just the size and shape of the data.
Now, perhaps your company works with a solution that supports multi-tenancy, a big shout out to Google Cloud here for supporting multi-tenancy. Also, programs like Office 365 on the cloud support this, but managing conflicts, managing traceability of multiple people managing the data starts to grow out of control. And if you start to introduce different permission sets to edit different fields or different sections of the spreadsheet, that's when, in my opinion, spreadsheets fall over the fastest. And finally, when thinking about if a spreadsheet makes sense, you need to think of how the data is actually gonna be used.
In many cases, recording data down on a spreadsheet for an audit purpose or a record-keeping purpose isn't really the end goal. Someone is going to need to look at it and probably produce a report, or at least a simplified graphic for an Excel document or a PowerPoint.
So typically, spreadsheets support simplified reporting. Now, many nice programs, such as Tableau, support the import of spreadsheets, but for live dynamic dashboards, that's when this starts to fall over. And that's when you need to start considering you need a better system for managing the size, the concurrency and the reportability consistency of the data.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.