Moving Beyond Spreadsheets
The course is part of this learning path
This course discusses some of the fundamental concepts of data management and looks at the differences between spreadsheets and databases for managing data. We'll look at some specific examples to understand when spreadsheets makes sense and when it makes sense to switch over to a database, which is sometimes a much better option for more complex datasets.
Specifically, this course aims to give students a practical hands-on introduction to database concepts. In addition, we'll gain an understanding of how to select the right database and we'll go through the basics of setting up an RDS instance on Amazon. This course includes a practical example of a company that is looking to choose a database, to give you an understanding of how databases work in the real world.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Understand the difference between spreadsheets and databases and when to use one or the other
- Learn about the different types of database available and the various features and characteristics to consider
- Learn how to choose the right database
- Learn how to deploy an Amazon Aurora instance
This course is designed for anyone who wants to improve their knowledge of databases and understand when it makes sense to use them as opposed to a spreadsheet.
To get the most out of this course, you should already have a basic understanding of simple data structures such as comma-separated values, as well as an understanding of cloud concepts in general.
So we've talked a lot about when to select a spreadsheet and when not to, but let's actually dive into a little more practically, how do you select the right database and how can you get started on that path from a practical perspective? As we get started, just know that Cloud Academy's content library has a lot, and I mean a lot, of highly detailed information about the specifics of specific database technologies. And I know I said specific twice there, but they really do have a lot of in-depth content.
So if this introduction and practical guide isn't enough and you are really facing specific problems, search for the exact database you're using on Cloud Academy, and you'll be able to find highly detailed instructions. But to dive into here, by far the most common type of database is known as a relational database. These relational databases are what most people think of traditionally as a database.
Their history goes way back, and they are typically what people refer to just in passing when they refer to a database, unless you're having a more technical discussion. The most popular ones are MySQL, PostgreSQL, and SQLServer. These are legacy databases that go way back. They also continue innovation, so they're still used in cutting edge applications today.
They're used in business applications that require structured data. This means that the data is representable in terms of tables, columns, and rows, so they're a little similar to spreadsheets in that regard. But very importantly, if you're looking at a relational database, you're thinking in terms of structured data, the data is highly repeatable, highly predictable. You know what's coming in and you're able to apply control to it. They're also able to handle extremely large amounts of volume relative to a spreadsheet.
Although, if there's anybody on this course listening in, they don't quite scale up to what a distributed system could mean. I would say a traditional rule of thumb is if you have less than a handful of terabytes of data, the single relational databases are key for you, PostgreSQL, SQL and MySQL. Amazon, in particular, can handle up to 64 terabyte MySQL databases now through what they call their Aurora platform. As a rule of thumb though, if you start to get above single-digit terabytes, it's time to start considering more advanced technologies
The other major category of databases is known as NoSQL. This convenient name doesn't mean negative sequel. It actually means Not only SQL. These are databases that you've probably actually already come across. They're classics such as MongoDB or CouchDB. And some of the more big data databases in this space might be HBase or Google Bigtable. These databases are really designed for high volume, high-performance applications, but there's a very important twist. They don't require you to have the data in a tabular format. This is very important because it means that the data can come in as any real schema without you needing to predefine it. This is not to say there's no schema at all, but this is what you want if you have a high flexibility in your data.
Typically, the best way to use these is in what we call a key-value pair where you have some type of key like user ID and some type of value such as user preferences. I personally have used this type of database when handling highly unstructured test data coming from a production environment where they're building new IoT devices and onboarding new sensors.
So to quickly recap right there is SQL is typically good for structured data that you know what's coming in and you want to have enforced expressible relationships. And NoSQL is really ideal for when you have data that's coming in with variable formats and structures.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.