1. Home
  2. Training Library
  3. Big Data
  4. Courses
  5. Moving Beyond Spreadsheets

When is it Time to Think Beyond Spreadsheets?

Developed with
Calculated Systems
play-arrow
Start course
Overview
DifficultyIntermediate
Duration42m
Students110
Ratings
5/5
starstarstarstarstar

Description

This course discusses some of the fundamental concepts of data management and looks at the differences between spreadsheets and databases for managing data. We'll look at some specific examples to understand when spreadsheets makes sense and when it makes sense to switch over to a database, which is sometimes a much better option for more complex datasets. 

Specifically, this course aims to give students a practical hands-on introduction to database concepts. In addition, we'll gain an understanding of how to select the right database and we'll go through the basics of setting up an RDS instance on Amazon. This course includes a practical example of a company that is looking to choose a database, to give you an understanding of how databases work in the real world.

If you have any feedback relating to this course, please contact us at support@cloudacademy.com.

Learning Objectives

  • Understand the difference between spreadsheets and databases and when to use one or the other
  • Learn about the different types of database available and the various features and characteristics to consider
  • Learn how to choose the right database
  • Learn how to deploy an Amazon Aurora instance

Intended Audience

This course is designed for anyone who wants to improve their knowledge of databases and understand when it makes sense to use them as opposed to a spreadsheet.

Prerequisites

To get the most out of this course, you should already have a basic understanding of simple data structures such as comma-separated values, as well as an understanding of cloud concepts in general.

Transcript

So for those of you still with us, I assume that you either need a database or you have an interest in learning more about databases and that you haven't just said, "Hey, spreadsheets are great for me. I'm gonna be on my way." So let's talk about what considerations you need when selecting a database.

First is the volume of the data. As we previously mentioned, databases are typically when you outgrow spreadsheets. So anything from upper hundreds of rows to literally billions if not trillions of rows is the domain of databases. It's important to start to understand what this is exactly in terms of rough scale, 'cause it affects the technology you're going to pick. Complex selection is a little beyond this course, but just trying to get a feeling, is it hundreds, thousands, millions, billions of rows? And you're able to then start to qualify as a single server or multi server solution, with a rule of thumb being: as you get into the upper millions, you start to need more complex, scalable databases.

The next consideration when thinking about a database and if it's correct for a spreadsheet or just how you're going to manage your database is how many stakeholders are and that you don't have death by strangulation of too many people trying to control the data at the same time. If you've ever worked in a big organization, I'm sure you've seen email chains with documents called V1, V2, V final, V final for real, and people don't have good version control or governance over who makes an edit. The good news is all of this is solvable with even the most basic databases. But very importantly, you need to consider your model of who the shareholders are, who the stakeholders are, who is going to be editing the data, and how you wanna control that.

On the simple end of the spectrum is anybody with access to the data can edit the data, but on the more complex side of the spectrum of a database could be only certain people can edit certain rows with certain columns. Just try to understand how simple or complex of an access model you need before making this choice. I advise for many beginners, try to whittle it down to a collection of people who can read and write and people who can just read.

As you grow from there, you start to introduce more complexity. And the simple read, read/write model is by far the simplest to introduce. Data integrity is a slightly more complex issue within stakeholders in that it could come from multiple people, it could also come from the source data itself. But the core concept is when multiple people in multiple systems are accessing and editing the same data, the integrity of the data becomes questionable. And if you're in a company that handles health care or financial information in particular, who added it and why is very important for when an auditor knocks on your door.

So potentially understanding who is editing the data, what is that in the data, where did data come from, that is known as data integrity on a more complex level that's known as data lineage. But databases vary much so more than that spreadsheets enable you to monitor, trace, control, and manage your data quality and integrity, especially because many of the databases come with built-in audit logs on a per field basis. And finally, just to loop back on this issue, because it really is typically what a database supports, is complex reporting requirements.

Whereas a spreadsheet is fantastic at simple one-off reports or maybe manually updated, databases are when you need data that is maybe dynamically updated nightly, hourly, secondly, or maybe it is updated by multiple processes owned by multiple people. But just know, and to reiterate, the rule of thumb is if you need more complex reporting, go with the database over a spreadsheet.

About the Author
Students1898
Labs14
Courses7
Learning paths11

Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity.  With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing  decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.

Covered Topics