Data vs Big Data


Big Data and AI | SDL4 A3.1 |
Start course

Data is information and, just as there are lots of different types of information, there are different types of data. In these videos, you'll learn more about types of data and the ways in which you can store them. 

When you're ready, click 'next step' to continue.


Now, there's only, really, one good definition of big data and there are many bad ones. Here is the one good one. Not tabular. So, the reason we have the term 'big data' is to signal to practitioners, and to businesses, and other people that the ordinary methods of business intelligence and business practice can't be used. 

So, if you were using an SQL database, if you are running ordinary SQL queries, you are not doing anything with big data. It doesn't actually matter what volume of data you have, whether you have gigabytes, or terabytes, or anything of that kind, it isn't big. It only becomes big when it can't be processed that way, because at this point, we need to use different techniques and different approaches to obtain the information we need. Now, there are other definitions. 

One is the three Vs. The three Vs are volume, velocity and variety. Data is said to be big when it has a large volume, when it is many terabytes in size. It is said to be big when it-, when it-, when it runs at high speeds, let's say 100Mbps, or even, ten to be honest, is quite fast. It's said to be very high, high variety when it has a complex structure. Non-tabular structure. I'll give you an example. Graph, as in a social network of some kind. A graph of, like this, a little network. An image. A audio signal, only because they challenge traditional methods. 

Let us finish this section then with a small discussion about some of the concerns of a big data practitioner. One of the defining concerns here is what's known as the CAP theorem, and this is where we trade off three concerns that our tools may have as far as they interact with data. Concern for consistency, concern for availability and concern for partition tolerance. These concerns arise whenever the volume of data, perhaps even the speed of data, becomes so large that it cannot be held on one machine, and it is, perhaps, best held on hundreds of machines, but, of course, there are many, many tools, and I think having some intuition for these concerns is really part of the engineering aspects of big data. It's what engineers and practitioners have to think through when they're in this tricky situation of having this non-traditional data set that needs this non-traditional tool. Becomes a very challenging problem. 

About the Author