Data vs Big Data
Start course
1h 23m

Machine learning is a big topic. Before you can start to use it, you need to understand what it is, and what it is and isn’t capable of. This course is part one of the module on machine learning. It starts with the basics, introducing you to AI and its history. We’ll discuss the ethics of it, and talk about examples of currently existing AI. We’ll cover data, statistics and variables, before moving onto notation and supervised learning.

Part two of this two-part series can be found here, and covers unsupervised learning, the theoretical basis for machine learning, model and linear regression, the semantic gap, and how we approximate the truth.

If you have any feedback relating to this course, please contact us at


Now there's only really one good definition of big data, and there are many bad ones. Here is the one good one. Not tabular . So the reason we have the term big data is to signal to practitioners and to businesses, and other people, that the ordinary methods of business intelLigence, and business practice, can't be used. So if you were using an SQL database, and if you're running ordinary SQL queries, you are not doing anything with big data. It doesn't actually matter what volume of data you have, whether you have gigabytes, or terabytes, or, anything of that kind, it isn't big. It only becomes big when it can't be processed that way, because at this point, we need to use different techniques, and different approaches, to obtain the information we need. Now there are other definitions. One is the three Vs. The three Vs are volume, velocity, and variety. Data is said to be big when it has a large volume, when it is many terabytes in size. It is said to be big when it runs at high speeds. Let's say a hundred megabits per second, or even 10, to be honest, that's quite fast. It's said to be very high variety, when you have a complex structure. Non-tabular structure, let me give you an example. Graph, as in a social network of some kind. A graph of like this, a little network. An image. A audio signal. Only because they challenge traditional methods. Let us finish this section, then, with a small discussion about some of the concerns of a big data practitioner. One of the defining concerns here is what's known as the CAP theorem. And this is where we trade off three concerns that our tools may have, as far as they interact with data. Concern for consistency, concern for availability, and concern for partition tolerance. And these concerns arise whenever the volume of data, or perhaps even the speed of data, becomes so large, that it can not be held on one machine, and is perhaps best held on hundreds of machines. But of course there are many many tools, and I think having some intuition for these concerns is really part of the engineering aspects of big data, and it's what engineers and practitioners have to think through when they're in this tricky situation of having this non-traditional dataset that needs these non-traditional tools. It becomes a very challenging problem.

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.