Organizations must deal with the collection and storage of continuously-growing data, and then harvest it to capture value. “Big Data,” as its called, concerns itself with these complex processes.
The following list contains 46 key Big Data terms that you’re likely going to find in the wild explained in easy-to-understand terminology. If you want to learn more about Big Data in the cloud, be sure to check out our Analytics Fundamentals for AWS course.
Algorithms: These are mathematical and analytical formulas which also include statistical processes used to analyze data. Algorithms are implemented in software to analyze, process the input data, and produce output – or results.
Analytics: This term refers to the course of depicting conclusions based on the raw data. With the help of analysis, otherwise meaningless numbers and data can be converted into something which is more useful. The emphasis here is on interpretation and not on big software systems. The focus is on the art of storytelling, something in which data analysts excel.
Biometrics: Refers to using analytics and technology in identifying people by one or many of their physical characteristics, such as fingerprint recognition, facial recognition, iris recognition, and so on.
Cassandra: Cassandra is a very well-known open source database management system which is managed by the Apache Software Foundation and is designed to handle high volumes of data throughout distributed servers.
Cloud: The cloud model provides individuals and companies with scalable, on-demand resources abstracted away from the underlying hardware cluster. Running applications and storing data in the cloud often provides organizations with significant cost savings and operational simplicity.
Database: A systematized collection of data. It could include schemas, charts, or tables. It might also be combined with a Database Management System (DBMS), which is software that helps data to be analyzed and explored (such as Microsoft Access).
Data Mining: It can mean different things for different contexts. For the layman, it means the automatic analysis of large databases. For an analyst, it refers to the pool of statistical and machine learning methodologies used in the databases.
Dark Data: Relates to the information which is collected and managed by a business, which is never put to use – and which sits waiting to be studied. Many companies have a lot of this kind of data lying around without their awareness and knowledge.
Hadoop: An open source software structure; it works mainly by processing and storing files and data. Hadoop is also known for its excellent processing power, which makes it easy to run a host of tasks in parallel. It assists businesses in accessing, saving, and analyzing enormous amounts of data.
IoT: Stands for “Internet of things” and defines the fact that many ordinary objects, from self-driving cars to trash cans, have the capability to send and receive data. It is a grid of objects imbued with network connectivity. They pull information from the cloud – and with their sensors relay the information back. The IoT generates a large amount of data, making it significant and prevalent source of analysis for data scientists.
Data Scientist: Refers to a skilled expert in extracting value and insights from data. Typically someone who has skills in computer science, analytics, mathematics, creativity, statistics, communication and data visualization as well as strategy and business.
Gamification: The process of generating a game-like structure from something which may not commonly be a game. With regards to Big Data, gamification is a powerful tool for incentivizing data collection.
Machine Learning: A highly effective way performing data analysis. Machine learning mechanizes logical model building and trusts the ability of the system to adapt. With the use of algorithms, models dynamically learn and improve themselves every time they process any new data. While machine learning is not a new idea, it is receiving massive attention as an advanced and modern tool for data analysis. It allows systems to grow and acclimatize without demanding numerous hours of extra work by scientists.
MapReduce: A model for programming, generating, and processing massive data sets. It does two different things. Firstly, the Map, which includes rotating one dataset to the other, more valuable and fragmented dataset are made of bits known as tuples. Secondly, the Reduce, which takes all of this fragmented tuples and breaks them even further. MapReduce results in a useful breakdown of information.
NoSQL: Refers to database management systems that do not use relational tables that are usually used in traditional relational database systems. It is a data retrieval and storage system designed for managing massive volumes of data without tabular categorization.
SaaS: Stands for Software-as-a-Service, a business model which allows vendors to host applications and to make them available through the use of the internet for a subscription fee. SaaS providers often deliver their services over the cloud.
Spark: An open source computing structure that was initially developed at University of California, Berkeley; it was then donated to Apache Foundation. It is used for interactive analytics and machine learning.
ACID Test: A test which is applied to data for atomicity, consistency, isolation, and durability.
Anonymization: Refers to severing links among the people in a particular database and their records to prevent the detection of the source of these records.
Artificial Intelligence: Artificial intelligence (AI) is the development of intelligent software and machines which are proficient enough to perceive the environment and perform an intelligent action and when required can even learn from their actions.
Automatic Identification and Capture (AIDC): Consists of methods of collecting and identifying data without requiring manual data entry.
Avro: A system of data serialization which permits the encoding the schema of Hadoop files. It is especially suited to analyzing data and carrying out remote procedure calls.
Cascading: Cascading offers an advanced level of abstraction for Hadoop, which allows developers in producing complex jobs easily, rapidly and in various languages which run on the JVM, including Scala, Ruby, and more.
Chukwa: A Hadoop sub-project which is devoted to a large-scale analysis and collection of logs. It is developed over the HDFS (Hadoop distributed filesystem) and MapReduce and comes with Hadoop’s strength and scalability. It also comprises a powerful and flexible toolkit for observing the display and analyzing the results to make the most out of the collected data.
Clojure: A functional programming language constructed in LISP which uses the JVM (Java Virtual Machine). Clojure is particularly suitable for parallel data processing.
Complex Event Processing: Complex event processing refers to the process of analyzing and monitoring the activities throughout the system of an organization and acting on them when required in real time.
Data Lake: A storage warehouse which holds a significant amount of raw data in its original format until it is required. Though a categorized data store stocks data in folders or files, it uses a flat architecture for storing the data. All data elements in the lake are allocated a single identifier which is tagged with a group of high-level metadata tags. In the case of business analysis, these data lakes can be queried for relevant data, and the smaller data sets can be examined for answering business questions.
Database-as-a-Service (DaaS): DaaS provides database functionality which is similar to what is in relational database management systems like MySQL, SQL Server, and Oracle. It offers a scalable, flexible, and on-demand platform which is focused on self-service and can be managed easily, mainly for provisioning an environment for a business. These products can also include some degree of data analytics.
Drill: An open-source distributed system which helps in performing interactive analysis over large-scale datasets. It can be considered similar to Google’s Dremel, which is managed by Apache.
Grid Computing: Refers to performing computing functions with resources from several distributed systems. It usually involves large files and is often used for various applications. Systems containing a grid computing network are not required to be alike in design or be from the same location.
Hama: A distributed computing structure that is built on Bulk Synchronous Parallel computing systems for enormous scientific calculations such as a graph, matrix, and network algorithms. It’s one of the high-level project undertaken by Apache.
HANA: Refers to a hardware/software in-memory computing tool from SAP which is intended for high-volume transactions and analytics in real-time.
Hive: A Hadoop-based data warehousing framework which was developed by Facebook. It allows users to write queries in HiveQL, a language similar to SQL, which is transformed into MapReduce. It allows SQL programmers who do not have any MapReduce experience to use warehouses which makes it easier to become acclimated with business visualization and intelligence tools such as Tableau, Microstrategy, and more.
Hadoop User Experience (Hue): Hue is an open-source interface which makes it easier to use Apache Hadoop. It is a web-based application and has a file browser for HDFS, a job designer for MapReduce, an Oozie Application for making coordinators and workflows, a Shell, an Impala and Hive UI, and a group of Hadoop APIs.
Impala: Impala offers quick and interactive SQL queries straight to your Apache Hadoop data that is stored in HBase or HDFS using the same SQL syntax, metadata, and user interface as Apache Hive. It provides a unified and familiar platform for real-time and batch-oriented queries.
Kafka: Kafka, developed by Linkedin, is a dispersed publish-subscribe system for messaging which provides a solution that is proficient in conducting all activity related to data flow and also processing this data over a consumer website. This type of information is an essential element of the current social web.
Mashup: A method of merging different datasets into a single application to improve output–for instance, combining real estate listings with demographic data.
Mahout: A library for data mining. It uses the best prevalent data mining algorithms in performing regression testing, clustering, modeling, and implementing them with the use of the MapReduce model.
Oozie: Refers to a workflow processing system which allows its users to define a series of jobs which can be written in several languages like Pig, MapReduce, and Hive. It then intelligently links them to each other. It permits users to state, for instance, that a particular query is to be started only after defined previous jobs on which it depends on for data are completed.
Pentaho: Provides a suite of open source BI (Business Intelligence) products known as Pentaho Business Analytics that helps in OLAP services, data integration, dashboards, reporting, ETL capabilities, and data mining.
Pig: A Hadoop-based language which was developed by Yahoo. It is comparatively easy to understand and learn and is adept at very long and very broad data pipelines.
R: A language and an environment for data visualization and statistical computing. It is similar to the S language. It offers a large variety of statistical (classical statistical tests, linear and nonlinear modeling, classification, time-series analysis, clustering) and graphical techniques which are very extensible.
Sqoop: A tool for moving data from Hadoop to non-Hadoop data stores like data warehouses and relational databases. It permits its users to state the location inside Hadoop and direct Sqoop to move data from Teradata, Oracle or other relational databases to the target.
Storm: A free and open source real-time distributed computing system. It makes it easier to process unstructured data continuously with instantaneous processing, which uses Hadoop for batch processing.
Thrift: A software structure for accessible cross-language services development. Thrift pools a stack of software with a code generation engine to form services which work proficiently and flawlessly between Java, C++, Python, Ruby, PHP, Erlang, Haskell, Perl, and C#.
ZooKeeper: An Apache Software Foundation software project service which offers unified configuration and also open code name registration for big distributed systems. It is a sub-project of Hadoop.
Introduction to Streaming Data
Designing a streaming data pipeline presents many challenges, particularly around specific technology requirements. When designing a cloud-based solution, an architect is no longer faced with the question, “How do I get this job done with the technology we have?” but rather, “What is th...
Top Cloud Skills in Demand for 2018: Big Data, AI, Machine Learning
Cloud is a pathway to innovation. Where yesterday’s cloud deployments were about moving an on-premises infrastructure in your data center to a cloud environment, companies today are using cloud platforms to build new features for their products and services that are integrated at a soft...
November ’17 New on Cloud Academy: DC/OS, Serverless, Security, Big Data, and more
Explore the newest learning paths, courses, and hands-on labs on Cloud Academy in November. Learning Paths Introduction to DC/OS In an enterprise environment, running multiple workload types simultaneously can be both difficult and costly, especially when servers aren’t being used ...
New on Cloud Academy, September ’17. Big Data, Security, and Containers
Explore the newest Learning Paths, Courses, and Hands-on Labs on Cloud Academy in September. Learning Paths and Courses Certified Big Data Specialty on AWS Solving problems and identifying opportunities starts with data. The ability to collect, store, retrieve, and analyze data me...
New on Cloud Academy: Networking, Serverless, Big data, and more
This week on Cloud Academy, we’ve added new learning paths and hands-on labs in networking, serverless, big data, storage, and other cloud services that you need to know about in AWS, Azure, and Google Cloud Platform. Learning Paths AWS Network Specialty Certification Exam Advanced...
What is Azure Data Factory: Data Migration on the Azure Cloud
The availability of so much data is one of the greatest gifts of our day. But how does this impact a business when it’s transitioning to the cloud? Will your historic on-premise data be a hindrance if you’re looking to move to the cloud? What is Azure Data Factory? Is it possible to enr...
Building a serverless architecture for data collection with AWS Lambda
AWS Lambda is one of the best solutions for managing a data collection pipeline and for implementing a serverless architecture. In this post, we'll discover how to build a serverless data pipeline in three simple steps using AWS Lambda Functions, Kinesis Streams, Amazon Simple Queue Ser...
Harnessing the Power of Big Data Analysis on AWS
Like a jigsaw puzzle, there are many components in the AWS big data ecosystem. Read this article and see how the components fit together to form a beautiful whole. If you are a data engineer, wouldn’t it be great if you could easily scale your existing infrastructure on-demand to sup...
HDInsight – Azure’s Hadoop Big Data Service
How can Azure HDInsight solve your big data challenges? Big data refers to large volumes of fast-moving data in any format that haven't yet been handled by your traditional data processing system. In other words, it refers to data which have Volume, Variety and Velocity (commonly terme...
Big Data: Amazon EMR, Apache Spark and Apache Zeppelin – Part 2 of 2
In the first article about Amazon EMR, in our two-part series, we learned to install Apache Spark and Apache Zeppelin on Amazon EMR. We also learned ways of using different interactive shells for Scala, Python, and R, to program for Spark. Let's continue with the final part of this s...
Big Data: Amazon EMR, Apache Spark, and Apache Zeppelin – Part 1 of 2
Amazon EMR (Elastic MapReduce) provides a platform to provision and manage Amazon EC2-based data processing clusters. Amazon EMR clusters are installed with different supported projects in the Apache Hadoop and Apache Spark ecosystems. You can either choose to install from a predefined...
Azure Data Lake Analytics and Big Data: an Introduction
Azure Data Lake Analytics simplifies the management of big data processing using integrated Azure resource infrastructure and complex code. We've previously discussed Azure Data Lake and Azure Data Lake Store. That post should provide you with a good foundation for understanding Azure ...