Big Data on AWS: How the Cloud Can Help You

(Update) We’ve recently uploaded new training material on Big Data using services on Amazon Web Services, Microsoft Azure, and Google Cloud Platform on the Cloud Academy Training Library. On top of that, we’ve been busy adding new content on the Cloud Academy blog on how to best train yourself and your team on Big Data.


Big Data is a term used to describe data that is too large and complex to be processed by traditional data processing techniques, instead, it requires massively parallel software running on a big number of servers, which could be in the range of hundreds or even thousands. The size of the data that can be considered “Big” is relative. What is considered Big Data today, might not be considered “Big” few years ahead: 1 GB of data was considered Big Data years ago; 1 TB (more than a thousand time bigger) is not considered to be “Big” nowadays.

According to the widely used Gartner’s definition, Big Data is mainly characterized by the 3 V’s: Volume (amount of data), Velocity (speed of data in and out), and Variety (range of data types and sources). In 2012, Gartner updated its definition for Big Data as follows: “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery, and process optimization.
Room with several computers

Big Data on AWS

Big Data processing requires huge investments in hardware and processing resources, and that creates an obstacle for small to medium-sized businesses. Cloud computing with public clouds can overcome this obstacle by providing pay-as-you-go, on-demand, and scalable cloud services for Big Data handling. Using cloud computing for Big Data will reduce the cost of hardware, reduce the cost of processing, and facilitate testing the value of Big Data before deploying expensive resources.

Amazon Web Services is the largest public cloud and is described by Gartner to be leading other public clouds by years. It provides a comprehensive set of services that enable customers to rely completely on AWS to handle their Big Data. In addition to database services, AWS makes it easy to provision computation (EC2), storage (S3), data transfer (AWS Direct Connect and Import/Export services), and archiving (Glacier) services to facilitate turning data into information useful for business.  In the rest of this article, we will shed light on AWS data services that are used to handle Big Data.

Amazon EMR

EMR is basically an Amazon-enhanced Hadoop to work seamlessly on AWS. Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware (in EMR it would be AWS virtual servers). Hadoop Distributed File System (HDFS) splits files into large blocks and distributes the blocks amongst the nodes in the cluster. Hadoop Map/Reduce processes the data by moving code to the nodes that have the required data, and the data will be processed in parallel on the nodes.

Hadoop clusters running on Amazon EMR use Amazon S3 for bulk storage of input and output data, and CloudWatch to monitor cluster performance and raise alarms. You can also move data into and out of DynamoDB using Amazon EMR and Hive. All of this is orchestrated by Amazon EMR control software that launches and manages the Hadoop cluster. This process is called an Amazon EMR cluster. EMR has the advantage of using the Cloud over the traditional Hadoop. Users can provision scalable clusters of virtual servers within minutes and pay for the actual use only.

EMR can also integrate and benefit from the other AWS services. Open-source projects that run on top of the Hadoop architecture can also be run on Amazon EMR.

Amazon Redshift

Amazon Redshift is Amazon’s Columnar Data Store, that is data stores arranged in columns instead of rows, enabling faster access for analytic applications. It’s a fully managed petabyte-scale data warehouse service. RedShift is designed for analytic workloads and connects to standard SQL-based clients and business intelligence tools.

According to Amazon’s website, Redshift delivers fast query and I/O performance for virtually any size dataset by using columnar storage technology and parallelizing and distributing queries across multiple nodes. Most common administrative tasks associated with provisioning, configuring, monitoring, backing up, and securing a data warehouse are automated.

DynamoDB

Amazon DynamoDB is a fully managed fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale. It has high availability and reliability with seamless scaling. In DynamoDB service is purchased based on throughput rather than storage. When more throughput is requested, DynamoDB will spread the data and traffic over a number of servers using solid-state drives to allow predictable performance.

DynamoDB supports both document and key-value data models and is schema-less, that is each item (row) has a primary key and any number of attributes (columns), and the primary key is the only required attribute that is needed to identify the item. In addition to query, the primary key, DynamoDB has added flexibility by querying non-primary key attributes using Global Secondary Indexes and Local Secondary Indexes. Its flexible data model and reliable performance make it a great fit for mobile, web, gaming, ad-tech, IoT, and many other applications.

Big Data is not necessarily NoSQL, Relational DB are Big too

Although the term Big Data is mainly associated with NoSQL DBs, Relational DBs can come under the definition of Big Data too. According to Amazon’s website Amazon RDS allows you to easily set up, operate, and scale a relational database in the cloud. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks, freeing you up to focus on your applications and business.

Amazon Kinesis

AWS has introduced a real-time event processing service called Amazon Kinesis. Amazon describes Kinesis as a fully managed streaming data service in which continuously various types of data such as clickstreams, application logs, and social media can be put into an Amazon Kinesis stream from hundreds of thousands of sources. Within seconds, the data will be available for an Amazon Kinesis Application to read and process from the stream. Amazon Kinesis stream consists of Shards receiving data for the producer application. Shard is the basic unit of Kinesis streams which can support 1 MB of data written per second, and 2 MB of data read per second. The consumer applications take the data from the Kinesis stream and do whatever processing required.

By looking at the services provided by Amazon to handle Big Data, AWS has a complete set that covers all needs for Big Data processing, storage, and transfer. AWS covers the full spectrum of Big Data technologies: Hadoop and Map Reduce (EMR), Relational DBs (RDS), NoSQL DBs (DynamoDB), Columnar Data Stores (RedShift), and Stream Processing (Kinesis). In addition to that, Amazon facilitated connecting these services with each other, and with other services on AWS, and that creates unrivaled flexibility and capabilities for Big Data.

Avatar

Written by

Motasem Aldiab

Motasem Aldiab is a professor, consultant, trainer, and developer. Dr. Aldiab has got his PhD in Computer Engineering from QUB in 2008. He is a certified trainer for the Cloud School and SOA School. He has been training and offering consultations for years in Java, SOA, and Cloud Computing, and leading workshops and training session (virtual or instructor led).


Related Posts

Joe Nemer
Joe Nemer
— September 15, 2020

New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More

This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— August 28, 2020

AWS Certification Practice Exam: What to Expect from Test Questions

If you’re building applications on the AWS cloud or looking to get started in cloud computing, certification is a way to build deep knowledge in key services unique to the AWS platform. AWS currently offers 12 certifications that cover major cloud roles including Solutions Architect, De...

Read more
  • AWS
  • AWS Certifications
Patrick Navarro
Patrick Navarro
— August 25, 2020

Overcoming Unprecedented Business Challenges with AWS

From auto-scaling applications with high availability to video conferencing that’s used by everyone, every day —  cloud technology has never been more popular or in-demand. But what does this mean for experienced cloud professionals and the challenges they face as they carve out a new p...

Read more
  • AWS
  • Cloud Adoption
  • digital transformation
Avatar
Andrew Larkin
— August 18, 2020

Constant Content: Cloud Academy’s Q3 2020 Roadmap

Hello —  Andy Larkin here, VP of Content at Cloud Academy. I am pleased to release our roadmap for the next three months of 2020 — August through October. Let me walk you through the content we have planned for you and how this content can help you gain skills, get certified, and...

Read more
  • alibaba
  • AWS
  • Azure
  • content roadmap
  • Content updates
  • DevOps
  • GCP
  • Google Cloud
  • New content
Alisha Reyes
Alisha Reyes
— August 5, 2020

New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More

This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Alisha Reyes
Alisha Reyes
— July 16, 2020

Blog Digest: Which Certifications Should I Get?, The 12 Microsoft Azure Certifications, 6 Ways to Prevent a Data Breach, and More

This month, we were excited to announce that Cloud Academy was recognized in the G2 Summer 2020 reports! These reports highlight the top-rated solutions in the industry, as chosen by the source that matters most: customers. We're grateful to have been nominated as a High Performer in se...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • OWASP
  • OWASP Top 10
  • Security
  • VPCs
Avatar
Cloud Academy Team
— July 9, 2020

Which Certifications Should I Get?

The old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and companies. With all that in mind, the s...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— July 2, 2020

New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More

This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— June 19, 2020

Kickstart Your Tech Training With a Free Week on Cloud Academy

Are you looking to make a jump in your technical career? Want to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Kubernetes, Python, or another in-demand skill? Then you'll want to mark your calendar. Starting Monday, June 22 at 12:00 a.m. PDT (3:00 a.m. EDT), ...

Read more
  • AWS
  • Azure
  • cloud academy content
  • complimentary access
  • GCP
  • on the house
Alisha Reyes
Alisha Reyes
— June 11, 2020

New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More

This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Rebecca Willis
Rebecca Willis
— June 3, 2020

Azure vs. AWS: Which Certification Provides the Brighter Future?

More and more companies are using cloud services, prompting more and more people to switch their current IT position to something cloud-related. The problem is most people only have that much time after work to learn new technologies, and there are plenty of cloud services that you can ...

Read more
  • AWS
  • Azure
  • certification
Alisha Reyes
Alisha Reyes
— June 2, 2020

Blog Digest: 5 Reasons to Get AWS Certified, OWASP Top 10, Getting Started with VPCs, Top 10 Soft Skills, and More

Thank you for being a valued member of our community! We recently sent out a short survey to understand what type of content you would like us to add to Cloud Academy, and we want to thank everyone who gave us their input. If you would like to complete the survey, it's not too late. It ...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • OWASP
  • OWASP Top 10
  • Security
  • VPCs