Harnessing the Power of Big Data Analysis on AWS

Like a jigsaw puzzle, there are many components in the AWS big data ecosystem. Read this article and see how the components fit together to form a beautiful whole.

If you are a data engineer, wouldn’t it be great if you could easily scale your existing infrastructure on-demand to support your real-time data pipelines?

If you are a data scientist, wouldn’t it be great if you could leverage massive clusters of computers to conduct analyses on large datasets and build statistical models quickly?

Or, maybe you are playing both roles. Perhaps you are a student and you are trying to find a platform to get started in big data quickly. Whatever your current role might be, be sure to read on to learn more about AWS’s big data capabilities provided by the following services: EMR, DynamoDB, Redshift, Data Pipeline, Kinesis Streams, Machine Learning, Elasticsearch, and Kinesis Firehose. That’s a lot of ground to cover, so let’s get started!

Need a big data platform? Look no further than Amazon Web Services!

AWS is one of the largest and fastest-growing cloud infrastructure providers in the world. Over the past ten years, AWS has developed more than 70 products and services to support different and often very demanding types of cloud computing use cases.

The company dominates the global cloud industry with its market share of more than 30%. Even though Microsoft Azure and Google Cloud Platform are growing rapidly, AWS still witnessed a significant growth of 63% year over year. The brand equity that AWS has built over the years has allowed the company to gain a strong foothold in the increasingly-competitive cloud computing landscape.
AWS has been embracing the big data movement as part of their Infrastructure-as-a-Service platform strategy beginning as early as 2008 — just two years after they were officially launched as a subsidiary of Amazon.com.

In 2008, AWS announced public access to huge data sets such as the map of the human genome and the US census

It all began with an announcement to make available certain large public datasets such as the mapping of the Human Genome and the US Census data on Amazon S3 so that anyone who wants to analyze these datasets can do so easily and quickly by accessing them directly from their Amazon Ec2 instances.
Just to get an idea of how large some of these datasets are, the 1000 Genomes Project that tries to build the most comprehensive catalog of human genetic variation is over 200TB. The 1980, 1990 and 2000 US Census data are approximately 5GB, 50GB and 200GB respectively. AWS is able to leverage their cloud computing infrastructure to host these datasets with the hope of spurring innovation.
Shortly after they introduced these large public datasets, AWS started supporting big data processing by introducing the Amazon Elastic MapReduce (EMR) service in April 2009. Amazon EC2 and S3 are the basic building blocks that make Amazon EMR possible. AWS made incremental improvements and added new features to EMR, and also introduced other big data-related services over the years, which I have summarized chronologically in the following list:

Let’s take a look at the constituent services of the AWS big data infrastructure-as-a-service.

Amazon EMR: Elastic MapReduce Hadoop Service

Amazon EMR is a service that lets you create and manage large-scale distributed data processing clusters easily. Data engineers know that setting up different components in the Hadoop ecosystem and getting them to work well together often requires a lot of effort. Amazon EMR takes care of this burden and lets you focus on the dataset instead.

Amazon EMR supports Hadoop and Spark, along with interactive notebooks like Hue and Zeppelin-Sandbox, as well as machine learning frameworks like Mahout and Spark Mllib. For the full list of supported applications, see the Amazon EMR release page.

I have written a two-part series about using Apache Spark and Apache Zeppelin on Amazon EMR – See Part 1 and Part 2. Check them out. You may find my notes on IAM helpful too.

Cloud Academy offers quizzes around Amazon EMR and you can try them out (and all the Cloud Academy resources) for free.
Big Data AWS

Amazon DynamoDB: Managed NoSQL Databases

Amazon DynamoDB is a fully managed NoSQL database service that you can use to import, persist, and extract schema-less data. You can create DynamoDB tables and populate them with data from CSV files. You can also export data from DynamoDB tables to Amazon S3 buckets as a backup.
Our Database Fundamentals for AWS course gives you an introduction to Amazon DynamoDB. We also have a dedicated course focusing on Working with DynamoDB. At the same time, you should also try our hands-on lab to create and query DynamoDB tables.
amazon courses at Cloud Academy

Amazon Redshift: Industrial-Grade Data Warehousing

Amazon Redshift is a fully managed petabyte-scale data warehouse service. One petabyte is approximately 1,000 terabytes (in case you were going to look it up). You can create and provision an Amazon Redshift cluster to store large amounts of data, and perform fast queries using SQL query tools like SQL Workbench/J or Re:dash. You can also connect your business intelligence application or any other applications as long as it supports the standard PostgresSQL JDBC or OCBC drivers.
We have an introductory blog post, and as you might imagine some detailed learning resources if you would like to find out more.

AWS Data Pipeline: Data Processing Workflow Automation

AWS Data Pipeline lets you define a workflow to automate the processing and moving of data from one AWS service to another on a regular basis. You can create a pipeline to launch EMR jobs that run Hive queries with data imported from Amazon RDS, or backup DynamoDB tables into Amazon S3 every end of business day, or import new data when it is available into Amazon Redshift and send an Amazon SNS notification that new data has been imported into Redshift.
Our Automated Data Management with EBS, S3, and Glacier course shows you the best methods of backing up your data resources using AWS Data Pipeline.

Amazon Kinesis Streams: Streaming Data Service

Amazon Kinesis Streams is a managed service to ingest streaming data from many different sources. It makes the data available to Amazon Kinesis Applications that would read and process the streaming data in real-time. These applications are data consumers that are developed with the Amazon Kinesis API or Amazon Kinesis Client Library (KCL). There is a pre-built library that you can use to integrate Amazon Kinesis Streams with Apache Storm. You can think of this as the AWS’s equivalent of Apache Kafka.
Our blog post from 2015 titled Amazon Kinesis: Managed Real-time Event Processing is a good starting point and I encourage you to review posts from different sources.

Amazon Machine Learning

Amazon Machine Learning is a service that makes it easy for anyone to create machine learning models to do things like classifications or predictions through the use of wizards. You do not need to have a strong grasp of machine learning to use this service, although you are strongly advised to understand what you are doing.

Try Amazon Machine Learning with our hands-on lab using the HAR (Human Activity Recognition) dataset. We also have a course that introduces you to the principles and practice of Amazon Machine Learning.

Amazon Elasticsearch: Distributed Search Engine

Elasticsearch is an open source distributed search and analytics engine that you can use to run full-text queries on large amounts of data to make sense of the data. Amazon Elasticsearch is a managed service that lets you launch Elasticsearch on AWS.

We have a couple of blog posts about our impression of Amazon Elasticsearch when it was first launched, and our comparison with Amazon CloudSearch.

Amazon Kinesis Firehose

Amazon Kinesis Firehose is a managed service that lets you create delivery streams to send streaming data to AWS services such as Amazon S3, Amazon Redshift or Amazon Elasticsearch. Amazon Kinesis Firehose simply reads and writes data, and does not do any processing to the data stream.

Conclusion

As you probably gathered from reading this post, big data isn’t confined to a single area. I included links to useful posts and other articles. Researching this article forced me to see AWS Big Data with fresh eyes. AWS never stands still and that dynamic progress inspires me.

I hope that this blog post inspires you to explore and learn more about the different big data options that are available on AWS. Check out our Analytics Fundamentals for AWS course. The course covers numerous analytics tools including Amazon EMR, Kinesis Streams and Firehose, Machine Learning, Data Pipeline and Elasticsearch.

Have any questions? Leave a comment below!

 

Avatar

Written by

Eugene Teo

Eugene Teo is a director of security at a US-based technology company. He is interested in applying machine learning techniques to solve problems in the security domain.


Related Posts

Alisha Reyes
Alisha Reyes
— August 5, 2020

New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More

This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Alisha Reyes
Alisha Reyes
— July 16, 2020

Blog Digest: Which Certifications Should I Get?, The 12 Microsoft Azure Certifications, 6 Ways to Prevent a Data Breach, and More

This month, we were excited to announce that Cloud Academy was recognized in the G2 Summer 2020 reports! These reports highlight the top-rated solutions in the industry, as chosen by the source that matters most: customers. We're grateful to have been nominated as a High Performer in se...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • OWASP
  • OWASP Top 10
  • Security
  • VPCs
Avatar
Cloud Academy Team
— July 9, 2020

Which Certifications Should I Get?

The old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and companies. With all that in mind, the s...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— July 2, 2020

New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More

This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— June 19, 2020

Kickstart Your Tech Training With a Free Week on Cloud Academy

Are you looking to make a jump in your technical career? Want to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Kubernetes, Python, or another in-demand skill? Then you'll want to mark your calendar. Starting Monday, June 22 at 12:00 a.m. PDT (3:00 a.m. EDT), ...

Read more
  • AWS
  • Azure
  • cloud academy content
  • complimentary access
  • GCP
  • on the house
Alisha Reyes
Alisha Reyes
— June 11, 2020

New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More

This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Rebecca Willis
Rebecca Willis
— June 3, 2020

Azure vs. AWS: Which Certification Provides the Brighter Future?

More and more companies are using cloud services, prompting more and more people to switch their current IT position to something cloud-related. The problem is most people only have that much time after work to learn new technologies, and there are plenty of cloud services that you can ...

Read more
  • AWS
  • Azure
  • certification
Alisha Reyes
Alisha Reyes
— June 2, 2020

Blog Digest: 5 Reasons to Get AWS Certified, OWASP Top 10, Getting Started with VPCs, Top 10 Soft Skills, and More

Thank you for being a valued member of our community! We recently sent out a short survey to understand what type of content you would like us to add to Cloud Academy, and we want to thank everyone who gave us their input. If you would like to complete the survey, it's not too late. It ...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • OWASP
  • OWASP Top 10
  • Security
  • VPCs
Alisha Reyes
Alisha Reyes
— May 11, 2020

New Content: Alibaba, Azure Cert Prep: AI-100, AZ-104, AZ-204 & AZ-400, Amazon Athena Playground, Google Cloud Developer Challenge, and much more

This month, our Content Team released 8 new learning paths, 4 courses, 7 labs in real cloud environments, and 4 new knowledge check assessments. Not only that, but we introduced our very first course on Alibaba Cloud, and our expert instructors are working 'round the clock to create 6 n...

Read more
  • alibaba
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Avatar
Rhonda Martinez
— May 4, 2020

Top 5 Reasons to Get AWS Certified Right Now

Cloud computing trends are on the rise and have been for some time already. Fortunately, it’s never too late to start learning cloud computing. Skills like AWS and others associated with cloud computing are in high demand because cloud technologies have become crucial for many businesse...

Read more
  • Amazon Elastic Book Store
  • Amazon Elastic Compute Cloud (EC2)
  • AWS
  • AWS Certifications
  • Glacier
Alisha Reyes
Alisha Reyes
— May 1, 2020

Introducing Our Newest Lab Environments: Lab Playgrounds

Want to train in a real cloud environment, but feel slowed down by spinning up your own deployments? When you consider security or pricing costs, it can be costly and challenging to get up to speed quickly for self-training. To solve this problem, Cloud Academy created a new suite of la...

Read more
  • AWS
  • Azure
  • Docker
  • Google Cloud Platform
  • Java
  • lab playgrounds
  • Python
Alisha Reyes
Alisha Reyes
— April 30, 2020

Blog Digest: AWS Breaking News, Azure DevOps, AWS Study Guide, 8 Ways to Prevent a Ransomware Attack, and More

  New articles by topic AWS Azure Data Science Google Cloud  Cloud Adoption Platform Updates & New Content Security Women in Tech AWS Breaking News: All AWS Certification Exams Now Available Online As an Advanced AWS Technology Partner, C...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • programming
  • Security