In our hyper-connected world, countless sources generate real-time information 24 hours per day. Rich streams of data pour in from logs, Twitter trends, financial transactions, factory floors, click streams, and much more, and developing the ability to properly handle such volumes of high-velocity and time-sensitive data demands special attention. Apache Storm and Kafka can process very high volumes of real-time data in a distributed environment with a fault-tolerant manner.
Both Storm and Kafka are top-level Apache projects currently used by various big data and cloud vendors. While Apache Storm offers highly scalable stream processing, Kafka handles messages at scale. In this post, we’ll see how both technologies work seamlessly together to form the bedrock of your real-time data analysis pipeline. You’re going to learn the basics of Apache Storm how to quickly install it in the AWS cloud environment.
By design, Apache Storm is very similar to any other distributed computing framework like Hadoop. However, while Hadoop is used for batch or archival data processing, Storm provides a framework for streaming data processing. Here are the components of a Storm cluster:
(Storm cluster components)
The Nimbus Service runs on the master node (Like Job Tracker in Hadoop). The task of Nimbus is to distribute code around the cluster, assign tasks to servers, and monitor for cluster failures.
Nimbus relies on the Apache ZooKeeper service to monitor message processing tasks. The worker nodes update their task status in Apache ZooKeeper.
The worker nodes that do the actual processing run a daemon called Supervisor. Supervisor receives and manages worker processes to complete the tasks assigned by Nimbus.
There are a few more Apache Storm-related terms with which we should be familiar before we actually get started:
(Storm topology)
Setting up Apache Storm in AWS (or on any virtual computing platform) should be as easy as downloading and configuring Storm and a ZooKeeper cluster. The Apache Storm documentation provides excellent guidance. In this blog post, however, we’re going to focus on storm-deploy – an easy to use tool that automates the deployment process.
Deploying with storm-deploy is really easy. Storm-deploy is a github project developed by Nathan Martz, the creator of Apache Storm. Storm-deploy automates both provisioning and deployment. In addition, it installs the Ganglia interface for monitoring disk, CPU, and network usage. I will assume that you already have an AWS account with privileges for AWS EC2 and S3.
You will need to install a Java version greater than jdk-1.6 from your workstation. Storm-deploy is built on top of jclouds and pallet. Apache jclouds is an open-source, multi-cloud toolkit for the Java platform that lets you create applications that are portable across clouds. Similarly, Pallet is used to provision and maintain servers on cloud and virtual machine infrastructure, by providing a consistently configured running image across a range of clouds. It is designed for use from the Clojure REPL (Read-Eval-Print Loop), from Clojure code, and from the command line. I simply followed the storm-deploy github project documentation. The commands are universal.
Now let’s see how all this is really done.
Step-1: Generate password-less key pairs:
ssh-keygen -t rsa
Step-2: Install leningen-2. Leiningen is used for automating Clojure project tasks such as project creation and dependency downloads. All you will need to do here is to download the script:
wget https://raw.github.com/technomancy/leiningen/stable/bin/lein
Step-3: Place the script it on your path, say /usr/local/bin, and make it executable.
chmod +x /usr/local/bin/lein
Step-4: Clone storm-deploy using git.
git clone https://github.com/nathanmarz/storm-deploy.git
Step-5: Download all dependencies by running:.
lein deps
Step-6: Create a ~/.pallet/config.clj file in your home directory and configure it with the credentials and details necessary to launch and configure instances on AWS.
(defpallet :services { :default { :blobstore-provider "aws-s3" :provider "aws-ec2" :environment {:user {:username "storm" ; this must be "storm" :private-key-path "$YOUR_PRIVATE_KEY_PATH$" :public-key-path "$YOUR_PUBLIC_KEY_PATH$"} :aws-user-id "$YOUR_USER_ID$"} :identity "$YOUR_AWS_ACCESS_KEY$" :credential "$YOUR_AWS_ACCESS_KEY_SECRET$" :jclouds.regions "$YOUR_AWS_REGION$" } })
The parameters are as follows:
Step-7: There are two very important configurable files. Here the path is for my set up: /opt/chandandata/storm-cluster/conf/clusters.yaml. The first file, clusters.yaml, looks like this:
################################################################################ # CLUSTERS CONFIG FILE ################################################################################ nimbus.image: "us-east-1/ami-d726abbe" #64-bit ubuntu nimbus.hardware: "m1.large" supervisor.count: 2 supervisor.image: "us-east-1/ami-d726abbe" #64-bit ubuntu on eu-east-1 supervisor.hardware: "m1.large" #supervisor.spot.price: 1.60 zookeeper.count: 1 zookeeper.image: "us-east-1/ami-d726abbe" #64-bit ubuntu zookeeper.hardware: "m1.large"
All of the properties are self-explanatory. We have set the cluster to lunch instances in us-east-1 with an AMI ID: ami-d726abbe and an instance size of m1.large. (This might be an odd configuration file. It is normally better to use m3 or m4 series instances). You can configure the hardware and images and select a spot pricing maximum – if that’s what you want.
Step-8: The other file, storm.yaml, is where you can set storm-specific configurations.
Step-9: Launch the cluster:
lein deploy-storm --start --name cluster-name [--branch {branch}] [--commit {commit tag-or-sha1}]
lein deploy-storm --start --name cp-cluster
This will run the cluster with appropriate settings. Before submitting an actual storm job, you should know about just a few additional tasks that can be performed.
lein deploy-storm --attach --name cp-cluster
Attaching to a cluster configures your storm client (which is used to start and stop topologies) to talk to that particular cluster as well as giving your workstation authorization to view the Storm UI on port 8080 on Nimbus. It writes the location of Nimbus in ~/.storm/storm.yaml so that the storm client knows which cluster to talk to, and authorizes your workstation to access the Nimbus daemon’s Thrift port (which is used for submitting topologies), and to access Ganglia on port 80 on Nimbus.
lein deploy-storm --ips --name mycluster
lein deploy-storm --stop --name cp-cluster
You can download Apache Storm here. At the time of this writing, the latest Storm version was 0.9.6. We have attached our workstation to the storm cluster so we are ready to run a Storm job.
storm jar PATH_TO_JAR.JAR JOB_CLASSNAME
storm kill TOPOLOGY_NAME
In this post, you’ve been introduced to Apache Storm and how to easily deploy a storm cluster using storm-deploy. But as I mentioned earlier, Storm on its own is not capable of large scale real-time data processing. We need to add Kafka, which acts as the distributed messaging service for storm topology. We will discuss Kafka integration and compare it with Amazon Kinesis in a later post. Stay tuned!
It's Flash Sale time! Get 50% off your first year with Cloud Academy: all access to AWS, Azure, and Cloud…
In this blog post, we're going to answer some questions you might have about the new AWS Certified Data Engineer…
This is my 3rd and final post of this series ‘Navigating the Vocabulary of Gen AI’. If you would like…