Building a Data Pipeline in DC/OS


64 students completed the lab in ~36m

Total available time: 1h:0m

16 students rated this lab!

Lab Overview

It is relatively simple to create powerful data pipelines in DC/OS. In this Lab, you will learn how to perform streaming data analytics by building a data pipeline in DC/OS that combines multiple services and a Twitter-like application. You will review many of the fundamental concepts in using DC/OS along the way, including installing packages, using Marathon-LB to load balance traffic, and working with virtual IPs.

Lab Objectives

Upon completion of this Lab you will be able to:

  • Install DC/OS packages with custom options using the DC/OS CLI
  • Deploy a data pipeline using Kafka, Cassandra, and a social networking app
  • Use the Zeppelin package and DC/OS Spark to perform basic streaming analytics on the data pipeline

Lab Prerequisites

You should be familiar with:

  • Basic and intermediate DC/OS concepts including Virtual IPs and Marathon-LB
  • Working at the command-line in Linux
  • AWS services to optionally understand the architecture of the pre-created DC/OS cluster

Lab Environment

Before completing the Lab instructions, the environment will look as follows:

After completing the Lab instructions, the environment should look similar to:

Follow these steps to learn by building helpful cloud resources

Logging in to the Amazon Web Services Console

Your first step to start the Lab experience

Understanding the DC/OS Cluster Architecture

Understand the cluster architecture and the resources provisioned for this Lab

Connecting to the Virtual Machine using SSH

Create a secure connection to a remote machine

Installing the DC/OS CLI on Linux

Install the DC/OS command-line interface (CLI) on Linux.

Installing the Required Packages in the DC/OS Cluster

Install the Marathon-LB, Cassandra, and Kafka packages from the Mesosphere Catalog

Running the Tweeter Application

Deploy the Tweeter application on your DC/OS cluster

Analyzing Tweets in Real-Time with Zeppelin

Use Zeppelin for streaming analytics of the tweets coming in from Kafka