Moving Data to S3 with Apache NiFi

Moving data to the cloud is one of the cornerstones of any cloud migration. Apache NiFi is an open source tool that enables you to easily move and process data using a graphical user interface (GUI).  In this blog post, we will examine a simple way to move data to the cloud using NiFi complete with practical steps. Calculated Systems offers a cloud-first version of NiFi that you can use to follow along.

Cloud Object Storage

There are many ways to store data on the cloud, but the easiest are the object stores. All three major cloud providers have them:
These is an ideal starting point for files as you can typically land the files without too much forethought or capacity planning. Additionally, these object stores are extremely robust, featuring multiple levels of durability and availability.
NiFi Cloud Migration
For the purposes of this tutorial, we will start with the most common object store: Amazon Simple Storage Service (Amazon S3).

Amazon S3 Terminology

Before we get started moving data, let’s establish some basic terminology:
  • Identity and Access Management (IAM) – Controls for making and controlling who and what can interact with your AWS resources.   
  • Access Keys – These are your access credentials to use AWS. These are not your typical username/password — they are generated using access identity management. 
  • Bucket – A grouping of similar files that must have a unique name. These can be made publicly accessible and are often used to host static objects.
  • Folder – Much like an operating system folder, these exist within a bucket to enable organization.

Creating an Access Key

For NiFi to have permission to write to S3, we must set it up with an access key pair. There are many ways to do this, but the best practice is to create a new IAM user. To get to the IAM user screen, navigate to the IAM homepage
  1. Select “Add user” and check “Programmatic access”
  2. Enter a new name such as “NiFi_demo”
  3. Click “Next: Permissions”
  4. Click “Create Group” and you will be presented with a list of permissions you can add to this new user
  5. Enter a group name such as “Nifi_Demo_Group”
  6. Next to filter policies search for S3 and check “AmazonS3FullAccess” > Click “Create Group”
  7. At the bottom right, select “Next:Tags” > Click through to “Next:Review”
  8. Click “Create user” to finish making an IAM User
The access key ID and secret access key are very important to setting up your data transfer. You can download them as a .CSV file or save them somewhere safe.
IMPORTANT: Be sure to record your secret access key as this is the only time it can be viewed. 

Creating an S3 Bucket

Although we will cover the basics of creating your S3 bucket in this post, you can check out Cloud Academy’s Storage Fundamentals of AWS for an in-depth overview. Now that we have credentials for AWS, we need a place to land them. To put it simply, we need to create a new S3 bucket if you do not already have one. Go to the AWS S3 Console
  1. Click “+ Create Bucket”
  2. Enter a unique bucket name and the region you are creating it in
  3. Click through until the bucket is created (default options are fine to use)
  4. Click on your new bucket and you should be able to see its contents — which will be empty

Setting up your NiFi & AWS Credential Service or Processor Controls

NiFi can be setup several ways including download from the Apache website or using a pre-made solution like Calculated System’s AWS Marketplace Offering.
NiFi has many ways to provide access to AWS either through an overarching credential service or parameters set to a specific processor. The credential service is ideal when you have multiple processors all relying on the same keys. For the scope of this tutorial, we will not be using the service, but it is ideal when moving into a production setting.
  1. To get started, click and drag in a new processor “PutS3Object” > right-click “Configure the processor”
  2. Under the Settings tab, you will see Automatically Terminate Relations > check the boxes next to “failure” and “success” since this is the last processor in the flow.
  3. Under the Properties tab, configure the following properties:
    • Access Key ID – From the User you created earlier and noted down
    • Secret Access Key ID – From the User you created earlier and noted down
    • Bucket – Put the name of the bucket you created
    • Region – The region your bucket is located; often U.S. East (N. Virginia)

    Processor Configuration

  4. Click “Apply” to finish up configuring the processor. 

Setting Up Your Flow

For the purposes of this sample flow, let’s replicate NiFi’s own configuration directory to S3. To accomplish this, we need two additional processors: ListFiles and FetchFiles. Connect and configure them as shown below.
List File Flow
ListFile
  • Properties tab – Set “Input Directory” to /nifi/docs/html
  • Drag a connection from ListFile to FetchFile for relationship success
FetchFile
  • Settings tab – Check the boxes next to “Failure,” “not.found,” * “permission.denied”
  • Drag a connection from FetchFile to PutS3Object for relationship success
Running Your Flow
  • Right-click each of the processors > click “Start”
  • Let this run for a few seconds. If you want to track the progress, right-click into any blank space of your NiFi canvas and press “refresh.” You should see each processor reporting flowfiles “in” and “out”
  • For the purpose of this demo, right-click “Stop list files.” In production, you can leave this task running, but it is always best to stop demos when done. This stops the demo from producing sample files after you stopped using the program.

Viewing the Objects in S3

If you return to your bucket, you should see your files listed. Note: You may have to refresh button the page depending on your browser/settings.
S3 Bucket

[Optional] Security Cleanup

As an optional step, you may wish to revoke the access keys you gave to this Nifi Demo. It is general best practice to remove unused keys when done. To revoke the keys, go the AWS Console.
  • Click on the user you created earlier in the tutorial
  • Go to the Security Credentials tab and search for the Access Keys subsection. Here you can inactivate, delete, or even make new keys.
  • As a best practice, make the key inactive or delete the key.
Chris Gambino

Written by

Chris Gambino

Chris has been focused on the big data ecosystem for years. Starting with simple databases he focused on Hadoop for several years before switching to a cloud-first approach. An Author of "Nifi for Dummies", his approach involves a holistic evaluation of the problem before assigning technology. https://www.calculatedsystems.com/


Related Posts

Avatar
Stuart Scott
— October 16, 2019

AWS Security: Bastion Host, NAT instances and VPC Peering

Effective security requires close control over your data and resources. Bastion hosts, NAT instances, and VPC peering can help you secure your AWS infrastructure. Welcome to part four of my AWS Security overview. In part three, we looked at network security at the subnet level. This ti...

Read more
  • AWS
Avatar
Sudhi Seshachala
— October 9, 2019

Top 13 Amazon Virtual Private Cloud (VPC) Best Practices

Amazon Virtual Private Cloud (VPC) brings a host of advantages to the table, including static private IP addresses, Elastic Network Interfaces, secure bastion host setup, DHCP options, Advanced Network Access Control, predictable internal IP ranges, VPN connectivity, movement of interna...

Read more
  • AWS
  • best practices
  • VPC
Avatar
Stuart Scott
— October 2, 2019

Big Changes to the AWS Certification Exams

With AWS re:Invent 2019 just around the corner, we can expect some early announcements to trickle through with upcoming features and services. However, AWS has just announced some big changes to their certification exams. So what’s changing and what’s new? There is a brand NEW ...

Read more
  • AWS
  • Certifications
Alisha Reyes
Alisha Reyes
— October 1, 2019

New on Cloud Academy: ITIL® 4, Microsoft 365 Tenant, Jenkins, TOGAF® 9.1, and more

At Cloud Academy, we're always striving to make improvements to our training platform. Based on your feedback, we released some new features to help make it easier for you to continue studying. These new features allow you to: Remove content from “Continue Studying” section Disc...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • ITIL® 4
  • Jenkins
  • Microsoft 365 Tenant
  • New content
  • Product Feature
  • Python programming
  • TOGAF® 9.1
Avatar
Stuart Scott
— September 27, 2019

AWS Security Groups: Instance Level Security

Instance security requires that you fully understand AWS security groups, along with patching responsibility, key pairs, and various tenancy options. As a precursor to this post, you should have a thorough understanding of the AWS Shared Responsibility Model before moving onto discussi...

Read more
  • AWS
  • instance security
  • Security
  • security groups
Avatar
Jeremy Cook
— September 17, 2019

Cloud Migration Risks & Benefits

If you’re like most businesses, you already have at least one workload running in the cloud. However, that doesn’t mean that cloud migration is right for everyone. While cloud environments are generally scalable, reliable, and highly available, those won’t be the only considerations dri...

Read more
  • AWS
  • Azure
  • Cloud Migration
Joe Nemer
Joe Nemer
— September 12, 2019

Real-Time Application Monitoring with Amazon Kinesis

Amazon Kinesis is a real-time data streaming service that makes it easy to collect, process, and analyze data so you can get quick insights and react as fast as possible to new information.  With Amazon Kinesis you can ingest real-time data such as application logs, website clickstre...

Read more
  • amazon kinesis
  • AWS
  • Stream Analytics
  • Streaming data
Joe Nemer
Joe Nemer
— September 6, 2019

Google Cloud Functions vs. AWS Lambda: The Fight for Serverless Cloud Domination

Serverless computing: What is it and why is it important? A quick background The general concept of serverless computing was introduced to the market by Amazon Web Services (AWS) around 2014 with the release of AWS Lambda. As we know, cloud computing has made it possible for users to ...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
Joe Nemer
Joe Nemer
— September 3, 2019

Google Vision vs. Amazon Rekognition: A Vendor-Neutral Comparison

Google Cloud Vision and Amazon Rekognition offer a broad spectrum of solutions, some of which are comparable in terms of functional details, quality, performance, and costs. This post is a fact-based comparative analysis on Google Vision vs. Amazon Rekognition and will focus on the tech...

Read more
  • Amazon Rekognition
  • AWS
  • Google Cloud Platform
  • Google Vision
Alisha Reyes
Alisha Reyes
— August 30, 2019

New on Cloud Academy: CISSP, AWS, Azure, & DevOps Labs, Python for Beginners, and more…

As Hurricane Dorian intensifies, it looks like Floridians across the entire state might have to hunker down for another big one. If you've gone through a hurricane, you know that preparing for one is no joke. You'll need a survival kit with plenty of water, flashlights, batteries, and n...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • New content
  • Product Feature
  • Python programming
Joe Nemer
Joe Nemer
— August 27, 2019

Amazon Route 53: Why You Should Consider DNS Migration

What Amazon Route 53 brings to the DNS table Amazon Route 53 is a highly available and scalable Domain Name System (DNS) service offered by AWS. It is named by the TCP or UDP port 53, which is where DNS server requests are addressed. Like any DNS service, Route 53 handles domain regist...

Read more
  • Amazon
  • AWS
  • Cloud Migration
  • DNS
  • Route 53
Alisha Reyes
Alisha Reyes
— August 22, 2019

How to Unlock Complimentary Access to Cloud Academy

Are you looking to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Cloud Security, Python, Java, or another technical skill? Then you'll want to mark your calendars for August 23, 2019. Starting Friday at 12:00 a.m. PDT (3:00 a.m. EDT), Cloud Academy is offering c...

Read more
  • AWS
  • Azure
  • cloud academy content
  • complimentary access
  • GCP
  • on the house