AWS Step Functions: a Serverless Orchestrator

One of the most applauded announcements at re:Invent 2016 was AWS Step Functions. Step Functions is basically an orchestration service for AWS Lambda and activity-based tasks. Thanks to SFN, you can control multiple executions of your processes using Lambda Functions and activity workers.

What is AWS Step Functions?

AWS Step Functions is the last application service released by AWS to solve a problem that many people reading this have probably experienced: orchestrating complex flows using Lambda Functions.

In many use cases, there are several processes composed of different tasks. If you want to run the entire process in a serverless way, you can create a Lambda Function for each task and run those functions using your own orchestrator. Writing a code that orchestrates those functions could be painful and really hard to debug and optimize. AWS Step Functions removes this need by applying an easy design and by implementing a complex flow for our functions or tasks. According to the AWS documentation page, “AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows.”

Let’s take a look at some examples.

Example 1: Game hosted on AWS

A simple use case could be handling one of your users who completes a level in your game hosted on AWS. In this case, you would need to perform many different tasks, which could include:

  • Updating different DynamoDB tables
  • Storing reports on S3
  • Putting a metric on CloudWatch for further analysis

Fulfilling only these three functions could be really difficult if each one is a different Lambda Function to run sequentially or in parallel (even more difficult). With AWS Step Functions, you can run those tasks in parallel, handling different kinds of exceptions for each function and handling the final results without any further complications. This is how you could implement it:

Simple State Machine with Step Functions

Example 2: A serverless handler for libraries

Other examples might be some flows that require human interaction. For instance, a library wants to keep track of each item loaned out to customers and it wants to help customers return their items within the deadline. In this process, a customer checks out a book, the library employee inserts that action in the system and after, a State Machine can orchestrate all of the actions necessary for bringing the book from the customer back to the library.

Thanks to AWS Step Functions, you can run a Lambda Function that sends an email to the customer that confirms the check out with a link to renew it. Another Lambda, in conjunction with Amazon API Gateway, can generate a link to mark the loan as complete. After a few days, the State Machine can send an automated message to remind the customer to renew or return the book. This example is a bit more articulated than the previous one.

serverless handler for libraries

As you can see, this example is more difficult than the first one. It is also more difficult to design and implement with Step Functions. The advantage of using this service is that you can implement really long tasks and also handle human interaction that can modify the flow of execution.

A deeper look at AWS Step Functions

Under the hood

So, what powers AWS Step Functions? As you can imagine, there are several affinities between this service and Amazon Simple Workflow (SWF). In fact, as Tim Bray pointed out in the video presentation at re:Invent 2016, SWF shares part of its backend with SFN, but at first glance, Step Functions is less complicated. Let’s try to understand the main components of this service so that you can start using it in your next project.

State Machine

The biggest component is the State Machine. A State Machine represents the flow that you need to put in place to achieve your goals. For example, to manage lending resources for a library you need to create a State Machine that coordinates each task to provide a better experience for customers. The previous two screenshots above are examples of State Machines.

It is very easy to create a State Machine. You basically need a JSON and that’s it. Using the API or the AWS Console, you are able to create it and start as many executions as you need. The JSON template must follow the Amazon States Language. While it is not so easy to compose, with the console you have a real-time graph that shows you what to do.

Here are the JSON templates for implementing the State Machines for example number 1 and the number 2 above.

State

A State Machine is made of boxes, and each one represents a State. States are referred to by their name inside the State Machine template. Each name must be unique and there are many different State types. Currently, the available states (based on the publish date for this post) are:

  • Choice state: Branch the execution
  • Fail or Succeed state: Stop an execution with a failure or a success
  • Pass state: Pass its input to the output, injecting some prefixed data
  • Wait state: Provide a delay for a certain amount of time or until a specified time/date
  • Parallel state: Begin parallel branches of execution
  • Task state: Execute some code in your state machine

Task

All of the work in your State Machine is accomplished by tasks. A task can be:

  • A Lambda Function: You have to specify its ARN
  • An Activity: A piece of code that can be hosted wherever you want. It needs to call the GetActivityTask API to start the job and SendTaskSuccess or SendTaskFailure APIs to send the result of it. In this way, you can also include human tasks in your State Machine or those that are too long to be hosted in a Lambda Function. In our library example, you need to provide to the user a link that, if clicked, either renews the resource that has been checked out, or marks as it completed. Thanks to API Gateway and the SendTaskSuccess or SendTaskFailure API, you can do it.

Good to know: Pricing

A big difference between Step Functions and Simple Workflow is their pricing. AWS Step Functions is billed for each state transition of your execution. For example, if your State Machine has three steps in series, each execution consists of four state transitions. For each account, the first 4,000 transitions per month are included in the free tier and it will last forever. The free tier is a nice thing to have, but other than state transitions you will be charged for Lambda Function executions, data transfer, and EC2 instances if your activity is hosted there. In my opinion, this service is not as inexpensive as you might expect. Using it in production with a lot of execution can incur high expenses, but in many cases is necessary and removes the pain of orchestrating different tasks. It also provides us with a lot of nice features.

How can we actually use AWS Step Functions?

After this really long but necessary introduction to AWS Step Functions, let’s move on to how to use this service. There are three ways:

  • AWS Console
  • AWS CloudFormation
  • API/SDK

AWS Console

Simply insert the JSON template in the Code Box and your State Machine will appear in the Preview box.
AWS also provides a really small but very accurate set of blueprints to start out. For example, if you need a simple State Machine made by a parallel step, you only need to click on the related blueprint, change the name and the ARN and that’s it, your State Machine is ready for production. I think that AWS did a great job here, and the console is really helpful for composing State Machines.

AWS CloudFormation

CloudFormation is the Infrastructure as a Code service of AWS (Follow this link for an introduction to Code service of AWS). It supports AWS Step Functions, and it is pretty simple to implement. As you can see below, you only need to specify the State Machine template and the service role ARN.

{
   "Type": "AWS::StepFunctions::StateMachine",
   "Properties": {
      "DefinitionString": String,
      "RoleArn": String
    }
}

AWS API and AWS SDK

One of the big advantages of AWS is that everything is an API, and this also applies with SFN. These are the service’s most important APIs:

  • CreateStateMachine: To create a state machine you need to specify a name and a definition using the Amazon States Machine Language and also a Role ARN that the service will assume.
  • GetActivityTask: With this API, a worker (EC2, ECS container or whatever) receives the activity input to execute it
  • SendTaskFailure and SendTaskSuccess: Allow you to set the result of an activity.

Of course, there are much more APIs available, and you can find the documentation of each one here.

Logging and monitoring

After you have created your State Machine, you would like to run millions of executions and this is the easy part! In fact, AWS Step Functions is a fully managed service and you don’t need to take care of scaling or server maintenance. We are in a serverless world right now!

In order to take control of your system, you need to have a really good monitoring system in place. With Step Functions, AWS did a pretty good job. Using the console, each execution has its own logs for each state and they are well detailed. SFN is also integrated with CloudWatch Metrics and CloudTrail. Of course, if your activity is performed by a Lambda Function, each of them will deliver their logs to CloudWatch as usual. You can learn more about these services following these links: Introduction to CloudWatch and Learn the tools for governing accounts. Here a screenshot where you can see the logs that the AWS Console provides.

logging and monitoring

Using the console is simple and the user interface is pretty good, but what about getting the logs of execution via API? The API that we need here is the GetExecutionHistory. This will provide us with the complete history of execution. Although  I have never used it before, after reading the doc, I can see that the response could be pretty hard to handle. In fact, there are a lot of different possible fields that represent each type of activity and its result. For example, in the case of failure, there is a different field if the activity type is: ActivityFailed or LambdaFunctionFailed (even if they have the same information inside).

Why Step Functions is your friend

There are several great things about this new service:

State as a service

AWS Step Functions provide something that could be called state as-a-service. Usually, a serverless infrastructure is also stateless. In fact, if you are using multiple Lambda Functions to complete a task, it is really hard to store and keep the state of an execution up to date. If you need it, you are probably going to use S3 or a database, but this is a repetitive and complex task to accomplish. SFN will keep your state among each task and orchestrate each of them to run only if needed and in the right order.

Keep your tasks alive with a Heartbeat

Another cool feature is that you are able to build really long tasks. The maximum duration for a single execution is one year!

For long tasks, you can also specify TimeoutSeconds and HeartbeatSeconds parameters. If a state runs longer than its TimeoutSeconds, then it fails with a States.Timeout Error. The latter parameter is even more powerful. In fact, by specifying the HeartbeatSeconds parameter you have to design your activity worker to call the SendTaskHeartbeat API for at least the amount of seconds that the parameter specifies. If you don’t call that API, the state fails with a States.Timeout Error. Both of these parameters could be useful, for instance, when your activity has to process a bunch of records. You can specify the timeout for the entire duration of the activity, or, using the HeartbeatSeconds parameter you can say: the activity must process an overall of M records but N records at least each X seconds. You can do it by specifying that parameter to X seconds and every N records call the API.

Retry strategy

A really difficult problem to solve using Lambda is implementing a retry strategy. This is quite important but difficult to achieve in an easy and simple way. AWS Step Functions allows you to define a retry strategy to all different kinds of errors that your Lambda Functions can incur. I think this is easier to understand using an example.

{
  "Comment": "Hello World with multiple retry strategies",
  "StartAt": "HelloWorld",
  "States": {
    "HelloWorld": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:HelloWorld",
      "Retry": [
        {
          "ErrorEquals": ["HandledError"],
          "IntervalSeconds": 1,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        },
        {
          "ErrorEquals": ["States.TaskFailed"],
          "Next": "AlertDevOps"
        },
        {
          "ErrorEquals": ["States.ALL"],
          "IntervalSeconds": 5,
          "MaxAttempts": 5,
          "BackoffRate": 2.0
        }
      ],
      "End": true
    }
  },
  "AlertDevOps": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:AlertDevOps",
      "End": true
  }
}

This is the template for creating the Hello World State Machine with a retry strategy. The HelloWorld Lambda Function could fail for different reasons, but we can handle errors with a different strategy. For each kind of strategy you can define three parameters:

  • IntervalSeconds: Represents the number of seconds before the first retry attempt
  • MaxAttempts: Represents the maximum number of retry attempts
  • BackoffRate: The multiplier that increases the retry interval on each attempt

You can use the same strategy for multiple kinds of exceptions by defining multiple values in the ErrorEquals array.  If you do not need to retry a function in the event of a specific error, you can set the Next field with the name of another State. In the example above, if a CriticalError happens, the Lambda AlertDevOps is invoked and then the execution terminates.

Cool Service Console

One thing that I would like to highlight here is the AWS Console. Usually, AWS doesn’t provide a nice console to interact with and that’s ok because after a bit of experience with the service a developer usually switches to use the service via API, CLI, SDK, or even CloudFormation.

With Step Functions, AWS creates a user-friendly experience by providing a lot of blueprints but also a nice UI with all the information needed. In fact, on the creation phase of a State Machine, you can start from different kinds of blueprints and you have two boxes. The Preview box represents the State Machine that you are building and is based on the Code box, positioned below it. Once you are satisfied with your State Machine, you can create it and start simulating all the executions that you need. Here the AWS Console is also helpful. For each execution, you have all of the logs that you need, both at the execution and single state level.

What I’d like to see in the future

I used this service a couple of times and there are several features that I feel are missing:

Price model

The first point is the price. With complex task, AWS Step Functions is mandatory but at a high cost. I hope that a price reduction or, even better, a pricing model similar to CloudFormation, ElasticBeanstalk, or ECS, is forthcoming. Those services offer a “pay only for what you use model.” Instead, with Step Functions, you pay for the resources that you use (Lambda, EC2, on-premise servers), but also for states that only wait or just pass to the next one.

Where are triggers?

With Lambda, AWS teaches us to love events and triggers. Where are they? It would be nice to start an execution in response to an event such as Amazon DynamoDB or Amazon Kinesis streams, AWS Code Commit push, or automatically pull messages from an Amazon SQS Queue. Now, you are able to integrate Step Functions with Amazon API Gateway, which makes human tasks possible in our executions. Yesterday AWS announced the integration between CloudWatch Events and Step Functions. That is good news because means that AWS is working on integrating more triggers with this service.

What about push events to other services?

Even the opposite feature of triggers is missing. Here, I’m talking about the ability to automatically send the event received from the previous state in my State Machine to other AWS Services. For example, I’d like to receive a notification for each execution that ends without error. The last state of my State Machine could be an integration with Amazon SNS that without any further code will trigger alerts.

Conclusion

There are several key points to keep in mind with this service:

  • State as a service in a serverless infrastructure
  • Easy integration of human tasks
  • Really long execution with timeout and heartbeat functionality
  • Deep integrations with CloudWatch Logs, Metrics, and CloudTrail
  • Nice AWS Console with blueprint and everything you need to get started
  • Integration with IAM; each State Machine has its service role
  • Pay attention to the bill: Costs can increase quickly

If you would like to take this intro-level look to the next level, you should try our hands-on lab. In our lab, you will be able to create and use the State Machine in example number one above. You can find the templates of the examples in this post at the following links: game hosted on AWS and library loans handler.

Avatar

Written by

Giacomo Consonni

Giacomo is a Computer Engineer with a passion for all things AWS and Cloud Technology. Treating every day like a school day, Giacomo is curious about everything. Constantly traveling, discovering and experiencing new things from sports to photography; he revels in uncovering the unknown.


Related Posts

Patrick Navarro
Patrick Navarro
— January 22, 2020

Top 5 AWS Salary Report Findings

At the speed the cloud tech space is developing, it can be hard to keep track of everything that’s happening within the AWS ecosystem. Advances in technology prompt smarter functionality and innovative new products, which in turn give rise to new job roles that have a ripple effect on t...

Read more
  • AWS
  • salary
Alisha Reyes
Alisha Reyes
— January 6, 2020

New on Cloud Academy: Red Hat, Agile, OWASP Labs, Amazon SageMaker Lab, Linux Command Line Lab, SQL, Git Labs, Scrum Master, Azure Architects Lab, and Much More

Happy New Year! We hope you're ready to kick your training in overdrive in 2020 because we have a ton of new content for you. Not only do we have a bunch of new courses, hands-on labs, and lab challenges on AWS, Azure, and Google Cloud, but we also have three new courses on Red Hat, th...

Read more
  • agile
  • AWS
  • Azure
  • Google Cloud Platform
  • Linux
  • OWASP
  • programming
  • red hat
  • scrum
Alisha Reyes
Alisha Reyes
— December 24, 2019

Cloud Academy’s Blog Digest: Azure Best Practices, 6 Reasons You Should Get AWS Certified, Google Cloud Certification Prep, and more

Happy Holidays from Cloud Academy We hope you have a wonderful holiday season filled with family, friends, and plenty of food. Here at Cloud Academy, we are thankful for our amazing customer like you.  Since this time of year can be stressful, we’re sharing a few of our latest article...

Read more
  • AWS
  • azure best practices
  • blog digest
  • Cloud Academy
  • Google Cloud
Avatar
Guy Hummel
— December 12, 2019

Google Cloud Platform Certification: Preparation and Prerequisites

Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2019, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the second consecuti...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— December 10, 2019

New Lab Challenges: Push Your Skills to the Next Level

Build hands-on experience using real accounts on AWS, Azure, Google Cloud Platform, and more Meaningful cloud skills require more than book knowledge. Hands-on experience is required to translate knowledge into real-world results. We see this time and time again in studies about how pe...

Read more
  • AWS
  • Azure
  • Google Cloud
  • hands-on
  • labs
Alisha Reyes
Alisha Reyes
— December 5, 2019

New on Cloud Academy: AWS Solution Architect Lab Challenge, Azure Hands-on Labs, Foundation Certificate in Cyber Security, and Much More

Now that Thanksgiving is over and the craziness of Black Friday has died down, it's now time for the busiest season of the year. Whether you're a last-minute shopper or you already have your shopping done, the holidays bring so much more excitement than any other time of year. Since our...

Read more
  • AWS
  • AWS solution architect
  • AZ-203
  • Azure
  • cyber security
  • FCCS
  • Foundation Certificate in Cyber Security
  • Google Cloud Platform
  • Kubernetes
Avatar
Cloud Academy Team
— December 4, 2019

Understanding Enterprise Cloud Migration

What is enterprise cloud migration? Cloud migration is about moving your data, applications, and even infrastructure from your on-premises computers or infrastructure to a virtual pool of on-demand, shared resources that offer compute, storage, and network services at scale. Why d...

Read more
  • AWS
  • Azure
  • Data Migration
Wendy Dessler
Wendy Dessler
— November 27, 2019

6 Reasons Why You Should Get an AWS Certification This Year

In the past decade, the rise of cloud computing has been undeniable. Businesses of all sizes are moving their infrastructure and applications to the cloud. This is partly because the cloud allows businesses and their employees to access important information from just about anywhere. ...

Read more
  • AWS
  • Certifications
  • certified
Avatar
Andrea Colangelo
— November 26, 2019

AWS Regions and Availability Zones: The Simplest Explanation You Will Ever Find Around

The basics of AWS Regions and Availability Zones We’re going to treat this article as a sort of AWS 101 — it’ll be a quick primer on AWS Regions and Availability Zones that will be useful for understanding the basics of how AWS infrastructure is organized. We’ll define each section,...

Read more
  • AWS
Avatar
Dzenan Dzevlan
— November 20, 2019

Application Load Balancer vs. Classic Load Balancer

What is an Elastic Load Balancer? This post covers basics of what an Elastic Load Balancer is, and two of its examples: Application Load Balancers and Classic Load Balancers. For additional information — including a comparison that explains Network Load Balancers — check out our post o...

Read more
  • ALB
  • Application Load Balancer
  • AWS
  • Elastic Load Balancer
  • ELB
Albert Qian
Albert Qian
— November 13, 2019

Advantages and Disadvantages of Microservices Architecture

What are microservices? Let's start our discussion by setting a foundation of what microservices are. Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs). ...

Read more
  • AWS
  • Docker
  • Kubernetes
  • Microservices
Nisar Ahmad
Nisar Ahmad
— November 12, 2019

Kubernetes Services: AWS vs. Azure vs. Google Cloud

Kubernetes is a popular open-source container orchestration platform that allows us to deploy and manage multi-container applications at scale. Businesses are rapidly adopting this revolutionary technology to modernize their applications. Cloud service providers — such as Amazon Web Ser...

Read more
  • AWS
  • Azure
  • Google Cloud
  • Kubernetes