One of the most applauded announcements at re:Invent 2016 was AWS Step Functions. Step Functions is basically an orchestration service for AWS Lambda and activity-based tasks. Thanks to SFN, you can control multiple executions of your processes using Lambda Functions and activity workers.

What is AWS Step Functions?

AWS Step Functions is the last application service released by AWS to solve a problem that many people reading this have probably experienced: orchestrating complex flows using Lambda Functions.

In many use cases, there are several processes composed of different tasks. If you want to run the entire process in a serverless way, you can create a Lambda Function for each task and run those functions using your own orchestrator. Writing a code that orchestrates those functions could be painful and really hard to debug and optimize. AWS Step Functions removes this need by applying an easy design and by implementing a complex flow for our functions or tasks. According to the AWS documentation page, “AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows.”

Let’s take a look at some examples.

Example 1: Game hosted on AWS

A simple use case could be handling one of your users who completes a level in your game hosted on AWS. In this case, you would need to perform many different tasks, which could include:

  • Updating different DynamoDB tables
  • Storing reports on S3
  • Putting a metric on CloudWatch for further analysis

Fulfilling only these three functions could be really difficult if each one is a different Lambda Function to run sequentially or in parallel (even more difficult). With AWS Step Functions, you can run those tasks in parallel, handling different kinds of exceptions for each function and handling the final results without any further complications. This is how you could implement it:

Simple State Machine with Step Functions

Example 2: A serverless handler for libraries

Other examples might be some flows that require human interaction. For instance, a library wants to keep track of each item loaned out to customers and it wants to help customers return their items within the deadline. In this process, a customer checks out a book, the library employee inserts that action in the system and after, a State Machine can orchestrate all of the actions necessary for bringing the book from the customer back to the library.

Thanks to AWS Step Functions, you can run a Lambda Function that sends an email to the customer that confirms the check out with a link to renew it. Another Lambda, in conjunction with Amazon API Gateway, can generate a link to mark the loan as complete. After a few days, the State Machine can send an automated message to remind the customer to renew or return the book. This example is a bit more articulated than the previous one.

As you can see, this example is more difficult than the first one. It is also more difficult to design and implement with Step Functions. The advantage of using this service is that you can implement really long tasks and also handle human interaction that can modify the flow of an execution.

A deeper look at AWS Step Functions

Under the hood

So, what powers AWS Step Functions? As you can imagine, there are several affinities between this service and Amazon Simple Workflow (SWF). In fact, as Tim Bray pointed out in the video presentation at re:Invent 2016, SWF shares part of its backend with SFN, but at first glance, Step Functions is less complicated. Let’s try to understand the main components of this service so that you can start using it in your next project.

State Machine

The biggest component is the State Machine. A State Machine represents the flow that you need to put in place to achieve your goals. For example, to manage lending resources for a library you need to create a State Machine that coordinates each task to provide a better experience for customers. The previous two screenshots above are examples of State Machines.

It is very easy to create a State Machine. You basically need a JSON and that’s it. Using the API or the AWS Console, you are able to create it and start as many executions as you need. The JSON template must follow the Amazon States Language. While it is not so easy to compose, with the console you have a real-time graph that shows you what to do.

Here are the JSON templates for implementing the State Machines for example number 1 and the number 2 above.

State

A State Machine is made of boxes, and each one represents a State. States are referred to by their name inside the State Machine template. Each name must be unique and there are many different State types. Currently the available states (based on the publish date for this post) are:

  • Choice state: Branch the execution
  • Fail or Succeed state: Stop an execution with a failure or a success
  • Pass state: Pass its input to the output, injecting some prefixed data
  • Wait state: Provide a delay for a certain amount of time or until a specified time/date
  • Parallel state: Begin parallel branches of execution
  • Task state: Execute some code in your state machine

Task

All of the work in your State Machine is accomplished by tasks. A task can be:

  • A Lambda Function: You have to specify its ARN
  • An Activity: A piece of code that can be hosted wherever you want. It needs to call the GetActivityTask API to start the job and SendTaskSuccess or SendTaskFailure APIs to send the result of it. In this way, you can also include human tasks in your State Machine or those that are too long to be hosted in a Lambda Function. In our library example, you need to provide to the user a link that, if clicked, either renews the resource that has been checked out, or marks as it completed. Thanks to API Gateway and the SendTaskSuccess or SendTaskFailure API, you can do it.

Good to know: Pricing

A big difference between Step Functions and Simple Workflow is their pricing. AWS Step Functions is billed for each state transition of your execution. For example, if your State Machine has three steps in series, each execution consists of four state transitions. For each account, the first 4,000 transitions per month are included in the free tier and it will last forever. The free tier is a nice thing to have, but other than state transitions you will be charged for Lambda Function executions, data transfer, and EC2 instances if your activity is hosted there. In my opinion, this service is not as inexpensive as you might expect. Using it in production with a lot of execution can incur high expenses, but in many cases is necessary and removes the pain of orchestrating different tasks. It also provides us with a lot of nice features.

How can we actually use AWS Step Functions?

After this really long but necessary introduction to AWS Step Functions, let’s move on to how to use this service. There are three ways:

  • AWS Console
  • AWS CloudFormation
  • API/SDK

AWS Console

Simply insert the JSON template in the Code Box and your State Machine will appear in the Preview box.
AWS also provides a really small but very accurate set of blueprints to start out. For example, if you need a simple State Machine made by a parallel step, you only need to click on the related blueprint, change the name and the ARN and that’s it, your State Machine is ready for production. I think that AWS did a great job here, and the console is really helpful for composing State Machines.

AWS CloudFormation

CloudFormation is the Infrastructure as a Code service of AWS (Follow this link for an introduction to Code service of AWS). It supports AWS Step Functions, and it is pretty simple to implement. As you can see below, you only need to specify the State Machine template and the service role ARN.

AWS API and AWS SDK

One of the big advantages of AWS is that everything is an API, and this also applies with SFN. These are the service’s most important APIs:

  • CreateStateMachine: To create a state machine you need to specify a name and a definition using the Amazon States Machine Language and also a Role ARN that the service will assume.
  • GetActivityTask: With this API, a worker (EC2, ECS container or whatever) receives the activity input to execute it
  • SendTaskFailure and SendTaskSuccess: Allow you to set the result of an activity.

Of course, there are much more APIs available, and you can find the documentation of each one here.

Logging and monitoring

After you have created your State Machine, you would like to run millions of executions and this is the easy part! In fact, AWS Step Functions is a fully managed service and you don’t need to take care of scaling or server maintenance. We are in a serverless world right now!

In order to take control of your system, you need to have a really good monitoring system in place. With Step Functions, AWS did a pretty good job. Using the console, each execution has its own logs for each state and they are well detailed. SFN is also integrated with CloudWatch Metrics and CloudTrail. Of course, if your activity is performed by a Lambda Function, each of them will deliver their logs to CloudWatch as usual. You can learn more about these services following these links: Introduction to CloudWatch and Learn the tools for governing accounts. Here a screenshot where you can see the logs that the AWS Console provides.

Using the console is simple and the user interface is pretty good, but what about getting the logs of an execution via API? The API that we need here is the GetExecutionHistory. This will provide us with the complete history of an execution. Although  I have never used it before, after reading the doc, I can see that the response could be pretty hard to handle. In fact, there are a lot of different possible fields that represent each type of activity and its result. For example, in the case of failure, there is a different field if the activity type is: ActivityFailed or LambdaFunctionFailed (even if they have the same information inside).

Why Step Functions is your friend

There are several great things about this new service:

State as a service

AWS Step Functions provide something that could be called state as-a-service. Usually, a serverless infrastructure is also stateless. In fact, if you are using multiple Lambda Functions to complete a task, it is really hard to store and keep the state of an execution up to date. If you need it, you are probably going to use S3 or a database, but this is a repetitive and complex task to accomplish. SFN will keep your state among each task and orchestrate each of them to run only if needed and in the right order.

Keep your tasks alive with a Heartbeat

Another cool feature is that you are able to build really long tasks. The maximum duration for a single execution is one year!

For long tasks, you can also specify TimeoutSeconds and HeartbeatSeconds parameters. If a state runs longer than its TimeoutSeconds, then it fails with a States.Timeout Error. The latter parameter is even more powerful. In fact, by specifying the HeartbeatSeconds parameter you have to design your activity worker to call the SendTaskHeartbeat API for at least the amount of seconds that the parameter specifies. If you don’t call that API, the state fails with a States.Timeout Error. Both of these parameters could be useful, for instance, when your activity has to process a bunch of records. You can specify the timeout for the entire duration of the activity, or, using the HeartbeatSeconds parameter you can say: the activity must process an overall of M records but N records at least each X seconds. You can do it by specifying that parameter to X seconds and every N records call the API.

Retry strategy

A really difficult problem to solve using Lambda is implementing a retry strategy. This is quite important but difficult to achieve in an easy and simple way. AWS Step Functions allows you to define a retry strategy to all different kinds of errors that your Lambda Functions can incur. I think this is easier to understand using an example.

This is the template for creating the Hello World State Machine with a retry strategy. The HelloWorld Lambda Function could fail for different reasons, but we can handle errors with a different strategy. For each kind of strategy you can define three parameters:

  • IntervalSeconds: Represents the number of seconds before the first retry attempt
  • MaxAttempts: Represents the maximum number of retry attempts
  • BackoffRate: The multiplier that increases the retry interval on each attempt

You can use the same strategy for multiple kinds of exceptions by defining multiple values in the ErrorEquals array.  If you do not need to retry a function in the event of a specific error, you can set the Next field with the name of another State. In the example above, if a CriticalError happens, the Lambda AlertDevOps is invoked and then the execution terminates.

Cool Service Console

One thing that I would like to highlight here is the AWS Console. Usually, AWS doesn’t provide a nice console to interact with and that’s ok because after a bit of experience with the service a developer usually switches to use the service via API, CLI, SDK, or even CloudFormation.

With Step Functions, AWS creates a user-friendly experience by providing a lot of blueprints but also a nice UI with all the information needed. In fact, on the creation phase of a State Machine, you can start from different kinds of blueprints and you have two boxes. The Preview box represents the State Machine that you are building and is based on the Code box, positioned below it. Once you are satisfied with your State Machine, you can create it and start simulating all the executions that you need. Here the AWS Console is also helpful. For each execution, you have all of the logs that you need, both at the execution and single state level.

What I’d like to see in the future

I used this service a couple of times and there are several features that I feel are missing:

Price model

The first point is the price. With complex task, AWS Step Functions is mandatory but at a high cost. I hope that a price reduction or, even better, a pricing model similar to CloudFormation, ElasticBeanstalk, or ECS, is forthcoming. Those services offer a “pay only for what you use model.” Instead, with Step Functions, you pay for the resources that you use (Lambda, EC2, on-premise servers), but also for states that only wait or just pass to the next one.

Where are triggers?

With Lambda, AWS teaches us to love events and triggers. Where are they? It would be nice to start an execution in response to an event such as Amazon DynamoDB or Amazon Kinesis streams, AWS Code Commit push, or automatically pull messages from an Amazon SQS Queue. Now, you are able to integrate Step Functions with Amazon API Gateway, which makes human tasks possible in our executions. Yesterday AWS announced the integration between CloudWatch Events and Step Functions. That is good news because means that AWS is working on integrating more triggers with this service.

What about push events to other services?

Even the opposite feature of triggers is missing. Here, I’m talking about the ability to automatically send the event received from the previous state in my State Machine to other AWS Services. For example, I’d like to receive a notification for each execution that ends without error. The last state of my State Machine could be an integration with Amazon SNS that without any further code will trigger alerts.

Conclusion

There are several key points to keep in mind with this service:

  • State as a service in a serverless infrastructure
  • Easy integration of human tasks
  • Really long execution with timeout and heartbeat functionality
  • Deep integrations with CloudWatch Logs, Metrics, and CloudTrail
  • Nice AWS Console with blueprint and everything you need to get started
  • Integration with IAM; each State Machine has its service role
  • Pay attention to the bill: Costs can increase quickly

If you would like to take this intro-level look to the next level, you should try our hands-on lab. In our lab, you will be able to create and use the State Machine in example number one above. You can find the templates of the examples in this post at the following links: game hosted on AWS and library loans handler.

  • Paola

    I don’t fully understand what push events really do..can you please explain with an example?

  • maria lucena

    Hello,

    Thanks for this neat post. I’ve been working on a POC which is contemplating using step functions for maintaining a user session. In my scenario, each state machine execution is a unique session. With each step the user may take time before moving to the next, but not more than 15 minutes. We chose activities as the step resources, because we can advance the user state when and if, all goes well at a certain point in time. Lambdas nature of immediate invocation prevents us from using them inside the state machine as we don’t want to advance as soon as the work is done and “waits” are not good because we don’t know how long it will the the user to respond, so the fact that we can call sendTaskSuccess when meets that requirement. The challenge is that because any worker can pick up an activity, one execution can pick up another executions taskToken, and so we could advance the wrong state machine by picking up the taskToken placed in the queue by a different execution, which is unacceptable for our scenario. Say Maria is at step 1 and instead of going to step 2, workerx (which is not in charge of tracking the execution) advances her conversation to end. I tried a few things:

    – Adding activities on the fly. This can be done, but I can’t find a way to attach them to an state machine execution dynamically.

    – Having a worker pick up all tasks, save the taskToken to a table with workerName/activityArn identifies. Then when the execution state can move to the next step, the task token is retrieved from the db instead of whatever queue step functions service is placing it. This works better, b/c as soon as a state machine enters a state the taskToken gets picked up. But when doing a load test it quickly crumbles.

    Can anyone please let me know if there is a fundamental constraint here, or what workaround I can possibly use to overcome this. In your blog I can’t find something that explains how activities are managed inside set functions so that I can help alleviate the problem.
    Best Regards,
    Maria