DevOps Telemetry: Open Source vs Cloud vs Third Party

The DevOps principle of feedback calls for business, application, and infrastructure telemetry. While telemetry is important for engineers when debugging production issues or setting base operational conditions, it is also important to product owners and business stakeholders because it can reveal poor user engagement that validates product development decisions. High-value business KPI telemetry can even guide the board of directors. Moreover, strong telemetry is a prerequisite to continuous delivery since it promotes safety at high speed.

Organizations have a seemingly infinite list of DevOps telemetry systems to choose from. Although all cloud providers offer something, there are even more third party vendors like DataDog, NewRelic, and AppDynamics. Of course, there are also open source options like Promotheus and Statsd.

But how do you choose between all the options when they overlap? First, take a step. Then, consider the business objectives the telemetry system must deliver.

Outcomes First, Technology Second

The DevOps Handbook discusses telemetry at length. First, focus on business outcomes and the technology choices will follow. The telemetry system should deliver any of the following outcomes:

  1. Lower mean-time-to-resolve via more accurate data.
  2. Decreased change failure rate due to earlier detection in the deployment pipeline.
  3. Increased confidence in operations staff.

Let’s also clarify “telemetry.” Telemetry is any form of diagnostic or operational data about a running system. This includes time series metrics, text-based logs, and events — such an error or a circuit breaker trip. The DevOps Handbook also provides a telemetry collection checklist:

  1. Business logic data: number of sales transactions, revenue, user signups, churn rate, A/B test results, etc.
  2. Application layer data: transaction times, latencies, response codes, and unexpected errors.
  3. Infrastructure layer data: CPU load, memory consumption, disk space, and network bandwidth.
  4. Client/user software level data: application errors, crashes, and user-measured response times.
  5. Deployment pipeline: the pipeline status, lead times, deployment frequencies, number of promotions to the various environments, and their status.

This checklist only accounts for raw data, but your telemetry system should include alerting, dashboards, and other visualizations features. Consider this: You’ll need a deploy dashboard that displays critical metrics to see if a deploy passes the “smell test.” You’ll also need pre-configured, time-based visualizations for engineers to analyze historical data and understand current conditions.

The next question is just how telemetry data is collected throughout the system. There are two schools of thought: push and pull. A push system reports telemetry to a central collector for further processing. A pull approach maintains a list of connected systems, then pulls data from each system using a specified protocol. Most real-time systems push data. Pull systems make it easier to detect offline systems. If the telemetry cannot be pulled, then it’s safe to assume the system is unavailable.

You must also consider how you’ll report and transport telemetry data. This is tricky because different categories are reported differently. Consider a call to increment a counter, which is just a key and a value. Consider an error. It has a name, location, severity, and a detailed backtrace. That data cannot be reported in the same as the counter increment. Also consider log output that contains multitudes of useful information. The log level indicates severity, and the messages may include useful context, such as user IDs. These categories create a large surface area. The DevOps Handbook recommends a centralized system that can ingress all types of data, then route and process them accordingly.

Now we’ve established three points:

  1. Business outcomes a telemetry system should provide.
  2. Checklist of telemetry data layers.
  3. Types of telemetry data (metrics, events, logs, etc.).

Now we can survey the landscape and see what makes the most sense. Let’s begin with cloud providers.

Cloud Providers

Cloud provided telemetry systems, like AWS CloudWatch, are key to telemetry systems, but you don’t have to go all in with them. CloudWatch is great for collecting telemetry inside AWS, but it’s visualization, ad-hoc querying, and dashboard features leave much to be desired. However, all cloud providers report telemetry for all components to their relevant telemetry system, providing a central collection point for your cloud infrastructure.

This implementation detail is especially useful when combined with a serverless function to process logs. While the function can parse and forward to any upstream system, the cloud provider offers Infrastructure-as-a-Service for running open source software to fill the gaps. My advice is to look closely at cloud provider telemetry systems and consider the gaps you’ll need to fill.

Open Source

Open source telemetry has come a long way since Ganglia and Nagios. There’s software for practically every layer. Riemann and Flutend are powerful ingress points and routers. Graphfana and Kibana handle all kinds of telemetry and run on ElasticSearch. Collectd and friends can run on hosts and report infrastructure metrics. Promotheus is probably the most popular open source telemetry solutions since it provides collection, visualization, and alerting. However, selecting any of these options requires an investment in learning the software and potentially running and operating it yourself. Open source telemetry software tends to be more plumbing oriented than application oriented.

James Turnbull demonstrates how to implement a full-featured telemetry system with Riemann in the Art of Monitoring. It’s a great example of what you can achieve with off-the-shelf open source software.

Third party vendors tend to fill the gaps between cloud provider infrastructure and specific application telemetry and real-world application use.

Third Party Vendors

These vendors, like DataDog, offer turn key solutions that connect telemetry in almost any form from practically any system. DataDog specifically can handle numeric telemetry data and parse text-based log streams, triggering alerts or other notifications. Each vendor targets a different area. Companies like NewRelic take an application-first world view and tend to integrate deeply with specific tech stacks. This is undoubtedly useful, but it forgoes many items on the DevOps Handbook’s checklist.

Also consider the pricing. It may be steep, especially if it’s priced per host or container. Pricing can become even higher for RUM systems; although, it is extremely useful if you can afford it. My advice is to save your time and effort by integrating a robust third party solution with your cloud provider.

What’s Right for Me?

The right solution delivers your desired business outcomes without breaking the bank or introducing too much friction. Telemetry collection should be automated and easy, and adding additional instrumentation should be single line of code. New systems and standard infrastructure components — like databases or web servers — should be automatically instrumented. Moreover, the telemetry system should encourage experimentation and learning from everyone in the organization. If engineers want to track a “seemingly” random number, then the system should be able to track it and see what happens. The same goes for business stakeholders. If they want to track engagement on a certain feature and see how changes impact it, then they should be able to.

There’s no prescriptive solution that fits everyone. However, it’s safe to say that a complete telemetry system will mix cloud, open source, and paid third party software. Even though you may have your AWS CloudWatch metrics reported to something like Kibana, that triggers alerts in PagerDuty. The best advice is to start small and focus on telemetry that maps directly to business KPIs and known operational regressions. This mitigates an overwhelming pile of telemetry and subsequent alert fatigue, enabling your telemetry system to grow and mature along with your business.

Also, don’t forget that DevOps calls for flow, feedback, and learning across the entire organization. DevOps doesn’t stop at the deployment pipeline. Mik Kersten’s recent book Project to Product introduces the flow framework and metrics that drive business. Keep in mind that there’s telemetry for the business itself as well. If you can harness the data in the same way as operational data, then you’ll have more insight and control over the business. Check out TaskTop to learn more about this area.

If this all sounds too abstract, then turn it into concrete learning since you’ll still need to know how to connect systems and build a telemetry system. Cloud Academy has the courses and labs to build your cloud telemetry system. If you’re doing DevOps, then you’re likely leveraging (and succeeding with) cloud computing. Cloud Academy has entire telemetry course and labs for AWSGoogle Cloud, and Azure. There’s also a useful tutorial on building a serverless telemetry system that provides a great overview and implementation advice.

Cloud Academy