IT has changed over the past 10 years with the adoption of cloud computing, continuous delivery, and significantly better telemetry tools. These technologies have spawned an entirely new container ecosystem, demonstrated the importance of strong security practices, and have been a catalyst for a new world of big data. Small and midsize businesses, or SMBs, and enterprises alike now likely need to employ data engineers, data scientists, and security specialists. These roles may be siloed right now, but history tells us there can be a more collaborative path.
DevOps broke down the barrier between development and operations to create the best methodology for building and shipping software. InfoSec is the latest member to join the DevOps value stream. Take a look around the internet and you’ll find plenty of posts on integrating security and compliance concerns into a continuous delivery pipeline, including The DevOps Handbook, which dedicates an entire chapter to the topic. I’ve also written about continuous security on this blog. While Dev, InfoSec, and Ops have become the “DevSecOps” you see splashed around on the internet, we need a new movement that’s rooted in DevOps philosophy to bring in data workers.
Calling for DevSecDataOps
So the name may not be great (you may have even seen it coming), but I’ll make my case for integrating all data related activities into the DevOps value stream. My case begins with the second DevOps principle: The Principle of Feedback.
The Principle of Feedback requires teams to use automated telemetry to identify new production issues and verify the correctness of any new release. Let’s put aside the first clause and instead focus on the arguably more important second clause. First I must clarify a common shortcoming. Many teams ship changes to production and considers that “done”. That’s not “Done”, but still “Work In Progress”. “Done” means delivering expected business value in production.
Imagine your team ships feature X to production. The product manager expects to see Y engagement on feature X and possible changes in business KPIs A, B, and C. The Principle of Feedback requires telemetry for Y, A, B, and C such that the team can confirm or deny in a reasonable time that feature X produced the expected outcome. Businesses live and die with this type of telemetry. Data engineers and data teams are becoming increasingly responsible for providing this type of telemetry. Thus, data workers are part of the critical path from business idea to delivering business value in production. Moreover, their workflows and processes must move at the speed of continuous delivery. In other words: it’s time to bring the data side into the DevOps Value stream.
InfoSec and Compliance for Data Pipelines
I think it’s clear that business KPIs and user engagement telemetry collected by data teams is critical to business. It’s also clear to me that the principle of continuous security connects to compliance in data pipelines. I predicted that the GDPR (General Data Protection Regulation) would be a big deal in 2019 in my previous post. Data warehouses and data lakes are potential sources for GDPR and other regulatory infractions; consider the GDPR requirement that all user data must be deleted 90 days after terminating service.
One solution is to deploy time-to-live telemetry on different types of data, creating alerts for violations. Another solution is to add automated tests for scripts that scrub user data and run the tests as part of the automated deployment pipeline. Hopefully, there’s already a set of automated tests for whatever transformation and munging goes on. If not, then this is a place to create and start building a deployment pipeline for the data processing system. Plus, a deployment pipeline is required for data teams to move at the speed of DevOps.
Converge Data Engineering and DevOps Practices
Data teams can benefit from DevOps practice. Consider what happens when a new data scientist joins the team. That person needs an environment to build and test their models. This requires test data, workstation setup, and even cloud infrastructure setup. This also calls for automation backed by infrastructure-as-code, a key DevOps practice (along with the management of test data, and any other artifact required to bootstrap a new environment). The environment may be something simple like a dedicated EC2 instance or a more complex pipeline of data streams and serverless Lambda functions. Regardless, the setup can and should be automated.
Consider the architecture with a transactional system and a separate data processing system. The data system ingests data from the transactional system to produce reports, KPIs, and other real-time telemetry. Our imaginary feature X spans both systems: functional implementation changes in the transactional system, and processing or analysis changes in the data processing system. Both systems need to be developed simultaneously, tested alongside each other, and ultimately promoted together to production. Note the relationship between these two systems. The data system should not be tested in production, especially if it outputs drive business decisions. Technical issues should not prevent the team from achieving this. It just requires some automated elbow grease and collaboration. Given both systems are encapsulated in infrastructure-as-code, then it should be possible to deploy each system into an isolated and dedicated test environment, enabling smoke testing across both systems. A simple test triggers feature X and assert on the availability of telemetry Y, A, B and C. This small test eliminates an entire class of costly regressions like misconfigured integration points and flat out broken implementations. If the automated tests pass, then both systems can be promoted into production. That’s continuous delivery in a nutshell or The Principle of Flow. The Principle of Flow leads us back to where we started: The Principle of Feedback.
Earlier, we set aside the first part of The Principle of Feedback. Now we must return and apply it to data pipelines. Data Pipelines are just like any other IT component. At runtime, they can be impacted by operational conditions such as memory limits, CPU thrashing, network latency, disk capacity, and/or bandwidth saturation. There are known telemetry playbooks for common data pipelines components such as Kafka or Hadoop. There are also known abnormal operational conditions, application specific failure modes, and tripwires. Consider a data pipeline using Kafka. If there are no messages across the ingestion stream, then something is wrong. That’s a simple tripwire. That covers integration points. Data stores and processing systems also require standard USE (Utilization, Saturation, Errors) metrics and relevant alerts. One example is the disk capacity inside a data warehouse system. Known limits can be defined to trigger an alert condition, say 85% utilization for example, and a resolution. Again, applying these telemetry practices is a core DevOps concept.
There’s one remaining DevOps principle: The Principle of Continuous Learning and Experimentation. The entire IT organization must experiment and learn together. Integrating all members of the value stream is only possible if it’s attempted. Teams have to start somewhere. It may be asking questions like “How can we test and deploy our product and data systems together?” Or “How can we get more real-time data from our data pipeline?” Both are valid questions with many possible solutions. The best outcomes involve collaboration and experimentation. Your organization will achieve something when proper leadership and learning is applied.
How to Shift to DevSecDataOps
Cloud Academy has a deep training catalog for anyone interested in development, security, operations and/or data. You can lead the convergence in your team or organization with a strong knowledge mix across these areas.
The DevOps Culture learning path teaches you to see things from a DevOps perspective and how to bridge gaps in your organization. There’s also a lab on Building a Data Pipeline with DCOS that connects data pipelines to ops and infrastructure. The library of data-oriented courses can get you started with AWS, Google Cloud Platform, or Azure.
Cloud Academy provides in-depth courses which will take you from zero to hero on infrastructure-as-code and configuration management tools:
- Terraform (developed in coordination with Hashicorp)
- Puppet (developed in coordination with Puppet)
- Chef (developed in coordination with Chef)
- Ansible (developed in coordination with Ansible — and my personal favorite!)
There’s also an introduction to continuous delivery course.
With CloudAcademy, you can learn the skills to make essential changes, converging development, operations, InfoSec and data engineering. Here’s my last question to you: How will you lead DevSecDataOps in your company?
Docker vs. Virtual Machines: Differences You Should Know
What are the differences between Docker and virtual machines? In this article, we'll compare the differences and provide our insights to help you decide between the two. Before we get started discussing Docker vs. Virtual Machines comparisons, let us first explain the basics. What is ...
DevOps: From Continuous Delivery to Continuous Experimentation
Imagine this scenario. Your team built a continuous delivery pipeline. Team members deploy multiple times a day. Telemetry warns the team about production issues before they become outages. Automated tests ensure known regressions don't enter production. Team velocity is consistent and ...
How Google, HP, and Etsy Succeed with DevOps
DevOps is currently well developed, and there are many examples of companies adopting it to improve their existing practices and explore new frontiers. In this article, we'll take a look at case studies and use cases from Google, HP, and Etsy. These companies are having success with Dev...
How to Accelerate Development in the Cloud
Understanding how to accelerate development in the cloud can prevent typical challenges that developers face in a traditional enterprise. While there are many benefits to switching to a cloud-first model, the most immediate one is accelerated development and testing. The road blocks tha...
DevSecOps: How to Secure DevOps Environments
Security has been a friction point when discussing DevOps. This stems from the assumption that DevOps teams move too fast to handle security concerns. This makes sense if Information Security (InfoSec) is separate from the DevOps value stream, or if development velocity exceeds the band...
Understanding Python Datetime Handling
Communicating dates and times with another person is pretty simple... right? “See you at 6 o’clock on Monday” sounds understandable. But was it a.m. or p.m.? And was your friend in the same time zone as you when you said that? When we need to use and store dates and times on Pytho...
Cloud Academy’s Blog Digest: July 2019
July has been a very exciting month for us at Cloud Academy. On July 10, we officially joined forces with QA, the UK’s largest B2B skills provider (read the announcement). Over the coming weeks, you will see additions from QA’s massive catalog of 500+ certification courses and 1500+ ins...
How to Become a DevOps Engineer
The DevOps Handbook introduces DevOps as a framework for improving the process for converting a business hypothesis into a technology-enabled service that delivers value to the customer. This process is called the value stream. Accelerate finds that applying DevOps principles of flow, f...
Top 20 Open Source Tools for DevOps Success
Open source tools perform a very specific task, and the source code is openly published for use or modification free of charge. I've written about DevOps multiple times on this blog. I reiterate the point that DevOps is not about specific tools. It's a philosophy for building and improv...
DevOps: Scaling Velocity and Increasing Quality
All software teams strive to build better software and ship it faster. That's a competitive edge required to survive in the Age of Software. DevOps is the best methodology to leverage that competitive advantage, ultimately allowing practitioners to accelerate software delivery and raise...
Continuous Deployment: What’s the Point?
Continuous Deployment is the pinnacle of high-performance software development. Continuous deployment teams deploy every commit that passes tests to production, and there's nothing faster than that. Even though you'll see the "CD" term thrown around the internet, continuous deployment a...
DevOps Telemetry: Open Source vs Cloud vs Third Party
The DevOps principle of feedback calls for business, application, and infrastructure telemetry. While telemetry is important for engineers when debugging production issues or setting base operational conditions, it is also important to product owners and business stakeholders because it...