Test in Prod, No Thanks! Continuous Deployment, Yes Please!

Nobody should test in prod, except that you already do so get good at it.

Legacy organizations with legacy products have designed complex and rigid processes for releasing software into a production environment. The production environment is special and has limited access and compliance and constraints and needs to always work so developers shouldn’t even try to look at it.

Except things change and stuff breaks. So now what? Continuous deployment to the rescue!?

Thanks to Charity Majors and Honeycomb, there are new voices in the world espousing the need to test in production. This creates a nearly infinite voice screaming back that it’s completely ludicrous to test in production. My sense is that these voices are talking right past each other, possibly intentionally.

Not all tests

When people want to empower testing in production to be part of the process, they are not saying to move all tests out of the SDLC and just deploy nonsense and then see how it goes.

The tests that need to run in production are those that are only applicable to the data, workload, scale, constraints, credentials, and other environment-specific facets. The most important aspect of these facets is that they are ultimately unknowable and ever-changing. Even in highly controlled environments, the way that users engage, the number of them, the types of devices, and third-party components will change in subtle ways.

So any pipeline test that claims to represent the production environment fails in some obvious and some non-obvious ways.

Not instrumenting production for testing means the unknowns say unknowns until catastrophe.

Not all production

When people want to empower testing in production to be part of the process, they are not saying to upgrade your entire deployment in many breaking ways without validating anything beforehand.

The production environment for legacy systems is often made up of many components. In some of the most legacy environments, these components require a lot of coordination and if one aspect isn’t perfectly matched to the others, the cross-system messages will fail or the shared schema won’t include the right fields or data types.

This coordination risk is absolutely massive and comes from poorly defined service boundaries and the working agreement for how the relationships evolve. If a change to one system requires changes to all other systems be deployed at the same time, these are tightly coupled, super risky, very breaking, and untestable before it all lands in production.

If the idea of instrumenting your production environment is so scary because of the changes required to add that instrumentation, the technical debt is what you’re sensing. It’s likely at the core of everything creating friction for your business trying to evolve and succeed in the market.

Decouple the components, architecting production for reliability, localizing data, and get ready for velocity.

Continuous means all the time, without approval?

Since we know your production environment changes all the time, and we know your product needs to change a lot more, a lot faster, to keep up with customer and business demands, there seems to be an overlap here. Become comfortable with the production environment, as a complex system, including the developers and administrators as components in the sociotechnical system.

You’ll find there are some good leverage points in the production system once the passage of time and human activities are factored in. Production isn’t a table setting that can be perfectly organized and expect it to stay that way. It’s a system that has to re-set the table over and over while hundreds of kids and adults and dogs all use it.

Changes to the production system get a lot less scary when individuals have ownership of their components. It gets a lot easier when the pact between teams dictates what sorts of changes should be expected. It gets a lot safer when the release of new functionality doesn’t require actually changing the software. Deployments should rarely be apparent to users who are just trying to get the value they want from the provided software.

Summarize it in a numbered list

The three mindsets of production excellence which lead to a rapidly growing and stable system are:

  1. Automate deployments and do them often BUT NOT RELEASING FEATURES
  2. Factor in human activity and drop the assumption that production can be “stable”
  3. Traces rather than logs/metrics so unknowables can be known

Any other stuff like “integration testing” and “service level objectives” will fall out of one of these mindsets.

The first one would lead to a lot of outcomes and encompasses most of “DevOps” and even “DevSecOps” so it’s not worth rehashing. The benefits include stuff like modularity, isolating persistence and databases from the software (can’t move fast while dragging a shared RDBMS around), automated testing, pipeline refinement, configuration management, and security policy reimagination. That last part of feature releases being a separate thing from deployment is at the core of continuous deployment but often overlooked.

The second one (people problems) leads to the cultural changes such as ownership of components and a shared understanding of what should change and what should stay.

The third is to establish a comfort level with the system’s quiet failures that it can recover from on its own. When a massive SaaS system falls over, it’s rarely a single line of code that brought the whole thing down. It’s always been swallowing recoverable errors of various types and some new thing overwhelmed that or degraded its recoverability.

Okay, we can’t change production or test in production because of reasons, so what

Start with parts that are small already, that have a defined system border. Add traces and feature flags to things that are changing anyway. Categorize systems into the camps of those that have versioned and controlled APIs versus those that don’t. Any new system or refactored component or microservice carved from the monolith needs to land in the API camp.

The working agreement view of system architecture can help when considering changes. Set the expectation that APIs are either versioned specifically or follow a capability accrual pattern to maintain backward compatibility for the life of the service. This also leads to the system being better at shedding outmoded services as their usefulness fades. If an API needs to change so much that you’d go from V1 to V2, that sounds like a good time to massively refactor, reuse useful components, and delete the V1 stuff right away.

Run both services in an overlapping context for some time to make sure the new one is meeting the requirements and then decommission the old one. If the new one can’t keep up or struggles for other reasons, taking a step back and still having V1 in production means production is not at risk.

So we need to rebuild prod?

Yeah… It’s being rebuilt all the time anyway and the unmeasured and high impact risks should not be accepted or normalized.

Aside 1: What’s that accrual thing?

Time makes fools of us all. If you have a service that takes an input and returns an output, that becomes the basic functionality of the system.

For example, you take in a user account and get back a storage usage integer in bytes. As the production system adds new types of users and new types of data, that basic contract needs to always be fulfilled even as relevance decreases. When the response for a user includes a data total plus 3 more fields for the quota, database, and object storage values. All services that expect only a single value can continue to react to that first “total” value. Newer services that need the breakdown or react to the quota can parse the additional fields.

The idea of “let’s replace the storage value with the breakdown” leads to breaking changes that are unnecessary.

If every service operated this way until its responses were so not-applicable that it was time for a rewrite, the production system would shed the bad old stuff and new teams using new tech could usher in a higher performance, more resilient, and better user experience. And really, user experience is all that matters here.

Aside 2: Sounds like feature flags.

When talking about Continuous Deployment or Delivery, the default for most people is that you’re skipping the rigorous change control process and just let random developers put new .jar files into the websphere or something. That’s a fantastic starting point but you’d find it quickly falls apart and needs refinement. The first refinement should be feature flags.

If you can feature flag the release, then you have a matrix of risky levels where horizontal is deployment and vertical is risk, with deliberate increments for both. For the deployment risk, your single tiny feature or subcomponent change hits the master branch. The pipeline builds and tests and all looks good. The deployment doesn’t replace the only binary in existence with a different one. The deployment makes a new path among the many others that have been running successfully for a while. Some of these paths are probably 2 or 3 or 10 commits behind the change that is just now being released. Since none of them actually change the behavior of the system, it’s not a big deal to run them side by side and make sure none of them are creating additional error states.

Since the developer and testing teams and those who are building functionality based on these new aspects want to see and use them, we can start the other dimension of riskiness. Enable the feature for a single user or a group of internal users. Let them smoke test it while they’re watching the observability tools. See the new traces, make sure performance is okay. If there are troubles, disable the feature flag and deploy some fixes. If it looks good, open up the feature flag to a larger audience.

Both of these can be reacted to by incidents in the system itself. If a deployment creates an error state, the progressive delivery can be halted. If a feature flag is bringing a new code path to life which creates an error state, the feature flag can be disabled. In both cases, parts of production may be suffering but the users are still happy and operations are still operating.