How People Approach Datadog

Observing systems has evolved over the years, and Datadog has been there helping it along. The way agents deliver data to predictable and information-rich dashboards is very nice. There is a lot to be said about being predictable and I don’t think anyone would argue about Datadog being predictable on all fronts, except maybe cost.

In this post, I’m going to use the persona frame from the baseline post and walk the same three people through Datadog.

Disclaimer up front!

I work at Honeycomb, and have for four-plus years. This has caused a bunch of biases toward the types of teams that struggle with Datadog and other monitoring tools. These are the people who operate outside the predictable patterns that fit nicely into the existing dashboards.

We all inherit from our architectures, everyone ships their org chart. I am going to explore how these architectures and org charts influence the operator and user experiences.

Why Datadog deployments get enormous

Nobody sits down and decides to stand up a giant Datadog deployment. It accretes around you and your teams.

Datadog is a product per signal — APM, Logs, Metrics, RUM, Synthetics, Profiling, Security, on and on — and each one is its own little world with its own UI. You pick the silo first and then query inside it. There’s cross-signal stitching — Watchdog, correlations, the Service Catalog but only when you go looking. Infra brings in host metrics. Some team flips on APM. The frontend folks want RUM. Security shows up with its own line items. Every one of those yeses is reasonable on its own, and not one of them made anybody rethink the whole stack.

So the thing grows sideways, one defensible decision at a time, until every signal lives under one vendor and a number in a spreadsheet startles somebody in finance. I won’t pretend the breadth is a trick — one vendor with a checkbox for whatever just broke is an easy thing to say yes to when you’re a stressed-out platform team.

But “it has every feature” doesn’t explain how it ends up in six budgets with nobody noticing. Two quieter things do that work: how easy it is to get data in, and how the bill gets carved up. Both land on the same idea — Datadog lets an organization grow its observability footprint without ever stopping to agree on anything. Honeycomb, by contrast, doesn’t have an opinionated agent or dashboard so the implementers have to get concensus up front.

Getting data in is the lubrication

Datadog’s agent does something like 90% of the work for you and still lets you customize until your heart gives out. You drop the agent on a host, it discovers what’s running, and data starts showing up. The defaults are opinionated, and the opinions are pretty close to what most teams wanted anyway. You don’t have to know what good looks like. Datadog already decided.

Honeycomb is “all in on OpenTelemetry” which asks more of you. The SDK does maybe half the work for a Java app and less for everything else. Infra still needs collectors stood up. And before any of that, if you’re doing it right, you owe yourself a conversation that sounds like “how should our team approach this” — span names, attribute conventions, what to sample, who owns the pipeline when it pages someone at 2am.

That conversation is a tax, and you pay it before a single trace lands. It’s a good tax — the team that has it ends up with instrumentation that actually fits their system — but it’s a tax, and plenty of teams would rather not. Datadog lets them not. You get what you get, and it’s fine. One guy I worked with had dozens of email folders that each received a ping from datadog every day so he knew the systems were working. He didn’t look at a dashboard, just watched unread messages accumulate.

So the easy on-ramp isn’t a convenience bolted onto the side. It’s the mechanism. Frictionless ingest is a big part of why the footprint spreads the way it does — every new team can light up its corner without convening a working group first.

The billing silos are doing work

The other half of the story is the invoice.

By keeping every product in its own silo, Datadog keeps every product on its own line. Whether or not anyone drew it up on a whiteboard, the effect is a single contract with tendrils running into half a dozen budgets. APM comes out of one team’s number, RUM out of another’s, Logs out of a third. Nobody feels the whole bill land on their desk, so nobody ever flinches.

Cost split six ways is cost nobody owns, and cost nobody owns is cost nobody is standing there with a reason to stop. The deployment accretes because the accretion never piles up anywhere big enough to scare a single budget-holder.

To be fair to the shops where this doesn’t happen: plenty of them installed the missing feedback loop on purpose. Tag-based cost allocation, a FinOps person is watching the total, Datadog’s own cost tooling pointed at the sprawl. For those teams the per-product split isn’t a trap, it’s accounting. They can easily run the chargeback, so I don’t hear from those teams. They’re not mad.

Honeycomb’s “billing simplicity” shows up in the market differently. “Just pay for the events you consume.” Okay, so the observability team eats the whole cost? Otherwise, that team has to build a chargeback model to push the cost back out to the teams generating events.

And the final ding against Honeycomb, the teams benefitting from the answers to people’s queries is incredibly poorly correlated with the ingested event count.

So how does an Ops person approach Datadog?

The ops person lands on a dashboard and feels at home immediately. Honeycomb has a “service scoped home” but it’s based on an old assumption, that one team cares about one service.name. This is true of (from what I’ve seen) 0% of the population. They all have to head over to the query builder and learn. Datadog is the guy pointing at the charts going “it has fuel, the oil temperature is fine”. Honeycomb can answer the question “am I about to drive off a cliff?”, but you have to figure out how to ask that question.

Datadog’s fit is no accident. The ops box is the most predictable-shaped box on the org chart — it has wanted the same four golden signals for fifteen years — so a product built around opinionated, pre-aggregated dashboards lands exactly where they live. They never pay the consensus tax, because nobody had to agree on anything. The agreement was baked into the product years ago, and it happens to match what they’d have asked for.

They drill down the way they always have — filter a tag, jump to the APM service page, follow it into traces. Comfy and familiar!

The wheels come off the moment they want to group by something nobody pre-aggregated. High cardinality is where the dashboard model runs out of road, and “custom metric” turns out to be a phrase that means “you’re gonna get a call from procurement.” That’s the cost engine from before, instrumenting doesn’t incur the cost at the time, the bill arrives with a higher number of custom metrics.

But for broad fleet ops, Datadog is comfortable in a way that matters to many operators. When you’re on the hook for a few hundred services, those dashboards and that endless integration tail carry weight you’d feel the second they vanished. For the predictable-shaped box, the predictability is the whole point.

So how does a developer approach Datadog?

Most developers I’ve watched in Datadog go straight to Logs. It’s the printf habit from the baseline post wearing a nicer coat — the search is fast, the filter chips feel like writing code, and it’s close enough to the console they already trust that they don’t have to learn a new way of thinking. A smaller, more enlightened cohort starts in APM and clicks into a trace, because flame graphs are pretty and watching your own request fan out across services scratches an itch.

Here’s where the org chart bites them. The developer almost never sat in the room where the pipeline got decided. Sampling rates, what gets kept, what gets dropped at the agent — that was an ops or platform call, made in a different box, optimized for cost and dashboard fidelity rather than for the one weird trace a developer is hunting at 4pm on a Thursday. So they go looking for the request that broke, and it isn’t there. It got sampled away by a policy they never saw, to save money in a budget that isn’t theirs.

That exact moment — “the trace I need doesn’t exist” — is where Honeycomb prospects are born. Not because Honeycomb is magic, but because the developer just found out the thing they cared about got thrown away upstream by someone optimizing for something else.

It isn’t all friction, though. Profiling and Live Containers are good developer tools, and Honeycomb doesn’t play in that space. If your problem is “which function is eating the CPU,” Datadog has a better answer. Finding out that that higher duration span is only present on 0.3% of traces and that they’re all from one user… that’s where Honeycomb comes in since most people can’t afford to add a user ID to custom tags unless they only have dozens of user.

So how does the mysterious third group approach Datadog?

The third group — my catch-all for engineering leads, data people, and anyone who didn’t write the code and doesn’t get paged by it — splits in two the moment you point Datadog at it. It’s also the group I’m farthest from. I know the ops box and the dev box from the inside. This one I’ve mostly watched from across the room.

The execs and engineering leaders love it. Datadog built tooling aimed right at them — the Software Catalog answers the “who owns what and how does it all fit together” question, Notebooks turn a pile of graphs into a narrative, and the executive dashboard is built so a VP can answer “are we on track” without ever interpreting a p99. And we know it works because of Datadog’s billions in annual revenue and consistent growth. Somebody with budget authority signs those renewals, and that somebody is either in this group or above it.

The data people are the other half, and they don’t use Datadog the way they’d use Honeycomb — mostly because they learned not to. The place I’ve actually watched this go sideways is at the Honeycomb end, not the Datadog one. The boss reads Charity describing observability as analytics for everything — compute resources, user experience, all the gooey business parts in between — and wants those insights. So they ask the data team to deploy an agent and have it all show up, the same way a Datadog agent makes host metrics show up. That’s not how it works. Honeycomb can do the advanced analytics, but only once the data is there: every transaction, annotated with an attribute for every business variable you care about.

And that’s exactly the thing the data team trained itself never to do. In Datadog, annotating every transaction like that is the dreaded custom metrics bill, so they never bothered — they built their own collection and piped it into Snowflake or whatever analytics tool they already trust. The habit followed them over.

How much revenue did this incident cost? Honeycomb can tell you without resorting to a BI report.

Who am I to tell you what to do?! Oh yeah. I’m Mike.

For dashboard-first ops with a long integration tail, Datadog wins on inertia and breadth, and it isn’t close. For exploration-first work on high-cardinality data — the developer hunting a trace that got sampled away, the data person who learned to keep business analytics in a warehouse because the observability tool priced them out — the gap with Honeycomb hasn’t closed, and Datadog’s product-per-silo shape is the reason it hasn’t.

Both of those are typically true in the same company, because most companies are not one shape. Do you want a team of people chartered with the primary responsibility of “keeping cardinality low”?

The first step, if you want one

The first step toward fixing a cost pathology is never to rip out the thing causing it. Tearing Datadog agents out of a few hundred services is more than 1 quarter of work so can’t be scheduled and carries personal risk. Nobody sane volunteers for that, and I’m not going to be the Honeycomb guy who tells you to.

Focus on one question you currently can’t afford to ask — the high-cardinality one where the needed “custom metric” has been denied or removed. Since someone cares about the system and about the question, have them (or more likely Claude) add OpenTelemetry, annotated with the attributes you actually care about. A free Honeycomb account can take 20 million events per month so try it without talking to salesfolks. You don’t touch an agent, you don’t migrate anything, you just answer one question the old model priced out of reach.

The one bad outcome is the one most shops pick by default: keep paying the tax, pay for all the copies as metrics, logs, traces, errors, business intelligence, and data pipelines.

Nothing has to move. You instrument with OpenTelemetry alongside whatever Datadog’s already doing — and every workload you add that way is one less thing locking you in when the renewal lands.

How many other observability tools you got?

Also, I hate to even mention it, but there are more contestants:

Grafana — the free dashboard already in the building; schema-on-write four times over, exploration sold separately.
Observe — the Snowflake data-lake bet: no silos, but a modeling tax instead.

And if you wandered in mid-series, the baseline that started it is where the three personas come from.