Grafana was already approached. A decade ago someone needed to graph something and did so. Since you can put it in front of whatever data store, it shows up in lots of places.
In recent years, Grafana (the firm) has created Grafana Cloud, gobbled up backends like Mimir, Tempo, and Pyroscope. They’re now presenting as a batteries-included observability solution. Let’s dig into that!
I’m going to use the persona frame from the baseline post and walk the same three people through Grafana. But first I want to dig into the cascade of effects because Grafana’s sales pipeline comes from the free tool you installed 12 years ago, that time you ran out of disk space.
Disclaimer up front!
I work at Honeycomb, and have for four-plus years. That gives me a bias toward the teams who fall off the edge of the dashboard. Grafana is great at a lot of things I’m not the right person to sell you on. Keep that in mind.
Every telemetry setup has the same 4 things
- Find a datastore
- Put the data into a datastore
- Make that data usable
- Get answers
The first job is mostly a procurement question. With a SaaS tool you bought the datastore. With Grafana you stand up Prometheus, Loki, and Tempo yourself, or you buy Grafana Cloud and let them run it. Either way it’s a solved problem and nobody’s observability experience lives or dies here.
The fourth job — getting answers out — is the part everyone thinks they’re choosing between. It’s the demo, the query bar, the dashboard.
But the third job is where the whole fight actually happens, and it’s the one few people realize is a decision. “Make the data usable” has a small number of shapes:
- Manipulate it in a pipeline to fit a rigid backend. Reshape on the way in so it matches the structure the store demands.
- Map it to a schema at ingest. Decide the columns up front and bind the data to them as it lands.
- Store it wide and raw, and impose structure when you query. Don’t decide anything at write time.
The data world has names for the two ends of this: schema-on-write and schema-on-read. Schema-on-write, the first two shapes, is cheap and fast for questions you ask repeatedly. The designers optimized for asking those questions so it’s easy and fast. Schema-on-read is Honeycomb’s end of the pool. The floor is very low here, just JSON-ify something and write it. Meaning is derived at read time, which is the only way to answer a question you hadn’t thought of when the data was stored.
Grafana is schema-on-write, four times over. Prometheus has its data model, Loki has its, Tempo has its, Pyroscope has its. Each one made you decide the shape of the answer before you stored the data. It’s also the correct optimization for each specific kind of telemetry with a given set of priorities. It happens to de-prioritize a specific approach to questioning which I think is compelling.
So how does an Ops person approach Grafana?
Grafana is the ops person’s reflex. Every place I’ve worked had a few Grafanas scattered around — one the platform team blessed, a couple somebody stood up for a single service and never tore down — and the operators reach for all of them the same way. It’s how they know things are okay. You get to feel like, “Everything looks good!” For knowable signals, a glance at a board you trust is the fastest confirmation of health there is.
And the signals genuinely are knowable. The first chart anyone builds is either “remaining memory” or “remaining disk space”. The scope creeps to include stuff like requests per endpoint, error rate per service. You knew the question at instrumentation time, so binding it to a rigid shape up front costs nothing and buys speed and reduces cost. Schema-on-write is a fine trade when the questions don’t change, and the ops person’s questions mostly don’t. Build the board, hang it on the wall, leave it alone. This is the persona Grafana fits best, and it fits for the same reason Datadog fits its ops person: the ops box is a predictable shape with expected time spent on dashboard creation.
The board mostly stays alone, too. You rarely open an old dashboard and discover a brand-new problem on it. When you do, it’s a known signal crossing a line you were already watching, which is a regression on a metric you’d already drawn. This means a change control process failed or your folks didn’t learn from the last one. That’s monitoring working exactly as designed. But a dashboard can only ever surface a problem somebody knew to draw a panel for.
The fit ends the moment the green wall doesn’t explain the midnight pager noises. The alert fired, every dashboard looks normal, and now somebody has to ask a question nobody pre-drew. In a single-backend tool that’s a pivot. In Grafana it’s a scavenger hunt across datastores with different query languages, and Explore makes you pick the store before you know where the answer lives. The answer might be sitting in the cardinality Loki or Mimir dropped on the way in. When the platform team wired the derived fields and exemplars and trace-to-logs jumps, this is survivable and sometimes slick. When they didn’t, the ops person is alt-tabbing between Explore tabs copying a trace ID by hand. The difficulty is still there, hoisted onto whoever built the wiring, and the quality of every triage downstream depends on how good that person was.
There’s one place the ops person does care what attributes the developers are hanging on their spans and metrics: when it costs. “Just add a label to the Prometheus metric” sounds free, but every new label-value combination is a new time series, and a developer live-exploring with all their new attributes is precisely what detonates a Mimir cardinality bill. That’s the same cost engine as Datadog custom metrics because it legitimately hurts the backend. So the dev’s exploration tends to reach the ops person’s radar in exactly one form: as a cost that went up. Which sets up the next persona, because exploration is the developer’s whole job at incident time, and it’s the step where Grafana stops being built for them.
The one other example I’ve seen is a 25-year Grafana-all-the-way company where a developer asked “Who do I contact about adding some more metrics for my new service” and the serious answer was “We already store hundreds of thousands of these. We can’t possibly store more. Please do not contact anyone about it.” This person requesting it was a developer who just heard you're going in blind.
So how does a developer approach Grafana?
Most developers go straight to logs based on the printf habit from the baseline post wearing a Loki coat. Search is fast, the filter chips feel like writing code, and it’s close enough to the console they already trust that they don’t have to learn a new way of thinking. Except they kind of do, because logs are LogQL, metrics are PromQL, and traces are TraceQL, and three query languages increases cognitive load on newer engineers. TraceQL is the most interesting piece of the whole story and the closest thing to what Honeycomb does to power trace exploration, and the place the gap is actually closing.
But the schema-on-write tax the ops person paid at step three lands on the developer here, somewhere they don’t expect it. Loki indexes labels, not log content, and Loki’s own documentation tells you not to put high-cardinality fields in labels. So the exact move a developer wants, “group these logs by user.id on the new release,” is the thing Loki’s data model struggles to deliver.
Picture that developer at 4pm, an hour after a release, who wants to know if the new checkout.variant attribute correlates with the latency they’re seeing. In a schema-on-read world that attribute is queryable the instant it lands on the span. They added it to the code, it shipped, and it’s now a thing they can group by in the next query. Nobody had to open a ticket with another team for a pipeline change, or a new dashboard. In Grafana the same attribute is a code change plus a cardinality cost plus maybe a new panel plus, if they’re not a PromQL-and-LogQL native, a ticket to whoever owns the dashboards. And even the power user who skips the ticket and opens Explore can’t query a high-cardinality dimension the backend already discarded at ingest.
There’s an asymmetry between the constrained set of questions that ops folks answer and the unconstrained universe of developer questions. It makes exploration feel different in kind and rather than degree, and it’s why the developer is the persona Grafana doesn’t serve as effectively. Their Pyroscope acquisition adds continuous profiling, which developers love once they care how much time each function eats. It’s just harder to paint a coherent picture of what users are doing.
So how does the mysterious third group approach Grafana?
This is the group I’ve watched least, so take the following lightly. The thing that occasionally pulls them in is the one piece of Grafana that is extremely flexible: it’ll draw a chart over almost any backend you point it at. A panel reading from Postgres next to a panel reading from Prometheus next to one hitting some internal API. For an engineering lead or a data person who wants a business number on the same screen as the system health, that’s a magical connection. It’s a connection the more opinionated tools obscure. Sometimes Grafana is the only surface in the building that’ll put revenue and p99 in the same row.
The one time I watched someone lean all the way into that, it ended badly, and for a reason that rhymes with everything above. They pushed A/B test results and business outcomes straight into the same backend setup the ops team used for infra. The instinct was right: put the business outcome next to the system that produced it. The substrate was wrong. Business questions are high-cardinality by nature because you need variant by segment and customer. That’s the shape the time-series model struggles with. It worked until it was a scaling nightmare and every new metric required a budget and capacity planning discussion.
Grafana is good, actually
So this article mushes up a few things like Grafana, Loki, et al., and nebulous questions and purposes. This isn’t really fair but it is the experience people have when they want answers and the way to see the line go up is Grafana.
If there’s one thing that I’d point to as an undeniable benefit of always having Grafana around, it’s that ability to draw in data from disparate tools and show people lines.
One easy first step, if I may
One of those lines could be a Honeycomb chart showing the p99 of traces. You don’t have to migrate or change anything that Grafana does today, just fork the telemetry stream into Honeycomb for the high-cardinality stuff. The p99 is a useful line, see how the worst 1% of users is experiencing the system. They can click on that chart and be transported into the wonderful world of high-cardinality exploration.
How many other observability tools you got?
Also, I hate to even mention it, but there are more contestants:
- Datadog — frictionless agents and a bill split across six budgets nobody owns.
- Observe — the Snowflake data-lake bet: no silos, but a modeling tax instead.
And if you wandered in mid-series, the baseline that started it is where the three personas come from.