Observability maturity favors data clarity over quantity

Table of Contents

Site reliability engineers are tasked with maintaining the observability of systems, and while detailed data-gathering tools can help, they can also hinder visibility if not used correctly.

That was a major theme among presentations from expert site reliability engineers at this month’s SRECon. To wit, it’s not about how much data is gathered; it’s about how well it’s used to serve the business, keep systems running smoothly and keep team members informed. Observability, a term that has supplanted IT monitoring in cloud-native environments, refers to a practice in which systems can be queried effectively to troubleshoot or prevent problems in real time, with an emphasis on overall user experience rather than on the performance of individual system components.

Making good use of observability data starts with asking the right questions, aligned with the needs of the organization, according to presenters from SRE software vendor Blameless, who showed examples of their internal dashboards that track service reliability according to business priorities.

“As leaders in SRE, you can often be perceived as the bearer of bad news, especially to management,” said Christina Tan, a member of the Blameless strategy team. “Understanding business needs [means] that instead of being seen as a cost center, for SRE teams, you can show how you contribute to business goals and business growth.”

Places within each enterprise where system reliability needs improvement can seem endless, but SREs must prioritize what goals are most important. Aligning observability data gathering to specific goals will also help SREs present a more useful set of metrics to developers and business leaders.

“When companies invest in incident resolution, they may still have the same number of incidents, but the severity of customer impact will significantly decrease,” said Mindy Stevenson, director of engineering at Blameless, in the SRECon presentation. “And so perhaps instead of the number of incidents, incident severity is a better measure.”