Developers and engineers often use observability to solve three key business and technical challenges: reducing downtime, reducing latency, and improving efficiency.
Outage frequency, mean time to detection (MTTD), and mean time to resolution (MTTR) are common metrics used in security and IT incident management.
This section covers outage causes, frequency, and costs; and MTTD and MTTR trends.
Highlights:
Outage causes
More than a third (35%) of respondents said network failure was the most common cause of unplanned outages at their organization in the last two years. More than a quarter said third-party or cloud provider services failure (29%), someone making a change to the environment (28%), and deploying software changes (27%) were the most common causes.
said network failure was the most common cause of unplanned outages in the last two years
Organization size insight
Those from large organizations were more likely to say someone making a change to the environment was a top cause (31% compared to 24% for both small and midsize). Those from small organizations were more likely to cite capacity constraints (26% compared to 19% for large and 16% for midsize).
Regional insight
Those surveyed in Europe and the Americas were more likely to say someone making a change to the environment was a common cause (32% and 31% respectively compared to 23% for those in Asia Pacific). Those surveyed in Asia Pacific were more likely to contend with capacity constraints (21% compared to 18% for those in the Americas and 15% for those in Europe) and unexpected traffic surges (22% compared to 18% for those in the Americas and 16% for those in Europe).
Industry insight
Network failure wasn’t the top choice for all industries. It was tied for first place with security failure for government respondents (34%), and with power failure for media/entertainment respondents (32%). Power failure was also the top choice for energy/utilities respondents (35%).
said high-business-impact outages cost at least $1 million per hour of downtime
said their MTTR improved to some extent since adopting observability
experienced high-business-impact outages at least once a week
Outage frequency
When we asked survey takers how often they experience low-, medium-, and high-business-impact outages, the median annual outage frequency across all business impact levels was 232 outages. Low-business impact outages occurred the most frequently—more than half (57%) experienced them at least once a week, and 15% dealt with them daily. While high-business-impact outages happened the least frequently, 38% still experienced them at least once per week, and 12% said they occur at least once per day.
Seven factors were associated with less frequent outages, including:
- Having more unified telemetry data: Those who had more unified telemetry data experienced 77% fewer annual outages than those who had more siloed telemetry data (96 outages compared to 409 outages).
- Achieving full-stack observability: Those who had achieved full-stack observability experienced 71% fewer outages per year than those who hadn’t (74 outages compared to 252 outages).
- Deploying more observability capabilities: The more capabilities they deployed, the fewer outages they experienced per year. For example, those who had deployed five or more observability capabilities experienced 47% fewer annual outages than those who had deployed four or fewer (196 outages compared to 370 outages). Those who had deployed 10 or more experienced 62% fewer annual outages than those who had deployed nine or fewer (96 outages compared to 252 outages). And those who had deployed 15 or more experienced 69% fewer annual outages than those who had deployed 14 or fewer (74 outages compared to 234 outages).
- Learning about interruptions with observability: Those who learn about interruptions with observability experienced 69% fewer annual outages than those who used more manual detection methods (114 outages compared to 366 outages).
- Integrating more types of business-related data with telemetry data: Those who had integrated five or more types of business-related data with their telemetry data experienced 47% fewer annual outages than those who integrated one to four types (134 outages compared to 252 outages).
- Using a single tool for observability: Those using a single tool for observability experienced 9% fewer annual outages than those using multiple tools (214 outages compared to 234 outages).
- Employing more observability best practices: Those who had employed five or more observability best practices experienced 8% fewer annual outages compared to those who had employed four or fewer (214 outages compared to 232 outages).
experienced high-business-impact outages at least once a week
Organization size insight
Those from small organizations experienced substantially more outages per year (410) compared to those from large (234) and midsize (183) organizations.
Regional insight
Those surveyed in the Americas experienced the fewest outages per year (94) compared to those surveyed in Europe (207) and Asia Pacific (272).
Industry insight
Government organizations experienced the most outages per year (419), followed by media/entertainment organizations (413). Services/consulting organizations experienced the fewest outages per year (55), followed by retail/consumer organizations (118).
Mean time to detection (MTTD)
The mean time to detect an outage is a common service-level metric used in security and IT incident management. The data shows that the median number of hours spent on MTTD per year across all business impact levels was 134 hours—which is approximately six days. The median MTTD for high-business-impact outages was 37 minutes, and more than a quarter (29%) of respondents said MTTD was an hour or more for high-business-impact outages.
Seven factors were associated with faster MTTD, including:
- Achieving full-stack observability: Those who had achieved full-stack observability spent 85% fewer hours detecting outages per year than those who hadn’t (23 hours compared to 155 hours).
- Deploying more observability capabilities: The more capabilities they deployed, the less time they spent detecting outages per year. For example, those who had deployed five or more observability capabilities spent 52% less time detecting outages than those who had deployed four or fewer (95 hours compared to 195 hours). Those who had deployed 10 or more spent 77% less time detecting outages per year than those who had deployed nine or fewer (39 hours compared to 170 hours). And those who had deployed 15 or more spent 84% less time detecting outages per year than those who had deployed 14 or fewer (22 hours compared to 138 hours).
- Having more unified telemetry data: Those who had more unified telemetry data spent 79% less time detecting outages per year than those who had more siloed telemetry data (28 hours compared to 225 hours).
- Learning about interruptions with observability: Those who learn about interruptions with observability spent 78% less time detecting outages per year than those who used more manual detection methods (48 hours compared to 216 hours).
- Integrating more types of business-related data with telemetry data: Those who had integrated five or more types of business-related data with their telemetry data spent 65% less time detecting outages per year than those who integrated one to four types (57 hours compared to 162 hours).
- Employing more observability best practices: Those who had employed five or more observability best practices spent 35% less time detecting outages per year compared to those who had employed four or fewer (90 hours compared to 138 hours).
- Using a single tool for observability: Those using a single tool for observability spent 15% less time detecting outages per year than those using multiple tools (117 hours compared to 138 hours).
took at least an hour to detect high-business-impact outages
Organization size insight
On average, midsize organizations spent less time detecting outages per year (101 hours) than large (138 hours) and small (163 hours) organizations.
Regional insight
On average, those surveyed in Asia Pacific spent the most time detecting outages per year (219 hours), followed by those surveyed in Europe (110 hours) and the Americas (42 hours).
Industry insight
The industries that spent the least amount of time detecting outages per year included services/consulting (23 hours), retail/consumer (61 hours), and education (64 hours). The industries that spent the most amount of time detecting outages per year included media/entertainment (331 hours), government (269 hours), and financial services/insurance (227 hours).
Mean time to resolution (MTTR)
There are similar patterns with MTTR, another common service-level metric used in security and IT incident management. The median number of hours spent on MTTR per year across all business impact levels was 141 hours—which is about six days. The median MTTR for high-business-impact outages was 51 minutes, and more than a third (39%) of respondents said MTTR was an hour or more for high-business-impact outages.
Seven factors were associated with faster MTTR, including:
- Achieving full-stack observability: Those who had achieved full-stack observability spent 76% fewer hours resolving outages per year than those who hadn’t (41 compared to 168).
- Having more unified telemetry data: Those who had more unified telemetry data spent 76% less time resolving outages per year than those who had more siloed telemetry data (62 hours compared to 258 hours).
- Deploying more observability capabilities: The more capabilities they deployed, the less time they spent resolving outages per year. For example, those who had deployed five or more observability capabilities spent 41% less time resolving outages than those who had deployed four or fewer (113 hours compared to 191 hours). Those who had deployed 10 or more spent 70% less time detecting outages per year than those who had deployed nine or fewer (53 hours compared to 179 hours). And those who had deployed 15 or more spent 75% less time detecting outages per year than those who had deployed 14 or fewer (38 hours compared to 150 hours).
- Learning about interruptions with observability: Those who learn about interruptions with observability spent 74% less time resolving outages per year than those who used more manual detection methods (63 hours compared to 240 hours).
- Integrating more types of business-related data with telemetry data: Those who had integrated five or more types of business-related data with their telemetry data spent 57% less time resolving outages per year than those who integrated one to four types (77 hours compared to 178 hours).
- Using a single tool for observability: Those using a single tool for observability spent 20% less time resolving outages per year than those using multiple tools (124 hours compared to 155 hours).
- Employing more observability best practices: Those who had employed five or more observability best practices spent 10% less time resolving outages per year compared to those who had employed four or fewer (130 hours compared to 145 hours).
took at least an hour to resolve high-business-impact outages
Organization size insight
Midsize organizations had the lowest median MTTR (118 hours) compared to large (155 hours) and small (167 hours) organizations.
Regional insight
Respondents surveyed in Asia Pacific had the highest median MTTR (245 hours), followed by those surveyed in Europe (125 hours) and then Americas (53 hours).
Industry insight
The industries with the lowest median MTTR included services/consulting (48 hours), retail/consumer (75 hours), and education (97 hours). The industries with the highest median number MTTR included government (302 hours), media/entertainment (284 hours), and financial services/insurance (277 hours).
Total downtime
Given the relative frequency of outages and time to detect and resolve them as noted above, this adds up to considerable downtime for organizations. The data show that the median annual downtime across all business impact levels was 77 hours—which is about 3 days.
Several factors were associated with less annual downtime, including:
- Deploying more observability capabilities: The more capabilities they deployed, the less downtime they experienced per year. For example, those who had deployed five or more observability capabilities spent 45% less time resolving outages than those who had deployed four or fewer (223 hours compared to 409 hours). Those who had deployed 10 or more spent 74% less time detecting outages per year than those who had deployed nine or fewer (95 hours compared to 371 hours). And those who had deployed 15 or more spent 80% less time detecting outages per year than those who had deployed 14 or fewer (60 hours compared to 299 hours).
- Achieving full-stack observability: Those who had achieved full-stack observability experienced 79% downtime per year than those who hadn’t (70 hours compared to 338 hours).
- Having more unified telemetry data: Those who had more unified telemetry data experienced 78% less downtime per year than those who had more siloed telemetry data (107 hours compared to 488 hours).
- Learning about interruptions with observability: Those who learn about interruptions with observability experienced 73% less downtime per year than those who used more manual detection methods (118 hours compared to 445 hours).
- Integrating more types of business-related data with telemetry data: Those who had integrated five or more types of business-related data with their telemetry data experienced 63% less downtime per year than those who integrated one to four types (139 hours compared to 370 hours).
- Employing more observability best practices: Those who had employed five or more observability best practices experienced 19% less downtime per year compared to those who had employed four or fewer (239 hours compared to 294 hours).
- Using a single tool for observability: Those using a single tool for observability experienced 18% less downtime per year than those using multiple tools (249 hours compared to 305 hours).
Organization size insight
Small organizations had the highest median annual downtime (372 hours, which is about 16 days), followed by large (300 hours, which is about 13 days) and then midsize (230 hours, which is about 10 days) organizations.
Regional insight
Respondents surveyed in Asia Pacific had the highest median annual downtime (467 hours, which is about 19 days), followed by those surveyed in Europe (227 hours, which is about nine days) and the Americas (97 hours, which is about 4 days).
Industry insight
The industries with the highest median annual downtime included media/entertainment (608 hours, which is about 25 days), government (564 hours, which is about 24 days), and financial services/insurance (528 hours, which is about 22 days). The industries with the lowest median annual downtime included services/consulting (80 hours, which is about three days), education (158 hours, which is about a week), and retail/consumer (164 hours, which is about a week).
Outage cost
For low-business-impact outages, the median outage cost per hour of downtime was $1.3 million. For medium-business-impact outages, the median outage cost per hour of downtime was $1.6 million, and for high-business-impact outages, the median outage cost per hour of downtime was $1.9 million.
said high-business-impact outages cost at least $1 million per hour of downtime
Six factors were associated with a lower median outage cost for high-business-impact outages, including:
- Deploying more observability capabilities: The more capabilities they deployed, the less they spent on outage costs per hour. For example, those who had deployed five or more observability capabilities spent 5% less on outages per hour than those who had deployed four or fewer ($1.9 million compared to $2.0 million). Those who had deployed 10 or more spent 41% less on outages per hour than those who had deployed nine or fewer ($1.3 million compared to $2.2 million). And those who had deployed 15 or more spent 50% less on outages per hour than those who had deployed 14 or fewer ($1.0 million compared to $2.0 million).
- Achieving full-stack observability: Those who had achieved full-stack observability spent 48% less on outages per hour than those who hadn’t ($1.1 million compared to $2.1 million).
- Learning about interruptions with observability: Those who learn about interruptions with observability spent 19% less on outages per hour than those who used more manual detection methods ($1.7 million compared to $2.1 million).
- Integrating more types of business-related data with telemetry data: Those who had integrated five or more types of business-related data with their telemetry data spent 32% less on hourly outage costs than those who integrated one to four types ($1.5 million compared to $2.2 million).
- Using a single tool for observability: Those using a single tool for observability spent 45% less on outages per hour than those using multiple tools ($1.1 million compared to $2.0 million).
- Employing more observability best practices: Those who had employed five or more observability best practices spent 35% less on outages per hour compared to those who had employed four or fewer ($1.3 million compared to $2.0 million).
Organization size insight
Large organizations had higher median hourly outage costs for high-business-impact outages ($2.1 million) than midsize ($2.0 million) or small ($1.3 million).
Regional insight
Respondents surveyed in Asia Pacific had the highest median hourly outage costs ($2.3 million) compared to those in Europe ($1.7 million) and the Americas ($1.4 million).
Industry insight
The industries with the highest hourly outage costs for high-business-impact outages included government ($2.3 million), media/entertainment ($2.2 million), telco ($2.2 million), and financial services/insurance ($2.2 million). The industries with the lowest median annual outage costs included services/consulting ($1.3 million), education ($1.3 million), and healthcare/pharma ($1.3 million).
Detection of interruptions
While respondents were still more likely to say they learn about interruptions with observability (54%) than without observability (45%), this is 26% less than last year. And 12% more said they learn about interruptions with one observability platform compared to last year (17% compared to 15% in 2023).
Compared to those who learned about interruptions without observability, those who learned about them with observability:
- Experienced 73% less annual downtime (118 hours compared to 445 hours).
- Spent 19% less on hourly outage costs ($1.7 million compared to $2.1 million).
- Spent 28% less engineering time addressing disruptions (10 hours per week compared to 16 hours per week based on a 40-hour work week).
still learn about interruptions through less efficient methods
Regional insight
Those surveyed in the Americas were the most likely to learn about interruptions with observability (63% compared to 55% for those in Europe and 46% for those in Asia Pacific). Conversely, those surveyed in Asia Pacific were the most likely to learn about them without observability (54% compared to 44% for those in Europe and 36% for those in the Americas).
Industry insight
Services/consulting respondents were the most likely to learn about interruptions with observability (74%), followed by healthcare/pharma (60%) and IT (58%). Media/entertainment respondents were the most likely to say they learn about them without observability (57%), followed by energy/utilities (56%) and telco (53%).
Time spent addressing disruptions
The median percentage of engineering team time spent addressing disruptions was 30%, which works out to 12 hours per week based on a 40-hour work week. Nearly half (45%) of respondents said their engineering team spends less than 30% of their time addressing disruptions, or less than 12 hours per week based on a 40-hour work week.
The percentage of engineering team time spent addressing disruptions was correlated with annual downtime (correlation value of 0.516).
Seven factors were associated with a lower percentage of engineering team time spent addressing disruptions, including:
- Using a single tool for observability: Those using a single tool for observability spent 50% less engineering time addressing disruptions than those using multiple tools (17% compared to 33%, or seven hours compared to 13 hours based on a 40-hour work week).
- Achieving full-stack observability: Those who had achieved full-stack observability spent 44% less engineering time addressing disruptions than those who hadn’t (20% compared to 36%, or eight hours compared to 14 hours based on a 40-hour work week).
- Deploying more observability capabilities: The more capabilities they deployed, the less they tended to spend on engineering time per year. For example, those who had deployed five or more observability capabilities spent 24% less engineering time addressing disruptions than those who had deployed four or fewer (29% compared to 38%, or 12 hours compared to 15 hours based on a 40-hour work week). Those who had deployed 10 or more spent 41% less engineering time addressing disruptions than those who had deployed nine or fewer (22% compared to 38%, or nine hours compared to 15 hours based on a 40-hour work week). And those who had deployed 15 or more spent 39% less engineering time addressing disruptions than those who had deployed 14 or fewer (20% compared to 33%, or eight hours compared to 13 hours based on a 40-hour work week).
- Employing more observability best practices: Those who had employed five or more observability best practices spent 38% less engineering time addressing disruptions compared to those who had employed four or fewer (21% compared to 34%, or eight hours compared to 14 hours based on a 40-hour work week).
- Learning about interruptions with observability: Those who learn about interruptions with observability spent 38% less on outages per year than those who used more manual detection methods (25% compared to 40%, or 10 hours compared to 16 hours based on a 40-hour work week).
- Integrating more types of business-related data with telemetry data: Those who had integrated five or more types of business-related data with their telemetry data spent 27% less engineering time addressing disruptions than those who integrated one to four types (27% compared to 37%, or 11 hours compared to 15 hours based on a 40-hour work week).
- Having more unified telemetry data: Those who had more unified telemetry data spent 11% less engineering time addressing disruptions than those who had more siloed telemetry data (28% compared to 32%, or 11 hours compared to 13 hours based on a 40-hour work week).
said their engineering team spends at least half of their time addressing disruptions
Organization size insight
Midsize and large organizations spend more time addressing disruptions (32% and 31% respectively) than small organizations (25%).
Regional insight
Respondents surveyed in Asia Pacific estimated the most time spent addressing disruptions (41%), followed by those in Europe (30%) and then the Americas (20%).
Industry insight
The industries with the highest time spent addressing disruptions included media/entertainment (49%), government (43%), and financial services/insurance (40%). The industries with the lowest time spent addressing disruptions included education (20%), services/consulting (20%), and healthcare/pharma (24%).
MTTx change
We also wanted to know how respondents thought their organization’s MTTx (MTTD and MTTR) for outages had changed since adopting an observability solution.
MTTD change
For MTTD, data show more than half (56%) of respondents indicated some degree of improvement in MTTD since adopting an observability solution, including 29% who said it improved by 25% or more. About one in five (19%) said it remained the same.
Six factors were associated with improved MTTD, including:
- Deploying more observability capabilities: The more capabilities they deployed, the more likely they were to say their MTTD improved to some extent. For example, those who had deployed five or more observability capabilities were 62% more likely to say it improved than those who had deployed four or fewer (61% compared to 38%). Those who had deployed 10 or more were 44% more likely to say it improved than those who had deployed nine or fewer (69% compared to 48%). And those who had deployed 15 or more were 34% more likely to say it improved than those who had deployed 14 or fewer (72% compared to 54%).
- Achieving full-stack observability: Those who had achieved full-stack observability were 37% more likely to say their MTTD improved to some extent than those who hadn’t (70% compared to 51%).
- Employing more observability best practices: Those who had employed five or more observability best practices were 37% more likely to say their MTTD improved to some extent than those who had employed four or fewer (72% compared to 53%).
- Learning about interruptions with observability: Those who learned about interruptions with observability were 35% more likely to say their MTTD improved to some extent than those who used more manual detection methods (63% compared to 47%).
- Integrating more types of business-related data with telemetry data: Those who had integrated five or more types of business-related data with their telemetry data were 33% more likely to say their MTTD improved to some extent than those who integrated one to four types (66% compared to 50%).
- Having more unified telemetry data: Those who had more unified telemetry data were 15% more likely to say their MTTD improved to some extent than those who had more siloed telemetry data (63% compared to 55%).
said their their MTTD improved to some extent since adopting observability
Regional insight
Respondents surveyed in the Americas were much more likely to say their MTTD had improved to some extent since adopting observability (69% compared to 48% for those in both Asia Pacific and Europe).
Industry insight
Services/consulting respondents were the most likely to say their MTTD has improved to some extent since adopting observability (65%), followed by retail/consumer (63%), healthcare/pharma (63%), and media/entertainment (60%) respondents.
Seven factors were associated with improved MTTR, including:
- Deploying more observability capabilities: The more capabilities they deployed, the more likely they were to say their MTTR improved to some extent. For example, those who had deployed five or more observability capabilities were 30% more likely to say it improved than those who had deployed four or fewer (63% compared to 48%). Those who had deployed 10 or more were 27% more likely to say it improved than those who had deployed nine or fewer (69% compared to 54%). And those who had deployed 15 or more were 23% more likely to say it improved than those who had deployed 14 or fewer (71% compared to 58%).
- Employing more observability best practices: Those who had employed five or more observability best practices were 36% more likely to say their MTTD improved to some extent than those who had employed four or fewer (77% compared to 56%).
- Achieving full-stack observability: Those who had achieved full-stack observability were 23% more likely to say their MTTR improved to some extent than those who hadn’t (69% compared to 56%).
- Integrating more types of business-related data with telemetry data: Those who had integrated five or more types of business-related data with their telemetry data were 20% more likely to say their MTTR improved to some extent than those who integrated one to four types (67% compared to 57%).
- Having more unified telemetry data: Those who had more unified telemetry data were 14% more likely to say their MTTD improved to some extent than those who had more siloed telemetry data (65% compared to 57%).
- Learning about interruptions with observability: Those who learn about interruptions with observability were 13% more likely to say their MTTD improved to some extent than those who used more manual detection methods (63% compared to 55%).
- Using a single tool for observability: Those using a single tool for observability were 11% more likely to say their MTTR improved to some extent than those using multiple tools (65% compared to 59%).
said their their MTTR improved to some extent since adopting observability
Organization size insight
Small (64%) and large organizations were more likely to experience improved MTTR (64% and 61% respectively) than midsize organizations (55%).
Regional insight
Respondents surveyed in the Americas were much more likely to say their MTTR improved to some extent since adopting observability (67%) compared to those surveyed in Asia Pacific (59%) and Europe (47%).
Industry insight
Media/entertainment respondents were the most likely to say their MTTR has improved to some extent since adopting observability (73%), followed by education (71%), healthcare/pharma (66%), services/consulting (65%), and financial services/insurance (62%).
Influencers of lower MTTx by capability
The data show there is a positive association between a lower than average MTTD and MTTR and 11 observability capabilities:
- Business observability and error tracking are statistically significant within 5% significance levels.
- Alerts and dashboards have had a positive association for three years in a row (2022–2024).
- Error tracking and log management have had a positive association for two years in a row (2023 and 2024).
- APM, database monitoring, and security monitoring have had a positive association twice (2022 and 2024).
- AI monitoring (new this year), browser monitoring, business observability (new this year), and network monitoring had a positive association for the first time this year.
Downtime reduction
At least a third of respondents said conducting root cause analysis (RCA) and post-incident reviews (37%), monitoring DORA (DevOps Research and Assessment) metrics (34%), monitoring the golden signals (33%), and tracking, reporting, and incentivizing MTTx (33%) have helped their organization reduce downtime.
About a quarter said implementing service-level management (28%), providing organization-wide access to observability data (26%), using dashboards to report detailed performance and health KPIs (22%), and configuring automated alerts for critical incidents (22%) have helped their organization reduce downtime.
said conducting root cause analysis and post-incident reviews helped reduce downtime
Organization size insight
Respondents from large organizations were notably more likely to say conducting RCA and post-incident reviews, monitoring DORA metrics, and monitoring the golden signals than those from midsize and small organizations.
Regional insight
Respondents surveyed in Asia Pacific were much more likely to say monitoring DORA metrics helped reduce downtime. Respondents surveyed in the Americas were much more likely to say monitoring the golden signals and using a centralized log management system.
Industry insight
Nearly half said monitoring DORA metrics helped reduce downtime for the following industries: media/entertainment (48%), telco (47%), government (44%), and financial services/insurance (42%). At least a third said monitoring the golden signals helped reduce downtime for the following industries: education (39%), financial services/insurance (39%), retail/consumer (37%), media/entertainment (35%), telco (35%), energy/utilities (35%), and industrial/materials/manufacturing (33%).