Imagine that you're the world’s greatest DevOps detective. When issues strike at the heart of a system, you need to track down the details of everything going on in the technology stack leading up to, during, and after an incident. People—developers, users, SREs—can be unreliable witnesses. Logs, on the other hand, may contain the detailed information you need to reconstruct the crime scene. Logs are configurable to provide the level of granularity that an organization deems necessary, which can be helpful in finding the smoking gun.
You may already know what role logs can play in troubleshooting issues that happen across a technology stack, but using logs to improve the efficiency and effectiveness of an organization is a game changer. For example, dig into the case of the hidden forest in the trees: an application is down and you have to sort through 10,000+ logs (trees) to find the source of the issue (the forest). How do you do this efficiently so that you can quickly find the most relevant information regarding the application’s downtime?
In this blog post, you’ll learn how to improve your troubleshooting by leveraging correlated log data with New Relic logs in context. When your logs are correlated to other incoming telemetry data, you can detect and resolve incidents faster, helping you improve on two key performance indicators for stabilizing system availability and performance: the mean time to detection and the mean time to resolution.
What are MTTD and MTTR?
Mean time to detection (MTTD) and mean time to resolution (MTTR) represent two key business measures for how efficiently you are able to detect and analyze incident occurrences, determine root causes, and remediate the underlying issues.
What is mean time to detection (MTTD)?
MTTD is a statistical average. In the MTTD calculation, the numerator is the sum of the time between an incident occurrence and its detection, for all incidents. The denominator is the total number of incidents.
MTTD = sum(time between incident occurrence and detection)/number of incidents
What factors affect MTTD?
Several factors can impact MTTD, including the effectiveness of monitoring tools, the skill level of security personnel, the complexity of the IT environment, and the quality of incident detection processes.
What is mean time to resolution (MTTR)?
MTTR is also a statistical calculation. To calculate MTTR, the numerator is the sum of the time between an incident occurrence and its resolution, for all incidents. The denominator is the total number of incidents. Resolution in this definition includes the time required to fix the underlying problem, clean up the downstream effects, and take the necessary steps to ensure that the problem does not reoccur.
MTTR = sum(time between incident occurrence and resolution)/number of incidents
How is MTTD different from MTTR?
MTTD focuses on the time it takes to detect a security incident, while MTTR, or Mean time to recover/resolve, measures the average time it takes to restore normal operations after an incident has been detected and confirmed. MTTD is about identification, while MTTR is about resolution.
Let’s see an example. Check out the next table of incidents. Each row is an individual incident with the time of occurrence, detection, and resolution listed.
For this set of four incidents, here's how to calculate MTTD and MTTR:
MTTD = (1.5 + 0.8 + 3.5 + 0.5) / 4 = 1.575 hours
MTTR = ( 8 + 4.75 + 13.33 + 7.33 ) / 4 = 8.3535 hours
Minimizing MTTD and MTTR is essential for businesses to maintain uptime and the reliability of their systems, as well as to improve the digital customer experience. Think of it from a detective’s perspective. You want to keep your town, DevOpsville, safe for local residents and appealing to tourists. A part of that mission is to keep crime rates low. To that end, you need mechanisms in place to discover when a crime has occurred, to bring the responsible parties to justice, and to ensure that the problem does not happen again. This is exactly what improving MTTD and MTTR can do for your business.
Mean time to failure (MTTF) and mean time between failures (MTBF)
Mean time to failure (MTTF) and mean time between failures (MTBF) are pivotal metrics in reliability engineering. MTTF, applicable to non-repairable systems, estimates the average time until the first failure, aiding in early design refinement. MTBF, tailored for repairable systems, considers the average time between consecutive failures, guiding preventive maintenance planning and minimizing downtime. Together, these metrics form the bedrock for resilient and long-lasting products.
MTTF and MTBF drive continuous improvement by facilitating root cause analysis, enabling engineers to address systemic issues. They also play a crucial role in predictive maintenance, combining historical data with real-time monitoring to predict and prevent failures and optimize asset performance. In essence, these metrics elevate product reliability and contribute to customer satisfaction, industry compliance, and overall operational efficiency.
How to improve MTTD and reduce MTTR with correlated log data
Detecting issues is not simply a matter of knowing that something is wrong, but knowing what specifically is wrong and how it relates to other system components. For example, a sole log message that indicates an internal server error is less helpful than contextual information, such as on which host the issue occurred or in which application.
Directly linking relevant logs to other telemetry data helps to focus your investigation. Capabilities in a good observability solution, like New Relic logs in context for example, can match logs with other telemetry data from your technology stack, giving you more in-depth visibility to detect and resolve issues faster. The result of this contextualization gives you correlated log data.
Let’s evaluate the different types of telemetry data separately. Metrics alone give you performance information. Events tell you what happened and when. Traces tie together discrete pieces of data to illuminate the flow of information. Logs can fill in the gaps and allow you to weave together the bigger picture. With the logs in context capability, New Relic agents automatically instrument your logging framework by injecting important entity information into your log data before forwarding them to the New Relic database. This allows your logs to be connected with telemetry data flowing from your other monitored sources (applications, infrastructure, Kubernetes, and lambda functions). That in turn allows for the contextualization of the details behind a particular transaction, issue, or outage—giving you the clues you need to resolve the problem. Along with these strategies, conducting a thorough incident post mortem analysis is crucial in understanding and learning from each incident, ensuring continuous improvement in your incident response strategy.
Let’s look at an example of using New Relic logs in context. Consider the case of the corrupted coupon code. Customers of the WebPortal ecommerce application are complaining that they can't complete purchases when they try to use a promo code. The SREs have confirmed an excessive amount of “Coupon not found” errors in the past 6 hours. The developers say that the application code is working as expected.
Are the users entering bad promo codes? Is there a problem with the application code? What's the underlying problem? Time to investigate.
You’ll start your investigation by looking at the transaction traces that include a Coupon not found
error message in the past 6 hours.
The trace details provide more information about the transaction. Notice how the WebPortal browser application calls the WebPortal backend service, which in turn calls the Promo Service. All three are reporting errors.
This is the proverbial crime scene, but what was the chain of events that led to this state?
When you dig in further by looking at associated logs, you find more detailed information: the system is experiencing an internal server error, and it's happening on host ip-172-31-19-177.
As a next step, you could direct DevOps personnel to look into performance issues or additional telemetry data on host ip-172-31-19-177.
Without the connections you derived from the log data, you might still be investigating possible issues with the promo codes—which aren't the cause of the problem at all.
It's important to reiterate here that resolution includes not only the time required to fix the underlying problem, but also the time to clean up the downstream effects and to remediate the root cause to prevent the problem from recurring. Having contextual information will allow you to better understand the relationships in the impacted components and more quickly identify the steps needed to fully resolve the problem.
With all of your telemetry data in one place, including your logs, you can begin putting together a holistic view of your environment. Having a complete and integrated picture is essential for detecting and resolving problems. Correlated log data and capabilities such as logs in context will save time and effort, reducing the overall mean time to detect and resolve issues.
How to collect log data and configure logs in context
To use logs in context and correlate the logs from all your telemetry sources, you first need to collect all of your log data in New Relic. There are several ways to do this.
The good news is that the New Relic APM language agents include logs in context. Each of these agents:
- Adds metadata to your logs that will appear as attributes when you are viewing log details in the New Relic user interface.
- Automatically forwards your logs to New Relic, eliminating the need for a third-party tool.
- Aggregates metrics about your logs and reports that information in the APM summary screen.
Alternatively, you could use a third-party log forwarding service, such as Fluent Bit, FluentD, Kubernetes, AWS CloudWatch, Azure, or Google Cloud Platform. Additionally, New Relic provides a Log API and TCP endpoint that you can use to send your logs into the platform. Or, you could also use the OpenTelemetry Collector to report your logs to New Relic.
If your logging framework is not supported by this New Relic logs in context solution, you can manually configure your logging libraries using the New Relic APM agent APIs.
Finally, if you’re an existing New Relic customer who manually configured log forwarding prior to the introduction of our automatic log forwarding capabilities, you run the risk of sending duplicate log data to New Relic. You should review our Upgrade to automatic logs in context documentation to determine the best path forward for your environment.
Do more with New Relic log management
Logs in context is one of many New Relic capabilities, albeit a powerful one. There are several other capabilities in the New Relic platform to explore to help you manage your log data. Here are a few to get you started:
- Parsing. Get the most out of log data by parsing them to extract important attributes.
- Log obfuscation. Prevent certain types of information from being saved in New Relic by masking or hashing sensitive data.
- Drop filter rules. Lower costs, protect privacy and security, or reduce noise by discarding data from the ingestion pipeline before it is written to the New Relic database.
- Log patterns. Use machine learning to automatically group log messages that are consistent in format but variable in content.
次のステップ
- Read more about logs in context in the New Relic documentation.
- Learn more ways to reduce MTTR in Reducing MTTR the Right Way.
- Got lots of logs? Check out our tutorial on how to optimize and manage them.
- Don't have New Relic? Become a hands-on observability practitioner by creating a free account. Your free account includes 100 GB/month of free data ingestion, one free full-platform user, and unlimited free basic users.
本ブログに掲載されている見解は著者に所属するものであり、必ずしも New Relic 株式会社の公式見解であるわけではありません。また、本ブログには、外部サイトにアクセスするリンクが含まれる場合があります。それらリンク先の内容について、New Relic がいかなる保証も提供することはありません。