Service levels describe services provided to users within a given period of time, in measurable terms. Service level objectives (SLOs) are the goals set for the availability expected out of a system. Service level indicators (SLIs) are the key measurements and metrics to determine the availability of a system. Service level agreements (SLAs) are the legal contracts that explain what is agreed upon and what happens if systems don’t meet SLOs.
For example, an SLO for a web application might be that videos must start playing in less than 2 seconds, 99% of the time during a one week period. The SLI measures the proportion of videos on the website that start playing in less than 2 seconds. The SLA includes both this SLO and other SLOs that are agreed upon by the customer and the service provider, the scope of services that will be covered, and the SLIs, which are the metrics that will be used to measure performance.
Site reliability engineering (SRE) has popularized best practices for maintaining uptime and reliability of distributed systems, focused on the way to measure performance and reliability of services. Google published Site Reliability Engineering: How Google Runs Production Systems in March 2016, describing a framework for modeling, selecting, and analyzing metrics, starting with service level objectives.
So how do SLOs, SLIs, and SLAs relate to each other and to ways to manage service levels that your users expect? Let’s look at each in more detail.
What are SLOs?
SLOs are the goals you set for how much availability you expect out of your system, expressed as a percentage over a period of time.
The service level objectives help teams collaborate on a shared meaning of “availability” and “uptime.” You use SLOs as a standard to measure your reliability and availability. As described in the earlier example, a SLO states that videos in the web application must start playing in less than 2 seconds, 99% of the time over a week period.
Examples of SLOs
As mentioned previously, SLOs serve as a bridge between technical metrics and the broader service level agreements (SLAs) agreed upon with customers. Let’s take a look at some more examples.
Uptime/Availability SLOs
- 99.9% uptime over a 30-day window.
- Less than 0.1% of requests fail due to system errors in any given week.
Latency SLOs
- 95% of web page loads complete within 2 seconds.
- 99% of API requests return within 300 milliseconds.
Error rate SLOs
- Fewer than 0.05% of all transactions result in an error.
- Less than 1% of database writes fail.
Throughput SLOs
- The system can handle 10,000 requests per second during peak times.
- Data ingestion rates of 5TB per day without degradation.
Capacity and usage SLOs
- Disk usage on critical systems remains below 80% at all times.
- No more than 70% of total RAM usage on any service instance.
Data integrity and consistency SLOs:
- Data replication across clusters completes within 5 minutes.
- Less than 0.01% data inconsistency between primary and secondary storage systems.
Durability SLOs:
- 99.9999999% (nine 9's) durability of data over a year.
- Successful backup restoration 99.5% of the time.
Change management and deployment SLOs:
- 98% of deployments occur without rollback.
- 99% of changes result in no unplanned outages.
How to set SLOs
Setting the right SLOs is a strategic process but when done correctly, improves service reliability and creates an incredible customer experience. This process starts with understanding your users’ expectations and needs. You’ll want to engage with all stakeholders including but not limited to customers and internal teams to gain insight into what’s critical to your application’s performance and reliability. Once that’s done, you’ll want to analyze the historical performance of your system to understand its current behavior and identify any recurring issues or areas of concern. This information will allow you to set specific, measurable indicators that truly represent the service’s health, like latency, error rate, or uptime. As soon as these indicators are in place, define your target objectives. These should be both challenging but achievable and align with your broader business goals.
Remember, SLOs should be reviewed and potentially adjusted periodically to reflect changes in user expectations, system behavior, or business priorities. Additionally, it’s essential to strike a balance: while high reliability is crucial, over-stringent SLOs can impede agility and innovation. Collaborative tools and observability platforms like New Relic can aid in continuously monitoring and adjusting SLOs as your system and business evolve.
What happens if SLOs are consistently not met?
If SLOs are consistently not met, it may indicate underlying issues in the service. Teams should conduct root cause analysis to identify problems and work on improvements. For SLAs, missing SLOs might result in penalties or other consequences defined in the agreement.
How can you balance between setting aggressive SLOs and realistic ones?
Striking a balance involves understanding user expectations and the technical capabilities of your system. It's crucial to involve stakeholders from both the business and technical sides to set SLOs that are challenging yet feasible.
What are SLIs?
SLIs are the quantitative measurements of how users experience the availability of a system. They represent a proportion of successful outputs for a level of service, expressed as a percentage.
These service level indicators are described in relation to SLOs, but SLIs provide real-time signals into system reliability. SLIs can measure the proportion of requests that were faster than a threshold or the proportion of records coming into a pipeline that result in the correct value coming out. As described in the earlier example, the SLI measures the proportion of videos on the website that start playing in less than 2 seconds. You can tell how far you are from the objective in the SLO.
Examples of SLIs
SLIs serve as the foundation upon which SLOs and SLAs are based. Let’s look at some examples.
Availability/Uptime
- Percentage of successful requests vs. total requests.
- Ratio of system uptime to the total time period.
Latency
- Time taken for an API request to return a response.
- Time taken for a webpage to load for the end user.
Throughput
- Number of requests handled per second.
- Volume of data processed within a specific time frame.
Error rate
- Percentage of failed requests vs. total requests.
- Number of 4xx or 5xx HTTP status codes returned.
Saturation
- Percentage of resource utilization, such as CPU or RAM.
- Amount of used storage relative to the total available storage.
Coverage
- Percentage of users who receive a new feature update within a given time frame.
- Ratio of cached responses vs. total responses delivered.
Freshness
- Age of the data being read relative to when it was written.
- Time taken for data replication across multiple databases or systems.
Capacity
- Maximum number of users or sessions the system can handle simultaneously.
- Maximum data volume the system can handle without degradation.
How do you choose appropriate SLIs for a service?
SLIs should be chosen based on what matters most to users/customers. Common SLIs include latency, error rates, throughput, and availability. It's essential to understand user expectations and business priorities.
How do you measure SLIs accurately?
Accurate measurement often requires implementing monitoring and logging systems. Use tools that capture relevant data points and provide insights into SLIs. Regularly validate and calibrate measurement systems to ensure accuracy.
What are SLAs?
SLAs define the level of service your customers expect when they use your service.
These service level agreements are contracts between service providers and their customers that document what services the provider will furnish and define the service standards the provider is obligated to meet. SLAs describe remedies or penalties as results of breaking the SLO commitments.
For the earlier example, the SLA will include all the SLOs for the web application, as well as the scope of services that will be covered, and all the SLIs, which are the metrics that will be used to measure performance against the SLOs. The agreement also includes both the responsibilities of the service provider and the customer.
Here are more examples of SLIs measuring real-time user experience, compared against SLOs:
SLIs, SLOs, and SLAs are crucial for observability. Get started with New Relic service levels today.
Who uses service levels, SLOs, SLIs, and SLAs?
SRE teams, reliability engineers, and cross-functional teams often struggle to define and measure service “reliability.” Cross-functional teams need to create an aggregated, comprehensive view of important metrics for all aspects of a service or system so they can easily measure uptime and performance.
Service levels come into play to help SRE teams and reliability engineers identify critical components of their applications and infrastructure. In particular, they need to know when one or more components expose functionality to external customers. We call these intersection points system boundaries. System boundaries are where site reliability engineers need to apply service level indicators and objectives to their metrics in order to tell the real story of system performance and reliability.
It takes a lot of effort and thought to establish service boundaries and determine which metrics need to be SLIs and what the SLO compliance requirements should be. This complexity often results in teams abandoning the effort altogether. Reliability engineers and SRE teams need accurate, customized SLIs and SLOs based on historical system performance so they can quickly set a baseline for availability and uptime across their entire stack, for all of their teams.
While SRE teams and reliability engineers aren’t always responsible for managing service levels, it often falls within their purview. By tracking SLIs and tying them to SLOs, you can set goals around the performance of a system. Google’s SRE book defines the four golden signals of service levels as latency, traffic, errors, and saturation. So, for example, you could look at an API call and track its number of successful/failed requests (the SLI) against a general percentage of requests (the SLO, for example 95%) that need to be successful for customers to have a good experience.
SRE teams often set strict SLOs on critical components within their applications and services to better understand how strict of an SLA they can agree to with customers. From here, the team can apply error budgets as a way to understand how quickly they must resolve issues in order to stay compliant with their SLOs. Service levels allow teams to aggregate metrics and create a transparent view of uptime, performance, and reliability across the entire organization. At a glance, business leaders can use service levels to monitor compliance across multiple teams, applications, services, etc. to gain a comprehensive understanding of their system’s health.
What is service level management?
Service level management means ensuring that all of your processes and operational agreements for the level of your services provided to customers are appropriate. It includes monitoring and reporting on service levels, setting and adjusting SLOs, determining SLIs, making sure you are meeting SLAs, and holding customer reviews.
The central focus really is the shared meaning of “availability” across teams, in your SLOs, also captured in the SLAs with your customers. To make sure your business is meeting or exceeding these service level agreements, it’s important for cross-functional teams to manage internal SLOs.
This next video shows how teams can use service level management with New Relic.
Benefits of service level management
Implementing SLO best practices across teams isn’t easy. You need the right data to define a shared language across teams.
Reliability engineers need to quickly set a baseline for availability and uptime across their full stack and team. You need SLOs and SLIs to determine service boundaries and a unified, transparent view of service reliability to better comply with customer-facing SLAs. You need to be able to report on reliability and SLO compliance metrics and error budgets so you can make improvements across your environment.
When you have good practices for SLIs, SLOs, and SLAs, and a platform for your service level management, you’ll see these benefits:
- Easy setup: Automatically establish a baseline of performance and reliability for any service with a one-click setup and recommendations and customizations provided in a simple, guided flow.
- Define reliability across teams: Avoid arduous alignment processes with SLO and SLI recommendations that help you determine service boundaries. Set reliability benchmarks automatically based on recent performance metrics in any entity.
- Iterate and improve: With full-stack context and automation through open-source infrastructure-as-code tools like Terraform, teams have insight into how specific nodes or services impact system reliability and can quickly take control over their performance. Custom views for both service owners and business leaders drive operational efficiency and lead to better reporting, alerting, and incident management processes.
- Standardize reliability: Cross-organizational teams have a unified, transparent view of service reliability, and can better comply with customer-facing SLAs, avoiding SLA breaches. SLO compliance metrics and error budgets give organizations a way to report on reliability and implement changes across applications, infrastructure, and teams in a cohesive fashion.
For more tips, read our blog posts, Best practices for setting SLOs and SLIs for modern, complex systems and Introducing service level management.
Next steps
Get started with service level management. Try New Relic.
The best way to learn more about service level management and observability is to get hands-on experience with an observability solution. Sign up for New Relic. Your free account includes 100 GB/month of free data ingest, one free full-access user, and unlimited free basic users. Then explore the service level management documentation. And learn how New Relic can recommend SLIs and SLOs based on historical system performance.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.