SRE Course: part 01: abstract

Just finished the SRE course and thought I need to capture my thoughts both directly and indirectly related to the course.

This is a work-in-progress.

100% Reliability

In an perfect world everything is 100% reliable.

Too bad we don’t have one. Expecting perfect reliability is a mistake:

100% reliability is a wrong target.

How (un)reliable? That’s the question.

We want to be satisfied with our experiences(aka happy). When it comes to systems, we want: reliability and improvements.

Above two are at conflict, since:

Thus

Rather than choosing between either reliability or improvements, SRE advocates for embracing a certain level of un-reliability to enable improvements and maintain the target reliability.

The goal is to find the acceptable balance between being open to change and providing stability, keeping in mind that it (the balance) may change over the life time.

The balance depends on the objectives: reliability expectations for an early startup and a mature enterprise are likely very different.

Reliability math

Let’s get familiar with simple math involved first.

It’s convenient to have reliability expressed as a relative value or percentage.

reliabilitynameun-reliabilitycalculation
99.99%4 nines0.01%100%-99.99%
99.9%3 nines0.1%100%-99.9%
99.5%2.5 nines0.5%100%-99.5%

and so on…

It’s worth mentioning that there are 2 ways reliability is calculated:

  1. time based
  2. event based

Time based calculation

Given the 1 year(or 525.960 mins = 365.25 * 24 * 60 mins) duration:

reliabilitymax un-reliable min / yearcalculation
99.99%52.5525,960 * 0.01 /100
99.9%525.9525,960 * 0.1 /100
99.5%2629.8525,960 * 0.5 /100

Note: each 9 of reliability increases reliability 10x

Event based calculation

Given the 1_000_000 events:

reliabilitymax un-reliable eventscalculation
99.99%1001000000 * 0.01 /100
99.9%10001000000 * 0.1 /100
99.5%50001000000 * 0.5 /100

Un-reliability aka Error Budget

Let’s assume that we have a desired reliability target in mind(it’ll be discussed in upcoming sections). It, the reliability target, as shown by the above examples, also sets the target level of un-reliability also known as the Error Budget(EB).

The Error Budget is the “room” for making mistakes.

Similar to financial budget it’s:

Let’s say our reliability target is 99.9%, then EB = 100%-99.9% = 0.01%

                                        reliability target
---------------|-…---------|-…------------|-…----^-…--------------|
reliability    0          50%           99.5%  99.9%             100%
---------------|-…---------|-…------------|-…----|-…--------------|
 period        |                                 |                |
---------------|---------------------------------|----------------|
 28d           |########################################**********|
 7d            |#####################################*************|
 1d            |###########################***********************|
               |                                 |> Error Budget <|

Ascii bars above indicate the current reliability level and EB spending per period:

7d: 
               **********    10
------------------------- = ---- ~ 58.8% spent: budget surplus
        *****************    17

Or going over the budget:

1d:
 ***********************    24
 ----------------------- = ---- ~ 141% spent: budget debt
       *****************    17

Satisfactory Objectives

Users set the objectives

Users seek satisfactions, they also establish the satisfactory levels.

For a business that means finding the trade-off between:

Along the Error Budget described previously, SRE introduces several concepts:

For online businesses it’s common to have high reliability targets effectivelySLOs: 99%, 99.5%, 99.99%, etc.

SLI vs SLO

SLI is the indicator(measure) of the actual performance, normally it’s measured over a period of time since time is an important factor.

I like to see it as tension between want vs have:

HAVE: SLIWANT: SLO
Reliability98.4%>=99%
EB1.6%<=1%

Which means that over a period X, the service reliability, as measured by SLI, was below the target, as specified by SLO: 98.4% < 99%. As result the Error Budget was over-spent by 60%.

Thank you

That’s it for the part 01. Make sure to check out next parts in, hopefully, near future.

References

Related Posts
Read More
Ambiguity in spoken and programming languages
Comments
read or add one↓