Some would argue that defense needs to be proactive to face the ever-increasing and never entirely predictable security environment. To that end, they put security in SRE (Site Reliability Engineering) and rely on its core principles.
The modern software industry is increasingly distributed, rapidly iterative, and predominantly stateless. Development and Operations have to deal with increasingly rapid paces. To deal with these issues, developers and operations teams came up with a way to increase the reliability and delivery speed of the software: SRE in operations, and DevOps in development.
We can also assess reliability in security, which remains predominantly preventative, focused, and dependent on the state in time.
We’re focusing on SRE and what motivated its creation at Google in 2003. Also, whether or not this applies to cybersecurity. The goal is to have a different perspective about how we conceive security.
To keep up with us, look at what we’re going to learn down there:
- Can you tell me what SRE is?
- Can I have some security in this?
- A new approach to cybersecurity.
Can you tell me what SRE is?
Traditionally, development and operations are two distinct fields. However, two fields usually imply fragmentation about tools, metrics, and goals. Hence, the lack of communication between these two.
The idea behind SRE was to merge development and operations around the same goal: reliability.
The term reliability has to be understood from the customer’s perspective. What does make software reliable to their eyes? To SRE’s founder, Benjamin Treynor Sloss, there are multiple variables behind the reliability concept: availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
It’s easy to apply these variables to your customer perspective. For instance, when someone wants to watch a video on its favorite streaming platform, they’re very concerned with its capacity to watch it when he wants: availability. Also, he wants a smooth experience, no endless buffering or error code: latency and performance. In case of failure, he wants a fast response from the provider, a quick return to normal: emergency response.
We could go on with what a customer expects when using a software/service, but you get the point. To his eyes, all of these characteristics make the reliability.
As we said, typically, this was handled by operations teams, relying on manual labor to resolve issues arising.
However, SRE relies on a different approach when it comes to the reliability of your software. Whereas operations traditionally rely on manual tasks to overcome problems arising in production, SRE is bringing automation to handle these at scale and provide more reliable software.
In SRE, engineers take the software approach to handle these issues. Through code, they’re able to address problems in a much more scalable and sustainable way than your traditional process.
To that end, the SRE’s mantra is: “Do it once manually, automate it the second time.”
Some would say that SRE was DevOps before it was coined. It’s partly true on the fast that it means bringing one team mindset and tools in the other. Still, SRE is more about reducing failure rates, as the software is evolving than streamlining change in the development pipe. To distinguish the two, let’s say that:
- DevOps automates development speed;
- SRE automates reliability at production scale.
To improve reliability, one of the core principles of SRE, besides automation, is chaos engineering, where failures aren’t feared but embraced as valuable lessons.
SRE and chaos engineering: continuous experimentation
Zero risk doesn’t exist in SRE. This initial postulate changes the way they approach reliability. Instead of hoping that the systems will run smoothly forever, they expect it to break at some point.
As you inferred, Chaos engineering derives from Chaos theory, a cross-disciplinary scientific theory focusing on the evolutions of initial conditions in dynamical systems.
Do you picture Ian Malcolm doing his thing? That’s right. He nailed the chaos theory! Now let’s transpose this in the software field:
Software is evolving, even by minimalistic changes from its initial condition. Each time, these updates and changes come piling on the infrastructure. Even if the software is deterministic, following a unique behavior according to the intent of their creators, updates are integrating into a complex system, a distributed computing system linked over network and resources-sharing, potentially impacting its base. The wider this distributed system is, the more unexpected it can act when introduced to changes.
This can yield widely diverging outcomes in a dynamic system and render long-term production impossible, thus appearing as a randomized system or chaotic.
Now, Chaos engineering focuses on this random and unpredictable behavior to identify weaknesses through experiments done in a controlled environment.
The goal is to stay ahead of the unexpected or even the attack. Break your system on purpose instead of waiting for someone or something to do it when you’re at sleep.
To that end, you’re going to conduct experiments and tests. Why? In an experiment, you’re unsure what will happen when you start the process. In a test, however, you know the outcome. It’s done to ensure it, not to discover something. You’ll compare the results to your initial hypothesis and your theoretical normal running state.
Okay, that’s cool, but can I have some security in SRE?
We agree that this unreliability can be related to anything in your company, from hardware malfunctions to errors embedded in your code. However, there’s also some reliability in security.
A security breach often implies snowballing consequences for the end-user himself. Ransomware, for instance, can force a company to shut down its delivery system, dramatically impacting its customers and then the company’s reliability to their eyes.
Roughly put, security can be considered as part of reliability. Having the most resilient system to reduce incidents as much as possible is a shared goal between SRE and Security teams.
The traditional security approach would be acceptable if one could capture all the risks in an audit. The experience proved us wrong. More, relying on this approach dooms you to face unexpected events barely prepared since you’re only focused on risks listed in your audits. In other terms, when the tide goes out, you realize you’ve been swimming naked.
You need to prepare yourself for the unexpected incidents arising from this chaos. Applying SRE and chaos engineering to security is the way to it.
Security and chaos engineering = security experimentation
Above, we said that, without an SRE approach, we often discover failures after the incident. You can tell the same about security. The most common way we’re acknowledging them is after the security incident. Knowing about your failure after it materialized, in an unexpected way, is a little too late. Damage has been done.
This is why some engineers wanted to have a more proactive approach to security. To them, you have to embrace failure, anticipate it. Yes, security failures are going to happen at some point. Yes, you better be prepared for it.
To ensure more robust security, you have to test and experiment for the known and unknown vulnerabilities to understand what happened and stay at the edge of security as it evolves and changes. Thus, you create a feedback loop around your controlled experimentations. This combination of chaos engineering and security gives us security experimentation.
- Use risk analysis metrics to assess the work needed to be done
SRE relies on metrics to evaluate the reliability of the service and the work to be done. Key metrics are namely Service Level Agreements (SLA), Service Level Indicators (SLI), and Service Level Objectives (SLO). SRE teaches us that the closest to perfection you can achieve is 99.999%. 100% security isn’t achievable given the facts we’ve stated above. To try to achieve it, you’d be wasting resources and money that you could employ better elsewhere. Your error budget is built according to this principle. Reach the inbounds of these metrics, and your set to focus on the other one.In security, you have to assess your metrics according to this concept. Use known ones such as:Â
– Key Performance Indicators to assess the effectiveness of your activities: Mean Time To Detect, Mean Time To Repair, Alert Time To Triage, and so on;
– Key Risks Indicators: internal assets deemed critical, known vulnerabilities, etc.Not relying on such metrics and precise lower and upper bounds will lead you to chase ghosts when time, money, and people matter.
- Tests aren’t enough to face the unknown
Modern distributed systems are constituted of immense, constantly changing stateless variables. It makes them nearly impossible to understand the work at a given moment. More, in terms of security, the main factor is the human, unpredictable. Every system has to work this factor in their security schemes.
No security system can stay idle in an ever-evolving threat environment. The Set-it-and-forget-it mantra is doomed to fail because it’s static. Create dynamic security: feedback loops, experimentations.
In a few words: test and experiment as many times as possible. You want to be ready when it comes.
– Test: the validation or assessment of a previously known outcome;
– Experiment: derive new insights and information that was previously unknown.
Â
- Do it once manually and automate it
One of the main particularities of SRE is to rely on automation once the vulnerabilities have been found and acknowledged. As such, you’re increasing your readiness when hard times come.Also, automation is a dramatic enhancer of human labor. The capacity to remediate issues at machine-time speed leverage our work and make us able to focus on the next issue once we’ve discovered one, fixed it, and found the way to rely on technology to handle its future occurrence.Â
A new approach to cybersecurity
Every day we’re coming with new measures, new tools to handle security threats. The cybersecurity spendings are already over $50 billion annually, and they’re vowed to continue to increase at a fast pace. Still, the issue only seems to get worse. More and more attacks and new exploits are reported every day.Â
Interestingly enough, most attacks aren’t due to advanced threats, such as APTs. No, most are rather simpler than that. They’re about relatively simple things like incomplete implementation, misconfiguration, or design imperfections. Yes, Human error and system glitches (system error, misconfiguration) account for most security breaches. Unpatched vulnerabilities or human error (credentials, phishing, or accidental data loss) together account for more than 50% of the total root causes of data breaches.
More than often, malicious or criminal attacks succeed due to initial human errors and system glitches. We know these factors. They’re testable, measurable. This is about having a proactive approach to designing, building, implementing, operating, and monitoring our security controls.
In a dynamic environment such as cyberspace, there’s a need for continuous instrumentation and validation of security capacities to create a real sense of reliability in our system’s ability to defend against incoming threats. This is about relying on something other than hope when we’re waiting for the threat.Â