SRE Practice

CodingDaddy·2022년 3월 18일

목록 보기

3/4

Put simply, SREs run services—a set of related systems, operated for users, who may be internal or external—and are ultimately responsible for the health of these services. Successfully operating a service entails a wide range of activities: developing monitoring systems, planning capacity, responding to incidents, ensuring the root causes of outages are addressed, and so on. This section addresses the theory and practice of an SRE’s day-to-day activity: building and operating large distributed computing systems.

간단히 말해서 SRE는 서비스(내부 또는 외부에 있는 사용자를 위해 운영되는 관련 시스템 세트)를 실행하고 궁극적으로 이러한 서비스의 상태를 책임집니다. 서비스를 성공적으로 운영하려면 모니터링 시스템 개발, 역량 계획, 사고 대응, 정전의 근본 원인 해결 등 다양한 활동이 필요합니다. 이 섹션에서는 SRE의 일상적인 활동, 즉 대규모 분산 컴퓨팅 시스템 구축 및 운영에 대한 이론과 실습을 다룹니다.

We can characterize the health of a service—in much the same way that Abraham Maslow categorized human needs [Mas43]—from the most basic requirements needed for a system to function as a service at all to the higher levels of function—permitting self-actualization and taking active control of the direction of the service rather than reactively fighting fires. This understanding is so fundamental to how we evaluate services at Google that it wasn’t explicitly developed until a number of Google SREs, including our former colleague Mikey Dickerson,41 temporarily joined the radically different culture of the United States government to help with the launch of Get 2022 health coverage. Health Insurance Marketplace® in late 2013 and early 2014: they needed a way to explain how to increase systems’ reliability. We’ll use this hierarchy, illustrated in Figure 3-1, to look at the elements that go into making a service reliable, from most basic to most advanced.

우리는 Abraham Maslow가 인간의 필요를 분류한 것과 거의 같은 방식으로 서비스의 건강을 특성화할 수 있습니다 [Mas43] . 화재 진압에 대응하기 보다는 서비스의 방향을 능동적으로 제어하고 구현하는 것입니다.

이러한 이해는 Google에서 서비스를 평가하는 방법에 있어 매우 기본적이어서 이전 동료인 Mikey Dickerson( 41세 )을 비롯한 여러 Google SRE 가 출시를 돕기 위해 미국 정부의 근본적으로 다른 문화에 일시적으로 합류 할 때까지 명시적으로 개발되지 않았습니다. Healthcare.gov 의2013년 말과 2014년 초에 그들은 시스템의 신뢰성을 높이는 방법을 설명할 방법이 필요했습니다.

그림 3-1 에 나와 있는 이 계층 구조를 사용 하여 가장 기본적인 것부터 가장 높은 것까지 서비스를 안정적으로 만드는 데 필요한 요소를 살펴보겠습니다.

Figure III-1. Service Reliability Hierarchy

Service Reliability Hierarchy.

Monitoring

Without monitoring, you have no way to tell whether the service is even working; absent a thoughtfully designed monitoring infrastructure, you’re flying blind. Maybe everyone who tries to use the website gets an error, maybe not—but you want to be aware of problems before your users notice them. We discuss tools and philosophy in Practical Alerting from Time-Series Data.

Incident Response
SREs don’t go on-call merely for the sake of it: rather, on-call support is a tool we use to achieve our larger mission and remain in touch with how distributed computing systems actually work (and fail!). If we could find a way to relieve ourselves of carrying a pager, we would. In Being On-Call, we explain how we balance on-call duties with our other responsibilities.
Once you’re aware that there is a problem, how do you make it go away? That doesn’t necessarily mean fixing it once and for all—maybe you can stop the bleeding by reducing the system’s precision or turning off some features temporarily, allowing it to gracefully degrade, or maybe you can direct traffic to another instance of the service that’s working properly. The details of the solution you choose to implement are necessarily specific to your service and your organization. Responding effectively to incidents, however, is something applicable to all teams.

Figuring out what’s wrong is the first step; we offer a structured approach in Effective Troubleshooting.

During an incident, it’s often tempting to give in to adrenalin and start responding ad hoc. We advise against this temptation in Emergency Response, and counsel in Managing Incidents, that managing incidents effectively should reduce their impact and limit outage-induced anxiety.

Postmortem and Root-Cause Analysis
We aim to be alerted on and manually solve only new and exciting problems presented by our service; it’s woefully boring to "fix" the same issue over and over. In fact, this mindset is one of the key differentiators between the SRE philosophy and some more traditional operations-focused environments. This theme is explored in two chapters.

Building a blameless postmortem culture is the first step in understanding what went wrong (and what went right!), as described in Postmortem Culture: Learning from Failure.

Related to that discussion, in Tracking Outages, we briefly describe an internal tool, the outage tracker, that allows SRE teams to keep track of recent production incidents, their causes, and actions taken in response to them.

Testing
Once we understand what tends to go wrong, our next step is attempting to prevent it, because an ounce of prevention is worth a pound of cure. Test suites offer some assurance that our software isn’t making certain classes of errors before it’s released to production; we talk about how best to use these in Testing for Reliability.

Capacity Planning
In Software Engineering in SRE, we offer a case study of software engineering in SRE with AUXON:정보를 찾지 못 함 Auxon, a tool for automating capacity planning.

Naturally following capacity planning, load balancing ensures we’re properly using the capacity we’ve built. We discuss how requests to our services get sent to datacenters in Load Balancing at the Frontend. Then we continue the discussion in Load Balancing in the Datacenter and Handling Overload, both of which are essential for ensuring service reliability.

Finally, in Addressing Cascading Failures, we offer advice for addressing cascading failures, both in system design and should your service be caught in a cascading failure.

Development
One of the key aspects of Google’s approach to Site Reliability Engineering is that we do significant large-scale system design and software engineering work within the organization.

In Managing Critical State: Distributed Consensus for Reliability, we explain distributed consensus, which (in the guise of Paxos) is at the core of many of Google’s distributed systems, including our globally distributed Cron system. In Distributed Periodic Scheduling with Cron, we outline a system that scales to whole datacenters and beyond, which is no easy task.

Data Processing Pipelines, discusses the various forms that data processing pipelines can take: from one-shot MapReduce jobs running periodically to systems that operate in near real-time. Different architectures can lead to surprising and counterintuitive challenges.

Making sure that the data you stored is still there when you want to read it is the heart of data integrity; in Data Integrity: What You Read Is What You Wrote, we explain how to keep data safe.

Paxos : Google Chubby이 채택한 합의 알고리즘

P2P 네트워크에서는 정보의 지연과 미도달이라는 사태를 피할 수 없다. 따라서 데이터를 변조할 의도가 없다 해도 이중 송신에 따른 중복이나 잘못된 정보에 의한 오작동 등의 위험이 있기 때문에 정확한 정보를 공유하기 어렵다. 이 문제점을 해결하는 것이 합의 알고리즘의 목적이다.

Paxos의 특징은 과반수의 동의를 얻었다면 그 동의 내용이 나중에 변경되지 않는다는 점이다. 리더가 부정을 저지르는 경우 동기화되지 않는다. 그리고 멤버가 거짓으로 신고한 경우에도 동기화가 되지 않기 때문에 악의를 가진 참가자가 있는 환경에서 운영하기에는 적절하지 않다.세요

Product

Finally, having made our way up the reliability pyramid, we find ourselves at the point of having a workable product. In Reliable Product Launches at Scale, we write about how Google does reliable product launches at scale to try to give users the best possible experience starting from Day Zero.

Further Reading from Google SRE
As discussed previously, testing is subtle, and its improper execution can have large effects on overall stability. In an ACM article [Kri12], we explain how Google performs company-wide resilience testing to ensure we’re capable of weathering the unexpected should a zombie apocalypse or other disaster strike.

While it’s often thought of as a dark art, full of mystifying spreadsheets divining the future, capacity planning is nonetheless vital, and as [Hix15a] shows, you don’t actually need a crystal ball to do it right.

Finally, an interesting and new approach to corporate network security is detailed in [War14], an initiative to replace privileged intranets with device and user credentials. Driven by SREs at the infrastructure level, this is definitely an approach to keep in mind when you’re creating your next network.

CodingDaddy

Creative - DevOps in Korea

이전 포스트

SRE 소개

다음 포스트

SRE Practice

SRE

Monitoring

Product

SRE 소개

왜 인프라 관리 자동화가 필요한가?

0개의 댓글