컨테이너 원리 분석

hyuckhoon.ko·2024년 2월 9일

SRE가 되고 싶습니다만

목록 보기

2/9

컨테이너는 리눅스 OS 기반의 자원이 격리(독립)된 프로세스임을 배웠다.

도커로 바로 컨테이너를 만들기 전에 직접 격리된 프로세스를 만들어 보는 시간을 갖는다.
rootfs, PID, 포트를 각각 분리해보며 비록 지금 단계에서는 완전히 독립된 프로세스는 아니지만, 큰 방향성을 살펴본다.

chroot 실습 : 나만의 root filesystem 공간

현재 루트 파일시스템 확인

$ ls /
bin    dev   initrd.img      lib32       media  proc  sbin  sys  var
boot   etc   initrd.img.old  lib64       mnt    root  snap  tmp  vmlinuz
cdrom  home  lib             lost+found  opt    run   srv   usr  vmlinuz.old

임시 루트 파일 시스템 폴더 다운로드

$ wget -O rootfs.tar.gz https://www.dropbox.com/s/rx6t9s92h9wdjud/rootfs.tar.gz?dl=1

압축 해제

$ sudo tar -zxf rootfs.tar.gz

$ ls rootfs
bin   dev  home  lib64  mnt  proc  run   srv  tmp  var
boot  etc  lib   media  opt  root  sbin  sys  usr

나만의 독립적인 루트 파일 시스템으로 만들거다.
/bin/bash 만을 위한 루트 파일 시스템을 만들어보자

reallinux@ubuntu:~$ sudo chroot rootfs /bin/bash
root@ubuntu:/# ls /

프로세스를 종료해보자.

$ root@ubuntu:/# exit
exit
reallinux@ubuntu:~$

bash는 단지 명령어 반응형 CLI 프로그램일 뿐이다.

$ echo $$
4114

$ bash
$ echo $$
8826


$ ps -ef | grep 8826
reallin+  8826  4114  4 06:35 pts/1    00:00:01 bash
reallin+  9025  8826  0 06:36 pts/1    00:00:00 ps -ef
reallin+  9026  8826  0 06:36 pts/1    00:00:00 grep --color=auto 8826

$ exit

PID namespace 실습 : 나만의 PID

현재 시스템에서 모든 프로세스 확인

$ ps -ef

루트 파일시스템만 독립시킨 다음에

reallinux@ubuntu:~$ sudo chroot rootfs /bin/bash
root@ubuntu:/#

proc(프로세스 관리) 연결하고

root@ubuntu: mount -t proc proc /proc

모든 프로세스 확인

$ ps -ef

호스트에서 조회한 프로세스와 동일한 프로세스들이 나온다.

당연하다. 루트 파일시스템만 독립시켰고, 프로세스 관리하는 영역은 연결시켰기 때문이다.
(위에 "proc(프로세스 관리) 연결하고~" 이 부분 없어도 프로세스들이 똑같다. 필요없는 부분인지 물어보기)

프로세스 관리 독립화와 루트 파일시스템 독립화 동시에 하기

reallinux@ubuntu:~$ sudo unshare -p -f --mount-proc=rootfs/proc chroot rootfs /bin/bash

(명령어 해석: 프로세스, fork)

모든 프로세스 확인(PID 1번 확인)

root@ubuntu:/# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  4 06:47 ?        00:00:00 /bin/bash
root         3     1  0 06:47 ?        00:00:00 ps -ef

컨테이너는 독립된 프로세스다.(namespace, cgroup)

처음엔 파일시스템만 격리된 프로세스를 만들었고, 그 다음엔
파일시스템과 프로세스 관리도 격리한 프로세스를 만들었다.

이 외에도 unshare를 통해 더 다양한 격리된 프로세스를 만들 수있다.

unshare 명령어가 제공하고 독립할 수 있는 내용은 다음과 같이 확인할 수 있다.

reallinux@ubuntu:~$ man unshare

UNSHARE(1)                               User Commands                              UNSHARE(1)

NAME
       unshare - run program with some namespaces unshared from parent

SYNOPSIS
       unshare [options] [program [arguments]]

DESCRIPTION
       Unshares  the indicated namespaces from the parent process and then executes the speci‐
       fied program. If program is not given, then ``${SHELL}'' is run (default: /bin/sh).

       The namespaces can optionally be made persistent  by  bind  mounting  /proc/pid/ns/type
       files  to  a  filesystem path and entered with nsenter(1) even after the program termi‐
       nates (except PID namespaces where permanently running init process is required).  Once
       a  persistent namespace is no longer needed, it can be unpersisted with umount(8).  See
       the EXAMPLES section for more details.

       The namespaces to be unshared are indicated via options.  Unshareable namespaces are:

       mount namespace
              Mounting and unmounting filesystems will not affect  the  rest  of  the  system,
              except for filesystems which are explicitly marked as shared (with mount --make-
              shared; see  /proc/self/mountinfo  or  findmnt  -o+PROPAGATION  for  the  shared
              flags).   For further details, see mount_namespaces(7) and the discussion of the
              CLONE_NEWNS flag in clone(2).

              unshare since util-linux version 2.27 automatically sets propagation to  private
              in a new mount namespace to make sure that the new namespace is really unshared.
              It's possible to disable this feature with option --propagation unchanged.  Note
              that private is the kernel default.

       UTS namespace
              Setting hostname or domainname will not affect the rest of the system.  For fur‐
              ther details, see namespaces(7) and the discussion of the CLONE_NEWUTS  flag  in
              clone(2).

       IPC namespace
              The  process will have an independent namespace for POSIX message queues as well
              as System V message queues, semaphore sets and shared memory segments.  For fur‐
              ther  details,  see namespaces(7) and the discussion of the CLONE_NEWIPC flag in
              clone(2).

       network namespace
              The process will have independent IPv4 and IPv6 stacks, IP routing tables, fire‐
              wall rules, the /proc/net and /sys/class/net directory trees, sockets, etc.  For
              further details, see namespaces(7) and the discussion of the  CLONE_NEWNET  flag
              in clone(2).

       PID namespace
              Children  will have a distinct set of PID-to-process mappings from their parent.
              For further details, see pid_namespaces(7) and the discussion of the  CLONE_NEW‐
              PID flag in clone(2).

       cgroup namespace
              The  process  will  have a virtualized view of /proc/self/cgroup, and new cgroup
              mounts will be rooted at the namespace cgroup root.  For  further  details,  see
              cgroup_namespaces(7) and the discussion of the CLONE_NEWCGROUP flag in clone(2).
              
user namespace
              The process will have a distinct set of UIDs,  GIDs  and  capabilities.   For  further  details,  see
              user_namespaces(7) and the discussion of the CLONE_NEWUSER flag in clone(2).

OPTIONS
       -i, --ipc[=file]
              Unshare  the  IPC  namespace.  If file is specified, then a persistent namespace is created by a bind
              mount.

       -m, --mount[=file]
              Unshare the mount namespace.  If file is specified, then a persistent namespace is created by a  bind
              mount.   Note  that  file has to be located on a filesystem with the propagation flag set to private.
              Use the command findmnt -o+PROPAGATION when not sure about the current setting.  See also  the  exam‐
              ples below.

       -n, --net[=file]
              Unshare  the  network  namespace.   If file is specified, then a persistent namespace is created by a
              bind mount.

       -p, --pid[=file]
              Unshare the PID namespace.  If file is specified then persistent  namespace  is  created  by  a  bind
              mount.  See also the --fork and --mount-proc options.

       -u, --uts[=file]
              Unshare  the  UTS  namespace.  If file is specified, then a persistent namespace is created by a bind
              mount.

       -U, --user[=file]
              Unshare the user namespace.  If file is specified, then a persistent namespace is created by  a  bind
              mount.

       -C, --cgroup[=file]
              Unshare  the  cgroup  namespace.  If  file  is specified then persistent namespace is created by bind
              mount.

       -f, --fork
              Fork the specified program as a child process of unshare rather than running it  directly.   This  is
              useful when creating a new PID namespace.

       --mount-proc[=mountpoint]
              Just before running the program, mount the proc filesystem at mountpoint (default is /proc).  This is
              useful when creating a new PID namespace.  It also implies creating a new mount namespace  since  the
              /proc  mount  would  otherwise  mess  up existing programs on the system.  The new proc filesystem is
              explicitly mounted as private (with MS_PRIVATE|MS_REC).

       -r, --map-root-user
              Run the program only after the current effective user and group IDs have been mapped to the superuser
              UID  and  GID in the newly created user namespace.  This makes it possible to conveniently gain capa‐
              bilities needed to manage various aspects of the newly created namespaces (such as configuring inter‐
              faces in the network namespace or mounting filesystems in the mount namespace) even when run unprivi‐
              leged.  As a mere convenience feature, it does not support more sophisticated use cases, such as map‐
              ping multiple ranges of UIDs and GIDs.  This option implies --setgroups=deny.

       --propagation private|shared|slave|unchanged
              Recursively  set  the  mount  propagation flag in the new mount namespace.  The default is to set the
              propagation to private.  It is possible to disable this feature with  the  argument  unchanged.   The
              option is silently ignored when the mount namespace (--mount) is not requested.

       --setgroups allow|deny
              Allow or deny the setgroups(2) system call in a user namespace.

              To  be able to call setgroups(2), the calling process must at least have CAP_SETGID.  But since Linux
              3.19 a further restriction applies: the kernel gives permission to call setgroups(2) only  after  the
              GID  map  (/proc/pid/gid_map)  has  been  set.   The GID map is writable by root when setgroups(2) is
              enabled (i.e. allow, the default), and the GID map becomes writable by  unprivileged  processes  when
              setgroups(2) is permanently disabled (with deny).

       -V, --version
              Display version information and exit.


       -h, --help
              Display help text and exit.

NOTES
       The  proc  and  sysfs  filesystems mounting as root in a user namespace have to be restricted so that a less
       privileged user can not get more access to sensitive files that a more privileged user made unavailable.  In
       short the rule for proc and sysfs is as close to a bind mount as possible.

EXAMPLES
       # unshare --fork --pid --mount-proc readlink /proc/self
       1
              Establish a PID namespace, ensure we're PID 1 in it against a newly mounted procfs instance.

       $ unshare --map-root-user --user sh -c whoami
       root
              Establish a user namespace as an unprivileged user with a root user within it.

       # touch /root/uts-ns
       # unshare --uts=/root/uts-ns hostname FOO
       # nsenter --uts=/root/uts-ns hostname
       FOO
       # umount /root/uts-ns
              Establish  a  persistent  UTS namespace, and modify the hostname.  The namespace is then entered with
              nsenter.  The namespace is destroyed by unmounting the bind reference.

       # mount --bind /root/namespaces /root/namespaces
       # mount --make-private /root/namespaces
       # touch /root/namespaces/mnt
       # unshare --mount=/root/namespaces/mnt
              Establish a persistent mount namespace referenced by the bind mount /root/namespaces/mnt.  This exam‐
              ple  shows  a  portable  solution,  because  it makes sure that the bind mount is created on a shared
              filesystem.

SEE ALSO
       clone(2), unshare(2), namespaces(7), mount(8)

AUTHORS
       Mikhail Gusarov ⟨dottedmag@dottedmag.net⟩
       Karel Zak ⟨kzak@redhat.com⟩

AVAILABILITY
       The  unshare  command  is  part  of  the  util-linux  package  and  is   available   from   https://www.ker‐
       nel.org/pub/linux/utils/util-linux/.

util-linux                                         February 2016                                         UNSHARE(1)

cgroup 실습 : 나만의 메모리 자원

$ sudo su

cgroup 으로 설정할 수 있는 항목 확인하기

$ ls /sys/fs/cgroup/
cpu  cpuacct  cpu,cpuacct  cpuset  freezer  memory  systemd  unified

/sys와 /proc 디렉터리와 그 하위 디렉터리들은 특수 파일, 디렉터리들이다.
물리적으로 연결된 것은 메모리다.
디스크(SSD, HDD)에 연결되어 있지 않다.
메모리 중에서도 커널 메모리, 커널 함수, 커널 변수와 연결되어 있다.

우리 눈에는 파일처럼 보이지만, 리눅스는 VFS를 통해 제어하는 중간 로직이 있다.
가상 파일 시스템 레이어를 통해 처리된다.

(프로세스가 얼마나 많은 메모리를 사용하는지 임계치를 cgroup을 통해 제한을 해보며 확인해볼 수도 있게 된다.)

메모리 기준으로 test폴더 생성하기

root@ubuntu:/home/reallinux# mkdir /sys/fs/cgroup/memory/test
root@ubuntu:/home/reallinux# ls /sys/fs/cgroup/memory/test
cgroup.clone_children           memory.kmem.tcp.failcnt             memory.oom_control
cgroup.event_control            memory.kmem.tcp.limit_in_bytes      memory.pressure_level
cgroup.procs                    memory.kmem.tcp.max_usage_in_bytes  memory.soft_limit_in_bytes
memory.failcnt                  memory.kmem.tcp.usage_in_bytes      memory.stat
memory.force_empty              memory.kmem.usage_in_bytes          memory.swappiness
memory.kmem.failcnt             memory.limit_in_bytes               memory.usage_in_bytes
memory.kmem.limit_in_bytes      memory.max_usage_in_bytes           memory.use_hierarchy
memory.kmem.max_usage_in_bytes  memory.move_charge_at_immigrate     notify_on_release
memory.kmem.slabinfo            memory.numa_stat                    tasks

(일반적으로 mkdir 명령어를 실행하면 빈 디렉터리여야 한다. 하지만 /sys, /proc라는 특수 디렉터리에서 생성된 경우는 mkdir 동작이 일반적인 디스크에 생성하는 방식과는 다르다)

새로운 cgroup "test" 기준: 100MB 제한

root@ubuntu:/home/reallinux# 
echo 100000000 > /sys/fs/cgroup/memory/test/memory.limit_in_bytes

메모리가 100MB인 것처럼 만든 것이고, 설정만 만든거고 프로세스에 적용된 상태는 아니다.

새로운 cgroup "test" 기준: swap 활용 disable

root@ubuntu:/home/reallinux# 
echo 0 > /sys/fs/cgroup/memory/test/memory.swappiness

0은 비활성화를 의미한다.

할당한 메모리가 부족할 때, 오랜 시간 사용하지 않는 메모리 페이지를 디스크 스왑 영역에 저장하겠다는 설정을 비활성화 해야 한다.

swap을 사용하지 않을 것이고(디스크를 끌어다쓰는 기능을 비활성화하고) 메모리 100MB만 사용.

이제 새로운 터미널을 사용해보자.

echo $$
25558

이전 터미널에서 아래 명령어를 통해 위의 프로세스에 우리가 설정한 메모리 설정을 적용해보자.

root@ubuntu:/home/reallinux# 
echo 25558 > /sys/fs/cgroup/memory/test/tasks

다시 새로운 터미널로 옮겨가보자.
해당 프로세스가 100MB 메모리 밖에 사용할 수 없다는 것을 증명해보자.

reallinux@ubuntu:~$ vim mem_eater.py

f = open("/dev/urandom", "r")
data = ""

i=0
while True:
    data += f.read(10000000) # 10MB
    i += 1
    print "%dmb" % (i*10,)

reallinux@ubuntu:~$ vim mem_eater.py
10mb
20mb
30mb
40mb
50mb
60mb
70mb
80mb
Killed

전체 메모리를 다 사용하게 둘 순 없기 때문에, 일정 임계치를 넘으면 프로세스를 kill한 것이다.

그렇다면 새로운 터미널에서 mem_eater.py를 실행해보자. cgroup이 적용되지 않은 상태다.

reallinux@ubuntu:~$ python mem_eater.py 
10mb
20mb
30mb
40mb
50mb
60mb
70mb
80mb
90mb
100mb
110mb
120mb
130mb
140mb
150mb
160mb
170mb
180mb
190mb
200mb
210mb
220mb
230mb
...

cgroup이 적용되지 않은 프로세스는 메모리 사용 제한 100MB를 넘어선다.

cgroup이 적용된 프로세스 25558에도 메모리 사용제한을 해제하려고 한다.
어떻게 다시 복구시킬까?

root@ubuntu:/home/reallinux# 
echo 25558 > /sys/fs/cgroup/memory/tasks

기본 cgroup 으로 돌려주면 된다.

복습

터미널 하나로 테스트

reallinux@ubuntu:~$ echo $$
2089

reallinux@ubuntu:~$ sudo su
[sudo] password for reallinux: 
root@ubuntu:/home/reallinux# 

root@ubuntu:/home/reallinux# 
echo $$
2109

root@ubuntu:/home/reallinux# 
bash

root@ubuntu:/home/reallinux# 
echo $$
2130


root@ubuntu:/home/reallinux# 
echo 2130 > /sys/fs/cgroup/memory/test/tasks
root@ubuntu:/home/reallinux# 
python mem_eater.py 

10mb
20mb
30mb
40mb
50mb
60mb
70mb
80mb
Killed

root@ubuntu:/home/reallinux# 
echo 2130 > /sys/fs/cgroup/memory/tasks
root@ubuntu:/home/reallinux# 
python mem_eater.py 

10mb
20mb
30mb
40mb
50mb
60mb
70mb
80mb
90mb
100mb
110mb
120mb
130mb
140mb
150mb
160mb