cgroup V2 簡介

Yi Jyun
4 min readSep 17, 2023

--

cgroup 簡介

簡單說就是一個機制。把 processes 建構成階級的樹狀結構。在這結構上提供可控的資源分配方式。而達成這樣的方式,主要依賴兩個部分。

  • cgroup core

負責把 process 給建構成階級的樹狀結構

  • cgroup controller

負責在階級的架構中,分配資源

C-A - C-B - C-C
\ C-D

而在 cgroup 裡的 process 有以下特點

  • 每個 process 存在於其中一個 cgroup
  • 一個 process 的所有 threads 屬於同一個 cgroup
  • 當建立新的 process,其跟 parent process 處於同一個 cgroup
  • 當 process 被移到其他的 cgroup,此行為不影響 child processes

而一個 cgroup 的 controller 啟用/停用在階級結構的影響

  • 影響到從屬的 sub-hierarchy cgroups 的所有 processes
  • 只會影響底下 sub-hierarchy cgroup,對於 ups-hierarchy cgroups 是不受任何 sub-hierarchy cgroups 修改的影響

Operations

cgroup — 建立 & 刪除

只要在 cgroup 目錄底下,透過 mkdir rmdir 的指令。即可建立或刪除child cgroup。

root@fce83d832c8f:/sys/fs/cgroup$ mkdir child-1 child-2
root@fce83d832c8f:/sys/fs/cgroup$ rmdir child-2
root@fce83d832c8f:/sys/fs/cgroup$ ls | grep child
child-1

Control Controller — enable 與 disable

cgroup 擁有的 controller 可透過 cgroup.controllers 得知。而建立出來的 child cgroup,可用的 controller 則基於上層的 cgroup.subtree_control。透過 echo “+XXX” 或者 echo “-XXX” 的方式寫入到 cgroup.subtree_control,child cgroup 就有相對應的 controller。

# available controllers of this cgroup 
root@5d193511335e:/sys/fs/cgroup# cat cgroup.controllers
cpuset cpu io memory hugetlb pids rdma
root@5d193511335e:/sys/fs/cgroup# cat cgroup.subtree_control
root@5d193511335e:/sys/fs/cgroup# cat child1/cgroup.controllers

# enable
root@4479fe71a3ec:/sys/fs/cgroup# echo "+cpuset +cpu +pids" > cgroup.subtree_control
root@4479fe71a3ec:/sys/fs/cgroup# cat cgroup.subtree_control
cpuset cpu pids
root@4479fe71a3ec:/sys/fs/cgroup# cat child1/cgroup.controllers
cpuset cpu pids

# disable
root@4479fe71a3ec:/sys/fs/cgroup# echo "-cpuset -cpu -pids" > cgroup.subtree_control
root@4479fe71a3ec:/sys/fs/cgroup# cat cgroup.subtree_control
root@4479fe71a3ec:/sys/fs/cgroup# cat child1/cgroup.controllers

這意味著 child 要有相關的資源,必須經由 parent 的分配。然而當許多 child 擁有並使用特定的資源,其 parent 無法直接取消特定的資源。

A controller can be enabled only if the parent has the controller enabled and a controller can’t be disabled if one or more children have it enabled.

此外 non-root cgroup 可以分配資源的前提是,本身並沒有任何的 processes 存在。

Non-root cgroups can distribute domain resources to their children only when they don’t have any processes of their own

Resource Distribution Models

在 controller resource 的分配,預設定義幾種方式。

1. Weight

parent cgroup 參考底下 active children cgroups 的權重比例去分配資源。數值範圍 [1, 10000]。

root@4479fe71a3ec:/sys/fs/cgroup# cat chile1/cpu.weight
100

2. Limits

限制資源使用的上限,數值範圍 [0, max]

root@4479fe71a3ec:/sys/fs/cgroup# cat memory.max
max

3. Protections

確保 resource 需求。其限制可分為 hard guarantees 或best effort soft boundaries。數值範圍 [0, max]

# “memory.low” implements best-effort memory protection
root@4479fe71a3ec:/sys/fs/cgroup# cat memory.low
0

4. Allocations

分配專門使用的資源於此 cgroup。數值範圍 [0, max]。其 children cgroup 總和起來的資源不可以超過 parent cgroup。

“cpu.rt.max” hard-allocates realtime slices and is an example of this type.

Core Interface Files

cgroup core files 都以 “cgroup” 開頭。檔案都記錄 cgroup 裡的設定、 progress 與 resource 的資訊

oot@f9eb4411427e:/sys/fs/cgroup# ls
cgroup.controllers cgroup.threads cpuset.cpus.effective hugetlb.1GB.rsvd.max hugetlb.32MB.events.local hugetlb.64KB.rsvd.max memory.max pids.events
cgroup.events cgroup.type cpuset.cpus.partition hugetlb.2MB.current hugetlb.32MB.max io.bfq.weight memory.min pids.max
cgroup.freeze cpu.idle cpuset.mems hugetlb.2MB.events hugetlb.32MB.rsvd.current io.max memory.oom.group rdma.current
cgroup.kill cpu.max cpuset.mems.effective hugetlb.2MB.events.local hugetlb.32MB.rsvd.max io.stat memory.stat rdma.max
cgroup.max.depth cpu.max.burst hugetlb.1GB.current hugetlb.2MB.max hugetlb.64KB.current memory.current memory.swap.current
cgroup.max.descendants cpu.stat hugetlb.1GB.events hugetlb.2MB.rsvd.current hugetlb.64KB.events memory.events memory.swap.events
cgroup.procs cpu.weight hugetlb.1GB.events.local hugetlb.2MB.rsvd.max hugetlb.64KB.events.local memory.events.local memory.swap.high
cgroup.stat cpu.weight.nice hugetlb.1GB.max hugetlb.32MB.current hugetlb.64KB.max memory.high memory.swap.max
cgroup.subtree_control cpuset.cpus hugetlb.1GB.rsvd.current hugetlb.32MB.events hugetlb.64KB.rsvd.current memory.low pids.current

cgroup.type

描述此 cgroup 的類型

domain : A normal valid domain cgroup.

domain threaded: A threaded domain cgroup which is serving as the root of a threaded subtree.

domain invalid: A cgroup which is in an invalid state. It can’t be populated or have controllers enabled. It may be allowed to become a threaded cgroup.

threaded: A threaded cgroup which is a member of a threaded subtree.

cgroup.proc

記錄屬於此 cgroup 的 processes

cgroup.thread

紀錄屬於此 cgroup 的 threads3

cgroup.controllers

此 cgroup 可用的 controllers

cgroup.subtree_control

預設為空白。裡面記載允許 child cgroup 可使用的 controller.

cgroup.events

  • populated:

1 表示此 cgroup or child cgroup 仍有 live processes。否則為 0

  • frozen

1 表示此 cgroup 為凍結 (frozen); 否則為 0.

cgroup.max.descendants

預設為 max。決定 child cgroup 的上限。

cgroup.max.depth

預設為 max 。決定此 cgroup 可允許往下擴張的階層深度。

cgroup.stat

  • nr_descendants

Total number of visible descendant cgroups

  • nr_dying_descendants

Total number of dying descendant cgroups.

cgroup.freeze

預設為 0。當寫入為 1 時候,其 cgroup 與其底下的 child cgroups 皆會被凍結。也就是說,所屬這些 cgroup 相關的 processes 皆會被停止。

Process 與 cgroup 的關聯

可以透過 /proc/${pid}/cgroup 去查看所對應的 cgroup 設定。裡面所顯示的路徑可以在 /sys/fs/cgroup 底下,找到相對應的 cgroup。

❯ ps  -ax | grep docker
211303 ? Ssl 10:39 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

# 查看 dockerd 的 cgroup 設定
❯ cat /proc/211303/cgroup
0::/system.slice/docker.service

❯ ls /sys/fs/cgroup/system.slice/docker.service
cgroup.controllers cgroup.subtree_control cpuset.cpus.partition hugetlb.1GB.events hugetlb.2MB.max memory.current memory.peak memory.zswap.max rdma.max
cgroup.events cgroup.threads cpuset.mems hugetlb.1GB.events.local hugetlb.2MB.numa_stat memory.events memory.pressure misc.current
cgroup.freeze cgroup.type cpuset.mems.effective hugetlb.1GB.max hugetlb.2MB.rsvd.current memory.events.local memory.reclaim misc.events
cgroup.kill cpu.idle cpu.stat hugetlb.1GB.numa_stat hugetlb.2MB.rsvd.max memory.high memory.stat misc.max
cgroup.max.depth cpu.max cpu.uclamp.max hugetlb.1GB.rsvd.current io.max memory.low memory.swap.current pids.current
cgroup.max.descendants cpu.max.burst cpu.uclamp.min hugetlb.1GB.rsvd.max io.pressure memory.max memory.swap.events pids.events
cgroup.pressure cpu.pressure cpu.weight hugetlb.2MB.current io.prio.class memory.min memory.swap.high pids.max
cgroup.procs cpuset.cpus cpu.weight.nice hugetlb.2MB.events io.stat memory.numa_stat memory.swap.max pids.peak
cgroup.stat cpuset.cpus.effective hugetlb.1GB.current hugetlb.2MB.events.local io.weight memory.oom.group memory.zswap.current rdma.current

# 查看此 cgroup 所屬的 procs
❯ cat /sys/fs/cgroup/system.slice/docker.service/cgroup.procs
211303

runc (docker) 與 cgroup (V2)

docker 可以針對 contianer 的 resources 進行更改。此部分是透過 runc command。從 runc 的 soruce code 來看。在更新 resources 的 Set() function,可以看到 resources 屬性設置部分。

// refer to https://github.com/opencontainers/runc/blob/release-1.1/libcontainer/cgroups/fs2/fs2.go#L150-L202

func (m *manager) Set(r *configs.Resources) error {
if r == nil {
return nil
}
if err := m.getControllers(); err != nil {
return err
}
// pids (since kernel 4.5)
if err := setPids(m.dirPath, r); err != nil {
return err
}
// memory (since kernel 4.5)
if err := setMemory(m.dirPath, r); err != nil {
return err
}
// io (since kernel 4.5)
if err := setIo(m.dirPath, r); err != nil {
return err
}
// cpu (since kernel 4.15)
if err := setCpu(m.dirPath, r); err != nil {
return err
}
// devices (since kernel 4.15, pseudo-controller)
//
// When rootless is true, errors from the device subsystem are ignored because it is really not expected to work.
// However, errors from other subsystems are not ignored.
// see @test "runc create (rootless + limits + no cgrouppath + no permission) fails with informative error"
if err := setDevices(m.dirPath, r); err != nil && !m.config.Rootless {
return err
}
// cpuset (since kernel 5.0)
if err := setCpuset(m.dirPath, r); err != nil {
return err
}
// hugetlb (since kernel 5.6)
if err := setHugeTlb(m.dirPath, r); err != nil {
return err
}
// rdma (since kernel 4.11)
if err := fscommon.RdmaSet(m.dirPath, r); err != nil {
return err
}
// freezer (since kernel 5.2, pseudo-controller)
if err := setFreezer(m.dirPath, r.Freezer); err != nil {
return err
}
if err := m.setUnified(r.Unified); err != nil {
return err
}
m.config.Resources = r
return nil
}

以 cpuset 來看,其主要也是透過 cgroup 方式去做設置。

// refer to https://github.com/opencontainers/runc/blob/release-1.1/libcontainer/cgroups/fs2/cpuset.go#L12C1-L28C2

func setCpuset(dirPath string, r *configs.Resources) error {
if !isCpusetSet(r) {
return nil
}

if r.CpusetCpus != "" {
if err := cgroups.WriteFile(dirPath, "cpuset.cpus", r.CpusetCpus); err != nil {
return err
}
}
if r.CpusetMems != "" {
if err := cgroups.WriteFile(dirPath, "cpuset.mems", r.CpusetMems); err != nil {
return err
}
}
return nil
}

參考

[1] Control Group V2

[2] [译] Control Group v2 (cgroupv2 权威指南) (KernelDoc, 2021)

[3] runs, https://github.com/opencontainers/runc

--

--