June 28, 2022
Linux Container Primitives: cgroup Kernel View and Usage in Containerization
Part ten of the Linux Container series
Disclaimer: The elaboration associated to this subject results from a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.
The previous post of the Linux Container Primitives series explains the internals of the cgroup kernel primitive. The following list shows the topics of all scheduled blog posts. It will be updated with the corresponding links once new posts are being released.
- An Introduction to Linux Containers
- Linux Capabilities
- An Introduction to Namespaces
- The Mount Namespace and a Description of a Related Information Leak in Docker
- The PID and Network Namespaces
- The User Namespace
- Namespaces Kernel View and Usage in Containerization
- An Introduction to Control Groups
- The Network and Block I/O Controllers
- The Memory, CPU, Freezer and Device Controllers
- Control Groups Kernel View and Usage in Containerization
cgroup Kernel View
In the kernel source code, control groups are represented by the cgroup
structure defined in linux/cgroup-defs.h
. Every cgroup
includes a unique ID, starting from the value 1
, always using the smallest value possible. When applying changes to the control group hierarchy, checks have to be performed on a regular basis to determine whether a group is a descendant of another group. To avoid the requirement to traverse in the control group tree, an integer value level
is present to solve this problem using numerical comparisons.
The logic to initialize a control group is implemented in cgroup_init
(kernel/cgroup/cgroup.c
):
[...]
BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0, 0));
[...]
for_each_subsys(ss, ssid) {
[...]
cgroup_init_subsys(ss, false);
[...]
css_populate_dir(init_css_set.subsys[ssid]);
[...]
}
[...]
WARN_ON(sysfs_create_mount_point(fs_kobj, "cgroup"));
WARN_ON(register_filesystem(&cgroup_fs_type));
WARN_ON(register_filesystem(&cgroup2_fs_type));
WARN_ON(!proc_create_single("cgroups", 0, NULL,
proc_cgroupstats_show));
[...]
The function cgroup_setup_root
initially sets up the control group root which is represented by a cgroup_root
structure. Internally this includes the cgroup
structure discussed above. The setup routine is also responsible for creating the kernfs
- the virtual filesystem exporting the files residing in /sys/fs/cgroup
. After that, all control group subsystems are being enabled. With init_and_link_css
called in cgroup_init_subsys
, pointers to the respective children, siblings and parent nodes are created. The abbreviation css
stands for cgroup subsystem state in this context and is being used to map a specific thread to a set of control groups [1]. The function css_populate_dir
creates a virtual filesystem for each controller in the kernfs
created before. Finally, the kernfs
is mounted in sysfs_create_mount_point
. For each control group version a filesystem is registered and the virtual /proc/cgroups
file is being created.
There exists a global array of all subsystems, called cgroup_subsys
which is defined using an include
directive for cgroup_subsys.h
as can be seen below. This is the file holding all available controllers supported by the kernel.
structure cgroup_subsys *cgroup_subsys[] = {
#include <linux/cgroup_subsys.h>
};
The controllers are implemented using another kernel structure: cgroup_subsys
. This structure provides a common interface for all implemented resource controllers. Another common interface for all subsystems is the cftype_ss
structure which enables all controllers to define own virtual files to export data. For example the cpuset
controller exports these files [2]:
Similar to the implementation of namespaces in the kernel, the task_struct
structure also holds information regarding control groups [2]:
As can be observed in the figure above, the css_set
structure associates a set of control group subsystems to a task. Every task with the same cgroup subsystem set has a pointer to the same css_set
. This is being used to save space in the task structure which effectively speeds up fork
calls.
Internally there’s a MxN
relationship between cgroup
s and css_set
s. To link the kernel structures efficiently, the following link structure is in place:
For each process, there exists exactly one leader thread whose thread ID is equal to the PID of the whole process. It’s possible that a css_set
is being linked to multiple control groups because every single task can be present in various cgroups. For this reason and to be able to traverse the link structure the other way around, beginning from a cgroup
, the linking structure cgrp_cset_link
associates both kernel structures. The labels of the arrows are to be interpreted as UML associations. As shown above, there also exist multiple shortcuts in the structure linkage to allow an efficient direct access to associated structures without having to traverse multiple lists. For example the link from css_set
to cgroup
bypasses the linking structure in between.
Usage in Containerization
Container engines like LXC and Docker support the configuration on control group settings. Similar to the configuration of namespaces, it’s possible to supply command line parameters to the Docker CLI while configuration files are being used for LXC.
When starting a Docker container without any additional control group configuration, a docker
group is created for each controller type mentioned in this chapter using permissive default values. Additional configuration can then be applied after starting a container using the mechanisms that have already been discussed. Furthermore the command line options allow cgroup configuration without the requirement of interacting with the virtual control group filesystem. For example, the amount of CPUs a container may use is being configured by passing a numerical value along with the --cpus
option. Another convenient feature is the ability to integrate a container into an existing parent control group. Therefore control groups are able to be prepared in order to use it as parent group for a container later on. With the ability to use persistent cgroups, containers can be restricted in an automated way upon booting a machine by assigning a parent control group.
Credits
Credits: The elaboration and software project associated to this subject are results of a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.