diamond_fulldiamonddiamond_halfdiamond_eurosearch-iconmenuchat-iconclose-iconenvelope-iconsmartphone-call-icon

Blog & News

June 28, 2022

Linux Container Primitives: cgroup Kernel View and Usage in Containerization

Part ten of the Linux Container series

preview-image for Logo

Disclaimer: The elaboration associated to this subject results from a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.

The previous post of the Linux Container Primitives series explains the internals of the cgroup kernel primitive. The following list shows the topics of all scheduled blog posts. It will be updated with the corresponding links once new posts are being released.

  1. An Introduction to Linux Containers
  2. Linux Capabilities
  3. An Introduction to Namespaces
  4. The Mount Namespace and a Description of a Related Information Leak in Docker
  5. The PID and Network Namespaces
  6. The User Namespace
  7. Namespaces Kernel View and Usage in Containerization
  8. An Introduction to Control Groups
  9. The Network and Block I/O Controllers
  10. The Memory, CPU, Freezer and Device Controllers
  11. Control Groups Kernel View and Usage in Containerization

cgroup Kernel View

In the kernel source code, control groups are represented by the cgroup structure defined in linux/cgroup-defs.h. Every cgroup includes a unique ID, starting from the value 1, always using the smallest value possible. When applying changes to the control group hierarchy, checks have to be performed on a regular basis to determine whether a group is a descendant of another group. To avoid the requirement to traverse in the control group tree, an integer value level is present to solve this problem using numerical comparisons.

The logic to initialize a control group is implemented in cgroup_init (kernel/cgroup/cgroup.c):

[...]
BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0, 0));
[...]
for_each_subsys(ss, ssid) {
    [...]
    cgroup_init_subsys(ss, false);
    [...]
    css_populate_dir(init_css_set.subsys[ssid]);
    [...]
}
[...]
WARN_ON(sysfs_create_mount_point(fs_kobj, "cgroup"));
WARN_ON(register_filesystem(&cgroup_fs_type));
WARN_ON(register_filesystem(&cgroup2_fs_type));
WARN_ON(!proc_create_single("cgroups", 0, NULL,
    proc_cgroupstats_show));
[...]

The function cgroup_setup_root initially sets up the control group root which is represented by a cgroup_root structure. Internally this includes the cgroup structure discussed above. The setup routine is also responsible for creating the kernfs - the virtual filesystem exporting the files residing in /sys/fs/cgroup. After that, all control group subsystems are being enabled. With init_and_link_css called in cgroup_init_subsys, pointers to the respective children, siblings and parent nodes are created. The abbreviation css stands for cgroup subsystem state in this context and is being used to map a specific thread to a set of control groups [1]. The function css_populate_dir creates a virtual filesystem for each controller in the kernfs created before. Finally, the kernfs is mounted in sysfs_create_mount_point. For each control group version a filesystem is registered and the virtual /proc/cgroups file is being created.

There exists a global array of all subsystems, called cgroup_subsys which is defined using an include directive for cgroup_subsys.h as can be seen below. This is the file holding all available controllers supported by the kernel.

structure cgroup_subsys *cgroup_subsys[] = {
    #include <linux/cgroup_subsys.h>
};

The controllers are implemented using another kernel structure: cgroup_subsys. This structure provides a common interface for all implemented resource controllers. Another common interface for all subsystems is the cftype_ss structure which enables all controllers to define own virtual files to export data. For example the cpuset controller exports these files [2]:

Control Group Exports
Control Group Exports

Similar to the implementation of namespaces in the kernel, the task_struct structure also holds information regarding control groups [2]:

Control Group Structures
Control Group Structures

As can be observed in the figure above, the css_set structure associates a set of control group subsystems to a task. Every task with the same cgroup subsystem set has a pointer to the same css_set. This is being used to save space in the task structure which effectively speeds up fork calls.

Internally there’s a MxN relationship between cgroups and css_sets. To link the kernel structures efficiently, the following link structure is in place:

Control Group Structure Links
Control Group Structure Links

For each process, there exists exactly one leader thread whose thread ID is equal to the PID of the whole process. It’s possible that a css_set is being linked to multiple control groups because every single task can be present in various cgroups. For this reason and to be able to traverse the link structure the other way around, beginning from a cgroup, the linking structure cgrp_cset_link associates both kernel structures. The labels of the arrows are to be interpreted as UML associations. As shown above, there also exist multiple shortcuts in the structure linkage to allow an efficient direct access to associated structures without having to traverse multiple lists. For example the link from css_set to cgroup bypasses the linking structure in between.

Usage in Containerization

Container engines like LXC and Docker support the configuration on control group settings. Similar to the configuration of namespaces, it’s possible to supply command line parameters to the Docker CLI while configuration files are being used for LXC.

When starting a Docker container without any additional control group configuration, a docker group is created for each controller type mentioned in this chapter using permissive default values. Additional configuration can then be applied after starting a container using the mechanisms that have already been discussed. Furthermore the command line options allow cgroup configuration without the requirement of interacting with the virtual control group filesystem. For example, the amount of CPUs a container may use is being configured by passing a numerical value along with the --cpus option. Another convenient feature is the ability to integrate a container into an existing parent control group. Therefore control groups are able to be prepared in order to use it as parent group for a container later on. With the ability to use persistent cgroups, containers can be restricted in an automated way upon booting a machine by assigning a parent control group.

Credits

Credits: The elaboration and software project associated to this subject are results of a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.

References

~ Philipp Schmied

Free Consultation