January 12, 2021
Linux Container Primitives: Network and Block I/O Control Groups
Part eight of the Linux Container series
In the previous post of the Linux Container Primitives series, the basics of control groups were covered. This post illustrates the purpose of two cgroup controllers: Network and Block I/O. The following list shows the topics of all scheduled blog posts. It will be updated with the corresponding links once new posts are being released.
- An Introduction to Linux Containers
- Linux Capabilities
- An Introduction to Namespaces
- The Mount Namespace and a Description of a Related Information Leak in Docker
- The PID and Network Namespaces
- The User Namespace
- Namespaces Kernel View and Usage in Containerization
- An Introduction to Control Groups
- The Network and Block I/O Controllers
- The Memory, CPU, Freezer and Device Controllers
- Control Groups Kernel View and Usage in Containerization
The Network Controller (v1)
This section covers the resource controllers net_cl
and net_prio
. Both cause identifiers to be attached to sockets once they are created by a process that’s being managed by one of the two controllers. The difference between the two controllers is that net_prio
assigns an ID that’s unique for each control group whereas net_cl
uses a specified identifier that does not have to be unique for each cgroup, allowing flexible tagging of sockets in classes [1]. Adding these identifiers allows quick checks to determine whether a socket originates from the same control group or class. This is more efficient than searching in the control group tree, for example using the cgroup function is_descendant()
to perform these checks - especially if this has to be performed regularly.
There are multiple use-cases for these additional socket attributes. Among others, some of them are:
- Setting the priority of network packets originating from a specific socket or device by using the network priority set by
net_prio
. - Using
iptables
in combination with thenet_cl
class identifier to filter and route packets based on the control group membership. - Scheduling network packets based on class identifiers.
A simple usage example of net_cls
that drops all IP based traffic for all processes not present in a specific control group can be seen below [2]:
root@box :~# mkdir /sys/fs/cgroup/net_cls # Create mountpoint
root@box :~# mount -t cgroup \ # Mount the controller
-o net_cls net_cls /sys/fs/cgroup/net_cls
root@box :~# mkdir /sys/fs/cgroup/net_cls/IPAllowed # Create cgroup
root@box :~# echo 0x100001 > \ # Set the fixed class identifier
/sys/fs/cgroup/net_cls/IPAllowed/net_cls.classid
root@box :~# tc qdisc add dev <interface> root handle 10: htb
root@box :~# tc class add dev <interface> parent 10: classid 10:1 \
htb rate 40mbit
root@box :~# tc filter add dev <interface> parent 10: protocol ip \
prio 10 handle 1: cgroup
root@box :~# iptables -A OUTPUT -m cgroup ! \
--cgroup 0x100001 -j DROP # Disallow IP for all non-members
root@box :~# echo $$ > \ # Add process to cgroup
/sys/fs/cgroup/net_cls/IPAllowed/cgroup.procs
-- Filtering active --
root@box :~# tc qdisc del dev <interface> root; \ # Revert settings
tc qdisc add dev <interface> root pfifo
The tc
(Traffic Control) commands listed above are being used to set up a qdisc
(Queueing-Discipline) that uses the fixed control group class to classify the traffic originating from a control group on a network interface by assigning it to a handle called cgroup
. This filtering is accomplished by using a HTB (Hierarchical Token Bucket) filter. With iptables
it’s then possible to use the cgroup
handle to add rules for a control group, e.g. allowing network access.
This controller type is an example where child control groups are not automatically affected by the net_*
controllers, meaning that this setting is not inherited throughout the hierarchy.
The Block IO Controller (v1/v2)
The blkio
(v1
) / io
(v2
) controller is being utilized to enable I/O resource usage policies. The most common use-cases to limit these aspects are:
Specifying upper bandwidth limits, for example in the
blkio.throttle.read_bps_device
file to specify the maximum bandwidth for a device in bits per second. Alternatively, therbps
parameter in conjunction with theio.max
file is the equivalent for version 2.Denying access to a specific device.
Limiting with proportional time based division: This allows settings weights for various control groups that will be used to prioritize all device accesses when performing I/O operations. The
blkio.weight
file is present for this purpose in cgroupv1
whereas this is configured withio.weight
in version 2.
Enforcing Limits in the Kernel
Enforcing bandwidth limits is implemented in blk_throtl_bio
which resides in block/blk-throttle.c
. This function makes use of throtl_charge_bio
to ultimately charge for the data volume used in an I/O operation. Depending on the resource usage, an I/O operation can be executed directly or may have to be delayed using a queue to meet the resource limitations. For delayed operations, a dispatcher function will then cause pending operations to be executed using pre-calculated timers in order to throttle requested operations. With throtl_trim_slice
the required time limiting is calculated, yielding the time slice the operation may be executed in.
To allow or deny accessing a specific device, functions of security/device_cgroup.c
come to use. When passing cgroup configuration strings to the files present in the virtual control group file system devcgroup_update_access
parses this information and configures the control group accordingly, e.g. by setting flags indicating whether accessing a device is allowed or denied for processes of a cgroup. This builds an exception list as seen in the listing following below. Upon accessing a block device, __blkdev_get
(fs/block_dev.c
) is being called which performs access checks prior to allowing access. To perform these checks, __devcgroup_check_permission
(security/device_cgroup.c
) is called, resulting in the following checks being performed using the internal exception list:
// current is the current task_struct
dev_cgroup = task_devcgroup(current);
if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW)
// perform checks based on the exception list
rc = !match_exception_partial(&dev_cgroup->exceptions,
type, major, minor, access);
else
rc = match_exception(&dev_cgroup->exceptions, type, major,
minor, access);
if (!rc)
return -EPERM; // deny access
The default scheduler for I/O operations in the Linux kernel is the CFQ scheduler. It was extended to support the I/O related cgroup controllers after control groups have been introduced in the kernel. This makes it possible to account and constrain processes regarding their consumed I/O resources, for example using pre-defined weights. The CFQ I/O scheduler is implemented in block/cfq-iosched.h
- not to be confused with kernel/sched/fair.c
where process-related CFQ scheduling is implemented. The kernel structure cfq_group
maps various settings per cgroup-device relationship. This includes applied policies and weights which are considered in order to schedule I/O operations.
Next post in series
- Continue reading the next article in this series The
Memory,
CPU,
Freezer
and
Device
Controllers
Follow us on Twitter , LinkedIn , Xing to stay up-to-date.
Credits
Credits: The elaboration and software project associated to this subject are results of a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.