October 29, 2019
Linux Container Primitives: An Introduction to Namespaces
Part two of the Linux Container series
The following list shows the topics of all scheduled blog posts regarding Linux containers. It will be updated with the corresponding links once new posts are being released.
- An Introduction to Linux Containers
- Linux Capabilities
- An Introduction to Namespaces
- The Mount Namespace and a Description of a Related Information Leak in Docker
- The PID and Network Namespaces
- The User Namespace
- Namespaces Kernel View and Usage in Containerization
- An Introduction to Control Groups
- The Network and Block I/O Controllers
- The Memory, CPU, Freezer and Device Controllers
- Control Groups Kernel View and Usage in Containerization
Being introduced first in Linux kernel version 2.4.19
in 2002, namespaces define groups of processes that share a common view regarding specific system resources. This ultimately isolates the view on a system resource a group of processes may have, meaning that a process can for instance have its own hostname while the real hostname of the system may have an entirely different value.
There exist various namespaces types – as of Linux kernel version 4.19
the following types are available:
- UTS
- Mount
- PID
- Network
- IPC (Inter Process Communication)
- Control Group
- User
Consider the UTS namespace as an example: Every process in a single UTS namespace shares the hostname with every other process in the same UTS namespace. Otherwise, the value of the hostname is isolated between different UTS namespaces. The code listing below illustrates this:
root@box :~# hostname # Get the hostname
box
root@box :~# unshare -u # Create a sub shell in a new UTS namespace
$ hostname # Get the hostname in the created UTS namespace
box
$ hostname anotherbox # Change the hostname in the UTS namespace
$ hostname # Verify that it has been changed
anotherbox
$ exit # Exit the UTS namespace shell
root@box :~# hostname # Verify that the real hostname of the parent UTS namespace hasn't changed
box
Namespaces can also be used in combination by making a process a member of multiple new namespaces at once. This is useful for containerization where multiple resources have to be isolated at once.
It’s important to note that by default every process is a member of a namespace of each type listed above. These namespaces are called default
, init
or root
namespaces. In case no additional namespace configuration is in place, processes and all their direct children will reside in this exact namespace. This can be verified by executing lsns
to list all namespaces in two different terminals and comparing the namespace identifiers which will be equal.
The isolation provided by namespaces is highly configurable and flexible. For instance, it’s possible for a database application and a web application to share the same network namespace, allowing both processes to communicate while other processes that reside in other network namespaces are not able to do so.
System Calls
Three system calls are commonly being used in conjunction with namespaces:
clone
: Create child processesunshare
: This disables sharing a namespace with the parent process, effectively unsharing a namespace and its associated resources. Please note that this system call changes the shared namespaces in-place without the requirement of spawning a new process – with the PID namespace being an exception in this case as discussed in one of the next blog posts.setns
: Attaches a process to an already existing namespace.
Similar to the fork
system call, clone
is used to create child processes. There are multiple differences between the two calls: The most significant difference is that clone
can be highly parametrized using flags. For example, it allows sharing the execution context, for instance the process memory space, with a child process. Therefore clone
can also create threads and is more versatile than the legacy fork
call. The fork
call does not support most of this behavior. Instead, fork
is essentially being used to create child processes as copies of a parent process. Before going into more detail about the differences and similarities it’s first important to understand what’s happening when fork
, clone
or a system call in general is being invoked in a C program.
fork() != fork
By using one of these two calls in C code the actual code that will be executed is not the system call itself as defined in the system call table. The code that will be called instead is a wrapper around the actual system call of the C library which is often named after the wrapped system call. These wrappers exist because using them is more convenient for developers than using system calls directly. For instance, to use a system call it’s necessary to setup registers, switch to kernel mode, handle the call results and switch back the user mode [1]. This can be simplified for by implementing a wrapper and doing these tasks in the wrapper’s code.
When inspecting the fork
wrapper function it becomes clear that the actual fork
system call that’s supposed to be wrapped is not being used it all. Instead, the ARCH_FORK
macro gets called which is an inline system call to clone
defined as:
#define ARCH_FORK() \
INLINE_SYSCALL (clone, 4, \
CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, 0, \
NULL, &THREAD_SELF->tid)
This results in the clone
and fork
library functions calling the same system call, namely clone
. This can be verified by compiling a C application containing a call to fork
and using strace
to trace the resulting system calls when executing the resulting binary: No calls to fork
are present, only a call to clone
with the SIGCHILD
flag. This ultimately implements the legacy fork
call with a call to clone
. The reason for that results in clone
being a more powerful and configurable call than fork
, making it possible to replace fork
entirely with clone
to spawn processes and threads.
Using clone()
The clone
system call accepts various flags to configure the process creation. For the usage of namespaces a subset of these flags can be used to specify the new namespaces a process will join. By default the child processes are being initialized with a modified copy of the parent’s namespace configuration when supplying such a flag. This takes the desired namespaces configuration into account and makes the child process a member of the new namespaces that are represented as flags. If a specific flag of a namespace type is not specified, then the process is part of the parent’s namespace for this specific type, providing no additional level of isolation. Consider the UTS namespace example from above: The clone
flag responsible for the creation of such a namespace is NEW_UTS
.
The clone
call has the following prototype, allowing to specify the flags described above:
int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg)
child_func
: Function pointer to the function being executed by the child process.child_stack
: The downwardly growing stack the child will operate on.flags
: Integer value representing all used flags as configured using an OR-Conjunction of flags.arg
: Additional arguments
The following code snippet shows a minimal example of using clone
to spawn a shell process in a new UTS namespace. Please note that this minimal example does not provide includes, error handling and does not check return values.
#define STACKSIZE 8192
char *childStack;
char *childStackTop;
pid_t childPid;
// Will be executed by the child process
int run(void *) {
system("/bin/sh");
return 0;
}
int main(int argc, char const *argv[]) {
childStack = (char *)malloc(STACKSIZE);
// stack grows downward
childStackTop = childStack + STACKSIZE;
childPid = clone(run, childStackTop,
CLONE_VFORK | CLONE_NEWUTS, 0);
return 0;
}
After compiling and running this example a shell in a new UTS namespace is being spawned.
unshare()
This system call allows processes to disable sharing namespaces after they have been created. In contrast, clone
causes child processes to be moved to namespaces while creating them. There exists a CLI application available called unshare
that creates a new process` and unshares specific system resources, whereas using the system call in a C program isolates a process at runtime without allocating a new process:
int main(int argc, char const *argv[]) {
unshare(CLONE_NEWUTS); // No error handling
// Print hostname
system("hostname");
// Set hostname
system("hostname testname");
// Print hostname again
system("hostname");
// Print the available namespaces
system("lsns");
// Spawn a shell
execvp("/bin/sh", 0);
return 0;
}
When compiling and running the program a similar output to the following listing content can be observed:
root@box:~# ./a.out
box
testname
[...]
NS TYPE NPROCS PID USER COMMAND
[...]
4026531835 cgroup 336 1 root /sbin/init splash
4026531836 pid 293 1 root /sbin/init splash
4026531837 user 293 1 root /sbin/init splash
4026531838 uts 333 1 root /sbin/init splash
4026531839 ipc 336 1 root /sbin/init splash
4026531840 mnt 326 1 root /sbin/init splash
4026532009 net 292 1 root /sbin/init splash
[...]
4026532436 uts 3 27750 root ./a.out
[...]
root@box:~# lsns | grep uts
4026531838 uts 134 803 user /bin/sh
As seen above, the created process is able to set its own hostname, ultimately isolating the value of this setting in-place. Additionally, the spawned process is present in two UTS namespaces. The parent process is only a member of one UTS namespace. According to the namespace IDs, one of these UTS namespaces is shared as it’s the default namespace of that type. The other namespace is the one that has been created using the unshare
call.
The command lsns
gathers the displayed information by checking the contents of the /proc/<PID>/ns
directory for each PID. In the context of containers it should not be possible to get information on parent namespaces like it’s possible in this example. When creating a container, the /proc
directory or sensitive parts of it should therefore be isolated to prevent this information leak.
setns()
To add processes to already existing namespaces, setns
is being utilized. It disassociates a process from its original namespace and associates it with another namespace of the same type. The prototype of this system call is as follows:
int setns(int fd, int nstype)
fd
: File descriptor of a symbolic link representing a specific namespace as represented in/proc/<PID>/ns
.nstype
: This parameter is designated for checks regarding the namespace type. By passing aCLONE_NEW*
flag, the namespace type of the first parameter is checked before entering the namespace. This makes sure that the passed file descriptor indeed points to the desired namespace type. To disable this check, a zero value can also be used for this parameter.
A simple example that invokes a command in a given namespace can be examined in the following code listing ([2] – modified):
[...]
// Get namespace file descriptor
int fd = open(argv[1], O_RDONLY);
// Join the namespace
setns(fd, 0);
// Execute a command in the namespace
execvp(argv[2], &argv[2]);
[...]
The code above launches a child process that executes the specified command and resides in a different namespace as the parent.
To perform this from a CLI the nsenter
application can be of use. Cosider a shell process residing in a separate UTS namespace: By executing nsenter -a -t 1
the process is being moved to all namespaces originating from the system initialization process with PID 1
. This effectively reverts the call to unshare
, making the original UTS namespace available to the shell process. Now changing the hostname from the shell process will affect the hostname of the default namespace and therefore the system’s hostname. As a result, it’s important to prevent these types of setns
calls by isolating the exposed namespaces. This illustrates that by using namespaces alone it may not be possible to prevent system modifications.
Next post in series
- Continue reading the next article in this series The Mount Namespace and a Description of a Related Information Leak in Docker
References / Credits
Credits: The elaboration and software project associated to this subject are results of a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.