Linux’s network namespaces are the coolest thing since Windows Vista. Ok that’s hardly a fair comparison, but I am talking about a feature that was introduced to Linux 2.6.24 which shipped in January of 2008, roughly one year after Vista was released. One of these things is still very relevant today.

The kernel has a man page1 with a concise description of the feature. I like this sentence from man ip-netns2:

A network namespace is logically another copy of the network stack, with its own routes, firewall rules, and network devices.

What does that mean?

Linux namespaces are similar to namespaces in a programming language - they group a set of related resources and isolate them from another group. Two identical entites can exist in separate namespaces without causing a conflict. Container tools like Docker or Podman use a network namespace to isolate a containerized process from the host’s network, though they also usually link to the host for internet access.

Perhaps the most common way to create a network namespace is through iproute2:

$ ip l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 72:42:40:55:08:0b brd ff:ff:ff:ff:ff:ff
$ sudo ip netns add asdf
$ sudo ip -n asdf l     # shorthand for `ip netns exec asdf ip l`
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Inside the namespace (note the -n asdf), my computer’s ethernet device no longer exists, and the loopback device is different (notice the state change).

How does it work?

Go see for yourself! The source code3 is not too difficult to follow.

The syscall that creates a namespace is called unshare()4. The Linux docs have an entire page5 dedicated to it. There are many types of namespaces, but the network ones are created with the CLONE_NEWNET flag. Each namespace is a copy of the kernel’s struct net defined at the top of net_namespace.h[net_namespace.h], which stores network devices among many other fields.

After creating the namespace, iproute2 effectively stores a reference to it by mount()ing the current process’ namespace handle to the filesystem. More specifically, it mounts /proc/self/ns/net to /var/run/netns/<name>. Otherwise the namespace would cease to exist as soon as the ip netns add <name> command exited.

$ ls -lh /var/run
lrwxrwxrwx 1 root root 4 Apr 15 15:59 /var/run -> /run
$ ls /run/netns/
asdf
$ mount | rg netns
tmpfs on /run/netns type tmpfs (rw,nosuid,nodev,noexec,relatime,size=395652k,mode=755,inode64)
nsfs on /run/netns/asdf type nsfs (rw)
nsfs on /run/netns/asdf type nsfs (rw)

So far I have only mentioned how to create a namespace, but do ip -n <name> ... and ip netns exec <name> ... work? For that, we have to use another syscall: setns()6. It shares the meaning of the flags with unshare, but also accepts a file descriptor. Go figure - in Linux you reference a namespace with a file. This file descriptor is obtained by simply opening the file that was previously mounted, i.e. /run/netns/asdf.

setns essentially moves a thread to a different namespace. Combine that with one of the exec* syscalls and you can execute a different program in the namespace.

Broader use of unshare

I’ve covered the basics using iproute2 as an example, but there are more tools that make use of Linux’s namespace feature. The util-linux repo has another program that calls unshare: unshare.c7.

$ sudo unshare -n
# ip l
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# exit
$

Unlike iproute2, this program does not persist the namespace; it only executes a program inside it (which is a shell by default).

You’ve probably noticed that I have used sudo for creating these namespaces, and that’s because CLONE_NEWNET requires CAP_SYS_ADMIN. Well unshare-the-program supports all the other kinds of namespaces too. For example:

$ whoami
jordan
$ unshare -U
$ whoami
nobody

When creating a user namespace, the program has a handy feature where it can map the current user’s UID to the root user inside the namespace. This is privilege escalation stuck inside a namespace separate from the rest of the system. I wouldn’t quite call it a sandbox, but it does give you permissions inside the namespace you wouldn’t otherwise have in the root/default namespace. I rhetorically wonder if this solves the permission problem with network namespaces?

$ unshare -Urn
$ whoami
root

So with the combination of CLONE_NEWUSER and CLONE_NEWNET, which are OR’d together in the flags argument to unshare, you can create a network namespace without your user having CAP_SYS_ADMIN. This is my go-to method for quickly tinkering with network config, e.g. exploring BPF programs or nftables rules.

The downside of unshare -Urn is that the namespace isn’t easily referenced by other processes. For example, when using iproute2, I can ip netns exec <name> to open up several shells, which is handy for running tcpdump or bpf prog tracelog. The unshare program can’t help me with this.

Broader use of setns

But going back to the basics, I know that the new namespace my shell is running in can be referenced by /proc/self/ns/net. And /proc/self is a symlink to a directory named with the process ID:

$ ls -lh /proc/self
lrwxrwxrwx 1 root root 0 Dec 31  1969 /proc/self -> 212116

So I should be able to enter the shell’s namespaces via

int fd = open("/proc/212116/ns/user", O_RDONLY, 0);
setns(fd, CLONE_NEWUSER);
close(fd);
fd = open("/proc/212116/ns/net", O_RDONLY, 0);
setns(fd, CLONE_NEWNET);
close(fd);

Turns out, the util-linux repo has a program called nsenter8 to do exactly that.

$ nsenter -U -n --preserve-credentials -t 212116
$ whoami
root

The UX of this flow is lame, but it does prove I can recreate ip netns exec without elevated privileges.