Categories
Application Security OS Internals

Linux Containers (LXC) and how they work

(This article was written for the MIT 6.858 Computer Systems Security class to supplement lecture content, but is not intended to be a replacement for attending lectures. The 2020 lecture video can be found here.)

What comes to mind when you hear the buzzword “containerization”? Perhaps you have heard of software packages such as Virtuozzo, OpenVZ and Docker (in fact, Docker used lxc in its early days before it broke off support in favor of their own libcontainer).

The word “container” is defined pretty loosely – is it a process? Is it a virtual machine? Is it a Docker container? What is an image?

This article aims to demystify Linux containers – specifically lxc – and give a practical introduction to them.

Introduction

Containerization is best defined as a process isolation mechanism that are enabled through features in the operating system. Hence,

A container is a collection of one or more processes that are isolated from the rest of the system.

The concept of Linux containers is not novel – lxc has existed for more than a decade in Linux, and earlier Linux distributions have their own implementations: FreeBSD Jails, Solaris Containers, AIX Partitions, etc. Containers were conceived with the goal of software portability in mind. Many a time in software development when packaging software for staging and production, weird problems arose due to differences in the operating environments. This could be a difference in the version of a required shared library, network topology or even underlying storage. By packaging an entire runtime environment (applications and all their dependencies, configuration etc.) into a single container we introduce an abstraction for the differences across multiple environments.

Containers != VMs

A common misconception that is Linux containers are virtual machines. This is not entirely true. lxc achieves containerization through the use of the following Linux features to abstract the operating system away (and to limit container-to-host privileges):

  • Control Groups (cgroups)
  • Capabilities
  • seccomp
  • Mandatory Access Control (via AppArmor, SELinux)
  • Namespaces
  • chroot jails

In contrast, VMs run on hypervisors which abstract the hardware. The following figures show the main differences between Linux containers and VMs:

Virtual machines running on hardware-assisted type I (baremetal) virtualization (left) vs. Linux containers (lxc, right)
Virtual MachineLinux LXC Container
Operating SystemRuns an isolated guest OS inside the VMRuns on the same OS as the host (shared kernel)
Networkingvia virtual network devicesvia an isolated view of a virtual network adapter (through namespaces)
IsolationComplete isolation from other guest OSes and the host OSNamespace- and cgroup-based isolation from the host OS and other guest lxc containers
SizeUsually on the order of gigabytesUsually on the order of megabytes
Startup timeOn the order of seconds to minutes depending on storage mediaOn the order of seconds
Main differences between VMs and Linux lxc containers

Use-Cases

Before we dive into the inner workings of lxc, let us consider some requirements in which containerization would be a viable solution:

  • Stronger privilege segregation in a microservice architecture on a single host (e.g. zookd in lab 2)
  • Improved blast radius containment in the event of a security compromise
  • More effective resource utilization in isolation (compared to hardware-assisted virtualization)
  • Ease of software deployment (the purpose for which containers were first developed)
  • Increasing the velocity of application delivery and operational efficiency (e.g. through the use of DevSecOps framework)

Four main factors compel the use of containers in modern environments:

  1. Need for stronger privilege segregation between processes on a host
  2. Need for blast radius containment in the event of security compromise
  3. Need for speed (performance) on limited hardware, or need for greater resource utilization efficiency (over VMs)
  4. Software portability – the ease of packaging and deployment (which increases software development agility and operational consistency)

(In all of the fictitious use-case scenarios discussed in lecture, the attack surface was large and contiguous: exploiting some vulnerability in a single component gave access to multiple other components in the system. In all of these cases, the use of containerization would help apply the principle of least-privilege and defense-in-depth to the system. For example, Bob the journalist could exclusively use sandboxed applications for his work. In the event that one application is compromised, other applications cannot be accessed by the threat actor because they exist in a different PID namespace, to mention the least of the protections. In modern browser apps like Firefox, browser tabs are containerized so that it is more difficult for threat actors to break out of a single tab into the parent process or the host OS.)

chroot

By default, the OS root directory is /, and processes see that as the system root from which all absolute file paths are rooted at. This “view” can be changed by invoking the chroot() system call so that we can create a separate isolated environment to run. chroot changes apparent root directory for current running process and its children.

However, chroot alone is does not provide strong isolation. It may seem that preventing access to parent directories is sufficient, but chroot simply modifies the pathname lookups for a process (via pivot_root) and its children, prepending the new chroot-ed directory path to any absolute path (paths starting with /). Among other reasons, access is also allowed to the parent of the chroot-ed directory if the process has access to a handle outside of the chroot jail – so this alone is not strong isolation.

Capabilities

The root superuser used to be all-powerful, capable of performing any action in the OS. The division between traditional UNIX discretionary access control was split into two: root/superuser/privileged and user/unprivileged. Suppose a system user needed to spawn a server process that needed some root privileges. In addition, suppose that the server code has a remote code execution vulnerability. Should the vulnerable server process get compromised, the entire system gets compromised (since the process has UID 0). Is there a way to give a process only the privileges it needs (least privilege)?

In a bid to shard the privileges usually afforded wholly to root, Linux capabilities were introduced into the Linux kernel starting with version 2.2. Each capability represents a distinct unit of privilege and is prefixed by CAP_. Some capabilities include:

  • CAP_CHOWN – the capability to change user and group ownership of files
  • CAP_NET_ADMIN – the capability to perform network-related administration on the system
  • CAP_NET_RAW – the capability to create RAW and PACKET sockets, and arbitrary address binding
  • CAP_SYS_ADMIN – the capability to do a lot of things to the point where many regard it as the new root. Definitely needs further privilege sharding in the future.

The commands getcap and setcap exist to get/set capabilities on a file. Let us take a look at the ping utility, which needs to create a RAW socket to send out ICMP packets:

rayden@uwuntu:~$ ls -al /bin/ping
-rwxr-xr-x 1 root root 72776 Jan 30 15:11 /bin/ping

It is owned by root:root, but it is readable and executable by any user. If we try and ping google.com we can verify that the UID is that of the current user (since there is no setuid bit set):

USER   PID  %CPU %MEM VSZ   TTY   COMMAND
rayden 3220 0.0  0.0  18464 pts/0 /bin/ping google.com

The unprivileged user here is able to ping because of a capability set on the /bin/ping binary:

rayden@uwuntu:~$ getcap /bin/ping
/bin/ping = cap_net_raw+ep

Here, two flags are set: Effective (E) and Permitted (P). There are 3 capability flags one may set:

  • Effective: whether the capability is active
  • Inheritable: whether the capability is inherited by child processes
  • Permitted: whether the capability is permitted, regardless of parent’s capability set

What happens if we clear that capability from ping?

rayden@uwuntu:~$ cp /bin/ping .
rayden@uwuntu:~$ getcap ./ping 
rayden@uwuntu:~$ ./ping google.com
ping: socket: Operation not permitted
rayden@uwuntu:~$ sudo setcap cap_net_raw=ep ./ping
rayden@uwuntu:~$ ./ping google.com
PING google.com (172.217.11.14) 56(84) bytes of data.
...

By copying the ping binary to a new destination, any extended attributes *setuid, capabilities etc.) are wiped. Without the cap_net_raw capability, the spawned ping process is unable to open a RAW socket. Once we give that capability back, ping functions normally again.

Capabilities seems like a good idea, but CAP_SYS_ADMIN still has too many privileges, and this is just another mechanism used by lxc to enforce stronger isolation.

Control Groups (cgroups)

Control groups (cgroups) enables the limiting of system resource utilization based on user-defined groups of processes. Suppose you are running a very intensive data analysis routine which uses a lot of compute and memory to the point where your system is not very responsive. cgroups is a kernel feature that would allow you to define a group of processes that run the analysis job and limit, account for and isolate the resources allocated – so that you can multitask while the analysis job runs with limited resources. In particular, the cgroup feature enables:

  • Limits: maximum limits can be specified on processor usage, memory usage, device usage, etc.
  • Accounting: resource usage is monitored.
  • Prioritization: resource usage can be prioritized over other cgroups.
  • Control: the state of processes can be controlled (e.g. stop, restart, suspend)

A cgroup is a set of one or more processes which are bound to the same set of defined limits for the cgroup. A cgroup can also inherit the properties of another cgroup in a hierarchical manner.

cgroups is generally available in most modern releases of Linux distros, and most define about 10 subsystems (also known as controllers). From the Red Hat Enterprise Linux documentation:

  • blkio — this subsystem sets limits on input/output access to and from block devices such as physical drives (disk, solid state, or USB).
  • cpu — this subsystem uses the scheduler to provide cgroup tasks access to the CPU.
  • cpuacct — this subsystem generates automatic reports on CPU resources used by tasks in a cgroup.
  • cpuset — this subsystem assigns individual CPUs (on a multicore system) and memory nodes to tasks in a cgroup.
  • devices — this subsystem allows or denies access to devices by tasks in a cgroup.
  • freezer — this subsystem suspends or resumes tasks in a cgroup.
  • memory — this subsystem sets limits on memory use by tasks in a cgroup and generates automatic reports on memory resources used by those tasks.
  • net_cls — this subsystem tags network packets with a class identifier (classid) that allows the Linux traffic controller (tc) to identify packets originating from a particular cgroup task.
  • net_prio — this subsystem provides a way to dynamically set the priority of network traffic per network interface.
  • ns — the namespace subsystem.
  • perf_event — this subsystem identifies cgroup membership of tasks and can be used for performance analysis.

The cgroup-tools and libcgroup1 packages are needed to administer them, which can be installed on Ubuntu via:

$ sudo apt install cgroup-tools libcgroup1

To demonstrate how cgroups limits resources, let look at the memory subsystem. Suppose we had a memory-intensive process called memes that we wish to run on a workstation. We can use cgroups to limit the memory usage by creating a cgroup called memegroup in the memory subsystem (using cgcreate), setting its limit (using cgset) and executing the process under that cgroup (using cgexec):

rayden@uwuntu:~$ sudo cgcreate -g memory:memegroup
rayden@uwuntu:~$ sudo cgset -r memory.limit_in_bytes=1500K memegroup rayden@uwuntu:~$ cgget -r memory.limit_in_bytes memegroup 
primes: 
memory.limit_in_bytes: 1536000
rayden@uwuntu:~$ cat /sys/fs/cgroup/memory/primes/memory.limit_in_bytes
1536000
rayden@uwuntu:~$ sudo cgexec -g memory:memegroup ./memes 
...

The cgcreate command helps create the directory under the sysfs (which is almost always mounted at /sys and can be manipulated directly from the command line), and cgset sets the values appropriately. Notice that the system will always correct it to the nearest 4096 B alignment (from 1500 KB to 1536 KB), which is the kernel page size. Finally, we execute memes in the memegroup cgroup under the memory subsystem.

After a while, you’ll see the message on the terminal saying that the process has been killed (literally just ‘Killed‘).

rayden@uwuntu:~$ cat /sys/fs/cgroup/memory/memegroup/memory.oom_control
oom_kill_disable 0
under_oom 0
oom_kill 1

We see that oom_kill has been set to 1, which means that the Kernel Out-Of-Memory Killer (OOM Killer) has terminated the processes in the memegroup cgroup.

A simple memory-intensive C program that would be killed in the example above is:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main() {
    char *ptr;
    while(1) {
        ptr = (char *)malloc(4096);
        memset(ptr, 0, 4096);
        sleep(1);
    }
    return 0;
}

That’s an example of how limits are enforced and process control is done via a cgroup. Most subsystems have accounting features such as memory.usage_in_bytes, cpuacct.usage_sys, etc. An example of prioritization would be cpu.shares (the share of CPU resources available to each process in every cgroup).

Namespaces

A namespace is an abstract object that encapsulates resources so that said resources have a view restricted to other resources in the same namespace. For example, Linux processes form a single process tree that is rooted at init (PID 1). Typically, privileged processes in this tree can trace or kill other processes. With the introduction of the PID namespace, we can have multiple process trees that are disjoint (that do not know of processes in another namespace). If we create a new PID namespace and run a process in it, that first process becomes PID 1 in that namespace. The process that creates namespace still remains in parent namespace, but makes its child the root of new process tree.

The Linux kernel defines 7 namespaces:

  • PID – isolates processes
  • Network – isolates networking
  • User – isolates User/Group IDs
  • UTS – isolates hostname and fully-qualified domain name (FQDN)
  • Mount – isolates mountpoints
  • cgroup – isolates the cgroup sysfs root directory
  • IPC – isolates IPC/message queues

You can see the namespaces defined on your system via the procfs:

rayden@uwuntu:~$ sudo ls /proc/1/ns
cgroup ipc mnt net pid pid_for_children user uts

This level of isolation is useful in containerization. Without namespaces, a process running in a container may be able to change the hostname of another container, unmount a file system , remove a network interface, change limits, etc. By using namespaces to encapsulate these resources, the processes in a container X are unaware of the resources in another container Y.

With the introduction of namespaces, the Linux kernel provides 3 new system calls:

  • clone()creates a new process with specified namespaces. If the CLONE_NEW flag is passed, then new namespaces are created for each specified namespace.
  • setns() – allows a process to join an existing namespace. The namespace is specified by a file descriptor reference in the procfs like so:
rayden@uwuntu:~$ sudo ls -al /proc/1/ns
total 0
dr-x--x--x 2 root root 0 Jul 22 00:46 .
dr-xr-xr-x 9 root root 0 Jul 21 22:08 ..
lrwxrwxrwx 1 root root 0 Jul 22 00:46 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Jul 22 00:46 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Jul 22 00:46 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Jul 22 00:46 net -> 'net:[4026531992]'
lrwxrwxrwx 1 root root 0 Jul 22 00:46 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Jul 22 00:47 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Jul 22 00:46 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Jul 22 00:46 uts -> 'uts:[4026531838]'
  • unshare() – moves calling process to a new namespace.

For example, we can create a new bash shell in a new UTS namespace through the unshare command:

rayden@uwuntu:~$ hostname
uwuntu
rayden@uwuntu:~$ sudo unshare -u /bin/bash
root@uwuntu:/home/rayden# hostname lmao
root@uwuntu:/home/rayden# hostname
lmao
root@uwuntu:/home/rayden# exit
exit
rayden@uwuntu:~$ hostname
uwuntu

Notice that the hostname remains unchanged in the parent shell. The same thing can be done for process IDs:

rayden@uwuntu:~$ sudo unshare --fork --pid --mount-proc /bin/bash
root@uwuntu:/home/rayden# pidof /bin/bash
1

The PID of the forked process is 1, but if you look at the output of ps aux on the parent shell, we see PID 6499:

...
root 6499 0.0 0.0 16712 580 pts/3 S 01:23 0:00 unshare --fork --pid --mount-proc /bin/bash
...

Bash sees itself as PID 1 only because its scope is confined to its own PID namespace. Once you have forked a process into its own namespace, its children processes are numbered starting from 1, but only within that namespace.

Namespaces are the foundation of containerization. Understanding the abstract concept of namespaces and how they encapsulate resources in an environment can help you understand how and why containerized applications behave the way they do. For instance, a container running a web server is unaware that it is running in a container – it knows that it has access to system calls and resources it needs, but it has its own view of things like the hostname, the process tree, the user, etc. (There are ways to detect if a process is in a container, but that is out of the scope of this discussion.)

Furthermore, a malicious process spawned from the web server cannot affect any other process on your system, because as far as any process in that PID namespace knows, the process tree is rooted at 1, and 1 the container it’s running in (or some cases itself).

There is a namespace subsystem defined by cgroups (in that you can control resources by their namespace), but be careful not to confuse the two: cgroups limit resource utilization, while namespaces limit the resource view (what a resource may see on the system)

seccomp

There are cases where isolation via chroot, capabilities, cgroups and namespaces is not enough. Suppose some web server running in a container was compromised, and a remote shell was spawned by the attacker. The same set of system calls is invocable by the host process and the container, and there could exist some sequence of calls that makes a container escape possible. (In fact, there are a number of container-escape exploits: False Boundaries and Arbitrary Code Execution, Container escape through open_by_handle_at, Abusing Privileged and Unprivileged Linux Containers to name a few.)

seccomp protects against the threat of damage by a malicious process via syscalls, by limiting the number of syscalls a process is allowed to execute. Modern browsers such as Chrome and Firefox use seccomp to clamp down tighter on their applications. Many container-escape exploits can be easily blocked by limiting the syscall interface to only syscalls required for the containerized application to carry out its function.

In lxc, seccomp filters can be specified through the container configuration file (~/.local/share/lxc/<container_name>/config):

lxc.seccomp = /usr/share/lxc/config/common.seccomp

where /usr/share/lxc/config/common.seccomp is a list of disallowed system calls by default.

Mandatory Access Control (MAC)

Suppose that in web server toy example in the previous section, the attacker managed to escape the container despite isolation via chroot, capabilities, cgroups, namespaces and seccomp. What happens now? If the container is privileged (run using UID 0), it’s pretty much GG. If it is unprivileged, the attacker could still try to escalate privileges, or if the current user is privileged enough, enough damage could be done.

In this situation, only discretionary access control (DAC) (via UNIX permissions) stands between the attacker and a fully compromised system. In good old Defense-in-Depth fashion, we layer another control to mitigate this risk: Mandatory Access Control (MAC).

MAC is a centralized authorization mechanism that operates on the philosophy that information belongs to an organization (and not the individual members). A security policy is defined and kept in the kernel, which authorizes accesses based on the defined policy. Modern MAC implementations such as SELinux are a combination of Role-based Access Control (RBAC) and two concepts:

  • Type Enforcement (TE)
  • Multilevel Security (MLS)

Type Enforcement (TE)

TE introduces type labeling to every file system object, and is a prerequisite for MAC. Objects are labeled with a type, and a policy is defined in the kernel to specify which types are allowed to transition to which other types. The kernel then checks the specified policy every time a labeled file system object is accessed. If the specific type transition is not present, the access is denied by default. For example in Security-Enhanced Linux (SELinux), the standard label httpd_sys_content_t applied to web server content served by Apache is not allowed to access files labeled with bin_t, which is applied to binaries in /usr/bin.

Suppose that, in the same web server toy example, the attacker managed to get root access. Under DAC, any user has discretionary control over any thing owned by the user, so the attacker has full control. If TE was implemented via SELinux , the attacker is severely impeded: the web server exploit gives a process of UID 0 but the exploit inherits the type label of the exploited web page (httpd_sys_content_t), which only allows access to other file system objects of the same label (in that content served by a web server should only need to access other web content, and nothing else).

Similarly, if we enabled a MAC mechanism like SELinux for our container host, all containers will be labeled with the default lxc_t type label (defined by the default SELinux TE policy for lxc). Any malicious process that bypasses the other isolation mechanisms will be confined by the TE policy. More information on what type transitions and accesses are allowed by default can be seen directly from the .te file here.

Multilevel Security (MLS)

(MLS is out of the scope of this article, and is only treated briefly).

Few systems are configured with MLS, except in government or military systems. In a military environment, files are labeled with a sensitivity level (e.g. Unclassified, Confidential, Secret, Top Secret). However, it is insufficient to use only sensitivity levels to classify files, because it does not respect the principle of least privilege (on a need-to-know basis). Hence, the US military compartmentalizes the most secretive information (known as Top Secret: Secret Compartmentalized Information or TS:SCI). Every information asset belongs to a set of compartments, which could be categories such as cyber, nuclear, biological, blackops, etc. An information asset in the compartments [cyber, biological] may only be accessed by principals (person) that have clearance to see information BOTH of those compartments. MLS is formalized through the Bell-LaPadula model of 1973:

Given ordered set of all sensitivity levels [latex]S[/latex] and the set of all compartments [latex]C[/latex], we write that [latex]\forall s_i\in S[/latex], [latex]\forall c_i \in C[/latex] two labels [latex]l_1 = (s_1, c_1)[/latex] and [latex]l_2 = (s_2, c_2)[/latex] are such that [latex]l_1\leq l_2[/latex] (in that [latex]l_1[/latex] is no more restrictive than [latex]l_2[/latex]) when [latex]s_1 \leq s_2[/latex] and [latex]c_1 \subseteq c_2[/latex].

Let [latex]P[/latex] denote a principal and [latex]L(E)[/latex] denote the type label of some asset [latex]A[/latex]. The BLP model also specifies two security conditions:

  • Principals are not allowed to “read up”, i.e. [latex]P[/latex] may only read some asset [latex]A[/latex] if [latex]L(A)\leq L(P)[/latex], and
  • Principals are not allowed to “write down”, i.e. [latex]P[/latex] may only write to some asset [latex]A[/latex] if [latex]L(P)\leq L(A)[/latex].

The first security condition of the BLP guarantees that a principal can never directly read an information asset for which it is not cleared. It also guarantees that a principal must never be able to learn information about some higher-labeled asset [latex]A[/latex] by reading another lower-labeled object [latex]A^\prime[/latex]: suppose some principal [latex]P[/latex] reads [latex]A[/latex] before writing to [latex]A^\prime[/latex], which gives [latex]L(A)\leq L(P) \leq L(A^\prime)[/latex] – but [latex]L(A)\leq L(A^\prime)[/latex], so there is no information leakage.

The BLP model is not perfect, and that is why real-world systems combine different access control mechanisms. Some problems with the BLP model are:

  • Only confidentiality is considered, and not integrity (in the event that principals write up to an asset of a higher label)
  • The security level of a principal is assumed to be static, when in reality it could change mid-operation.
  • By the second security condition, any principal [latex]P[/latex] cannot write down, and privileges have to be stripped to a minimal set (which may not be a problem since Least Privilege is observed here)

By default, SELinux only carries out TE using the default targeted policy. MLS can be enabled using the mls policy by changing the configuration at /etc/selinux/config. An operation is allowed if and only if both the MAC and DAC policy permits, and in some cases RBAC.

SELinux is installed by default on Red Hat-based distributions such as Fedora and CentOS. On Debian-based systems, the MAC implementation used is AppArmor.

Demo: working with lxc

With that in mind, let us go through a short demo on how to work with lxc. Throughout this section we will be using a Ubuntu 19.10 Eoan amd64 VMware workstation virtual machine on Windows 10. You may use your own choice of hypervisor (kvm, VirtualBox, etc.) and host operating system – it should not affect your ability to follow the steps listed below.

lxc can be simply installed through your favorite package manager. On Ubuntu:

rayden@uwuntu:~$ sudo apt install lxc
[sudo] password for rayden:
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following additional packages will be installed:
bridge-utils liblxc-common liblxc1 libpam-cgfs lxc-utils lxcfs uidmap
Suggested packages:
ifupdown btrfs-tools lvm2 lxc-templates lxctl
The following NEW packages will be installed:
bridge-utils liblxc-common liblxc1 libpam-cgfs lxc lxc-utils lxcfs uidmap
[output truncated]

Subordinate UID/GID ranges

We ensure that the current user is allowed to have subordinate uids and gids by making sure that the following files are defined:

rayden@uwuntu:~$ cat /etc/subuid
rayden:100000:65536
rayden@uwuntu:~$ cat /etc/subgid
rayden:100000:65536

which allows the user rayden to have 65536 subordinate uids/gids starting at 100000. We also need to create the user config directory for lxc if it does not exist and create the default configuration file:

$ mkdir -p ~/.config/lxc
$ touch ~/.config/lxc/default.conf

The ~/.config/lxc/default.conf file should be modified so that it looks like this (with the correct id_map values):

lxc.include = /etc/lxc/default.conf
lxc.id_map = u 0 100000 65536
lxc.id_map = g 0 100000 65536

Virtual network interfaces

When installing lxc, a default bridge should have been created for you: lxcbr0. You can verify that the bridge exists via the command

rayden@uwuntu:~$ brctl show
bridge name  bridge id          STP enabled  interfaces
lxcbr0       8000.00163e000000  no

Ensure that the /etc/lxc/lxc-usernet file is defined with:

# user type bridge max_interfaces_by_user
rayden veth lxcbr0 10

This tells lxc how many virtual network interfaces it may attach to the specified bridge as the user rayden (or group, one per line).

The quickest way to effect the changes would be to restart the node or log out and back in. This restarts dbus, sets up the cgroups properly and turns user namespaces on (kernel.unprivileged_userns_clone=1).

Verify that the vEthernet networking module is loaded via

rayden@uwuntu:~$ lsmod | grep veth
veth                   28672 0

If the veth module is not loaded, load it and make it persist after a reboot:

rayden@uwuntu:~$ echo veth | sudo tee -a /etc/modules
veth

Creating a container

To create a container, simply run lxc-create:

rayden@uwuntu:~$ lxc-create -t download -n example
Setting up the GPG keyring
Downloading the image index

DIST RELEASE ARCH VARIANT BUILD
alpine 3.10 amd64 default 20200714_13:00
alpine 3.10 arm64 default 20200714_13:00
alpine 3.10 armhf default 20200714_13:00
[output truncated]

This is an interactive command that creates a container with the name example, using the download template. There are 4 default templates specified by the lxc install, which are basically scripts in /usr/share/lxc/templates/:

  • download – downloads pre-built images and unpacks them
  • local – consumes local images that were built with the distrobuilder build-lxc command
  • busybox – common UNIX utilities contained in a single executable
  • oci – creates an application container from images in the Open Containers Image (OCI) format

The download template prompts for your choice of distribution/release from a given list as the base image for your container, which is what we will be using to create our example container. We can specify the desired image directly on the command line, i.e. for a Ubuntu 19.10 (Eoan) amd64 image (note the double dash -- after the name):

rayden@uwuntu:~$ lxc-create -t download -n example -- --dist ubuntu --release eoan --arch amd64
Setting up the GPG keyring
Downloading the image index
Downloading the rootfs
Downloading the metadata
The image cache is now ready
Unpacking the rootfs

You just created an Ubuntu eoan amd64 (20200714_07:42) container.
To enable SSH, run: apt install openssh-server
No default root or user password are set by LXC.

A container directory will be created at ~/.local/share/lxc/example/, with a container-specific configuration file named config where you can specify further filters and controls such as MAC, seccomp deny lists, networks, etc. You will see that the root filesystem of the newly created container is unpacked in rootfs/, which looks like a standard Linux root filesystem:

rayden@uwuntu:~/.local/share/lxc/example$ ls rootfs/
bin   dev  home lib32  libx32  mnt  proc  run   srv  tmp  var
boot  etc  lib  lib64  media   opt  root  sbin  sys  usr

You may make changes offline (without starting and attaching to the container) by using chroot on the rootfs directory.

Running a container

To start the example container, simply run

rayden@uwuntu:~$ lxc-start example

which daemonizes the container. If you encounter errors starting the container, using the -F option to start the container in the foreground will give more verbose output.

We can verify that our container is running via

rayden@uwuntu:~$ lxc-info example
Name:        example
State:       RUNNING
PID:         7547
IP:          10.0.3.79
Memory use:  50.38 MiB
KMem use:    30.09 MiB
Link:        veth1000_JSJQ
TX bytes:    2.08 KiB
RX bytes:    8.93 KiB
Total bytes: 11.01 KiB

We can also get a summarized view of all containers:

rayden@uwuntu:~$ lxc-ls --fancy
NAME     STATE    AUTOSTART  GROUPS  IPV4       IPV6  UNPRIVILEGED
example  RUNNING  0          -       10.0.3.79  -     true

Attaching to our running container instance is as simple as:

rayden@uwuntu:~$ lxc-attach example
root@example:/# id
uid=0(root) gid=0(root) groups=0(root)
root@example:/# passwd
New password:
Retype new password:
passwd: password updated successfully

Once we have added our users (as needed) and changed their passwords, we can connect to the container using an interactive login via the command lxc-console. The difference is that lxc-attach behaves more like key-based ssh setup (you get a root session directly inside without any prompts) while lxc-console gives you a virtual console which simulates an interactive console on a real server such as serial, DRAC, ILO, etc.

Notice that we are root inside the container, even though we created an unprivileged container. This behavior is the result of UID namespaces. We can see that any process in the container is mapped to an unprivileged UID on the host by running a process in the container:

root@example:/# while [ 1 ]; do sleep 5; done &
[1] 132

On the host we can see that the process is running with UID 100000:

rayden@uwuntu:~$ ps aux | grep sleep
100000 7983 0.0 0.0 8068 844 pts/3 S 20:24 0:00 sleep 5

If you look at other processes from the ps aux output, you will notice that the container init process is UID-mapped as well:

rayden@uwuntu:~$ ps aux | grep init
...
100000 7547 0.0 0.1 166192 10220 ? Ss 19:48 0:00 /sbin/init

We may run most system administration tasks inside, such as installing packages. Let us install the nginx web server and the net-tools binary package:

root@example:/# apt update
[output truncated]
root@example:/# apt install nginx net-tools
[output truncated]

Verify that nginx is running on port 80:

root@example:/# netstat -atunp | grep LISTEN
tcp  0 0 127.0.0.53:53 0.0.0.0:* LISTEN 88/systemd-resolved
tcp  0 0 0.0.0.0:80    0.0.0.0:* LISTEN 881/nginx: master p
tcp6 0 0 :::80         :::*      LISTEN 881/nginx: master p

If for some reason it isn’t running, start and persist it with

root@example:/# systemctl start nginx
root@example:/# systemctl enable nginx

Networking

(This section assumes knowledge of iptables.)

lxc creates an independent bridge by default, which uses masquerading for all traffic to the main interface. A bridge is created out of thin air (lxcbr0) and the containers are linked to this bridge. This allows the containers to reach the Internet if the main interface has access to the Internet as well (through the use of forwarding and masquerading). A quick look at the interfaces on our host shows the main interface with an Internet connection ens33, the default bridge lxcbr0 and the virtual interface veth1000_XXXX for the container example.

rayden@uwuntu:~$ ifconfig
ens33: flags=4163 mtu 1500
  inet 192.168.3.131 netmask 255.255.255.0 broadcast 192.168.3.255
  inet6 fe80::1dca:deec:91b8:2e31 prefixlen 64 scopeid 0x20
  ether 00:0c:29:d0:e5:24 txqueuelen 1000 (Ethernet)
  ...
...
lxcbr0: flags=4163 mtu 1500
  inet 10.0.3.1 netmask 255.255.255.0 broadcast 0.0.0.0
  inet6 fe80::216:3eff:fe00:0 prefixlen 64 scopeid 0x20
  ether 00:16:3e:00:00:00 txqueuelen 1000 (Ethernet)
  ...
veth1000_JSJQ: flags=4163 mtu 1500
  inet6 fe80::fcc8:d2ff:fee1:3646 prefixlen 64 scopeid 0x20
  ether fe:c8:d2:e1:36:46 txqueuelen 1000 (Ethernet)
  ...

The local network setup looks like this:

  • Hypervisor (Windows 10)
    • Directly connected to 192.168.3.0/24 (address 192.168.3.130)
  • Container Host (Ubuntu VM)
    • Directly connected to 192.168.3.0/24 via ens33 (address 192.168.3.131)
    • Directly connected to 10.0.3.0/24 via lxcbr0 (address 10.0.3.1)
  • example Container running nginx (Ubuntu)
    • Directly connected to 10.0.3.0/24 via eth0 (address 10.0.3.1) which is connected to the lxcbr0 host bridge via the virtual adapter veth1000_JSJQ.

In this setup, the 10.0.3.0/24 network uses the system default gateway in the 192.168.3.0/24 network, which we can see from the system routing table on the container host:

rayden@uwuntu:~$ route -n
Kernel IP routing table
Destination  Gateway      Genmask        Flags Metric Ref Use Iface
0.0.0.0      192.168.3.2  0.0.0.0        UG    100    0   0   ens33
10.0.3.0     0.0.0.0      255.255.255.0  U     0      0   0   lxcbr0
169.254.0.0  0.0.0.0      255.255.0.0    U     1000   0   0   ens33
192.168.3.0  0.0.0.0      255.255.255.0  U     100    0   0   ens33

We can see the masquerading rule in the NAT table through iptables:

rayden@uwuntu:~$ sudo iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target      prot opt source       destination
Chain INPUT (policy ACCEPT)
target      prot opt source       destination
Chain OUTPUT (policy ACCEPT)
target      prot opt source       destination
Chain POSTROUTING (policy ACCEPT)
target      prot opt source       destination
MASQUERADE  all  --  10.0.3.0/24  !10.0.3.0/24

Hence, the nginx default site is reachable by the host network 192.168.3.0/24 since it is directly connected. If you browse to http://10.0.3.79/ you should see the welcome page served by the container:

Browsing to nginx service in example container from Ubuntu VM

However, any other external network should not be able to reach the nginx service in the example container. In this setup the container host is running Ubuntu on a VMware workstation virtual machine, which runs on Windows 10. Since the addresses are translated from 10.0.3.0/24 to 192.168.3.0/24, the only address we can reach from the Windows 10 host is the ens33 interface in the Ubuntu VM (192.168.3.131).

In order to expose the nginx service in the container to the Windows 10 host, we need to forward port 80 on 192.168.3.131 to 10.0.3.79. We can do this via a NAT table PREROUTING chain rule:

rayden@uwuntu:~$ sudo iptables -t nat -A PREROUTING -p tcp -i ens33 --dport 80 -j DNAT --to-destination 10.0.3.79:80
Browsing to nginx service in example container from Windows 10 VM host

If you cannot access the service, check that port forwarding is enabled on the Ubuntu kernel:

rayden@uwuntu:~$ cat /proc/sys/net/ipv4/ip_forward
1

Otherwise, append the following line to /etc/sysctl.conf:

net.ipv4.ip_forward=1

and load the value using the command sudo sysctl -p.

Now let us tighten the firewall rules a little bit on the Ubuntu VM. The default chain policy on the filter table (ACCEPT) is too permissive, so let’s set the default policy on the INPUT and FORWARD chains to DROP:

rayden@uwuntu:~$ sudo iptables -P INPUT DROP
rayden@uwuntu:~$ sudo iptables -P FORWARD DROP

Make sure to delete any rules that accept all traffic on both INPUT and FORWARD. You should not be able to access the nginx service from the Windows 10 host right now, since the Ubuntu VM is not forwarding any traffic to the container. We need to enable some forwarding rules to allow HTTP traffic to 10.0.3.79:

rayden@uwuntu:~$ sudo iptables -A FORWARD -p tcp -d 10.0.3.79 --dport 80 -m state --state NEW,ESTABLISHED,RELATED -j ACCEPT

and vice versa:

rayden@uwuntu:~$ sudo iptables -A FORWARD -s 10.0.3.79 -p tcp --sport 80 -j ACCEPT

You should be able to access the nginx service from your VM host via the container host, which forwards it to the container.

Stopping a container

To stop the example container we have created simply issue the command:

rayden@uwuntu:~$ lxc-stop example
rayden@uwuntu:~$ lxc-info example
Name:     example
State:    STOPPED

If you would like to purge (delete) the container from the file system:

rayden@uwuntu:~$ lxc-destroy example

What’s the difference between lxc and Docker?

Both solutions are suited for different use-cases. In short:

lxc: been around much longer (Docker used to use lxc). Feels more like a full OS in a VM and has to be handled in a similar manner: software has to be installed and updated manually, either by hand or through configuration management tools such as Ansible.

Docker: intended for running a single application. Does not have a full stack of system processes like lxc. A container with the application and its dependencies is built and deployed using a Dockerfile.

In terms of container orchestration, both have rather new tools: lxc has lxd, and Docker has Docker Swarm and Kubernetes. There is a new project called lxe which aims to integrate lxc/lxd with Kubernetes.

A common misconception is that Docker uses lxc. Docker DOES NOT use lxc; Docker used to make use of lxc to run containers, but that ceased a few years ago. Both Docker and lxc use the same kernel features for containerization, but they are independent solutions. You can read more

Summary

To summarize:

  • A container is a collection of one or more processes that are isolated from the rest of the system.
  • lxc achieves containerization through the use Linux kernel features to abstract the operating system away, and isolate the container, such as:
    • Control Groups (cgroups)
    • Capabilities
    • seccomp
    • Mandatory Access Control (via AppArmor, SELinux)
    • Namespaces
    • chroot jails
  • Container-specific configuration for lxc is located at ~/.local/share/lxc/<container name>/config
  • User-specific configuration for lxc is located at ~/.config/lxc/default.conf
  • Global configuration for lxc is located at /etc/lxc/default.conf
  • Create a container using lxc-attach
  • Start a container using lxc-start
  • Stop a container using lxc-stop
  • List containers using lxc-info [--fancy]
  • Destroy containers using lxc-destroy
  • Container rootfs is at ~/.local/share/lxc/<container name>/rootfs

Leave a Reply

Your email address will not be published. Required fields are marked *