2022-03-06

The cgroup release_agent escape is a classical user mode helper escape issue several years ago. Recently it has a CVE and become popular. At first glance I don’t know why and has little time to dig into the issue why it has a CVE now. After read Yuval Avrahami’s post New Linux Vulnerability CVE-2022-0492 Affecting Cgroups: Can Containers Escape and discussed with him I found there are a lot of things after CVE-2022-0492 so I decide make a post.

CVE-2022-0492

In previous release_agent escape, we need to add CAP_SYS_ADMIN capability to the container. CVE-2022-0492 shows us that we can mount cgroupfs in new userns and then write to the release_agent file. Following is the reproducer.

The docker doesn’t give CAP_SYS_ADMIN to container.

    root@ubuntu:/home/test# docker run --rm -it --security-opt seccomp=unconfined --security-opt apparmor=unconfined ubuntu bash
    root@26604070fc87:/# cat /proc/self/status | grep Cap
    CapInh:	00000000a80425fb
    CapPrm:	00000000a80425fb
    CapEff:	00000000a80425fb
    CapBnd:	00000000a80425fb
    CapAmb:	0000000000000000


    test@ubuntu:~$ capsh --decode=00000000a80425fb
    WARNING: libcap needs an update (cap=40 should have a name).
    0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

Then in the container we execute unshare to create new user namespace and cgroup namespace. Then we can mount the cgroupfs and write our data to release_agent.

    root@26604070fc87:/# unshare -UrmC bash
    root@26604070fc87:/# cat /proc/self/status | grep Cap
    CapInh:	0000000000000000
    CapPrm:	000001ffffffffff
    CapEff:	000001ffffffffff
    CapBnd:	000001ffffffffff
    CapAmb:	0000000000000000
    root@26604070fc87:/# mount -t cgroup -o rdma cgroup /mnt
    root@26604070fc87:/# ls /mnt
    cgroup.clone_children  cgroup.procs  cgroup.sane_behavior  notify_on_release  release_agent  tasks
    root@26604070fc87:/# echo "test" > /mnt/release_agent 
    root@26604070fc87:/# cat /mnt/release_agent 
    test

Why sysfs and procfs can't work

The poc is not complex, but the detail behind it has a lot of things. The first is that why core_pattern and uevent_helper can’t work. Let’s see whether we can mount sysfs or procfs in new user namespace.

    root@26604070fc87:/# mkdir /tmp/procfs
    root@26604070fc87:/# mkdir /tmp/sysfs
    root@26604070fc87:/# mount -t proc procfs /tmp/procfs 
    mount: /tmp/procfs: permission denied.
    root@26604070fc87:/# mount -t sysfs sysfs /tmp/sysfs
    mount: /tmp/sysfs: permission denied.

As we can see, we can’t mount it.

The mount syscall path is as following:

    SYSCALL_DEFINE5(mount)
            -->do_mount
                    -->do_new_mount
                            -->mount_capable
                            -->vfs_get_tree
                            -->do_new_mount_fc
                                    -->mount_too_revealing
                                    -->vfs_create_mount
                                    -->do_add_mount

The first permission check is at ‘mount_capable’. Notice the user ns passed to ‘ns_capable’ is fs_context’s user ns.

    bool mount_capable(struct fs_context *fc)
    {
            if (!(fc->fs_type->fs_flags & FS_USERNS_MOUNT))
                    return capable(CAP_SYS_ADMIN);
            else
                    return ns_capable(fc->user_ns, CAP_SYS_ADMIN);
    }

The ‘fc->user_ns’ is set in the ‘init_fs_context’ callback of ‘struct file_system_type’. In the cgroupfs case, as we unshare user namespace and cgroup namespace together. So the ‘fc->user_ns’ is the new user namespace and has the CAP_SYS_ADMIN. So it will pass the ‘mount_capable’ check.

    static int cgroup_init_fs_context(struct fs_context *fc)
    {
            struct cgroup_fs_context *ctx;

            ctx = kzalloc(sizeof(struct cgroup_fs_context), GFP_KERNEL);
            if (!ctx)
                    return -ENOMEM;

            ctx->ns = current->nsproxy->cgroup_ns;
            ...
            fc->user_ns = get_user_ns(ctx->ns->user_ns);
            fc->global = true;
            return 0;
    }

In the procfs case, the ‘proc_init_fs_context’ set the fc->user_ns to the pid_ns.

    static int proc_init_fs_context(struct fs_context *fc)
    {
            struct proc_fs_context *ctx;

            ctx = kzalloc(sizeof(struct proc_fs_context), GFP_KERNEL);
            if (!ctx)
                    return -ENOMEM;

            ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
            put_user_ns(fc->user_ns);
            fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
            fc->fs_private = ctx;
            fc->ops = &proc_fs_context_ops;
            return 0;
    }

As we don’t create a new pid ns so the fc->user_ns is the init user ns. And in this user namespace the container has no CAP_SYS_ADMIN so it will not pass the ‘mount_capable’ check.

Why 'unshare -UrmC -pf bash' can't work

So what about we also unshare the pid namespace.

    root@26604070fc87:/# mount -t proc procfs /mnt
    mount: /mnt: permission denied.

We still can’t mount procfs in the new usernamespace and pid namespace. This time we will pass the ‘mount_capable’ check. However we will go to the second permission check of mount.

The second permission check is at ‘mount_too_revealing’. This function is interesting.

    static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags)
    {
            const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV;
            struct mnt_namespace *ns = current->nsproxy->mnt_ns;
            unsigned long s_iflags;

            if (ns->user_ns == &init_user_ns)
                    return false;

            /* Can this filesystem be too revealing? */
            s_iflags = sb->s_iflags;
            if (!(s_iflags & SB_I_USERNS_VISIBLE))
                    return false;

            if ((s_iflags & required_iflags) != required_iflags) {
                    WARN_ONCE(1, "Expected s_iflags to contain 0x%lx\n",
                            required_iflags);
                    return true;
            }

            return !mnt_already_visible(ns, sb, new_mnt_flags);
    }

The ‘mount_too_revealing’ is used only in new user namespace as we can see it return ‘false’ if it is called in the init_user_ns. So I guess the ‘revealing’ reveals the meaning, if the mount operation reveals too much data the kernel should deny it. The first interesting part is ‘SB_I_USERNS_VISIBLE’. If the super_block data doesn’t set it, it just bypassed this reveal check. The only two fs who set this in sysfs and procfs.

            if (!(s_iflags & SB_I_USERNS_VISIBLE))
                    return false;

For example, the ‘proc_fill_super’ set it in ‘proc_fill_super’

    static int proc_fill_super(struct super_block *s, struct fs_context *fc)
    {
            s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
    }

So in the ‘mount_too_revealing’ permission check, the cgroupfs passed it. The procfs and sysfs will go to ‘mnt_already_visible’ which we can’t pass the permission check.

    static bool mnt_already_visible(struct mnt_namespace *ns,
                                    const struct super_block *sb,
                                    int *new_mnt_flags)
    {
            int new_flags = *new_mnt_flags;
            struct mount *mnt;
            bool visible = false;

            down_read(&namespace_sem);
            lock_ns_list(ns);
            list_for_each_entry(mnt, &ns->list, mnt_list) {
                    struct mount *child;
                    int mnt_flags;

                    ...
                    list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
                            struct inode *inode = child->mnt_mountpoint->d_inode;
                            /* Only worry about locked mounts */
                            if (!(child->mnt.mnt_flags & MNT_LOCKED))
                                    continue;
                            /* Is the directory permanetly empty? */
                            if (!is_empty_dir_inode(inode))
                                    goto next;
                    }
                    /* Preserve the locked attributes */
                    *new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
                                            MNT_LOCK_ATIME);
                    visible = true;
                    goto found;
            next:	;
            }
    found:
            unlock_ns_list(ns);
            up_read(&namespace_sem);
            return visible;
    }

‘mnt_already_visible’ will iterate the new mount namespace and check whether it has child mountpoint. If it has child mountpoint, it is not fully visible to this mount namespace so the procfs will not be mounted. This reason is as following. The procfs and sysfs contains some global data, so the container should not touch. So mouting procfs and sysfs in new user namespace should be restricted. Anyway, if we allow this, we can mount the whole procfs data in new user namespace. In docker and runc environment, it has ‘maskedPaths’ which means the path should be masked in container. For example the runc’s default maskedPaths is as following:

	"maskedPaths": [
		"/proc/acpi",
		"/proc/asound",
		"/proc/kcore",
		"/proc/keys",
		"/proc/latency_stats",
		"/proc/timer_list",
		"/proc/timer_stats",
		"/proc/sched_debug",
		"/sys/firmware",
		"/proc/scsi"
	],

As we can see some of the proc and sys file is masked in container which means the container has no fully view of procfs and sysfs. The ‘maskedPaths’ is implemented by mounting these file to ‘/dev/null’ so the procfs has child mountpoint. As I don’t want find how to set docker’s maskedPath, let’s use runc as test. Also we will use sysfs as we just need to delete one line.

First we need delete the runc’s default config.json rootfs readonly configuration

     "readonly": true 

The sysfs user the netns’s user namespace so we need use ‘unshare -Urmn sh’.

    root@ubuntu:~/mycontainer# runc run test
    / # unshare -Urmn sh
    / # mkdir /mnt
    / # mount -t sysfs -o ro sysfs /mnt
    mount: permission denied (are you root?)

Next we delete the ‘/sys/firmware’ in maskedPaths in config.json. Now we can see we mount sysfs successfully.

    root@ubuntu:~/mycontainer# runc run test
    / # unshare -Urmn sh
    / # mount -t sysfs -o ro sysfs /mnt
    / # ls /mnt
    block       class       devices     fs          kernel      power
    bus         dev         firmware    hypervisor  module

I do this test in 5.4.1 successfully but failed in 5.11 maybe there are more protections. Here we uses ‘ro’ as the runc mount sysfs readonly in container.

Summary

Just as Yuval Avrahami point out, CVE-2022-0492 is about ‘creating new user & cgroup namespace’ and do the release_agent escape. The kernel security mechanism behind this CVE is quite interesing.

Reference

I mostly read Yuval Avrahami post and thanks him to point the key understanding of CVE-2022-0492.

New Linux Vulnerability CVE-2022-0492 Affecting Cgroups: Can Containers Escape? Rootless containers don’t work from unprivileged non-root Docker container (operation not permitted for mounting procfs)



blog comments powered by Disqus