You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: org/20211216125356-fuchsia_starnix.org
+49-44Lines changed: 49 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,35 +6,37 @@
6
6
7
7
[[https://fuchsia.googlesource.com/fuchsia/+/refs/heads/main/src/proc/bin/starnix][~Starnix~]] is the code name of a ~Fuchsia~ project which proposes to run unmodified Linux programs.
8
8
This is my take to understand what is needed to do in order for ~Fuchsia~ to run Linux programs,
9
-
and how Linux run programs itself. The main reference is [[https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0082_starnix][RFC 0082 from ~Fuchsia~]],
9
+
and how Linux runs programs itself. The main reference is [[https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0082_starnix][RFC 0082 from ~Fuchsia~]],
10
10
from which you definitely will benefit more.
11
11
12
-
* A Tale of two alternatives
12
+
* A Tale of Two Alternatives
13
13
So, you want to run unmodified Linux programs in ~Fuchsia~. You have two choices.
14
14
15
-
+ Creating a virtual machine which is able to emulate instructions of the binary you want to run on Linux
16
15
+ Mimicking Linux only when crossing the system boundary and running other instructions unmodified on the host
16
+
+ Creating a virtual machine which is able to emulate instructions of the binary you want to run on Linux
17
17
18
-
Coincidentally, the first approach is what WSL 1 takes, and the second approach is what WSL 2 takes.
18
+
Coincidentally, the first approach is what WSL 1 takes to run Linux programs, and the second approach is what WSL 2 takes.
19
19
The first one is an easy choice if you don't need tight system integration.
20
20
You don't need to differentiate the guest kernel and the applications running on the guest kernel.
21
21
You know, virtualization is a mature field. All you need to do is port the virtual machine monitor (hypervisor).
22
22
After that, The case is settled for good.
23
23
24
-
The second way has more stringent requirements. First, you will need the same ISA for the Linux program and the host machine.
24
+
Although the second way is much lightweight (you don't need a scheduler within another scheduler), it has more stringent requirements.
25
+
First, you will need the same ISA for the Linux program and the host machine.
25
26
Second, you need to implement a ton of system interfaces (system calls, or API through system libraries like win32 API).
26
27
If the upstream syscall interface changed, you need to
27
28
keep up to date. Third, not only there are many syscalls to port, but also there are many unnamed conventions the Linux binaries
28
29
expect the running host to satisfy. To name a few, ELF loader, dynamic interpreter, System V interface for process initialization,
29
-
POSIX API, stdin and stdout conventions, and so on.
30
+
POSIX API, stdin/stdout conventions.
30
31
31
32
To summarize, it is a great price to pay for tight integration. So why ~fuchsia~ choose to implement this?
32
-
And how does ~fuchsia~ implement the POSIX interface. I am not able to answer the first question.
33
+
And how does ~fuchsia~ implement the POSIX interface. I am not able to answer the first question
34
+
(fuchsia actually implemented a hypervisor called [[https://fuchsia.googlesource.com/fuchsia/+/refs/heads/main/src/virtualization][Machina]]).
33
35
As for the second one, follow me patiently.
34
36
35
-
* A detour through how debugger works
37
+
* A Detour through How Debugger Works
36
38
37
-
Ever wonder how a debugger can stop the expectation of a debuggee and inspect the running status of the debuggee,
39
+
Ever wonder how a debugger can stop the execution of a debuggee and inspect the running status of the debuggee,
38
40
and even change its control flow?
39
41
40
42
Here is the pseudocode of a Windows debugger. It is copied from [[https://www.microsoftpressstore.com/articles/article.aspx?p=2201303][How Windows Debuggers Work]].
@@ -71,7 +73,7 @@ The debugger can then do whatever it needs to facilitate debugging. For instance
71
73
it can not only read the memory pages of the debuggee,
72
74
but also change the control flow, e.g. jump to another address and execute the instruction there.
73
75
74
-
* Syscalls and how to emulate them in userspace
76
+
* Syscalls and How to Emulate Them in Userspace
75
77
The moral of the above story is that the operating system normally provides a way for one process to
76
78
trace and modify the running status of another process. If we can "arbitrarily" modify the control flow
77
79
of sub processes, we may be able to run foreign binaries.
Take read a file as an example, this ultimately attributes to three syscalls,
84
86
+ Userspace program proposes to open a file in the specified path, the kernel returns a file handle in the form of file descriptor.
85
-
+ userspace program continues on by reading the file descriptor. The kernel writes the data it read from block devices, and then
86
-
write the bytes to the location the userspace program specified.
87
+
+ Userspace program continues on by reading the file descriptor. The kernel writes the data it reads from block devices, and then
88
+
writes the bytes to the location the userspace program specified.
87
89
+ When the userspace program is done, it proposes to close the file descriptor. The kernel releases the related resources.
88
90
89
-
All the hardware resources is managed and utilized this way. The kernel provides a unified abstraction, the userspace programs
90
-
utilizes this abstraction through the convention of syscalls.
91
+
All the hardware resources is managed and utilized this way (almost, the userspace program can bypass the kernel in some situation).
92
+
The kernel provides a unified abstraction, the userspace programs utilize this abstraction through the convention of syscalls.
91
93
92
-
** How to make a syscall manually
93
-
See [[https://lwn.net/Articles/604287/][Anatomy of a system call, part 1]], [[https://lwn.net/Articles/604515/][Anatomy of a system call, part 2]] for details.
94
+
** How to Make a Syscall Manually
95
+
See [[https://lwn.net/Articles/604287/][Anatomy of a system call, part 1]], [[https://lwn.net/Articles/604515/][Anatomy of a system call, part 2]] and [[https://blog.packagecloud.io/eng/2016/04/05/the-definitive-guide-to-linux-system-calls/][The Definitive Guide to Linux System Calls]] for details.
94
96
95
-
The gist is that programs put the required arguments in the specified register. It then runs instruction [[https://stackoverflow.com/questions/1817577/what-does-int-0x80-mean-in-assembly-code][~int 0x80~]] to raise a soft interrupt.
96
-
The CPU automatically dispatch this interruption to a registered interruption handler, which is a kernel-space procedure.
97
+
The gist is that programs put the required arguments in the specified register. It then runs instruction [[https://stackoverflow.com/questions/1817577/what-does-int-0x80-mean-in-assembly-code][~int 0x80~]] to raise a soft interruption.
98
+
The CPU automatically dispatches this interruption to a registered interruption handler, which is a kernel-space procedure.
97
99
The kernel space procedure then checks the syscall number and dispatches the call to a specialized handler.
98
100
99
-
** How to intercept syscalls in Linux
100
-
In Linux, we can easily trace the syscalls made by a program with ~strace~.
101
-
~strace~ is able to print out all the syscalls a program has called and all the return code of those syscalls.
101
+
** How to Intercept Syscalls in Linux
102
+
In Linux, we can easily trace the syscalls made by a program with [[https://strace.io/][~strace~]].
103
+
~strace~ is able to print out all the syscalls a program has called and all the return codes of those syscalls.
102
104
103
105
You might have wondered how ~strace~ can have the ability to inspect syscalls. We need the blessing of Linux kernel to do such thing.
104
106
In order to obtain such blessing, ~strace~ needs to, you might have guessed,
@@ -107,33 +109,36 @@ The tracer is then notified to take some actions. In the ~strace~ case, ~strace~
107
109
tells the kernel to continue executing ~syscalls~. Just after the kernel finishes the ~syscall~ logic and before returns the control to the tracee,
108
110
the kernel tells the tracer the return code, thus you can see the syscall returning code with ~strace~.
109
111
110
-
** How to hijack syscalls in Linux
112
+
** How to Hijack Syscalls in Linux
111
113
As we have mentioned, the kernel is able to let userspace programs hook into syscalls.
112
-
In order to fully emulate syscalls, the userspace program only needs a few more privileges.
114
+
In order to fully emulate syscalls, the userspace program needs a few more privileges.
113
115
For example, some syscalls need to write the result to the memory of the caller, an operation strictly forbidden in normal situation.
114
116
The kernel needs to grant memory read and write permission to the tracing program. Fortunately, this is also doable with ~ptrace(2)~.
115
117
Well, theoretically this is fantastic. Do we have any real world usage of user space syscalls dispatch? Yes.
User-mode Linux is an ancient poor man's virtualization on Linux. See [[https://www.usenix.org/conference/als-01/user-mode-linux][User-mode Linux paper]] and [[https://www.kernel.org/doc/html/latest/virt/uml/user_mode_linux_howto_v2.html][kernel documentation]] for details.
122
+
User-mode Linux is an ancient poor man's virtualization on Linux. It use ~ptrace(2)~ to implement a Linux on Linux.
123
+
See [[https://www.usenix.org/conference/als-01/user-mode-linux][User-mode Linux paper]] and [[https://www.kernel.org/doc/html/latest/virt/uml/user_mode_linux_howto_v2.html][kernel documentation]] for details.
121
124
122
125
*** gVisor
123
126
A modern application is [[https://gvisor.dev/][gVisor]]. According to its [[https://gvisor.dev/docs/][official website documentation]],
124
127
#+begin_quote
125
128
gVisor is an application kernel, written in Go, that implements a substantial portion of the Linux system call interface. It provides an additional layer of isolation between running applications and the host operating system.
126
129
#+end_quote
127
130
128
-
Quite mouthful, isn't it? In gVisor environment, safe syscalls from the applications are passed to the underlying kernel,
131
+
Quite mouthful, isn't it? In gVisor-managed environments, safe syscalls from the applications are passed to the underlying kernel,
129
132
while dangerous ones are censored by a mediator component called [[https://github.com/google/gvisor/tree/master/pkg/sentry][Sentry]].
130
-
Sentry passes the syscalls to the [[https://gvisor.dev/docs/architecture_guide/platforms/][Platform]], which emulates real syscalls. When the emulation is done, the results are
131
-
delivered to user applications. In this way, gVisor provides greater isolation between applications, which is quite useful in container environment.
133
+
Sentry passes the syscalls to the [[https://gvisor.dev/docs/architecture_guide/platforms/][Platform]], which emulates real syscalls.
134
+
gVisor currently supports two platforms, ptrace and kvm. When the emulation is done, the results are
135
+
delivered to user applications. In this way, gVisor provides greater isolation between applications,
136
+
which is quite useful in container environment. Google cloud functions use gVisor to harden the system.
132
137
133
-
** A new mechanism to dispatch syscalls
138
+
** A New Mechanism to Dispatch Syscalls
134
139
[[https://www.kernel.org/doc/html/latest/admin-guide/syscall-user-dispatch.html][Syscall user dispatch]].
135
140
136
-
* The starnix runner
141
+
* The Starnix Runner
137
142
~Fuchsia~ already has the ability to run unmodified Linux binaries. See initial implementation [[https://fuchsia-review.googlesource.com/c/fuchsia/+/485746][here]].
138
143
The basic idea is already presented. We need a hook mechanism in the kernel to run specific handler when some exceptional events happened.
139
144
Those kinds of exceptional events are called [[https://fuchsia.dev/fuchsia-src/concepts/kernel/exceptions][exceptions in ~Fuchsia~]].
@@ -147,7 +152,7 @@ inspect or correct the condition.
147
152
148
153
We now dive into the details.
149
154
150
-
** hooks in the kernel
155
+
** Hooks in the Kernel
151
156
As a matter of fact, ~fuchsia~ (more precisely, zircon, ~fuchsia~'s kernel) provides system APIs through [[https://fuchsia.dev/fuchsia-src/concepts/kernel/vdso][vDSO]]
152
157
(which is great for binary compatibility and updatability, see [[https://xuzhongxing.github.io/201806fuchsia.pdf][P20 of these slides]]).
153
158
When you invoke normal Linux syscalls in ~Fuchsia~, exceptions are raised.
The line ~ret = sys_invalid_syscall(syscall_num, pc, vdso_code_address)~ saves the original syscall number, raises an exception.
183
188
Then the kernel would suspend current thread and notify the registered exception handler.
184
189
185
-
** handlers in the userspace
190
+
** Handlers in the Userspace
186
191
[[https://cs.opensource.google/fuchsia/fuchsia/+/main:src/proc/bin/starnix/runner.rs;l=69-152;drc=5744210c57bc34495941363f6ae1b7423483fe0b][Here]] is the code snippet copied from ~fuchsia~'s ~starnix~ runner.
Sans a few setup work (see elf loader, dynamic interpreter and process initialization below) and the actual dispatch logic,
276
-
this is how ~starnix~ runs unmodified Linux binaries. The ~starnix~ runner first set up an exception channel.
280
+
Sans a few setup work (see ELF loader, dynamic interpreter and process initialization below) and the actual dispatch logic,
281
+
this is how ~starnix~ runs unmodified Linux binaries. The ~starnix~ runner first sets up an exception channel.
277
282
and then runs a loop in which it waits for any message from the exception channel.
278
-
When the data arrive at this channel, The runner first checks if this message is actually bad syscall exception.
283
+
When the data arrive at this channel, the runner first checks if this message is actually bad syscall exception.
279
284
If so, the runner acquires the current registers state, then dispatches the original
280
285
syscall number and its arguments to the user-defined functions. The actually implementations are scattered among different
281
286
files named ~syscalls.rs~. As an example, here is the link to [[https://cs.opensource.google/fuchsia/fuchsia/+/main:src/proc/bin/starnix/fs/socket/syscalls.rs;l=612-633][~sendto~]].
@@ -284,24 +289,24 @@ files named ~syscalls.rs~. As an example, here is the link to [[https://cs.opens
284
289
Although I have mentioned how ~starnix~ intercepts and hijacks normal Linux syscalls. There are still quite
285
290
a few things omitted for Linux programs running normally.
286
291
287
-
*** More syscalls
292
+
*** More Syscalls
288
293
There are [[https://filippo.io/linux-syscall-table/][quite a few syscalls]] to reimplement. Linux offers many syscalls, most of which require a reimplementation.
289
294
Some syscalls like ~gettimeofday~ need only stateless shims, while some require ~starnix~ to save state internally.
290
295
For example, you may not want other process to access your file descriptor.
291
296
When ~starnix~ opens a file on the Linux binaries' behave, it needs to keep track of the ownership of handles.
292
297
Some syscalls are performance critical. Any implementation needs careful measurement.
293
298
[[https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0082_starnix#memory][Memory access]] is an example.
294
299
295
-
*** ELF Loader and Dynamic interpreter
296
-
Programs do not automagically run on a platform. The platform need to do a few setup work.
297
-
The first thing it needs to do is load the program from disk to memory.
298
-
The elf loader for ~fuchsia~ is implemented [[https://cs.opensource.google/fuchsia/fuchsia/+/main:src/proc/bin/starnix/loader.rs;drc=a447744ac172d77b4165342360c579a7fecb181b][here]].
300
+
*** ELF Loader and Dynamic Interpreter
301
+
Programs do not automagically run on a platform. The platform needs to do a few setup work.
302
+
The first thing it needs to do is load the program from disk to memory. This is what the ELF loader does.
303
+
The ELF loader for ~fuchsia~ is implemented [[https://cs.opensource.google/fuchsia/fuchsia/+/main:src/proc/bin/starnix/loader.rs;drc=a447744ac172d77b4165342360c579a7fecb181b][here]].
299
304
To complicate things further, not all programs are self-contained. Some of them require a symbol resolution at runtime.
300
305
After the program is loaded into memory. Depending on whether the program has a ~PT_INTERP~ segment, the runner may run
301
-
the dynamic interpreter first. The interpreter resolves symbols in the dynamically-linked binaries and then
306
+
the dynamic interpreter first. The interpreter resolves symbols in the dynamicallylinked binaries and then
302
307
jumps to the entry point address (which is available from the auxiliary vector ~AT_ENTRY~, see below) of this program.
303
308
304
-
*** Process initialization
309
+
*** Process Initialization
305
310
On Linux, the kernel does a few setup works for the programs which is quite different from the process initialization
306
311
logic of ~Fuchsia~. For example, the Linux kernel set up the stack for the binaries, and then push some auxiliary vector, environment variables, argv and argc
307
312
onto the stack (See [[https://gitlab.com/x86-psABIs/x86-64-ABI/-/blob/a0ea20c1a611e51891ea71687ba844abb86e987b/x86-64-ABI/low-level-sys-info.tex#L998][System V x86 psABIs]], [[https://lwn.net/Articles/630727/][How programs get run]] and [[https://lwn.net/Articles/631631/][How programs get run: ELF binaries]] for details),
@@ -328,16 +333,16 @@ The environmental information may be in a quite different format. Here is [[http
328
333
#+end_src
329
334
It is immediately clear that what is populated to the initial stack from the parameter names.
330
335
331
-
*** Other conventions
336
+
*** Other Conventions
332
337
There are many other implicit conventions Linux programs rely on.
333
338
For example, if you can't open stdout/stderr on your system, I expect more than 50% of the programs will crash immediately.
0 commit comments