Skip to content

Commit ccd1e6c

Browse files
committed
add more references for syscalls
1 parent 3fd17eb commit ccd1e6c

File tree

1 file changed

+49
-44
lines changed

1 file changed

+49
-44
lines changed

org/20211216125356-fuchsia_starnix.org

Lines changed: 49 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -6,35 +6,37 @@
66

77
[[https://fuchsia.googlesource.com/fuchsia/+/refs/heads/main/src/proc/bin/starnix][~Starnix~]] is the code name of a ~Fuchsia~ project which proposes to run unmodified Linux programs.
88
This is my take to understand what is needed to do in order for ~Fuchsia~ to run Linux programs,
9-
and how Linux run programs itself. The main reference is [[https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0082_starnix][RFC 0082 from ~Fuchsia~]],
9+
and how Linux runs programs itself. The main reference is [[https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0082_starnix][RFC 0082 from ~Fuchsia~]],
1010
from which you definitely will benefit more.
1111

12-
* A Tale of two alternatives
12+
* A Tale of Two Alternatives
1313
So, you want to run unmodified Linux programs in ~Fuchsia~. You have two choices.
1414

15-
+ Creating a virtual machine which is able to emulate instructions of the binary you want to run on Linux
1615
+ Mimicking Linux only when crossing the system boundary and running other instructions unmodified on the host
16+
+ Creating a virtual machine which is able to emulate instructions of the binary you want to run on Linux
1717

18-
Coincidentally, the first approach is what WSL 1 takes, and the second approach is what WSL 2 takes.
18+
Coincidentally, the first approach is what WSL 1 takes to run Linux programs, and the second approach is what WSL 2 takes.
1919
The first one is an easy choice if you don't need tight system integration.
2020
You don't need to differentiate the guest kernel and the applications running on the guest kernel.
2121
You know, virtualization is a mature field. All you need to do is port the virtual machine monitor (hypervisor).
2222
After that, The case is settled for good.
2323

24-
The second way has more stringent requirements. First, you will need the same ISA for the Linux program and the host machine.
24+
Although the second way is much lightweight (you don't need a scheduler within another scheduler), it has more stringent requirements.
25+
First, you will need the same ISA for the Linux program and the host machine.
2526
Second, you need to implement a ton of system interfaces (system calls, or API through system libraries like win32 API).
2627
If the upstream syscall interface changed, you need to
2728
keep up to date. Third, not only there are many syscalls to port, but also there are many unnamed conventions the Linux binaries
2829
expect the running host to satisfy. To name a few, ELF loader, dynamic interpreter, System V interface for process initialization,
29-
POSIX API, stdin and stdout conventions, and so on.
30+
POSIX API, stdin/stdout conventions.
3031

3132
To summarize, it is a great price to pay for tight integration. So why ~fuchsia~ choose to implement this?
32-
And how does ~fuchsia~ implement the POSIX interface. I am not able to answer the first question.
33+
And how does ~fuchsia~ implement the POSIX interface. I am not able to answer the first question
34+
(fuchsia actually implemented a hypervisor called [[https://fuchsia.googlesource.com/fuchsia/+/refs/heads/main/src/virtualization][Machina]]).
3335
As for the second one, follow me patiently.
3436

35-
* A detour through how debugger works
37+
* A Detour through How Debugger Works
3638

37-
Ever wonder how a debugger can stop the expectation of a debuggee and inspect the running status of the debuggee,
39+
Ever wonder how a debugger can stop the execution of a debuggee and inspect the running status of the debuggee,
3840
and even change its control flow?
3941

4042
Here is the pseudocode of a Windows debugger. It is copied from [[https://www.microsoftpressstore.com/articles/article.aspx?p=2201303][How Windows Debuggers Work]].
@@ -71,7 +73,7 @@ The debugger can then do whatever it needs to facilitate debugging. For instance
7173
it can not only read the memory pages of the debuggee,
7274
but also change the control flow, e.g. jump to another address and execute the instruction there.
7375

74-
* Syscalls and how to emulate them in userspace
76+
* Syscalls and How to Emulate Them in Userspace
7577
The moral of the above story is that the operating system normally provides a way for one process to
7678
trace and modify the running status of another process. If we can "arbitrarily" modify the control flow
7779
of sub processes, we may be able to run foreign binaries.
@@ -82,23 +84,23 @@ uses [[https://en.wikipedia.org/wiki/VDSO][vDSO]], [[https://lwn.net/Articles/80
8284

8385
Take read a file as an example, this ultimately attributes to three syscalls,
8486
+ Userspace program proposes to open a file in the specified path, the kernel returns a file handle in the form of file descriptor.
85-
+ userspace program continues on by reading the file descriptor. The kernel writes the data it read from block devices, and then
86-
write the bytes to the location the userspace program specified.
87+
+ Userspace program continues on by reading the file descriptor. The kernel writes the data it reads from block devices, and then
88+
writes the bytes to the location the userspace program specified.
8789
+ When the userspace program is done, it proposes to close the file descriptor. The kernel releases the related resources.
8890

89-
All the hardware resources is managed and utilized this way. The kernel provides a unified abstraction, the userspace programs
90-
utilizes this abstraction through the convention of syscalls.
91+
All the hardware resources is managed and utilized this way (almost, the userspace program can bypass the kernel in some situation).
92+
The kernel provides a unified abstraction, the userspace programs utilize this abstraction through the convention of syscalls.
9193

92-
** How to make a syscall manually
93-
See [[https://lwn.net/Articles/604287/][Anatomy of a system call, part 1]], [[https://lwn.net/Articles/604515/][Anatomy of a system call, part 2]] for details.
94+
** How to Make a Syscall Manually
95+
See [[https://lwn.net/Articles/604287/][Anatomy of a system call, part 1]], [[https://lwn.net/Articles/604515/][Anatomy of a system call, part 2]] and [[https://blog.packagecloud.io/eng/2016/04/05/the-definitive-guide-to-linux-system-calls/][The Definitive Guide to Linux System Calls]] for details.
9496

95-
The gist is that programs put the required arguments in the specified register. It then runs instruction [[https://stackoverflow.com/questions/1817577/what-does-int-0x80-mean-in-assembly-code][~int 0x80~]] to raise a soft interrupt.
96-
The CPU automatically dispatch this interruption to a registered interruption handler, which is a kernel-space procedure.
97+
The gist is that programs put the required arguments in the specified register. It then runs instruction [[https://stackoverflow.com/questions/1817577/what-does-int-0x80-mean-in-assembly-code][~int 0x80~]] to raise a soft interruption.
98+
The CPU automatically dispatches this interruption to a registered interruption handler, which is a kernel-space procedure.
9799
The kernel space procedure then checks the syscall number and dispatches the call to a specialized handler.
98100

99-
** How to intercept syscalls in Linux
100-
In Linux, we can easily trace the syscalls made by a program with ~strace~.
101-
~strace~ is able to print out all the syscalls a program has called and all the return code of those syscalls.
101+
** How to Intercept Syscalls in Linux
102+
In Linux, we can easily trace the syscalls made by a program with [[https://strace.io/][~strace~]].
103+
~strace~ is able to print out all the syscalls a program has called and all the return codes of those syscalls.
102104

103105
You might have wondered how ~strace~ can have the ability to inspect syscalls. We need the blessing of Linux kernel to do such thing.
104106
In order to obtain such blessing, ~strace~ needs to, you might have guessed,
@@ -107,33 +109,36 @@ The tracer is then notified to take some actions. In the ~strace~ case, ~strace~
107109
tells the kernel to continue executing ~syscalls~. Just after the kernel finishes the ~syscall~ logic and before returns the control to the tracee,
108110
the kernel tells the tracer the return code, thus you can see the syscall returning code with ~strace~.
109111

110-
** How to hijack syscalls in Linux
112+
** How to Hijack Syscalls in Linux
111113
As we have mentioned, the kernel is able to let userspace programs hook into syscalls.
112-
In order to fully emulate syscalls, the userspace program only needs a few more privileges.
114+
In order to fully emulate syscalls, the userspace program needs a few more privileges.
113115
For example, some syscalls need to write the result to the memory of the caller, an operation strictly forbidden in normal situation.
114116
The kernel needs to grant memory read and write permission to the tracing program. Fortunately, this is also doable with ~ptrace(2)~.
115117
Well, theoretically this is fantastic. Do we have any real world usage of user space syscalls dispatch? Yes.
116118

117119
*** User-mode Linux
118120
[[file:assets/images/obama-awards-obama-a-medal.jpg]]
119121

120-
User-mode Linux is an ancient poor man's virtualization on Linux. See [[https://www.usenix.org/conference/als-01/user-mode-linux][User-mode Linux paper]] and [[https://www.kernel.org/doc/html/latest/virt/uml/user_mode_linux_howto_v2.html][kernel documentation]] for details.
122+
User-mode Linux is an ancient poor man's virtualization on Linux. It use ~ptrace(2)~ to implement a Linux on Linux.
123+
See [[https://www.usenix.org/conference/als-01/user-mode-linux][User-mode Linux paper]] and [[https://www.kernel.org/doc/html/latest/virt/uml/user_mode_linux_howto_v2.html][kernel documentation]] for details.
121124

122125
*** gVisor
123126
A modern application is [[https://gvisor.dev/][gVisor]]. According to its [[https://gvisor.dev/docs/][official website documentation]],
124127
#+begin_quote
125128
gVisor is an application kernel, written in Go, that implements a substantial portion of the Linux system call interface. It provides an additional layer of isolation between running applications and the host operating system.
126129
#+end_quote
127130

128-
Quite mouthful, isn't it? In gVisor environment, safe syscalls from the applications are passed to the underlying kernel,
131+
Quite mouthful, isn't it? In gVisor-managed environments, safe syscalls from the applications are passed to the underlying kernel,
129132
while dangerous ones are censored by a mediator component called [[https://github.com/google/gvisor/tree/master/pkg/sentry][Sentry]].
130-
Sentry passes the syscalls to the [[https://gvisor.dev/docs/architecture_guide/platforms/][Platform]], which emulates real syscalls. When the emulation is done, the results are
131-
delivered to user applications. In this way, gVisor provides greater isolation between applications, which is quite useful in container environment.
133+
Sentry passes the syscalls to the [[https://gvisor.dev/docs/architecture_guide/platforms/][Platform]], which emulates real syscalls.
134+
gVisor currently supports two platforms, ptrace and kvm. When the emulation is done, the results are
135+
delivered to user applications. In this way, gVisor provides greater isolation between applications,
136+
which is quite useful in container environment. Google cloud functions use gVisor to harden the system.
132137

133-
** A new mechanism to dispatch syscalls
138+
** A New Mechanism to Dispatch Syscalls
134139
[[https://www.kernel.org/doc/html/latest/admin-guide/syscall-user-dispatch.html][Syscall user dispatch]].
135140

136-
* The starnix runner
141+
* The Starnix Runner
137142
~Fuchsia~ already has the ability to run unmodified Linux binaries. See initial implementation [[https://fuchsia-review.googlesource.com/c/fuchsia/+/485746][here]].
138143
The basic idea is already presented. We need a hook mechanism in the kernel to run specific handler when some exceptional events happened.
139144
Those kinds of exceptional events are called [[https://fuchsia.dev/fuchsia-src/concepts/kernel/exceptions][exceptions in ~Fuchsia~]].
@@ -147,7 +152,7 @@ inspect or correct the condition.
147152

148153
We now dive into the details.
149154

150-
** hooks in the kernel
155+
** Hooks in the Kernel
151156
As a matter of fact, ~fuchsia~ (more precisely, zircon, ~fuchsia~'s kernel) provides system APIs through [[https://fuchsia.dev/fuchsia-src/concepts/kernel/vdso][vDSO]]
152157
(which is great for binary compatibility and updatability, see [[https://xuzhongxing.github.io/201806fuchsia.pdf][P20 of these slides]]).
153158
When you invoke normal Linux syscalls in ~Fuchsia~, exceptions are raised.
@@ -182,7 +187,7 @@ inline syscall_result do_syscall(uint64_t syscall_num, uint64_t pc, bool (*valid
182187
The line ~ret = sys_invalid_syscall(syscall_num, pc, vdso_code_address)~ saves the original syscall number, raises an exception.
183188
Then the kernel would suspend current thread and notify the registered exception handler.
184189

185-
** handlers in the userspace
190+
** Handlers in the Userspace
186191
[[https://cs.opensource.google/fuchsia/fuchsia/+/main:src/proc/bin/starnix/runner.rs;l=69-152;drc=5744210c57bc34495941363f6ae1b7423483fe0b][Here]] is the code snippet copied from ~fuchsia~'s ~starnix~ runner.
187192

188193
#+begin_src rust
@@ -272,10 +277,10 @@ fn run_task(mut current_task: CurrentTask, exceptions: zx::Channel) -> Result<i3
272277
}
273278
#+end_src
274279

275-
Sans a few setup work (see elf loader, dynamic interpreter and process initialization below) and the actual dispatch logic,
276-
this is how ~starnix~ runs unmodified Linux binaries. The ~starnix~ runner first set up an exception channel.
280+
Sans a few setup work (see ELF loader, dynamic interpreter and process initialization below) and the actual dispatch logic,
281+
this is how ~starnix~ runs unmodified Linux binaries. The ~starnix~ runner first sets up an exception channel.
277282
and then runs a loop in which it waits for any message from the exception channel.
278-
When the data arrive at this channel, The runner first checks if this message is actually bad syscall exception.
283+
When the data arrive at this channel, the runner first checks if this message is actually bad syscall exception.
279284
If so, the runner acquires the current registers state, then dispatches the original
280285
syscall number and its arguments to the user-defined functions. The actually implementations are scattered among different
281286
files named ~syscalls.rs~. As an example, here is the link to [[https://cs.opensource.google/fuchsia/fuchsia/+/main:src/proc/bin/starnix/fs/socket/syscalls.rs;l=612-633][~sendto~]].
@@ -284,24 +289,24 @@ files named ~syscalls.rs~. As an example, here is the link to [[https://cs.opens
284289
Although I have mentioned how ~starnix~ intercepts and hijacks normal Linux syscalls. There are still quite
285290
a few things omitted for Linux programs running normally.
286291

287-
*** More syscalls
292+
*** More Syscalls
288293
There are [[https://filippo.io/linux-syscall-table/][quite a few syscalls]] to reimplement. Linux offers many syscalls, most of which require a reimplementation.
289294
Some syscalls like ~gettimeofday~ need only stateless shims, while some require ~starnix~ to save state internally.
290295
For example, you may not want other process to access your file descriptor.
291296
When ~starnix~ opens a file on the Linux binaries' behave, it needs to keep track of the ownership of handles.
292297
Some syscalls are performance critical. Any implementation needs careful measurement.
293298
[[https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0082_starnix#memory][Memory access]] is an example.
294299

295-
*** ELF Loader and Dynamic interpreter
296-
Programs do not automagically run on a platform. The platform need to do a few setup work.
297-
The first thing it needs to do is load the program from disk to memory.
298-
The elf loader for ~fuchsia~ is implemented [[https://cs.opensource.google/fuchsia/fuchsia/+/main:src/proc/bin/starnix/loader.rs;drc=a447744ac172d77b4165342360c579a7fecb181b][here]].
300+
*** ELF Loader and Dynamic Interpreter
301+
Programs do not automagically run on a platform. The platform needs to do a few setup work.
302+
The first thing it needs to do is load the program from disk to memory. This is what the ELF loader does.
303+
The ELF loader for ~fuchsia~ is implemented [[https://cs.opensource.google/fuchsia/fuchsia/+/main:src/proc/bin/starnix/loader.rs;drc=a447744ac172d77b4165342360c579a7fecb181b][here]].
299304
To complicate things further, not all programs are self-contained. Some of them require a symbol resolution at runtime.
300305
After the program is loaded into memory. Depending on whether the program has a ~PT_INTERP~ segment, the runner may run
301-
the dynamic interpreter first. The interpreter resolves symbols in the dynamically-linked binaries and then
306+
the dynamic interpreter first. The interpreter resolves symbols in the dynamically linked binaries and then
302307
jumps to the entry point address (which is available from the auxiliary vector ~AT_ENTRY~, see below) of this program.
303308

304-
*** Process initialization
309+
*** Process Initialization
305310
On Linux, the kernel does a few setup works for the programs which is quite different from the process initialization
306311
logic of ~Fuchsia~. For example, the Linux kernel set up the stack for the binaries, and then push some auxiliary vector, environment variables, argv and argc
307312
onto the stack (See [[https://gitlab.com/x86-psABIs/x86-64-ABI/-/blob/a0ea20c1a611e51891ea71687ba844abb86e987b/x86-64-ABI/low-level-sys-info.tex#L998][System V x86 psABIs]], [[https://lwn.net/Articles/630727/][How programs get run]] and [[https://lwn.net/Articles/631631/][How programs get run: ELF binaries]] for details),
@@ -328,16 +333,16 @@ The environmental information may be in a quite different format. Here is [[http
328333
#+end_src
329334
It is immediately clear that what is populated to the initial stack from the parameter names.
330335

331-
*** Other conventions
336+
*** Other Conventions
332337
There are many other implicit conventions Linux programs rely on.
333338
For example, if you can't open stdout/stderr on your system, I expect more than 50% of the programs will crash immediately.
334339

335-
**** Posix compatibility
340+
**** Posix Compatibility
336341
+ Many libraries
337342
+ ~system(3)~
338343
+ Posix threads
339344

340-
**** Linux standard base
345+
**** Linux Standard Base
341346
+ Many libraries
342347
+ [[https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard][Filesystem Hierarchy Standard]]
343348

0 commit comments

Comments
 (0)