diff --git a/articles/20230715-porting-riscv-ukl-translate-part3.md b/articles/20230715-porting-riscv-ukl-translate-part3.md new file mode 100644 index 0000000000000000000000000000000000000000..88662da6b22e1ee7ac6c08d58e1829a373f466a1 --- /dev/null +++ b/articles/20230715-porting-riscv-ukl-translate-part3.md @@ -0,0 +1,264 @@ +> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.2-rc1 - [spaces toc refs pangu]
+> Title: [Integrating Unikernel Optimizations in a General Purpose OS](https://arxiv.org/pdf/2206.00789.pdf)
+> Author: Ali Raza
+> Translator: Gege-Wang <2891067867@qq.com>
+> Date: 2023/07/14
+> Revisor: Falcon
+> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Sponsor: PLCT Lab, ISCAS + +# 在通用式操作系统中集成 Unikernel 优化-part3 + +## 评估 + +> After our experimental environment (§5.1), §5.2 discusses our experience with UKL supporting the fundamental non-performance goals of enabling Linux’s application support, HCL, and ecosystem. In Section 5.3 microbenchmarks are used to evaluate the performance of UKL on simple system calls (§5.3.1), more complex system calls (§5.3.2) and page faults (§5.3.3). We find that, while the advantage of just avoiding the hardware overhead of system calls is small, the advantage of adopting unikernel optimizations is large for simple kernel calls (e.g., 83%) and significant for page faults (e.g., 12.5%). Moreover, the improvement is significant even for expensive kernel calls that transfer 8KB of data (e.g., 24%). + +在我们的实验环境(§5.1)之后,§5.2 讨论了我们使用 UKL 支持的基本非性能目标的经验,包括允许 Linux 应用程序支持,HCL(Hardware Compatibility List)和生态系统。在 5.3 节中,微基准测试用于评估 UKL 在简单系统调用(第 5.3.1 节)、更复杂系统调用(第 5.3.2 节)和页面错误(第 5.3.3 节)上的性能。我们发现,虽然仅仅避免系统调用的硬件开销的优势很小,但采用单内核优化的优势对于简单的内核调用很大(例如,83%),对于页面错误也很重要(例如,12.5%)。此外,即使对于传输 8KB 数据的昂贵内核调用(例如,24%),这种改进也是显著的。 + +> In Section 5.4 we evaluate applying unikernel optimizations to both throughput (Redis §5.4.1, Memcached §5.4.2) and latency bound (Secrecy §5.4.3) applications. We find that configuration options provided by UKL can enable significant throughput improvements (e.g., 12%) and a simple 10 line change in Redis code results in more significant gains (e.g., 26%). The results are even more dramatic for latency-sensitive applications where configuration changes result in 15% improvement and a trivial application change enables a 100x improvement in performance. + +在 5.4 节中,我们评估将单内核优化应用于吞吐量(Redis§5.4.1,Memcached§5.4.2)和延迟敏感(Secrecy§5.4.3)的应用程序。我们发现,UKL 提供的配置选项可以显著提高吞吐量(例如,12%),Redis 代码中简单的 10 行更改会带来更显著的收益(例如,26%)。对于对延迟敏感的应用程序,结果更加显著,其中配置更改带来 15% 的改进,而一个微不足道的应用程序更改可以使性能提高 100 倍。 + +### 实验设置 + +> Experiments are run on Dell R620 servers configured with 128G of ram arranged as a single NUMA node. The servers have two sockets, each containing an Intel Xeon CPU E5-26600 @ 2.20GHz, with 8 real cores per socket. The processors are configured to disable Turbo Boost,hyper-threads, sleep states, and dynamic frequency scaling. The servers are connected through a 10Gb link and use Broadcom NetXtremeII BCM57800 1/10 Gigabit Ethernet NICs. Experiments run on multiple computers use identically configured machines attached to the same top of rack switch to reduce external noise. On the software side, we use Linux 5.14 kernel and glibc version 2.31. Linux and different configurations of UKL were built with same compile-time config options. We ran experiments on virtual and physical hardware and got consistent and repeatable results. In the interest of space, we only report bare-metal numbers unless stated otherwise. + +实验运行在 Dell R620 服务器上,服务器配置的 128G RAM 被安排为单个 NUMA 节点。服务器有两个插槽,每个插槽都包含一个英特尔至强处理器 E5-26600 @ 2.20GHz,每个插槽有 8 个 real core。处理器被配置为禁用 Turbo Boost、超线程、睡眠状态和动态频率缩放。服务器通过 10Gb 的链路连接,并使用 Broadcom NetXtremeII BCM57800 1/10 千兆以太网网卡。实验在多台计算机上运行,使用相同配置的机器连接到相同的 TOR 交换机,以减少外部噪音。在软件方面,我们使用 Linux 5.14 内核和 glibc 2.31 版本。Linux 和不同的 UKL 配置是用相同的编译时配置选项构建的。我们在虚拟和物理硬件上进行了实验,得到了一致和可重复的结果。由于篇幅的关系,除非另有说明,否则我们只报告裸金属数据。 + +### Linux 应用程序、硬件及生态系统 + +> The fundamental goals of the UKL project are to integrate unikernel optimizations without losing Linux’s broad support for applications, hardware, and ecosystem. We discuss each of these three goals in turn. + +UKL 项目的基本目标是在不失去 Linux 对应用程序、硬件和生态系统的广泛支持的情况下集成单内核优化。我们依次讨论这三个目标。 + +> Application support: As expected, we have had no difficulty running any Linux application as normal user-level processes on our modified kernel. We have used hundreds of unmodified binaries running as normal user-level processes without effort. That includes all the standard UNIX utilities, bash, different profilers, perf, and eBPF tools. + +应用程序支持:正如预期的那样,在修改后的内核上运行任何 Linux 应用程序作为普通用户级进程都没有任何困难。我们使用了数百个未经修改的二进制文件作为普通的用户级进程运行。这包括所有标准的 UNIX 实用程序、bash、不同的分析器、perf 和 eBPF 工具。 + +> Dozens of unmodified applications have been tested as optimization targets for UKL. These include Memcached, Redis, Secrecy, a small TCP echo server, simple test programs for C++ constructors and the STL, a complex C++ graph based benchmark suite, a performance benchmark called LEBench, and a large number of standard glibc and pthread unit test programs. + +数十个未经修改的应用程序已经作为 UKL 的优化目标进行了测试。其中包括 Memcached,Redis,Secrecy,一个小型的 TCP 回显服务器,一些简单的用于测试 C++ 构造器和 STL 测试程序,一个复杂的基于 C++ 图形的基准测试套件,一个名为 LEBench 的性能基准测试,以及大量标准的 glibc 和 pthread 单元测试程序。 + +> There are some challenges in getting some applications running on UKL. First, as expected, one needs to be able to re-compile and statically link both the application and all its dependencies. Second, we have hit a number of programs that by default invoke fork followed by exec e.g., Postgress, and many that are dependent on the dynamic loader through calls to dlopen and others. Third, we have run into issues of proprietary applications available in only binary form, e.g., user-level libraries for GPUs. + +让一些应用程序在 UKL 上运行存在一些挑战。首先,正如预期的那样,需要能够重新编译和静态链接应用程序及其所有依赖项。其次,我们遇到了一些默认情况下先调用 fork 再调用 exec 的程序,例如 Postgress,以及许多依赖于动态加载器的程序调用 dlopen 和其他。第三,我们遇到了闭源应用程序仅以二进制形式可用的问题,例如,用于 GPU 的用户级库。 + +> Hardware support: For hardware, we have not run into any compatibility issues and have booted or kexeced to UKL on five different x86-64 servers and virtualization platforms. The scripts and tools used to deploy and manage normal Linux machines were used for UKL deployments as well. + +硬件支持:对于硬件,我们没有遇到任何兼容性问题,并且已经在五个不同的 x86-64 服务器和虚拟化平台上启动或执行到 UKL。用于部署和管理普通 Linux 机器的脚本和工具也用于 UKL 部署。 + +> Ecosystem: Due to having a full-fledged userspace, we have been able to run all the different applications, utilities, and tools that can run on unmodified Linux. This has been extremely critical in building UKL, i.e., we use all the debugging tools and techniques available in Linux. We have been able to profile UKL workloads with perf and able to identify +code paths that could be squashed for performance benefits(see fig. 5). + +生态系统:由于拥有一个成熟的用户空间,我们已经能够运行所有不同的应用程序、实用程序和工具,这些都可以在未经修改的 Linux 上运行。这在构建 UKL 时非常关键,也就是说,我们使用 Linux 中可用的所有调试工具和技术。我们已经能够用 perf 工具对 UKL 工作负载进行概要分析,并能够识别可以压缩以获得性能优势的代码路径(见图 5)。 + +> The UKL patch size for the base model is around 550 lines, and the full UKL patch with all the configurations is 1250 lines. Since the patch is small and non-invasive, we are hopeful that we can work with the Linux community towards upstream acceptance. + +基本模型的 UKL 补丁大小约为 550 行,而具有所有配置的完整 UKL 补丁大小为 1250 行。由于该补丁很小且非侵入性,我们希望能够与 Linux 社区合作,使其得到上游的接受。 + +> Table 2 compares the UKL patch to Kernel-Mode Linux (KML) and a selection of Linux features described in Linux Weekly News (LWN) articles in 2020. For comparison, the KML patch, used in the recent Lupine work, that runs applications in kernel mode is 3177 LOC, a complexity that has resulted in the patch not being accepted upstream. In contrast, UKL both provides richer functionality than KML, and is much simpler. This simplicity is due to three fortuitous changes since KML was introduced. First, UKL, takes advantage of recent changes to the Linux kernel that make the changes to assembly much less intrusive. Second, UKL supports only x86-64, while KML was introduced at a time when it was necessary to support i386 to be relevant. Third, UKL does not deal with older hardware, like the i8259 PIC, that had to be supported by KML. + +表 2 将 UKL 补丁与 Kernel-Mode Linux (KML) 以及 2020 年 Linux Weekly News (LWN) 文章中描述的一些 Linux 特性进行了比较。相比之下,在最近的 Lupine 工作中使用的 KML 补丁,在内核模式下运行应用程序是 3177 LOC,这种复杂性导致补丁不被上游接受。相比之下,UKL 既提供了比 KML 更丰富的功能,又简单得多。这种简单性是由于引入 KML 以来的三个偶然变化。首先,UKL 利用了最近对 Linux 内核的修改,使得对汇编的修改干扰性大大降低。其次,UKL 只支持 x86-64,而 KML 是在需要支持 i386 的时候引入的。第三,UKL 不处理旧的硬件,比如必须由 KML 支持的 i8259 PIC。 + +
+ +
+ +### 微基准测试 + +> Unikernels offer the opportunity to dramatically reduce the overhead of interactions between application and kernel code. We evaluate how UKL optimizations impact the overhead of simple system calls (§5.3.1), more expensive system calls (§5.3.2), and page faults (§5.3.3). Our results contradict recent work that suggests that the advantages are modest; we see that the reduction in overhead is larger (e.g., 90%) than previously reported and has a significant impact even for requests with large payloads (e.g., 24% with 8KByte recvfrom()). + +Unikernels 提供了显著减少应用程序和内核代码之间交互开销的机会。我们评估了 UKL 优化如何影响简单系统调用(§5.3.1)和更昂贵的系统调用的开销(第 5.3.2 节)和页面错误(第 5.3.3 节)。我们的结果与最近的研究结果相矛盾,该研究表明,这种优势是适度的;我们看到开销的减少比之前报告的更大(例如,90%),甚至对于具有大效负载(large payloads)的请求也有显著的影响(例如,对于 8KByte 的 recvfrom(),减少了 24%)。 + +1. 系统调用的基础性能 + +> Figure 1, compares the overhead of simple system calls between Linux, UKL’s base model, and UKL_BYP. Results were gathered using the (slightly modified4 LEBench) microbenchmark to measure thebaselatency of getppid read,write, sendto, and recvfrom (all with 1 byte payloads). + +图 1 比较了 Linux、UKL 的基本模型和 UKL_BYP 之间简单系统调用的开销。使用(稍作修改的 LEBench)微基准测试,用于测量 getppid()、read()、write()、sendto() 和 recvfrom() 这些系统调用的基本延迟(都是 1 字节的有效负载)。 + +
+ +
+ +> We find that the advantage of the base model of UKL that essentially replacessyscall/sysretinstructionswithcall/ret is modest, i.e., less than 5%. However, the UKL BYP configuration that avoids expensive checks on transitions between application and kernel code can be up to 83% for a getppid; suggesting that optimizing the transition between application code may have a significant performance impact. + +我们发现,UKL 的基本模型的优势本质上是用 call/ret 代替 syscall/sysret 指令,这种优势是适度的,即小于 5%。然而,UKL BYP 配置可以避免在应用程序和内核代码之间的转换上进行昂贵的检查,对于 getppid 来说,这种配置可以达到 83%;这表明优化应用程序代码之间的转换可能会对性能产生重大影响。 + +2. 大请求 + +> Figure 2 contrasts the performance of Linux to UKL and UKL_BYP for read, write, sendto and recvfrom as we use LEBench microbenchmark to vary the payload up to 8KB of data. Again, baseline UKL shows very little improvement over Linux, but UKL_BYP shows a significant constant improvement. The right vertical axis also shows the downward trend of percentage improvement of UKL_BYP compared to Linux. As the time spent in the kernel increases, the percentage gain decreases. But even for payloads of up to 8KB, the percentage improvement is still significant, i.e., between 11% and 22%. + +图 2 对比了 Linux 与 UKL 和 UKL_BYP 在 read()、write()、sendto() 和 recvfrom() 方面的性能,我们使用 LEBench 微基准测试将有效载荷改变为 8KB 的数据。基线 UKL 再次显示在 Linux 上几乎没有改进,但是 UKL_BYP 显示出显著的持续改进。右纵轴也显示了 UKL_BYP 与 Linux 的比较百分比改善的下降趋势。随着在内核中花费的时间的增加,百分比增益减少。但是,即使对于高达 8KB 的有效负载,改进的百分比仍然很大,即在 11% 到 22% 之间。 + +
+ +
+ +> It is interesting to contrast our results with those from the recent Lupine work. Surprisingly they observed that just eliminating the system call overhead is significant (40%) for a null system call, but since they found that (like us) the improvement dropped to below 5% in most cases, they concluded that the benefit of co-locating the application and kernel is minimal. Our results suggest that the major performance gain comes not from eliminating the hardware cost but from eliminating all the checks on the transition between the application and kernel code and that reducing this overhead has a significant impact on even expensive system calls. + +将我们的结果与最近 Lupine 的研究结果进行对比是很有趣的。令人惊讶的是,他们发现仅仅消除系统调用开销就很重要 (40%),但是由于他们发现(和我们一样)在大多数情况下,改进下降到 5% 以下,将应用程序和内核放在一起的好处是最小的。我们的结果表明,主要的性能增益不是来自消除硬件成本,而是来自消除应用程序和内核代码之间转换的所有检查,并且减少这种开销对昂贵的系统调用也有重大影响。 + +3. 页面故障处理 + +> Figure 3 compares three different schemes we have for handing page faults, i.e., UKL_PF_DF, UKL_PF_SS and (UKL_RET_PF_DF). For UKL_PF_DF, we see close to 5% improvement in page fault latency compared to Linux. UKL_PF_SS is also comparable to the previous case, which means that stack switch on every page fault is not too costly, and most of the benefit over Linux in both these cases is due to handling page faults in kernel mode and avoiding ring transition. (UKL_RET_PF_DF) gives us more than 12.5% improvement over normal Linux. In all these cases, since the time taken to service more page faults increases, the improvement over normal Linux also increases, which is why we see a constant percentage improvement. Unmodified applications can choose anyone of these options through build time Linux config options. + +图 3 比较了处理页面错误的三种不同方案,即 UKL_PF_DF、UKL_PF_SS 和 (UKL_RET_PF_DF)。对于 UKL_PF_DF,我们看到页面错误延迟相比于 Linux 有将近 5% 的改进。UKL_PF_SS 也与前一种情况类似,这意味着在每个页面错误上进行堆栈切换的成本不会太高,而且在这两种情况下相对于 Linux 的大部分好处是由于在内核模式下处理页面错误并避免了环转换。(UKL_RET_PF_DF) 比普通 Linux 改进了 12.5% 以上。在所有这些情况下,由于处理更多页面错误所需的时间增加了,因此相对于普通 Linux 的改进也增加了,这就是为什么我们看到一个恒定百分比的改进。未修改的应用程序可以通过构建时 Linux 配置选项选择这些选项中的任何一个。 + +
+ +
+ +> We repeated this experiment for non-stack page faults, i.e., on mapped memory and got the same results. + +我们对非栈页面错误(即在映射内存上)重复这个实验,得到了相同的结果。 + +### 应用性能 + +> We want to see how real world applications perform on UKL. We chose three different types of applications: a simple application (Redis) used by previous works as well, a more complex application (Memcached) that many unikernels don’t support unmodified, and a latency-sensitive application (Secrecy). Our results show significant advantages in Redis (26%), Memcached (8%), and Secrecy (100x). + +我们想看看真实世界的应用程序在 UKL 上的表现。我们选择了三种不同类型的应用程序:一个以前的研究也使用过的简单的应用程序(Redis),一个未修改的、并且其他 unikernels 不支持的更复杂的应用程序(Memcached),和一个延迟敏感的应用程序(Secrecy)。我们的结果显示具有显著的优势,Redis(26%)、Memcached(8%)和 Secrecy(100x)。 + +1. 简单应用:Redis + +> We use Redis, a widely usedin-memory database, tomeasure the performance of UKL and its different configurations in real world applications. For this experiment, we ran Redis server on UKL on bare metal and ran the client on another physical node in the network. + +我们使用 Redis,一个广泛使用的内存数据库,来衡量在现实世界的应用程序中的性能和它的不同配置。在这个实验中,我们在裸机上的 UKL 上运行 Redis 服务器,在网络中的另一个物理节点上运行客户端。 + +> We use the Memtier benchmark to test Redis. Through Memtier benchmark, we create 300 clients, each sending 100 thousand requests to the server. The ratio of get to set operations in 1 to 10. We ran Redis on Linux, UKL_RET_BYP and UKL_RET_BYP with deeper shortcuts. Figure 4 helps us visualize the latency distribution for these requests. + +我们使用 Memtier 基准测试来测试 Redis。通过在 Memtier 基准测试中,我们创建 300 个客户端,每个客户端向服务器发送 10 万个请求。get 和 set 操作的比率为 1 比 10。我们在 Linux 上,分别开启 UKL_RET_BYP 和具有更深捷径的 UKL_RET_BYP 来运行 Redis。图 4 帮助我们可视化这些请求的延迟分布。 + +
+ +
+ +> To better understand where the time was being spent, we profiled Redis UKL with perf. Figure 5, which is part of the flame graph we generated, shows two clear opportunities for performance improvement. Blue arrows show how we could shorten the execution path by bypassing the entry and exit code for read and write system calls and invoke the underlying functionality directly. Figure 4 shows how Redis on UKL_RET shows improvement in average and 99th percentile tail latency when it bypasses the entry and exit code (UKL_RET_BYP). Table 3 shows that UKL_RET_BYP has 11% better tail latency and 12% better throughput. + +
+ +
+ +为了更好地了解时间消耗在哪里,我们用 perf 对 Redis UKL 进行了分析。图 5 是我们生成的 flame graph 的一部分,它显示了两个明显的性能改进机会。蓝色箭头显示了我们如何通过绕过读写系统调用的入口和退出代码来缩短执行路径,并直接调用底层功能。图 4 显示了当 Redis 绕过进入和退出代码(UKL_RET_BYP)时,UKL_RET 上的平均和第 99 百分位尾部延迟是如何改善的。表 3 显示,UKL_RET_BYP 的尾部延迟提高了 11%,吞吐量提高了 12%。 + +
+ +
+ +> Looking at Figure 5 again, the green arrows show that read and write calls, after all the polymorphism, eventually translate into tcp_recvmsg and tcp_sendmsg respectively. To investigate any potential benefit of shortcutting deep into the kernel, we wrote some code in the kernel to interface read and write with tcp_recvmsg and tcp_sendmsg respectively. We then modified Redis (10 lines modified) to call our interface functions instead of read and write. Our results show (Figure 4) further improvement in average and 99th percentile tail latency i.e., UKL_RET_BYP (shortcut). Table 3 shows that UKL_RET_BYP (shortcut) has 22% better tail latency and 26% better throughput. + +再次查看图 5,绿色箭头显示在所有多态性之后,读和写调用最终分别转换 tcp_recvmsg 和 tcp_sendmsg。为了研究深入内核的快捷方式的潜在好处,我们在内核中编写了一些代码,分别使用 tcp_recvmsg 和 tcp_sendmsg 进行读写操作。然后我们修改了 Redis(修改了 10 行)来调用我们的接口函数,而不是读写。我们的结果显示(图 4)平均和第 99 百分位尾部延迟的进一步改善,即 UKL_RET_BYP(快捷方式)。表 3 显示 UKL_RET_BYP(快捷方式)的尾部延迟提高 22%,吞吐量提高 26%。 + +> Figure 4 provides us some nice insights into future possibilities. There is almost a 0.5ms difference in the shortest latencies for Linux versus UKL_RET_BYP (shortcut) case. This means that there is an opportunity to further reduce the average and tail latencies to sit closer to the smallest latency case. + +图 4 为我们提供了一些关于未来可能性的深刻见解。Linux 与 UKL_RET_BYP(快捷方式)相比,最短延迟几乎相差 0.5ms。这意味着有机会进一步减少平均和尾部延迟,使其更接近最小延迟情况。 + +> Lupine shows slightly better results than baseline Linux for Redis, but it does so in virtualization on a lightweight hypervisor. It would be interesting to see how UKL performs in that setting, even though there is a huge difference in kernel versions used by Lupine and UKL. + +Lupine 在 Redis 上显示的结果略好于基线 Linux,但它是在轻量级管理程序上进行虚拟化的。尽管 Lupine 和 UKL 使用的内核版本存在巨大差异,但看看 UKL 在这种情况下的表现将是一件很有趣的事情。 + +2. 复杂应用:Memcached + +> Memcached is a multithreaded workload that relies heavily on pthreads library and glibc ’s internal synchronization mechanisms. It is an interesting application because unikernels generally don’t support complex applications, and systems like EbbRT first have to port Memcached. To evaluate Memcached, we use the Mutilate benchmark. This benchmark uses multiple clients to generate a fixed queries-per-second load on the server and then measures the latency. We ran the clients in userspace on the same node as Memcached UKL to remove any network delays, and we pinned the Memcached server and clients to separate cores. We used Mutilate to generate queries based on Facebook’s workloads. For different configurations of UKL, we measured how many queries per second Memecached can serve while keeping the 99% tail latency under the 500 us service level agreement. Figure 6 shows Memcached with UKL_RET performs similar to Memcached on Linux, i.e., both serve around 73 thousand queries before exceeding the 500 us threshold. Memcached on UKL_RET_BYP can serve around 77 thousand queries(around 5% improvement), and Memcached on UKL_RET_BYP (shortcut) can serve up to 79 thousand queries (around 8% improvement) before going over the 500 us threshold. + +Memcached 是一个严重依赖 pthread 库和 glibc 的内部同步机制的多线程工作负载。这是一个有趣的应用程序,因为 unikernels 通常不支持复杂的应用程序,像 EbbRT 这样的系统首先必须移植 Memcached。为了评估 Memcached,我们使用了 mutlate 基准测试。此基准测试使用多个客户机在服务器上生成固定的每秒查询数负载,然后测量延迟。我们在用户空间中与 Memcached UKL 在同一个节点上运行客户端,以消除任何网络延迟,并且我们将 Memcached 服务器和客户端固定在不同的核心上。我们使用 mutinate 来生成基于 Facebook 工作负载的查询。对于不同的 UKL 配置,我们测量了 Memecached 每秒可以处理多少查询,同时在 500 us 服务级别协议下保持 99% 的尾部延迟。图 6 显示了具有 UKL_RET 的 Memcached 执行类似于 Linux 上的 Memcached,例如,它们都在超过 500 个请求的阈值之前提供大约 73000 个查询。Memcached 上 UKL_RET_BYP 可以提供大约 77000 个查询而在 UKL_RET_BYP(快捷方式)上的 Memcached 在超过 500 us 阈值之前可以提供多达 79000 个查询(大约提高 8%)。 + +
+ +
+ +> This experiment also serves as a functionality and compatibility result; a comparatively large application with multiple threads etc. can run on UKL. + +本实验也作为功能性和兼容性的结果;具有多个线程等的较大应用程序可以在 UKL 上运行。 + +3. 对延迟敏感的应用:Secrecy + +> Secrecy is a multi-party computation framework for secure analytics on private data. While Redis and Memcached are throughput sensitive, Secrecy is latency-sensitive. This represents an important class of applications, e.g., highspeed financial trading, etc. Secrecy is a three node protocol with each node sending data to its successor and receiving from its predecessor with the third node sending to the first. Computation is done row by row with a round of messages that act as a barrier between each row. + +Secrecy 是一个多方计算框架,用于对私有数据进行安全分析。Redis 和 Memcached 是吞吐量敏感的,而 secrecy 是延迟敏感的。这代表了一类重要的应用,例如高速金融交易等。Secrecy 是一个三节点协议,每个节点向后继节点发送数据,并从前一个节点接收数据,第三个节点向第一个节点发送数据。计算逐行进行,并使用一轮消息作为每行之间的屏障。 + +> We used a test in the Secrecy implementation for a GROUPBY operator which groups rows in a table by key attributes and counts the number of rows per group. Messages used in a round of communication are each very small, between 8 and 24 bytes each, so we configured each TCP socket to use TCP_NODELAY to avoid stalls caused by congestion control. Using this test executable, we ran experiments with 100, 1000, and 10,000 input rows and measured the time required to complete the GROUP-BY. Each system and row size combination was run 20 times, and the worst two runs for each combination were discarded. + +我们在 Secrecy 实现中对 GROUPBY 操作符进行了测试,该操作符根据键属性对表中的行进行分组,并计算每组的行数。在一轮通信中使用的每个消息都非常小,每个消息在 8 到 24 字节之间,因此我们配置了要使用的每个 TCP 套接字 TCP_NODELAY 避免拥塞控制造成的延时。使用这个测试可执行文件,我们运行了 100、1000 和 10000 个输入行的实验,并测量了完成 GROUPBY 所需的时间。每个系统和行大小组合运行 20 次,每个组合的最差的两次运行被丢弃。 + +> Figure 7 shows the run times of the three systems normalized to the run time of Linux and the error bars show the coefficient of variation for each configuration. As with other experiments, the UKL_BYP configuration shows a modest improvement in run time. However, when we use the deeper shortcut to the TCP send and receive functions, we see significant (100x) runtime improvements. + +图 7 显示了三个系统的运行时间归一化为 Linux 的运行时间,误差条显示了每种配置的变异系数。与其他实验一样,UKL_BYP 配置在运行时显示出适度的改进。然而,当我们对 TCP 发送和接收函数使用更深的快捷方式时,我们看到了显著的(100x)运行时改进。 + +
+ +
+ +> The improvement of the shortcut system over the others was larger than anticipated, so we reran the experiments and achieved the same level of performance. To verify that the work was still happening, we collected a capture of all the inter-node traffic using Wireshark and verified that the same number of TCP packets traveled between nodes in all three system setups for a 100 row experiment.We also instrumented the send and receive paths in Secrecy to collect individual times for send and receive calls in each system for a 100 row run. The mean and standard deviation of send times for Linux were 2.23us and 1.14us, respectively, and the values for receive times on Linux were 1,100us and 3,300us, respectively. The shortcut showed send mean and standard deviation of 896ns and 1,755ns, which is a significant speed up, but the receive numbers were 638ns and 3,888ns. + +快捷系统相对于其他系统的改进比预期的要大,因此我们重新进行了实验,并达到了相同的性能水平。为了验证工作是否仍在进行,我们使用 Wireshark 收集了所有节点间流量的捕获,并验证了在所有三个系统设置中,在 100 行实验中,节点之间传输的 TCP 数据包数量相同。我们还检测了 Secrecy 中的发送和接收路径,以收集每个系统中 100 行的发送和接收调用的单独时间运行。Linux 系统的发送次数均值和标准差分别为 2.23us 和 1.14us,Linux 系统的接收次数均值和标准差分别为 1100 us 和 3300 us。快捷方式显示发送均值和标准差分别为 896ns 和 1755ns,速度明显加快,但接收数分别为 638ns 和 3888ns。 + +> It appears that, with the shortcut, the systemis never having to wait on packet delivery on top of bypassing system call entry and exit paths, so the shortcut system is never put to sleep waiting on incoming messages. We believe that because Secrecy is latency-sensitive and because we accelerate the send path, we ensure that no node ever has to wait for data and can move to the next round of processing immediately. Moreover, the shortcut implicitly disables scheduling on transitions, ensuring that the application is always run to completion. This is critical for an application with frequent barriers. + +看起来,使用快捷方式,系统在绕过系统调用进入和退出路径的基础上永远不必等待数据包传递,因此快捷方式系统永远不会在等待传入消息时处于休眠状态。我们相信,因为 Secrecy 是延迟敏感的,因为我们加速了发送路径,我们确保没有节点需要等待数据,可以立即进入下一轮处理。此外,该快捷方式隐式地禁用了转换调度,确保应用程序始终运行到完成。这对于具有频繁障碍的应用程序至关重要。 + +## 相关工作 + +> There has been a huge body of research on unikernels that we categorize as clean slate designs, forks of existing operating systems, and incremental systems. + +我们已经对 unikernels 进行了大量的研究,我们将其归类为全新设计、通用操作系统的分支和增量系统。 + +> CleanSlateUnikernels: Many unikernel projects arewritten from scratch or use a minimal kernel like MiniOS for bootstrapping. These projects have complete control over the language and methodology used to construct the kernel. MirageOS uses OCaml to implement the unikernel and uses the language and compiler level features to ensure robustness against vulnerabilities and small attack surface. Similarly, OSv uses lock-free scheduling algorithms to gain performance benefits for unmodified applications. Implementations in clean-slate unikernels can also be fine-tuned for performance of specific applications, e.g., Minicache optimizes Xen and MiniOS for CDN based use case. Further, from scratch implementations can easily expose efficient, low-level interfaces to applications e.g., EbbRT. Different clean slate unikernels can often be polar opposites in some regards, exposing the wide range of choices available to them. Forinstance, some might target custom APIs for performance while like HermiTux target full Linux ABI compatibility. Recently, efforts like Unikraft provide strong POSIX support while also allowing custom APIs for further performance gains. + +全新设计。许多 unikernel 项目都是从头开始编写的,或者使用像 MiniOS 这样的最小内核来引导。这些项目完全控制用于构造内核的语言和方法。MirageOS 使用 OCaml 实现单内核,并使用语言和编译器级别的特性来确保对漏洞的鲁棒性和较小的攻击面。同样,OSv 使用无锁调度算法来获得未修改应用程序的性能优势。在全新 unikernel 中的实现也可以针对特定应用程序的性能进行微调,例如,Minicache 优化了 Xen 和基于 CDN 的 MiniOS 用例。此外,从头开始实现可以很容易地将高效、低级的接口暴露给应用程序,例如 EbbRT。不同的全新单内核在某些方面通常是截然相反的,这就为它们提供了广泛的选择。例如,有些可能针对自定义 API 的性能,而像 HermiTux 目标是完全兼容 Linux ABI。最近,像 Unikraft 这样的努力提供了强大的 POSIX 支持,同时还允许自定义 API 来进一步提高性能。 + +> These unikernel offer compelling trade-offs to general purpose operating systems. These include improved security and smaller attack surfaces e.g., Xax and MirageOS, shorter boot times e.g., ClickOS and LightVM, efficient memory use through single address space e.g., OSv and many others, and better run-time performance e.g., EbbRT, Unikraft and SUESS. Some approaches target direct access to virtual or physical hardware. A number of researchers have directly confronted the problem of compatibility, e.g., OSv is almost Linux ABI compatible and HermiTux is fully ABI compatible with Linux binaries. Other projects aim to make building unikernels easier e.g., EbbRT, Libra and Unikraft. + +这些 unikernel 为通用操作系统提供了令人信服的折衷。这些包括改进的安全性和更小的攻击面,例如 Xax 和 MirageOS,更短的启动时间,例如 ClickOS 和 LightVM,通过单个地址空间高效地使用内存,例如 OSv 和许多其他,以及更好的运行时性能,例如,EbbRT、Unikraft、SUESS。一些方法针对直接访问虚拟或物理硬件。许多研究者直接面对兼容性问题,例如 OSv 几乎与 Linux ABI 兼容,HermiTux 与 Linux 二进制文件完全 ABI 兼容。其他项目旨在使构建 unikernel 更容易,例如 EbbRT,Libra 和 Unikraft。 + +> The UKL effort was inspired by the tremendous results demonstrated by clean slate unikernels. Our research targets trying to find ways to integrate some of the advantages these systems have shown into a general-purpose OS. + +UKL 的努力受到了由全新 unikernels 所展示的巨大结果的启发。我们的研究目标是试图找到将这些系统所显示的一些优点集成到通用操作系统中的方法。 + +> Forks of General Purpose OS. A number of projects either fork an existing general-purpose OS code base or reuse a significant portion of one. Examples include Drawbridge which harvests code from Windows, Rump kernel which uses NetBSD drivers and Linux Kernel Library (LKL) which borrows code from Linux. These systems, although constrained by the design and structure of the original OS, generally have better compatibility with existing applications. The codebase these systems fork are well tested and can serve as building blocks for other research projects, e.g., Rump has been used in other projects. + +通用操作系统的分支。许多项目要么派生现有的通用操作系统代码库,要么重用其中的很大一部分。例如,Drawbridge 从 Windows 中获取代码,Rump kernel 使用 NetBSD 驱动程序,Linux kernel Library (LKL) 从 Linux 中借用代码。这些系统虽然受到原始操作系统的设计和结构的限制,但通常与现有应用程序具有更好的兼容性。这些系统分叉的代码库经过了良好的测试,可以作为其他研究项目的构建块,例如:Rump 已在其他项目中使用。 + +> Our goal in UKL is to try to find a way to integrate unikernel optimizations without having the fork the original OS. + +我们在 UKL 中的目标是尝试找到一种方法来集成单内核优化,而不需要在原始操作系统上进行分支。 + +> IncrementalSystems. There are systems, e.g.,KernelMode Linux (KML), Lupine and X-Containers which use an existing general-purpose operating system (Linux) but make comparatively fewer changes. This way, a lot of working knowledge of users of Linux can easily transfer over to these systems, but in doing so, these systems only expose the system call entry points to applications and don’t make any further specializations. Unlike UKL, they don’t co-optimize the application and kernel together. Lupine and X-Containers demonstrate opportunities in customizing Linux through build time configurations, and that is orthogonal and complementary to UKL. UKL can also benefit from a customized Linux and then add unikernel optimizations on top of that. + +增量系统。有一些系统,例如 KernelModeLinux (KML),Lupine 和 X-Containers,它们使用现有的通用操作系统 (Linux),但进行的更改相对较少。这样,Linux 用户的许多工作知识可以很容易地转移到这些系统中,但是这样做,这些系统只向应用程序公开系统调用入口点,而不进行任何进一步的专门化。与 UKL 不同的是,它们不会共同优化应用程序和内核。Lupine 与 X-Containers 展示了通过构建时配置定制 Linux 的机会,这与 UKL 是正交的和互补的。UKL 还可以从定制的 Linux 中受益,然后此基础上添加单内核优化。 + +## 总结 + +> UKL creates a unikernel target of glibc and the Linux kernel. The changes are modest, and we have shown even with these, it is possible to achieve substantial performance advantages for real workloads, e.g., 26% improvement in Redis throughput while improving tail latency by 22%. UKL supports both virtualized platforms and bare-metal platforms. While we have not tested a wide range of devices, we have so far experienced no issues using any device that Linux supports. Operators can configure and control UKL using the same tools they are familiar with, and developers have the ability to use standard Linux kernel tools like BPF and perf to analyze their programs. + +UKL 创建基于 glibc 和 Linux 内核的单内核目标。这些变化是适度的,我们已经证明,即使有了这些,也有可能在实际工作负载中实现实质性的性能优势,例如,在 Redis 吞吐量提高 26% 的同时将尾部延迟提高 22%。UKL 同时支持虚拟化平台和裸机平台。虽然我们没有测试过大量的设备,但到目前为止,我们在使用 Linux 支持的任何设备时都没有遇到任何问题。操作人员可以使用他们熟悉的相同工具配置和控制 UKL,开发人员可以使用标准的 Linux 内核工具(如 BPF 和 perf)来分析他们的程序。 + +> UKL differs in a number of interesting ways from unikernels. First, while application and kernel code are statically linked together, UKL provides very different execution environments for each; enabling applications to run in UKL with no modifications while minimizing changes to the invariants (whatever they are) that the kernel code expects. Second, UKL enables a knowledgable developer to incrementally optimize performance by modifying the application to directly take advantage of kernel capabilities, violating the normal assumptions of kernel versus application code. Third, processes can run on top of UKL, enabling the entire ecosystem of Linux tools and scripting to just work. + +UKL 在许多有趣的方面与 unikernels 不同。首先,虽然应用程序和内核代码是静态链接在一起的,但 UKL 为两者提供了非常不同的执行环境;使应用程序无需修改就可以在 UKL 中运行,同时最大限度地减少对内核代码所期望的不变量的更改(不管它们是什么)。其次,UKL 使开发人员能够通过修改应用程序来直接利用内核功能,从而逐步优化性能,这违反了内核代码与应用程序代码之间的常规假设。第三,进程可以在 UKL 之上运行,使整个 Linux 工具和脚本的生态系统能够正常工作。 + +> We have repeatedly thought that we were only a few weeks away from a stable system, and it has only been recently that we had a design and a set of changes that met our fundamental goals. While the set of changes to create UKL ended up being very small, it has taken us several years of work to get to this point. The unique design decisions are a result of multiple, typically much more pervasive, changes to Linux as we changed directions and gained experience with how the capability we wanted could be integrated into Linux. It is in some sense an interesting experience that the very modularity of Linux that enables a broad community to participate both: 1) makes it very difficult to understand how to integrate a change like UKL and, 2) can be harnessed to enable the change in a very small number of lines of code. + +我们一再认为,我们离一个稳定的系统只有几周的时间了,直到最近,我们才有了一个设计和一系列的改变,达到了我们的基本目标。虽然创建 UKL 的更改集最终非常小,但我们花了几年的时间才达到这一点。这种独特的设计决策是对 Linux 的多次(通常是更普遍的)更改的结果,因为我们改变了方向,并获得了如何将我们想要的功能集成到 Linux 中的经验。从某种意义上说,Linux 的模块化使广泛的社区能够参与其中,这是一种有趣的体验:1)很难理解如何集成这样的更改可以利用 UKL 和 2)在非常少的代码行中启用更改。 + +> The focus of our work so far has been on functionality and just a proof of concept of a performance advantage in order to justify integrating the code into Linux. Now that we have achieved that, we plan to start working on getting UKL upstreamed as a standard target of Linux so that the community will continue to enhance it. + +到目前为止,我们的工作重点一直放在功能上,只是为了证明将代码集成到 Linux 中是合理的,从而证明了性能优势的概念。现在我们已经实现了这一点,我们计划开始将 UKL 作为 Linux 的标准目标进行升级,以便社区将继续增强它。 + +> We have only started performance optimizing UKL. As our knowledge of Linux has increased, a whole series of simple optimizations that can be readily adopted have become apparent beyond the current efforts. How hard will it be to introduce and/or exploit zero-copy interfaces to the application? How hard will it be to reduce some of the privacy assumptions implicit in the BSD socket interface when only one application consumes incoming data? + +我们才刚刚开始对 UKL 进行性能优化。随着我们对 Linux 知识的增加,一系列易于采用的简单优化已经变得明显,超出了当前的努力。在应用程序中引入和/或利用零复制接口有多难?当只有一个应用程序使用传入数据时,减少 BSD 套接字接口中隐式的一些隐私假设有多难? + +> These kernel-centric optimizations are just the start. From an application perspective, we believe that UKL will provide a natural path for improving performance and reducing the complexity of complex concurrent workloads. Concurrent operations on shared resources must be regulated. Often the burden falls onto the user code. From the user-level, it is hard to determine whether synchronization is needed, and the controlling operations and controlled entities usually live in the kernel. If the user code moves into the kernel and has the same privileges, some operations might become faster or possible in the first place. For instance, in a garbage collector, it might be necessary to prevent or at least detect whether concurrent accesses happen. With easy and fast access to the memory infrastructure (e.g., page tables) and the scheduler, many situations in which explicit, slow synchronization is needed might get away with detecting and cleaning up violations of the assumptions. + +这些以内核为中心的优化仅仅是个开始。从应用程序的角度来看,我们相信 UKL 将为提高性能和降低复杂并发工作负载的复杂性提供一条自然的途径。必须规范对共享资源的并发操作。通常负担落在用户代码上。从用户级别来看,很难确定是否需要同步,并且控制操作和控制实体通常位于内核中。如果用户代码移到内核中并具有相同的特权,那么某些操作可能会变得更快,或者可能首先变得更快。例如,在垃圾收集器中,可能需要防止或至少检测并发访问是否发生。通过对内存基础设施(例如页表)和调度器的简单快速访问,在许多需要显式缓慢同步的情况下,可以检测和清除违反假设的情况。 + +> If the Linux community accepts UKL, we believe it will not only impact Linux but may become a very important platform for future research. While the benefits to researchers of broad applications on HCL support are obvious. Perhaps less obvious, as unikernel researchers, is the ability to use tools like ktest to deploy and manage experiments, BPF and perf to be able to understand performance, have been incredibly valuable. + +如果 Linux 社区接受 UKL,我们相信它不仅会影响 Linux,而且可能成为未来研究的一个非常重要的平台。而 HCL(Hardware Compatibility List)支持的广泛应用对研究人员的好处是显而易见的。也许不太明显的是,作为 unikernel 研究人员,使用像 ktest 这样的工具来部署和管理实验的能力,使用 BPF 和 perf 理解性能,已经非常有价值。 + +## 参考资料 + +- https://www.cnblogs.com/demonatic/p/12962119.html +- https://docs.kernel.org/next/arch/x86/entry_64.html +- https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/entry_64.S +- https://zhuanlan.zhihu.com/p/603421427 diff --git a/articles/images/porting-riscv-ukl-2/translate-figure-1.PNG b/articles/images/porting-riscv-ukl-2/translate-figure-1.PNG new file mode 100644 index 0000000000000000000000000000000000000000..a246fb3b0971084cb46dc72d6259d0595501e0cc Binary files /dev/null and b/articles/images/porting-riscv-ukl-2/translate-figure-1.PNG differ diff --git a/articles/images/porting-riscv-ukl-2/translate-figure-2.PNG b/articles/images/porting-riscv-ukl-2/translate-figure-2.PNG new file mode 100644 index 0000000000000000000000000000000000000000..ed43ebe2389025e23323efdf0166d31894a598db Binary files /dev/null and b/articles/images/porting-riscv-ukl-2/translate-figure-2.PNG differ diff --git a/articles/images/porting-riscv-ukl-2/translate-figure-3.PNG b/articles/images/porting-riscv-ukl-2/translate-figure-3.PNG new file mode 100644 index 0000000000000000000000000000000000000000..dc50ff5a811f48f1c10e117110b82edaa6a91fcc Binary files /dev/null and b/articles/images/porting-riscv-ukl-2/translate-figure-3.PNG differ diff --git a/articles/images/porting-riscv-ukl-2/translate-figure-4.PNG b/articles/images/porting-riscv-ukl-2/translate-figure-4.PNG new file mode 100644 index 0000000000000000000000000000000000000000..b8c128aa6474a4f2d2941213cf355414649dac6d Binary files /dev/null and b/articles/images/porting-riscv-ukl-2/translate-figure-4.PNG differ diff --git a/articles/images/porting-riscv-ukl-2/translate-figure-5.PNG b/articles/images/porting-riscv-ukl-2/translate-figure-5.PNG new file mode 100644 index 0000000000000000000000000000000000000000..3107b62c505dce5addd0a03cc94117bb9d7d482b Binary files /dev/null and b/articles/images/porting-riscv-ukl-2/translate-figure-5.PNG differ diff --git a/articles/images/porting-riscv-ukl-2/translate-figure-6.PNG b/articles/images/porting-riscv-ukl-2/translate-figure-6.PNG new file mode 100644 index 0000000000000000000000000000000000000000..5ba82d3af54013da56b3ba6cee9569177aa2c7e7 Binary files /dev/null and b/articles/images/porting-riscv-ukl-2/translate-figure-6.PNG differ diff --git a/articles/images/porting-riscv-ukl-2/translate-figure-7.PNG b/articles/images/porting-riscv-ukl-2/translate-figure-7.PNG new file mode 100644 index 0000000000000000000000000000000000000000..8c991556baf9adea8a33af226390bf92bb4bc4cd Binary files /dev/null and b/articles/images/porting-riscv-ukl-2/translate-figure-7.PNG differ diff --git a/articles/images/porting-riscv-ukl-2/translate-table-2.PNG b/articles/images/porting-riscv-ukl-2/translate-table-2.PNG new file mode 100644 index 0000000000000000000000000000000000000000..cbbd05ed2863e31ff06abec983da320ba5a05bc7 Binary files /dev/null and b/articles/images/porting-riscv-ukl-2/translate-table-2.PNG differ diff --git a/articles/images/porting-riscv-ukl-2/translate-table-3.PNG b/articles/images/porting-riscv-ukl-2/translate-table-3.PNG new file mode 100644 index 0000000000000000000000000000000000000000..0a4b4e28d330534185d88312b76cfa500dfa1d1b Binary files /dev/null and b/articles/images/porting-riscv-ukl-2/translate-table-3.PNG differ