From b3d65027f3ab665da9c42cd534b4cbafcb7cfbd7 Mon Sep 17 00:00:00 2001 From: Chen Ridong Date: Tue, 8 Oct 2024 11:24:56 +0000 Subject: [PATCH 1/9] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction stable inclusion from stable-v6.6.60 commit 0d86cd70fc6a7ba18becb52ad8334d5ad3eca530 category: bugfix issue: #IBBMYW CVE: CVE-2024-53054 Signed-off-by: May27May --------------------------------------- [ Upstream commit 117932eea99b729ee5d12783601a4f7f5fd58a23 ] A hung_task problem shown below was found: INFO: task kworker/0:0:8 blocked for more than 327 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Workqueue: events cgroup_bpf_release Call Trace: __schedule+0x5a2/0x2050 ? find_held_lock+0x33/0x100 ? wq_worker_sleeping+0x9e/0xe0 schedule+0x9f/0x180 schedule_preempt_disabled+0x25/0x50 __mutex_lock+0x512/0x740 ? cgroup_bpf_release+0x1e/0x4d0 ? cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? cgroup_bpf_release+0x1e/0x4d0 ? mutex_lock_nested+0x2b/0x40 ? __pfx_delay_tsc+0x10/0x10 mutex_lock_nested+0x2b/0x40 cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 ? process_scheduled_works+0x161/0x8a0 process_scheduled_works+0x23a/0x8a0 worker_thread+0x231/0x5b0 ? __pfx_worker_thread+0x10/0x10 kthread+0x14d/0x1c0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x59/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 This issue can be reproduced by the following pressuse test: 1. A large number of cpuset cgroups are deleted. 2. Set cpu on and off repeatly. 3. Set watchdog_thresh repeatly. The scripts can be obtained at LINK mentioned above the signature. The reason for this issue is cgroup_mutex and cpu_hotplug_lock are acquired in different tasks, which may lead to deadlock. It can lead to a deadlock through the following steps: 1. A large number of cpusets are deleted asynchronously, which puts a large number of cgroup_bpf_release works into system_wq. The max_active of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are cgroup_bpf_release works, and many cgroup_bpf_release works will be put into inactive queue. As illustrated in the diagram, there are 256 (in the acvtive queue) + n (in the inactive queue) works. 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put smp_call_on_cpu work into system_wq. However step 1 has already filled system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has to wait until the works that were put into the inacvtive queue earlier have executed (n cgroup_bpf_release), so it will be blocked for a while. 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. 4. Cpusets that were deleted at step 1 put cgroup_release works into cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. When cgroup_metux is acqured by work at css_killed_work_fn, it will call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. However, cpuset_css_offline will be blocked for step 3. 5. At this moment, there are 256 works in active queue that are cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as a result, all of them are blocked. Consequently, sscs.work can not be executed. Ultimately, this situation leads to four processes being blocked, forming a deadlock. system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) ... 2000+ cgroups deleted asyn 256 actives + n inactives __lockup_detector_reconfigure P(cpu_hotplug_lock.read) put sscs.work into system_wq 256 + n + 1(sscs.work) sscs.work wait to be executed warting sscs.work finish percpu_down_write P(cpu_hotplug_lock.write) ...blocking... css_killed_work_fn P(cgroup_mutex) cpuset_css_offline P(cpu_hotplug_lock.read) ...blocking... 256 cgroup_bpf_release mutex_lock(&cgroup_mutex); ..blocking... To fix the problem, place cgroup_bpf_release works on a dedicated workqueue which can break the loop and solve the problem. System wqs are for misc things which shouldn't create a large number of concurrent work items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work items, it should use its own dedicated workqueue. Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Cc: stable@vger.kernel.org # v5.3+ Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t Tested-by: Vishal Chourasia Signed-off-by: Chen Ridong Signed-off-by: Tejun Heo Signed-off-by: Sasha Levin Signed-off-by: May27May Conflicts: kernel/bpf/cgroup.c --- kernel/bpf/cgroup.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 82f908e2f52f..51160db9657f 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -22,6 +22,23 @@ DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key); EXPORT_SYMBOL(cgroup_bpf_enabled_key); +/* + * cgroup bpf destruction makes heavy use of work items and there can be a lot + * of concurrent destructions. Use a separate workqueue so that cgroup bpf + * destruction work items don't end up filling up max_active of system_wq + * which may lead to deadlock. + */ +static struct workqueue_struct *cgroup_bpf_destroy_wq; + +static int __init cgroup_bpf_wq_init(void) +{ + cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 0, 1); + if (!cgroup_bpf_destroy_wq) + panic("Failed to alloc workqueue for cgroup bpf destroy.\n"); + return 0; +} +core_initcall(cgroup_bpf_wq_init); + void cgroup_bpf_offline(struct cgroup *cgrp) { cgroup_get(cgrp); -- Gitee From 1b6e47ff7b699aa84d6ac63e725ef1e3516c0b44 Mon Sep 17 00:00:00 2001 From: Xinqi Zhang Date: Wed, 16 Oct 2024 14:13:38 +0800 Subject: [PATCH 2/9] firmware: arm_scmi: Fix slab-use-after-free in scmi_bus_notifier() mainline inclusion from mainline-v6.12-rc7 commit 295416091e44806760ccf753aeafdafc0ae268f3 category: bugfix issue: #IBBMYW CVE: CVE-2024-53068 Signed-off-by: May27May --------------------------------------- The scmi_dev->name is released prematurely in __scmi_device_destroy(), which causes slab-use-after-free when accessing scmi_dev->name in scmi_bus_notifier(). So move the release of scmi_dev->name to scmi_device_release() to avoid slab-use-after-free. | BUG: KASAN: slab-use-after-free in strncmp+0xe4/0xec | Read of size 1 at addr ffffff80a482bcc0 by task swapper/0/1 | | CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.6.38-debug #1 | Hardware name: Qualcomm Technologies, Inc. SA8775P Ride (DT) | Call trace: | dump_backtrace+0x94/0x114 | show_stack+0x18/0x24 | dump_stack_lvl+0x48/0x60 | print_report+0xf4/0x5b0 | kasan_report+0xa4/0xec | __asan_report_load1_noabort+0x20/0x2c | strncmp+0xe4/0xec | scmi_bus_notifier+0x5c/0x54c | notifier_call_chain+0xb4/0x31c | blocking_notifier_call_chain+0x68/0x9c | bus_notify+0x54/0x78 | device_del+0x1bc/0x840 | device_unregister+0x20/0xb4 | __scmi_device_destroy+0xac/0x280 | scmi_device_destroy+0x94/0xd0 | scmi_chan_setup+0x524/0x750 | scmi_probe+0x7fc/0x1508 | platform_probe+0xc4/0x19c | really_probe+0x32c/0x99c | __driver_probe_device+0x15c/0x3c4 | driver_probe_device+0x5c/0x170 | __driver_attach+0x1c8/0x440 | bus_for_each_dev+0xf4/0x178 | driver_attach+0x3c/0x58 | bus_add_driver+0x234/0x4d4 | driver_register+0xf4/0x3c0 | __platform_driver_register+0x60/0x88 | scmi_driver_init+0xb0/0x104 | do_one_initcall+0xb4/0x664 | kernel_init_freeable+0x3c8/0x894 | kernel_init+0x24/0x1e8 | ret_from_fork+0x10/0x20 | | Allocated by task 1: | kasan_save_stack+0x2c/0x54 | kasan_set_track+0x2c/0x40 | kasan_save_alloc_info+0x24/0x34 | __kasan_kmalloc+0xa0/0xb8 | __kmalloc_node_track_caller+0x6c/0x104 | kstrdup+0x48/0x84 | kstrdup_const+0x34/0x40 | __scmi_device_create.part.0+0x8c/0x408 | scmi_device_create+0x104/0x370 | scmi_chan_setup+0x2a0/0x750 | scmi_probe+0x7fc/0x1508 | platform_probe+0xc4/0x19c | really_probe+0x32c/0x99c | __driver_probe_device+0x15c/0x3c4 | driver_probe_device+0x5c/0x170 | __driver_attach+0x1c8/0x440 | bus_for_each_dev+0xf4/0x178 | driver_attach+0x3c/0x58 | bus_add_driver+0x234/0x4d4 | driver_register+0xf4/0x3c0 | __platform_driver_register+0x60/0x88 | scmi_driver_init+0xb0/0x104 | do_one_initcall+0xb4/0x664 | kernel_init_freeable+0x3c8/0x894 | kernel_init+0x24/0x1e8 | ret_from_fork+0x10/0x20 | | Freed by task 1: | kasan_save_stack+0x2c/0x54 | kasan_set_track+0x2c/0x40 | kasan_save_free_info+0x38/0x5c | __kasan_slab_free+0xe8/0x164 | __kmem_cache_free+0x11c/0x230 | kfree+0x70/0x130 | kfree_const+0x20/0x40 | __scmi_device_destroy+0x70/0x280 | scmi_device_destroy+0x94/0xd0 | scmi_chan_setup+0x524/0x750 | scmi_probe+0x7fc/0x1508 | platform_probe+0xc4/0x19c | really_probe+0x32c/0x99c | __driver_probe_device+0x15c/0x3c4 | driver_probe_device+0x5c/0x170 | __driver_attach+0x1c8/0x440 | bus_for_each_dev+0xf4/0x178 | driver_attach+0x3c/0x58 | bus_add_driver+0x234/0x4d4 | driver_register+0xf4/0x3c0 | __platform_driver_register+0x60/0x88 | scmi_driver_init+0xb0/0x104 | do_one_initcall+0xb4/0x664 | kernel_init_freeable+0x3c8/0x894 | kernel_init+0x24/0x1e8 | ret_from_fork+0x10/0x20 Fixes: ee7a9c9f67c5 ("firmware: arm_scmi: Add support for multiple device per protocol") Signed-off-by: Xinqi Zhang Reviewed-by: Cristian Marussi Reviewed-by: Bjorn Andersson Message-Id: <20241016-fix-arm-scmi-slab-use-after-free-v2-1-1783685ef90d@quicinc.com> Signed-off-by: Sudeep Holla Signed-off-by: May27May Conflicts: drivers/firmware/arm_scmi/bus.c --- drivers/firmware/arm_scmi/bus.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/firmware/arm_scmi/bus.c b/drivers/firmware/arm_scmi/bus.c index def8a84d1611..14e253f50f3b 100644 --- a/drivers/firmware/arm_scmi/bus.c +++ b/drivers/firmware/arm_scmi/bus.c @@ -137,7 +137,10 @@ EXPORT_SYMBOL_GPL(scmi_driver_unregister); static void scmi_device_release(struct device *dev) { - kfree(to_scmi_dev(dev)); + struct scmi_device *scmi_dev = to_scmi_dev(dev); + + kfree_const(scmi_dev->name); + kfree(scmi_dev); } struct scmi_device * @@ -178,7 +181,6 @@ scmi_device_create(struct device_node *np, struct device *parent, int protocol, return scmi_dev; put_dev: - kfree_const(scmi_dev->name); put_device(&scmi_dev->dev); ida_simple_remove(&scmi_bus_id, id); return NULL; @@ -186,7 +188,6 @@ scmi_device_create(struct device_node *np, struct device *parent, int protocol, void scmi_device_destroy(struct scmi_device *scmi_dev) { - kfree_const(scmi_dev->name); scmi_handle_put(scmi_dev->handle); ida_simple_remove(&scmi_bus_id, scmi_dev->id); device_unregister(&scmi_dev->dev); -- Gitee From e7bb2bc118c5106e707d7f67aae4cc1120f1ef0f Mon Sep 17 00:00:00 2001 From: Jiri Kosina Date: Tue, 29 Oct 2024 15:44:35 +0100 Subject: [PATCH 3/9] HID: core: zero-initialize the report buffer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit stable inclusion from stable-v5.10.230 commit d7dc68d82ab3fcfc3f65322465da3d7031d4ab46 category: bugfix issue: #IBBMYW CVE: CVE-2024-50302 Signed-off-by: May27May --------------------------------------- [ Upstream commit 177f25d1292c7e16e1199b39c85480f7f8815552 ] Since the report buffer is used by all kinds of drivers in various ways, let's zero-initialize it during allocation to make sure that it can't be ever used to leak kernel memory via specially-crafted report. Fixes: 27ce405039bf ("HID: fix data access in implement()") Reported-by: Benoît Sevens Acked-by: Benjamin Tissoires Signed-off-by: Jiri Kosina Signed-off-by: Sasha Levin Signed-off-by: May27May --- drivers/hid/hid-core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/hid/hid-core.c b/drivers/hid/hid-core.c index e64a086651d6..cef54bd7755d 100644 --- a/drivers/hid/hid-core.c +++ b/drivers/hid/hid-core.c @@ -1657,7 +1657,7 @@ u8 *hid_alloc_report_buf(struct hid_report *report, gfp_t flags) u32 len = hid_report_len(report) + 7; - return kmalloc(len, flags); + return kzalloc(len, flags); } EXPORT_SYMBOL_GPL(hid_alloc_report_buf); -- Gitee From c1b9519989be821ed725fb1f324bbb463edd1399 Mon Sep 17 00:00:00 2001 From: Yin Fengwei Date: Sat, 29 Apr 2023 16:27:58 +0800 Subject: [PATCH 4/9] THP: avoid lock when check whether THP is in deferred list mainline inclusion from mainline-v6.5-rc1 commit deedad80f660af8199ea3b3f70939f2d226b9154 category: bugfix issue: #IBBMYW CVE: CVE-2024-53079 Signed-off-by: May27May --------------------------------------- free_transhuge_page() acquires split queue lock then check whether the THP was added to deferred list or not. It brings high deferred queue lock contention. It's safe to check whether the THP is in deferred list or not without holding the deferred queue lock in free_transhuge_page() because when code hit free_transhuge_page(), there is no one tries to add the folio to _deferred_list. Running page_fault1 of will-it-scale + order 2 folio for anonymous mapping with 96 processes on an Ice Lake 48C/96T test box, we could see the 61% split_queue_lock contention: - 63.02% 0.01% page_fault1_pro [kernel.kallsyms] [k] free_transhuge_page - 63.01% free_transhuge_page + 62.91% _raw_spin_lock_irqsave With this patch applied, the split_queue_lock contention is less than 1%. Link: https://lkml.kernel.org/r/20230429082759.1600796-2-fengwei.yin@intel.com Signed-off-by: Yin Fengwei Acked-by: Kirill A. Shutemov Reviewed-by: "Huang, Ying" Cc: Matthew Wilcox Cc: Ryan Roberts Cc: Yu Zhao Signed-off-by: Andrew Morton Signed-off-by: May27May Conflicts: mm/huge_memory.c --- mm/huge_memory.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e4c690c21fc9..9dc57f5c4f36 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2773,12 +2773,19 @@ void free_transhuge_page(struct page *page) struct deferred_split *ds_queue = get_deferred_split_queue(page); unsigned long flags; - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - if (!list_empty(page_deferred_list(page))) { - ds_queue->split_queue_len--; - list_del(page_deferred_list(page)); + /* + * At this point, there is no one trying to add the folio to + * deferred_list. If folio is not in deferred_list, it's safe + * to check without acquiring the split_queue_lock. + */ + if (data_race(!list_empty(page_deferred_list(page))) { + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); + if (!list_empty(page_deferred_list(page))) { + ds_queue->split_queue_len--; + list_del(page_deferred_list(page)); + } + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); } - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); free_compound_page(page); } -- Gitee From 253e61d7e63baf7f89a39e8ad732ec263ac52d13 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Sun, 27 Oct 2024 13:02:13 -0700 Subject: [PATCH 5/9] mm/thp: fix deferred split unqueue naming and locking mainline inclusion from mainline-v6.12-rc7 commit f8f931bba0f92052cf842b7e30917b1afcc77d5a category: bugfix issue: #IBBMYW CVE: CVE-2024-53079 Signed-off-by: May27May --------------------------------------- Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace. Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment. Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case. The 87eaceb3faa5 commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false. Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases. Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins Acked-by: David Hildenbrand Reviewed-by: Yang Shi Cc: Baolin Wang Cc: Barry Song Cc: Chris Li Cc: Johannes Weiner Cc: Kefeng Wang Cc: Kirill A. Shutemov Cc: Matthew Wilcox (Oracle) Cc: Nhat Pham Cc: Ryan Roberts Cc: Shakeel Butt Cc: Usama Arif Cc: Wei Yang Cc: Zi Yan Cc: Signed-off-by: Andrew Morton Signed-off-by: May27May Conflicts: mm/huge_memory.c mm/internal.h mm/memcontrol-v1.c mm/memcontrol.c mm/migrate.c mm/page_alloc.c mm/swap.c mm/vmscan.c --- mm/huge_memory.c | 9 +++++++-- mm/internal.h | 14 ++++++++++++++ mm/memcontrol.c | 1 + 3 files changed, 22 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9dc57f5c4f36..97a42d154111 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2768,7 +2768,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) return ret; } -void free_transhuge_page(struct page *page) +void __page_unqueue_deferred_split(struct page *page) { struct deferred_split *ds_queue = get_deferred_split_queue(page); unsigned long flags; @@ -2782,10 +2782,15 @@ void free_transhuge_page(struct page *page) spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (!list_empty(page_deferred_list(page))) { ds_queue->split_queue_len--; - list_del(page_deferred_list(page)); + list_del_init(page_deferred_list(page)); } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); } +} + +void free_transhuge_page(struct page *page) +{ + __page_unqueue_deferred_split(page); free_compound_page(page); } diff --git a/mm/internal.h b/mm/internal.h index 9fc8978e62f3..e85e249a4d9b 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -801,4 +801,18 @@ void reclaimacct_collect_data(void); void reclaimacct_collect_reclaim_efficiency(void); #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +void __page_unqueue_deferred_split(struct page *page); +static inline void page_unqueue_deferred_split(struct page *page) +{ + if (!PageTransCompound(page)) + return; + + __page_unqueue_deferred_split(page); +} +#else +static inline void page_unqueue_deferred_split(struct page *page) +{ +} +#endif #endif /* __MM_INTERNAL_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fa91ac9de753..ecec5862aa19 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7337,6 +7337,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) VM_BUG_ON_PAGE(oldid, page); mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries); + page_unqueue_deferred_split(page); page->mem_cgroup = NULL; if (!mem_cgroup_is_root(memcg)) -- Gitee From 3eb04ed3aa34aee67ff81cdfdc7d638cf9680b26 Mon Sep 17 00:00:00 2001 From: junhua huang Date: Fri, 2 Dec 2022 15:11:10 +0800 Subject: [PATCH 6/9] arm64:uprobe fix the uprobe SWBP_INSN in big-endian MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit stable inclusion from stable-v5.10.229 commit 7f1ef59185d23450d77e94671c0abb03f8978c01 category: bugfix issue: #IBBMYW CVE: CVE-2024-50194 Signed-off-by: May27May --------------------------------------- [ Upstream commit 60f07e22a73d318cddaafa5ef41a10476807cc07 ] We use uprobe in aarch64_be, which we found the tracee task would exit due to SIGILL when we enable the uprobe trace. We can see the replace inst from uprobe is not correct in aarch big-endian. As in Armv8-A, instruction fetches are always treated as little-endian, we should treat the UPROBE_SWBP_INSN as little-endian。 The test case is as following。 bash-4.4# ./mqueue_test_aarchbe 1 1 2 1 10 > /dev/null & bash-4.4# cd /sys/kernel/debug/tracing/ bash-4.4# echo 'p:test /mqueue_test_aarchbe:0xc30 %x0 %x1' > uprobe_events bash-4.4# echo 1 > events/uprobes/enable bash-4.4# bash-4.4# ps PID TTY TIME CMD 140 ? 00:00:01 bash 237 ? 00:00:00 ps [1]+ Illegal instruction ./mqueue_test_aarchbe 1 1 2 1 100 > /dev/null which we debug use gdb as following: bash-4.4# gdb attach 155 (gdb) disassemble send Dump of assembler code for function send: 0x0000000000400c30 <+0>: .inst 0xa00020d4 ; undefined 0x0000000000400c34 <+4>: mov x29, sp 0x0000000000400c38 <+8>: str w0, [sp, #28] 0x0000000000400c3c <+12>: strb w1, [sp, #27] 0x0000000000400c40 <+16>: str xzr, [sp, #40] 0x0000000000400c44 <+20>: str xzr, [sp, #48] 0x0000000000400c48 <+24>: add x0, sp, #0x1b 0x0000000000400c4c <+28>: mov w3, #0x0 // #0 0x0000000000400c50 <+32>: mov x2, #0x1 // #1 0x0000000000400c54 <+36>: mov x1, x0 0x0000000000400c58 <+40>: ldr w0, [sp, #28] 0x0000000000400c5c <+44>: bl 0x405e10 0x0000000000400c60 <+48>: str w0, [sp, #60] 0x0000000000400c64 <+52>: ldr w0, [sp, #60] 0x0000000000400c68 <+56>: ldp x29, x30, [sp], #64 0x0000000000400c6c <+60>: ret End of assembler dump. (gdb) info b No breakpoints or watchpoints. (gdb) c Continuing. Program received signal SIGILL, Illegal instruction. 0x0000000000400c30 in send () (gdb) x/10x 0x400c30 0x400c30 : 0xd42000a0 0xfd030091 0xe01f00b9 0xe16f0039 0x400c40 : 0xff1700f9 0xff1b00f9 0xe06f0091 0x03008052 0x400c50 : 0x220080d2 0xe10300aa (gdb) disassemble 0x400c30 Dump of assembler code for function send: => 0x0000000000400c30 <+0>: .inst 0xa00020d4 ; undefined 0x0000000000400c34 <+4>: mov x29, sp 0x0000000000400c38 <+8>: str w0, [sp, #28] 0x0000000000400c3c <+12>: strb w1, [sp, #27] 0x0000000000400c40 <+16>: str xzr, [sp, #40] Signed-off-by: junhua huang Link: https://lore.kernel.org/r/202212021511106844809@zte.com.cn Signed-off-by: Will Deacon Stable-dep-of: 13f8f1e05f1d ("arm64: probes: Fix uprobes for big-endian kernels") Signed-off-by: Sasha Levin Signed-off-by: May27May --- arch/arm64/include/asm/uprobes.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/uprobes.h b/arch/arm64/include/asm/uprobes.h index 315eef654e39..ba4bff5ca674 100644 --- a/arch/arm64/include/asm/uprobes.h +++ b/arch/arm64/include/asm/uprobes.h @@ -12,7 +12,7 @@ #define MAX_UINSN_BYTES AARCH64_INSN_SIZE -#define UPROBE_SWBP_INSN BRK64_OPCODE_UPROBES +#define UPROBE_SWBP_INSN cpu_to_le32(BRK64_OPCODE_UPROBES) #define UPROBE_SWBP_INSN_SIZE AARCH64_INSN_SIZE #define UPROBE_XOL_SLOT_BYTES MAX_UINSN_BYTES -- Gitee From 557f6d05b734a5587f2bffbfb228b7ba0e63cb33 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Tue, 8 Oct 2024 16:58:48 +0100 Subject: [PATCH 7/9] arm64: probes: Fix uprobes for big-endian kernels stable inclusion from stable-v5.10.229 commit cf9ddf9ed94c15564a05bbf6e9f18dffa0c7df80 category: bugfix issue: #IBBMYW CVE: CVE-2024-50194 Signed-off-by: May27May --------------------------------------- [ Upstream commit 13f8f1e05f1dc36dbba6cba0ae03354c0dafcde7 ] The arm64 uprobes code is broken for big-endian kernels as it doesn't convert the in-memory instruction encoding (which is always little-endian) into the kernel's native endianness before analyzing and simulating instructions. This may result in a few distinct problems: * The kernel may may erroneously reject probing an instruction which can safely be probed. * The kernel may erroneously erroneously permit stepping an instruction out-of-line when that instruction cannot be stepped out-of-line safely. * The kernel may erroneously simulate instruction incorrectly dur to interpretting the byte-swapped encoding. The endianness mismatch isn't caught by the compiler or sparse because: * The arch_uprobe::{insn,ixol} fields are encoded as arrays of u8, so the compiler and sparse have no idea these contain a little-endian 32-bit value. The core uprobes code populates these with a memcpy() which similarly does not handle endianness. * While the uprobe_opcode_t type is an alias for __le32, both arch_uprobe_analyze_insn() and arch_uprobe_skip_sstep() cast from u8[] to the similarly-named probe_opcode_t, which is an alias for u32. Hence there is no endianness conversion warning. Fix this by changing the arch_uprobe::{insn,ixol} fields to __le32 and adding the appropriate __le32_to_cpu() conversions prior to consuming the instruction encoding. The core uprobes copies these fields as opaque ranges of bytes, and so is unaffected by this change. At the same time, remove MAX_UINSN_BYTES and consistently use AARCH64_INSN_SIZE for clarity. Tested with the following: | #include | #include | | #define noinline __attribute__((noinline)) | | static noinline void *adrp_self(void) | { | void *addr; | | asm volatile( | " adrp %x0, adrp_self\n" | " add %x0, %x0, :lo12:adrp_self\n" | : "=r" (addr)); | } | | | int main(int argc, char *argv) | { | void *ptr = adrp_self(); | bool equal = (ptr == adrp_self); | | printf("adrp_self => %p\n" | "adrp_self() => %p\n" | "%s\n", | adrp_self, ptr, equal ? "EQUAL" : "NOT EQUAL"); | | return 0; | } .... where the adrp_self() function was compiled to: | 00000000004007e0 : | 4007e0: 90000000 adrp x0, 400000 <__ehdr_start> | 4007e4: 911f8000 add x0, x0, #0x7e0 | 4007e8: d65f03c0 ret Before this patch, the ADRP is not recognized, and is assumed to be steppable, resulting in corruption of the result: | # ./adrp-self | adrp_self => 0x4007e0 | adrp_self() => 0x4007e0 | EQUAL | # echo 'p /root/adrp-self:0x007e0' > /sys/kernel/tracing/uprobe_events | # echo 1 > /sys/kernel/tracing/events/uprobes/enable | # ./adrp-self | adrp_self => 0x4007e0 | adrp_self() => 0xffffffffff7e0 | NOT EQUAL After this patch, the ADRP is correctly recognized and simulated: | # ./adrp-self | adrp_self => 0x4007e0 | adrp_self() => 0x4007e0 | EQUAL | # | # echo 'p /root/adrp-self:0x007e0' > /sys/kernel/tracing/uprobe_events | # echo 1 > /sys/kernel/tracing/events/uprobes/enable | # ./adrp-self | adrp_self => 0x4007e0 | adrp_self() => 0x4007e0 | EQUAL Fixes: 9842ceae9fa8 ("arm64: Add uprobe support") Cc: stable@vger.kernel.org Signed-off-by: Mark Rutland Cc: Catalin Marinas Cc: Will Deacon Link: https://lore.kernel.org/r/20241008155851.801546-4-mark.rutland@arm.com Signed-off-by: Will Deacon Signed-off-by: Sasha Levin Signed-off-by: May27May --- arch/arm64/include/asm/uprobes.h | 8 +++----- arch/arm64/kernel/probes/uprobes.c | 4 ++-- 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/arch/arm64/include/asm/uprobes.h b/arch/arm64/include/asm/uprobes.h index ba4bff5ca674..98f29a43bfe8 100644 --- a/arch/arm64/include/asm/uprobes.h +++ b/arch/arm64/include/asm/uprobes.h @@ -10,11 +10,9 @@ #include #include -#define MAX_UINSN_BYTES AARCH64_INSN_SIZE - #define UPROBE_SWBP_INSN cpu_to_le32(BRK64_OPCODE_UPROBES) #define UPROBE_SWBP_INSN_SIZE AARCH64_INSN_SIZE -#define UPROBE_XOL_SLOT_BYTES MAX_UINSN_BYTES +#define UPROBE_XOL_SLOT_BYTES AARCH64_INSN_SIZE typedef u32 uprobe_opcode_t; @@ -23,8 +21,8 @@ struct arch_uprobe_task { struct arch_uprobe { union { - u8 insn[MAX_UINSN_BYTES]; - u8 ixol[MAX_UINSN_BYTES]; + __le32 insn; + __le32 ixol; }; struct arch_probe_insn api; bool simulate; diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c index 2c247634552b..8a02c549e57f 100644 --- a/arch/arm64/kernel/probes/uprobes.c +++ b/arch/arm64/kernel/probes/uprobes.c @@ -42,7 +42,7 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, else if (!IS_ALIGNED(addr, AARCH64_INSN_SIZE)) return -EINVAL; - insn = *(probe_opcode_t *)(&auprobe->insn[0]); + insn = le32_to_cpu(auprobe->insn); switch (arm_probe_decode_insn(insn, &auprobe->api)) { case INSN_REJECTED: @@ -108,7 +108,7 @@ bool arch_uprobe_skip_sstep(struct arch_uprobe *auprobe, struct pt_regs *regs) if (!auprobe->simulate) return false; - insn = *(probe_opcode_t *)(&auprobe->insn[0]); + insn = le32_to_cpu(auprobe->insn); addr = instruction_pointer(regs); if (auprobe->api.handler) -- Gitee From 54e6b2e9d293847e10756f72c2c2f3113acf8e94 Mon Sep 17 00:00:00 2001 From: junhua huang Date: Wed, 28 Dec 2022 09:54:12 +0800 Subject: [PATCH 8/9] arm64/uprobes: change the uprobe_opcode_t typedef to fix the sparse warning stable inclusion from stable-v5.10.229 commit 4abbba7105831a4402b2c983b24503b908f39396 category: bugfix issue: #IBBMYW CVE: CVE-2024-50194 Signed-off-by: May27May --------------------------------------- commit ef08c0fadd8a17ebe429b85e23952dac3263ad34 upstream. After we fixed the uprobe inst endian in aarch_be, the sparse check report the following warning info: sparse warnings: (new ones prefixed by >>) >> kernel/events/uprobes.c:223:25: sparse: sparse: restricted __le32 degrades to integer >> kernel/events/uprobes.c:574:56: sparse: sparse: incorrect type in argument 4 (different base types) @@ expected unsigned int [addressable] [usertype] opcode @@ got restricted __le32 [usertype] @@ kernel/events/uprobes.c:574:56: sparse: expected unsigned int [addressable] [usertype] opcode kernel/events/uprobes.c:574:56: sparse: got restricted __le32 [usertype] >> kernel/events/uprobes.c:1483:32: sparse: sparse: incorrect type in initializer (different base types) @@ expected unsigned int [usertype] insn @@ got restricted __le32 [usertype] @@ kernel/events/uprobes.c:1483:32: sparse: expected unsigned int [usertype] insn kernel/events/uprobes.c:1483:32: sparse: got restricted __le32 [usertype] use the __le32 to u32 for uprobe_opcode_t, to keep the same. Fixes: 60f07e22a73d ("arm64:uprobe fix the uprobe SWBP_INSN in big-endian") Reported-by: kernel test robot Signed-off-by: junhua huang Link: https://lore.kernel.org/r/202212280954121197626@zte.com.cn Signed-off-by: Will Deacon Signed-off-by: Greg Kroah-Hartman Signed-off-by: May27May --- arch/arm64/include/asm/uprobes.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/uprobes.h b/arch/arm64/include/asm/uprobes.h index 98f29a43bfe8..014b02897f8e 100644 --- a/arch/arm64/include/asm/uprobes.h +++ b/arch/arm64/include/asm/uprobes.h @@ -14,7 +14,7 @@ #define UPROBE_SWBP_INSN_SIZE AARCH64_INSN_SIZE #define UPROBE_XOL_SLOT_BYTES AARCH64_INSN_SIZE -typedef u32 uprobe_opcode_t; +typedef __le32 uprobe_opcode_t; struct arch_uprobe_task { }; -- Gitee From a5a5de00916068d2f4b102b3e9c2d018ab8ec463 Mon Sep 17 00:00:00 2001 From: Eduard Zingerman Date: Tue, 24 Sep 2024 14:08:43 -0700 Subject: [PATCH 9/9] bpf: sync_linked_regs() must preserve subreg_def mainline inclusion from mainline-v6.12-rc4 commit e9bd9c498cb0f5843996dbe5cbce7a1836a83c70 category: bugfix issue: #IBBMYW CVE: CVE-2024-53125 Signed-off-by: May27May --------------------------------------- Range propagation must not affect subreg_def marks, otherwise the following example is rewritten by verifier incorrectly when BPF_F_TEST_RND_HI32 flag is set: 0: call bpf_ktime_get_ns call bpf_ktime_get_ns 1: r0 &= 0x7fffffff after verifier r0 &= 0x7fffffff 2: w1 = w0 rewrites w1 = w0 3: if w0 < 10 goto +0 --------------> r11 = 0x2f5674a6 (r) 4: r1 >>= 32 r11 <<= 32 (r) 5: r0 = r1 r1 |= r11 (r) 6: exit; if w0 < 0xa goto pc+0 r1 >>= 32 r0 = r1 exit (or zero extension of w1 at (2) is missing for architectures that require zero extension for upper register half). The following happens w/o this patch: - r0 is marked as not a subreg at (0); - w1 is marked as subreg at (2); - w1 subreg_def is overridden at (3) by copy_register_state(); - w1 is read at (5) but mark_insn_zext() does not mark (2) for zero extension, because w1 subreg_def is not set; - because of BPF_F_TEST_RND_HI32 flag verifier inserts random value for hi32 bits of (2) (marked (r)); - this random value is read at (5). Fixes: 75748837b7e5 ("bpf: Propagate scalar ranges through register assignments.") Reported-by: Lonial Con Signed-off-by: Lonial Con Signed-off-by: Eduard Zingerman Signed-off-by: Andrii Nakryiko Signed-off-by: Daniel Borkmann Acked-by: Daniel Borkmann Closes: https://lore.kernel.org/bpf/7e2aa30a62d740db182c170fdd8f81c596df280d.camel@gmail.com Link: https://lore.kernel.org/bpf/20240924210844.1758441-1-eddyz87@gmail.com Signed-off-by: May27May Conflicts: kernel/bpf/verifier.c --- kernel/bpf/verifier.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index fe07e7d0a9e6..0ef625098d6e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -8042,8 +8042,12 @@ static void find_equal_scalars(struct bpf_verifier_state *vstate, struct bpf_reg_state *reg; bpf_for_each_reg_in_vstate(vstate, state, reg, ({ - if (reg->type == SCALAR_VALUE && reg->id == known_reg->id) + if (reg->type == SCALAR_VALUE && reg->id == known_reg->id) { + s32 saved_subreg_def = reg->subreg_def; + copy_register_state(reg, known_reg); + reg->subreg_def = saved_subreg_def; + } })); } -- Gitee