kernel: Do not call quiet_vmstat from IRQ context
We received a bug report indicating that the "Dirty" field in /proc/meminfo was increasing without bounds, to the point that the number of dirty file pages would eventually reach what is enforced by the vm.dirty_bytes threshold (which is set to 800_000_000 bytes in StarlingX) and cause any task attempting to carry out disk I/O to get blocked. Upon further debugging, we noticed that this issue occurred on nohz_full CPUs where a user application was carrying out disk I/O by writing to and rotating log files. The issue was reproducible with the preempt-rt patch set very reliably. This commit addresses the issue in question, by reverting commit 62cb1188ed86 ("sched/idle: Move quiet_vmstate() into the NOHZ code"), which was merged in the v4.15-rc1 time frame. The revert, in effect, moves the quiet_vmstat function call from hard IRQ context back to the start of the idle loop. Please see the patch description for a more detailed overview. Note that this commit does not introduce a "novel" change, as the 4.14.298-rt140 kernel, released on 2022-11-04 does not have the reverted commit either, which should preclude the need for regression testing in terms of functionality and performance. I would like to acknowledge the extensive help and guidance provided by Jim Somerville <jim.somerville@windriver.com> during the debugging and investigation of this issue. Verification - The issue was reproduced with an older CentOS-based StarlingX-based system, running a StarlingX/linux-yocto preempt-rt kernel based on v5.10.112-rt61 by running a test application for about 4~5 hours. In this configuration, the issue becomes apparent within 1 hour or so, where the Dirty field in /proc/meminfo reaches the threshold sysctl vm.dirty_background_bytes (set to 600_000_000 bytes in StarlingX). By the end of the test, the Dirty field was very close to the vm.dirty_bytes threshold sysctl (800_000_000 bytes). Afterwards, a kernel patched with this commit was found to no longer reproduce the issue, by running the same test application for ~12.5 hours. (Note that the second test had Meltdown/Spectre mitigations enabled by accident, but we are confident that this does not affect the test results.) The Dirty value in /proc/meminfo stayed around 180_000 KiB for the duration of the test. A test re-run with the Meltdown/Spectre mitigations disabled, for a duration of 1.75 hours, had similar results. The test application that reproduces this issue writes to and rotates log files in a rapid manner, with a usleep(0) call between every log file rotation. The issue is reproduced on nohz_full CPUs with the preempt-rt kernel, more reliably at least. - A Debian-based StarlingX ISO image was successfully built with this commit. - The ISO image was successfully installed into a qemu/KVM-based virtual machine using the All-in-One Simplex, low-latency profile, and the Ansible bootstrap procedure was successful. - The issue was confirmed to no longer exist with this commit, by running multiple concurrent instances of a simplified test application for about 30 minutes (with the installation resulting from the Debian-based StarlingX ISO image built with this commit). Without a patched kernel, the issue becomes apparent within 10 minutes of test runtime in this configuration. Closes-Bug: 2002039 Change-Id: I818d8bd751f4b1941a26530a99a4a635e98d5c54 Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
This commit is contained in:
parent
f29ab14ca8
commit
436c7067d0
@ -0,0 +1,120 @@
|
||||
From d636082d0ae322ebd744b88ccdc3a8a501826494 Mon Sep 17 00:00:00 2001
|
||||
From: "M. Vefa Bicakci" <vefa.bicakci@windriver.com>
|
||||
Date: Wed, 4 Jan 2023 20:36:53 -0500
|
||||
Subject: [PATCH] Revert "sched/idle: Move quiet_vmstate() into the NOHZ code"
|
||||
|
||||
This reverts commit 62cb1188ed86a9cf082fd2f757d4dd9b54741f24.
|
||||
|
||||
We received a bug report indicating that the "Dirty" field in
|
||||
/proc/meminfo was increasing without bounds, to the point that the
|
||||
number of dirty file pages would eventually reach what is enforced by
|
||||
the vm.dirty_bytes threshold (which is set to 800_000_000 bytes in
|
||||
StarlingX) and cause any task attempting to carry out disk I/O to get
|
||||
blocked.
|
||||
|
||||
Upon further debugging, we noticed that this issue occurred on nohz_full
|
||||
CPUs where a user application was carrying out disk I/O by writing to
|
||||
and rotating log files, with a usleep(0) call between every log file
|
||||
rotation. The issue was reproducible with the preempt-rt patch set very
|
||||
reliably.
|
||||
|
||||
The reverted commit moved the quiet_vmstat() call from the entry of the
|
||||
idle loop (do_idle function) to the tick_nohz_stop_tick function.
|
||||
However, the tick_nohz_stop_tick function is called very frequently from
|
||||
hard IRQ context via the following call chain:
|
||||
|
||||
irq_exit_rcu
|
||||
tick_irq_exit (has a condition to check for nohz_full CPUs)
|
||||
tick_nohz_irq_exit
|
||||
tick_nohz_full_update_tick
|
||||
tick_nohz_stop_sched_tick
|
||||
tick_nohz_stop_tick
|
||||
quiet_vmstat
|
||||
|
||||
The check for nohz_full CPUs in tick_irq_exit() explains why this issue
|
||||
occurred with nohz_full CPUs more reliably.
|
||||
|
||||
Calling quiet_vmstat from hard IRQ context is problematic.
|
||||
quiet_vmstat() makes the following calls to update vm_node_stat as well
|
||||
as other statistics such as vm_zone_stat and vm_numa_stat. Recall that
|
||||
an element in the vm_node_stat array tracks the number of dirty file
|
||||
pages:
|
||||
|
||||
quiet_vmstat
|
||||
refresh_cpu_vm_stats
|
||||
fold_diff (Updates vm_node_stat and other statistics)
|
||||
|
||||
However, __mod_node_page_state() (and fellow functions) also update
|
||||
vm_node_stat, and although it is called with interrupts disabled in most
|
||||
contexts (via spin_lock_irqsave), there are instances where it is called
|
||||
with interrupts enabled (as evidenced by instrumenting the function with
|
||||
counters that count the number of times the function was called with and
|
||||
without interrupts disabled). Also, the fact that __mod_node_page_state
|
||||
and its sibling function __mod_zone_page_state should not be called with
|
||||
interrupts enabled is evidenced by the following comment in mm/vmstat.c
|
||||
above __mod_zone_page_state():
|
||||
|
||||
For use when we know that interrupts are disabled, or when we know
|
||||
that preemption is disabled and that particular counter cannot be
|
||||
updated from interrupt context.
|
||||
|
||||
Furthermore, recall that the preempt-rt patch set makes most spinlocks
|
||||
sleeping locks and changes the implementation of spin_lock_irqsave in
|
||||
such a way that IRQs are *not* disabled by spin_lock_irqsave. With the
|
||||
preempt-rt patch set, this corresponds to a significant increase in the
|
||||
number of calls to __mod_node_page_state() with interrupts *enabled*.
|
||||
This in turn significantly increases the possibility of incorrectly
|
||||
modifying global statistics variables such the ones in the vm_node_stat
|
||||
array.
|
||||
|
||||
To avoid this issue, we revert commit 62cb1188ed86 ("sched/idle: Move
|
||||
quiet_vmstate() into the NOHZ code") and therefore move the quiet_vmstat
|
||||
call back into the idle loop's entry point, where it is *not* called
|
||||
from an hard IRQ context. With this revert applied, the issue is no
|
||||
longer reproducible.
|
||||
|
||||
I would like to acknowledge the extensive help and guidance provided by
|
||||
Jim Somerville <jim.somerville@windriver.com> during the debugging and
|
||||
investigation of this issue.
|
||||
|
||||
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
|
||||
---
|
||||
kernel/sched/idle.c | 1 +
|
||||
kernel/time/tick-sched.c | 2 --
|
||||
2 files changed, 1 insertion(+), 2 deletions(-)
|
||||
|
||||
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
|
||||
index 2593a733c084..a4122590ca54 100644
|
||||
--- a/kernel/sched/idle.c
|
||||
+++ b/kernel/sched/idle.c
|
||||
@@ -271,6 +271,7 @@ static void do_idle(void)
|
||||
*/
|
||||
|
||||
__current_set_polling();
|
||||
+ quiet_vmstat();
|
||||
tick_nohz_idle_enter();
|
||||
|
||||
while (!need_resched()) {
|
||||
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
|
||||
index a800fd555499..9935576b295d 100644
|
||||
--- a/kernel/time/tick-sched.c
|
||||
+++ b/kernel/time/tick-sched.c
|
||||
@@ -24,7 +24,6 @@
|
||||
#include <linux/irq_work.h>
|
||||
#include <linux/posix-timers.h>
|
||||
#include <linux/context_tracking.h>
|
||||
-#include <linux/mm.h>
|
||||
|
||||
#include <asm/irq_regs.h>
|
||||
|
||||
@@ -812,7 +811,6 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
|
||||
*/
|
||||
if (!ts->tick_stopped) {
|
||||
calc_load_nohz_start();
|
||||
- quiet_vmstat();
|
||||
|
||||
ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
|
||||
ts->tick_stopped = 1;
|
||||
--
|
||||
2.25.1
|
||||
|
@ -35,3 +35,4 @@
|
||||
0037-xfs-drop-unused-ioend-private-merge-and-setfilesize-.patch
|
||||
0038-xfs-drop-unnecessary-setfilesize-helper.patch
|
||||
0039-samples-bpf-use-kprobe-and-urandom_read_iter.patch
|
||||
0040-Revert-sched-idle-Move-quiet_vmstate-into-the-NOHZ-c.patch
|
||||
|
@ -0,0 +1,120 @@
|
||||
From ce6ac6352c2d5d319a852959b0d3169a0cf23d78 Mon Sep 17 00:00:00 2001
|
||||
From: "M. Vefa Bicakci" <vefa.bicakci@windriver.com>
|
||||
Date: Wed, 4 Jan 2023 20:41:54 -0500
|
||||
Subject: [PATCH] Revert "sched/idle: Move quiet_vmstate() into the NOHZ code"
|
||||
|
||||
This reverts commit 62cb1188ed86a9cf082fd2f757d4dd9b54741f24.
|
||||
|
||||
We received a bug report indicating that the "Dirty" field in
|
||||
/proc/meminfo was increasing without bounds, to the point that the
|
||||
number of dirty file pages would eventually reach what is enforced by
|
||||
the vm.dirty_bytes threshold (which is set to 800_000_000 bytes in
|
||||
StarlingX) and cause any task attempting to carry out disk I/O to get
|
||||
blocked.
|
||||
|
||||
Upon further debugging, we noticed that this issue occurred on nohz_full
|
||||
CPUs where a user application was carrying out disk I/O by writing to
|
||||
and rotating log files, with a usleep(0) call between every log file
|
||||
rotation. The issue was reproducible with the preempt-rt patch set very
|
||||
reliably.
|
||||
|
||||
The reverted commit moved the quiet_vmstat() call from the entry of the
|
||||
idle loop (do_idle function) to the tick_nohz_stop_tick function.
|
||||
However, the tick_nohz_stop_tick function is called very frequently from
|
||||
hard IRQ context via the following call chain:
|
||||
|
||||
irq_exit_rcu
|
||||
tick_irq_exit (has a condition to check for nohz_full CPUs)
|
||||
tick_nohz_irq_exit
|
||||
tick_nohz_full_update_tick
|
||||
tick_nohz_stop_sched_tick
|
||||
tick_nohz_stop_tick
|
||||
quiet_vmstat
|
||||
|
||||
The check for nohz_full CPUs in tick_irq_exit() explains why this issue
|
||||
occurred with nohz_full CPUs more reliably.
|
||||
|
||||
Calling quiet_vmstat from hard IRQ context is problematic.
|
||||
quiet_vmstat() makes the following calls to update vm_node_stat as well
|
||||
as other statistics such as vm_zone_stat and vm_numa_stat. Recall that
|
||||
an element in the vm_node_stat array tracks the number of dirty file
|
||||
pages:
|
||||
|
||||
quiet_vmstat
|
||||
refresh_cpu_vm_stats
|
||||
fold_diff (Updates vm_node_stat and other statistics)
|
||||
|
||||
However, __mod_node_page_state() (and fellow functions) also update
|
||||
vm_node_stat, and although it is called with interrupts disabled in most
|
||||
contexts (via spin_lock_irqsave), there are instances where it is called
|
||||
with interrupts enabled (as evidenced by instrumenting the function with
|
||||
counters that count the number of times the function was called with and
|
||||
without interrupts disabled). Also, the fact that __mod_node_page_state
|
||||
and its sibling function __mod_zone_page_state should not be called with
|
||||
interrupts enabled is evidenced by the following comment in mm/vmstat.c
|
||||
above __mod_zone_page_state():
|
||||
|
||||
For use when we know that interrupts are disabled, or when we know
|
||||
that preemption is disabled and that particular counter cannot be
|
||||
updated from interrupt context.
|
||||
|
||||
Furthermore, recall that the preempt-rt patch set makes most spinlocks
|
||||
sleeping locks and changes the implementation of spin_lock_irqsave in
|
||||
such a way that IRQs are *not* disabled by spin_lock_irqsave. With the
|
||||
preempt-rt patch set, this corresponds to a significant increase in the
|
||||
number of calls to __mod_node_page_state() with interrupts *enabled*.
|
||||
This in turn significantly increases the possibility of incorrectly
|
||||
modifying global statistics variables such the ones in the vm_node_stat
|
||||
array.
|
||||
|
||||
To avoid this issue, we revert commit 62cb1188ed86 ("sched/idle: Move
|
||||
quiet_vmstate() into the NOHZ code") and therefore move the quiet_vmstat
|
||||
call back into the idle loop's entry point, where it is *not* called
|
||||
from an hard IRQ context. With this revert applied, the issue is no
|
||||
longer reproducible.
|
||||
|
||||
I would like to acknowledge the extensive help and guidance provided by
|
||||
Jim Somerville <jim.somerville@windriver.com> during the debugging and
|
||||
investigation of this issue.
|
||||
|
||||
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
|
||||
---
|
||||
kernel/sched/idle.c | 1 +
|
||||
kernel/time/tick-sched.c | 2 --
|
||||
2 files changed, 1 insertion(+), 2 deletions(-)
|
||||
|
||||
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
|
||||
index 2593a733c084..a4122590ca54 100644
|
||||
--- a/kernel/sched/idle.c
|
||||
+++ b/kernel/sched/idle.c
|
||||
@@ -271,6 +271,7 @@ static void do_idle(void)
|
||||
*/
|
||||
|
||||
__current_set_polling();
|
||||
+ quiet_vmstat();
|
||||
tick_nohz_idle_enter();
|
||||
|
||||
while (!need_resched()) {
|
||||
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
|
||||
index 92fb738813f3..81ef5f83a0d8 100644
|
||||
--- a/kernel/time/tick-sched.c
|
||||
+++ b/kernel/time/tick-sched.c
|
||||
@@ -24,7 +24,6 @@
|
||||
#include <linux/irq_work.h>
|
||||
#include <linux/posix-timers.h>
|
||||
#include <linux/context_tracking.h>
|
||||
-#include <linux/mm.h>
|
||||
|
||||
#include <asm/irq_regs.h>
|
||||
|
||||
@@ -812,7 +811,6 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
|
||||
*/
|
||||
if (!ts->tick_stopped) {
|
||||
calc_load_nohz_start();
|
||||
- quiet_vmstat();
|
||||
|
||||
ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
|
||||
ts->tick_stopped = 1;
|
||||
--
|
||||
2.25.1
|
||||
|
@ -34,3 +34,4 @@
|
||||
0036-xfs-drop-unused-ioend-private-merge-and-setfilesize-.patch
|
||||
0037-xfs-drop-unnecessary-setfilesize-helper.patch
|
||||
0038-samples-bpf-use-kprobe-and-urandom_read_iter.patch
|
||||
0039-Revert-sched-idle-Move-quiet_vmstate-into-the-NOHZ-c.patch
|
||||
|
Loading…
x
Reference in New Issue
Block a user