Eric MacDonald 94afab2a6b Reduce CPU spikes during collect
The collect tool executes various Linux and system commands to gather
data into an archived collect bundle.

System administrators often need to run collect on busy, in-service
systems. During such operations, they have reported excessive CPU
usage, which can lead to undesirable CPU spikes caused by certain
collect operations.

While the collect tool already employs throttling for ssh and scp,
its data collection and archiving commands currently lack similar
safeguards.

This update introduces the following enhancements to mitigate CPU
spikes and improve performance on heavily loaded in-service servers:

 - removed one unnecessary tar archive operation.
 - add tar archive checkpoint option support with action handler
 - removed one redundant kubelet api-resources call in the
   containerization plugin.
 - add --chunk-size=50 support to all in one kubelet get api-resources
   command to help throttle this long running heavyweight command.
   50 seems to yield the lowest k8s api latency as measured with the
   k8smetrics tool.
 - launch collect plugins with 'nice' and 'ionice' attributes.
 - add 'nice' and 'ionice' attributes to select commands.
 - add sleep delays after known cpu intensive data collection commands.
 - remove unnecessary -v (verbose) option to all tar commands.
 - add a run_command utility that times the execution of commands
   and adaptively adds a small post execution delay based on how
   long that command took to run.
 - reduce the cpu impact of the containerization plugin by adding
   periodic delays.
 - added a few periodic delays in long running or cpu intensive plugins
 - create a collect command timing log that is added to each the
   host collect tarball.
 - timing log file records how long it took for each plugin to run as
   well as commands called with the new run_command function.
 - fixed issue in networking plugin.
 - added a 60 second timeout for the 'lsof' heavyweight command.
 - fixed delimiter string hostname in all plugins.
 - increase the default global timeout from 20 to 30 minutes.
 - increase the default collect_host timeout from 600 to 900 seconds.
 - incremented tool minor version.

These improvements aim to minimize the performance impact of running
collect on busy in-service systems.

Note: When a process is started with nice, its CPU priority is
      inherited by all threads spawned by that process.
      However, it does not restrict the total CPU time a process
      or its threads can use when no contention exists.

Test Plan:

PASS: Verify build and install of collect package.
PASS: Verify collect runtime is not substantially longer.
PASS: Verify tar checkpoint handling on busy system where checkpoint
      action handler detects and invokes system overload handling.
PASS: Verify some CPU spike reduction compared to before update.

Regression:

PASS: Compare collect bundle size and contents before and after update.
PASS: Soak collect on busy/overloaded AIO SX system.
PASS: Verify report tool reports the same data before/after update.
PASS: Verify multi-node collect

Closes-Bug: 2090923
Change-Id: If698d5f275f4482de205fa4a37e0398b19800777
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-12-06 03:01:06 +00:00

53 lines
1.9 KiB
Bash

#! /bin/bash
#
# Copyright (c) 2023,2024 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
##########################################################################################
# default timeouts for collect ; in seconds
declare -i SCP_TIMEOUT_DEFAULT=600
declare -i SSH_TIMEOUT_DEFAULT=60
declare -i SUDO_TIMEOUT_DEFAULT=60
declare -i COLLECT_HOST_TIMEOUT_DEFAULT=900
declare -i CREATE_TARBALL_TIMEOUT_DEFAULT=200
declare -i TIMEOUT_MIN_MINS=10
declare -i TIMEOUT_MAX_MINS=120
declare -i TIMEOUT_DEF_MINS=30
# shellcheck disable=SC2034
declare -i TIMEOUT_MIN_SECS=$((TIMEOUT_MAX_MINS*60))
# shellcheck disable=SC2034
declare -i TIMEOUT_MAX_SECS=$((TIMEOUT_MAX_MINS*60))
declare -i TIMEOUT_DEF_SECS=$((TIMEOUT_DEF_MINS*60)) # 20 minutes
# overall collect timeout
declare -i TIMEOUT=${TIMEOUT_DEF_SECS}
# sleep delay for specific operations outside of the run_command
# do not remove labels. Set to 0 for no delay.
COLLECT_RUNPARTS_DELAY=0.25 # delay time after running each collect plugin
COLLECT_RUNEXTRA_DELAY=0.20 # inline delay time for collect host extra commands
COLLECT_RUNCMD_DELAY=0.20 # inline delay for commands not using run_command
# collect run_command adaptive delay controls
COLLECT_RUN_COMMAND_ADAPTIVE_DELAY=true
# collect adaptive collect handling delays and time
# thresholds for collect commands that use run_command
# when COLLECT_RUN_COMMAND_ADAPTIVE_DELAY is true
# sleep <tsize>_DELAY if run_command took
# greater than or equal to <tsize>_THRESHOLD
COLLECT_RUNCMD_XLARGE_THRESHOLD=2 # secs command took to run
COLLECT_RUNCMD_XLARGE_DELAY=0.75 # sleep value
COLLECT_RUNCMD_LARGE_THRESHOLD=1 # secs command took to run
COLLECT_RUNCMD_LARGE_DELAY=0.5 # sleep value
COLLECT_RUNCMD_MEDIUM_THRESHOLD=100 # milliseconds command took to run
COLLECT_RUNCMD_MEDIUM_DELAY=0.2 # sleep value
COLLECT_RUNCMD_SMALL_THRESHOLD=50 # milliseconds command took to run
COLLECT_RUNCMD_SMALL_DELAY=0.1 # sleep value