Reduce CPU spikes during collect

The collect tool executes various Linux and system commands to gather
data into an archived collect bundle.

System administrators often need to run collect on busy, in-service
systems. During such operations, they have reported excessive CPU
usage, which can lead to undesirable CPU spikes caused by certain
collect operations.

While the collect tool already employs throttling for ssh and scp,
its data collection and archiving commands currently lack similar
safeguards.

This update introduces the following enhancements to mitigate CPU
spikes and improve performance on heavily loaded in-service servers:

 - removed one unnecessary tar archive operation.
 - add tar archive checkpoint option support with action handler
 - removed one redundant kubelet api-resources call in the
   containerization plugin.
 - add --chunk-size=50 support to all in one kubelet get api-resources
   command to help throttle this long running heavyweight command.
   50 seems to yield the lowest k8s api latency as measured with the
   k8smetrics tool.
 - launch collect plugins with 'nice' and 'ionice' attributes.
 - add 'nice' and 'ionice' attributes to select commands.
 - add sleep delays after known cpu intensive data collection commands.
 - remove unnecessary -v (verbose) option to all tar commands.
 - add a run_command utility that times the execution of commands
   and adaptively adds a small post execution delay based on how
   long that command took to run.
 - reduce the cpu impact of the containerization plugin by adding
   periodic delays.
 - added a few periodic delays in long running or cpu intensive plugins
 - create a collect command timing log that is added to each the
   host collect tarball.
 - timing log file records how long it took for each plugin to run as
   well as commands called with the new run_command function.
 - fixed issue in networking plugin.
 - added a 60 second timeout for the 'lsof' heavyweight command.
 - fixed delimiter string hostname in all plugins.
 - increase the default global timeout from 20 to 30 minutes.
 - increase the default collect_host timeout from 600 to 900 seconds.
 - incremented tool minor version.

These improvements aim to minimize the performance impact of running
collect on busy in-service systems.

Note: When a process is started with nice, its CPU priority is
      inherited by all threads spawned by that process.
      However, it does not restrict the total CPU time a process
      or its threads can use when no contention exists.

Test Plan:

PASS: Verify build and install of collect package.
PASS: Verify collect runtime is not substantially longer.
PASS: Verify tar checkpoint handling on busy system where checkpoint
      action handler detects and invokes system overload handling.
PASS: Verify some CPU spike reduction compared to before update.

Regression:

PASS: Compare collect bundle size and contents before and after update.
PASS: Soak collect on busy/overloaded AIO SX system.
PASS: Verify report tool reports the same data before/after update.
PASS: Verify multi-node collect

Closes-Bug: 2090923
Change-Id: If698d5f275f4482de205fa4a37e0398b19800777
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This commit is contained in:
Eric MacDonald 2024-11-25 20:15:16 +00:00
parent 5a79edbc59
commit 94afab2a6b
15 changed files with 387 additions and 219 deletions

View File

@ -200,7 +200,7 @@
TOOL_NAME="collect"
TOOL_VER=3
TOOL_REV=0
TOOL_REV=1
# only supported username
UN=$(whoami)
@ -208,7 +208,6 @@ pw=""
# pull in common utils and environment
source /usr/local/sbin/collect_utils
source /etc/collect/collect_timeouts
declare -i RETVAL=${FAIL}
function collect_exit()
@ -3618,7 +3617,7 @@ fi
create_collect_log
echo -n "creating ${COLLECT_TYPE} tarball ${TARBALL_NAME} ... "
(cd ${COLLECT_BASE_DIR} ; ${IONICE_CMD} ${NICE_CMD} ${TAR_CMD_APPEND} ${TARBALL_NAME} --remove-files ${COLLECT_NAME}/* 2>>${COLLECT_ERROR_LOG} 1>/dev/null)
(cd ${COLLECT_BASE_DIR} ; ${IONICE_CMD} ${NICE_CMD} ${TAR_CMD_APPEND} ${TARBALL_NAME} --remove-files ${COLLECT_NAME}/* ${CHECKPOINT_CMD} 2>>${COLLECT_ERROR_LOG} 1>/dev/null)
rc=${?}
if [ ${rc} -ne ${PASS} ] ; then
collect_errors ${HOSTNAME}

View File

@ -0,0 +1,37 @@
#!/bin/bash
#
# Copyright (c) 2024 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
TAG="COLLECT:"
# Parse /proc/loadavg and get the number of running processes
get_running_processes() {
awk '{split($4, arr, "/"); print arr[1]}' /proc/loadavg
}
# Parse /proc/stat and get the number of blocked processes
get_procs_blocked() {
awk '/^procs_blocked/ {print $2}' /proc/stat
}
# Parse writeback data size
get_writeback() {
awk '/^Writeback:/ {print $2}' /proc/meminfo
}
# Note: tar exports TAR_ARCHIVE
fsync ${TAR_ARCHIVE}
running_processes=$(get_running_processes)
procs_blocked=$(get_procs_blocked)
writeback_size=$(get_writeback)
if [ ${writeback_size} -gt 0 -o ${procs_blocked} -gt 0 ] ; then
sleep 1
logger -t ${TAG} -p user.warning "tar '${TAR_ARCHIVE}' : \
checkpoint handler overload stats -> \
running:${running_processes} blocked=${procs_blocked} writeback=${writeback_size}"
fi

View File

@ -1,6 +1,6 @@
#! /bin/bash
#
# Copyright (c) 2019-2022 Wind River Systems, Inc.
# Copyright (c) 2019-2022,2024 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
@ -34,36 +34,32 @@ mkdir -p ${HELM_DIR}
source_openrc_if_needed
CMD="docker system df"
delimiter ${LOGFILE_IMG} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_IMG}
run_command "${CMD}" "${LOGFILE_IMG}"
CMD="du -h --max-depth 1 /var/lib/docker"
delimiter ${LOGFILE_IMG} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_IMG}
run_command "${CMD}" "${LOGFILE_IMG}"
CMD="docker image ls -a"
delimiter ${LOGFILE_IMG} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_IMG}
run_command "${CMD}" "${LOGFILE_IMG}"
CMD="crictl images"
delimiter ${LOGFILE_IMG} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_IMG}
run_command "${CMD}" "${LOGFILE_IMG}"
sleep ${COLLECT_RUNCMD_DELAY}
CMD="ctr -n k8s.io images list"
delimiter ${LOGFILE_IMG} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_IMG}
run_command "${CMD}" "${LOGFILE_IMG}"
CMD="docker container ps -a"
delimiter ${LOGFILE_IMG} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_IMG}
run_command "${CMD}" "${LOGFILE_IMG}"
CMD="crictl ps -a"
delimiter ${LOGFILE_IMG} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_IMG}
run_command "${CMD}" "${LOGFILE_IMG}"
CMD="cat /var/lib/kubelet/cpu_manager_state | python -m json.tool"
delimiter ${LOGFILE_HOST} "${CMD}"
eval ${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_HOST}
run_command "eval ${CMD}" "${LOGFILE_HOST}"
sleep ${COLLECT_RUNCMD_DELAY}
###############################################################################
# Active Controller
@ -102,38 +98,31 @@ if [ "$nodetype" = "controller" -a "${ACTIVE}" = true ] ; then
CMDS+=("kubectl describe helmcharts.source.toolkit.fluxcd.io -A")
CMDS+=("kubectl describe helmreleases.helm.toolkit.fluxcd.io -A")
DELAY_THROTTLE=4
delay_count=0
for CMD in "${CMDS[@]}" ; do
delimiter ${LOGFILE_KUBE} "${CMD}"
eval ${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_KUBE}
echo >>${LOGFILE_KUBE}
run_command "eval ${CMD}" "${LOGFILE_KUBE}"
if [ ! -z ${COLLECT_RUNCMD_DELAY} ] ; then
((delay_count = delay_count + 1))
if [ ${delay_count} -ge ${DELAY_THROTTLE} ] ; then
sleep ${COLLECT_RUNCMD_DELAY}
delay_count=0
fi
fi
done
# api-resources; verbose, place in separate file
CMDS=()
CMDS+=("kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found --all-namespaces")
CMDS+=("kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found --all-namespaces -o yaml")
for CMD in "${CMDS[@]}" ; do
delimiter ${LOGFILE_API} "${CMD}"
eval ${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_API}
echo >>${LOGFILE_API}
done
# describe pods; verbose, place in separate file
CMDS=()
CMDS+=("kubectl describe pods --all-namespaces")
for CMD in "${CMDS[@]}" ; do
delimiter ${LOGFILE_PODS} "${CMD}"
eval ${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_PODS}
echo >>${LOGFILE_API}
done
run_command "eval kubectl api-resources --verbs=list --namespaced -o name | xargs -I {} kubectl get {} --chunk-size=50 --show-kind --ignore-not-found --all-namespaces -o yaml" "${LOGFILE_API}"
run_command "kubectl describe pods --all-namespaces" "${LOGFILE_PODS}"
# events; verbose, place in separate file
CMDS=()
CMDS+=("kubectl get events --all-namespaces --sort-by='.metadata.creationTimestamp' -o go-template='{{range .items}}{{printf \"%s %s\t%s\t%s\t%s\t%s\n\" .firstTimestamp .involvedObject.name .involvedObject.kind .message .reason .type}}{{end}}'")
for CMD in "${CMDS[@]}" ; do
delimiter ${LOGFILE_EVENT} "${CMD}"
eval ${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_EVENT}
run_command "eval ${CMD}" "${LOGFILE_EVENT}"
echo >>${LOGFILE_EVENT}
sleep ${COLLECT_RUNCMD_DELAY}
done
# Helm related
@ -144,8 +133,7 @@ if [ "$nodetype" = "controller" -a "${ACTIVE}" = true ] ; then
# NOTE: helm environment not configured for root user
CMD="sudo -u $(whoami) KUBECONFIG=${KUBECONFIG} helm list --all --all-namespaces"
delimiter ${LOGFILE_HELM} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_HELM}
run_command "${CMD}" "${LOGFILE_HELM}"
# Save history for each helm release
mapfile -t RELEASES < <( ${CMD} 2>>${COLLECT_ERROR_LOG} )
@ -157,17 +145,16 @@ if [ "$nodetype" = "controller" -a "${ACTIVE}" = true ] ; then
${CMD} >> ${HELM_DIR}/helm-history.info 2>>${COLLECT_ERROR_LOG}
done
sleep ${COLLECT_RUNCMD_DELAY}
CMD="sudo -u $(whoami) KUBECONFIG=${KUBECONFIG} helm search repo"
delimiter ${LOGFILE_HELM} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_HELM}
run_command "${CMD}" "${LOGFILE_HELM}"
CMD="sudo -u $(whoami) KUBECONFIG=${KUBECONFIG} helm repo list"
delimiter ${LOGFILE_HELM} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE_HELM}
run_command "${CMD}" "${LOGFILE_HELM}"
CMD="cp -r /opt/platform/helm_charts ${HELM_DIR}/"
delimiter ${LOGFILE} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG}
run_command "${CMD}" "${LOGFILE}"
export $(grep '^ETCD_LISTEN_CLIENT_URLS=' /etc/etcd/etcd.conf | tr -d '"')
@ -181,8 +168,7 @@ if [ "$nodetype" = "controller" -a "${ACTIVE}" = true ] ; then
--key=/etc/etcd/etcd-server.key --cacert=/etc/etcd/ca.crt"
fi
delimiter ${LOGFILE} "${CMD}"
${CMD} 2>>${COLLECT_ERROR_LOG} >> ${ETCD_DB_FILE}
run_command "${CMD}" "${ETCD_DB_FILE}"
fi
exit 0

View File

@ -1,5 +1,7 @@
#! /bin/bash
#
# Copyright (c) 2024 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
@ -27,8 +29,11 @@ if [ "$nodetype" = "controller" ] ; then
# These go into the SERVICE.info file
delimiter ${LOGFILE} "fm alarm-list"
fm alarm-list 2>>${COLLECT_ERROR_LOG} >> ${LOGFILE}
sleep ${COLLECT_RUNCMD_DELAY}
delimiter ${LOGFILE} "fm event-list --nopaging"
fm event-list --nopaging 2>>${COLLECT_ERROR_LOG} >> ${LOGFILE}
sleep ${COLLECT_RUNCMD_DELAY}
fi
exit 0

View File

@ -73,7 +73,6 @@ fi
COLLECT_BASE_DIR="/scratch"
EXTRA="var/extra"
hostname="${HOSTNAME}"
COLLECT_NAME_DIR="${COLLECT_BASE_DIR}/${COLLECT_NAME}"
EXTRA_DIR="${COLLECT_NAME_DIR}/${EXTRA}"
TARBALL="${COLLECT_NAME_DIR}.tgz"
@ -89,6 +88,8 @@ COLLECT_DIR_USAGE_CMD="df -h ${COLLECT_BASE_DIR}"
COLLECT_DATE="/usr/local/sbin/collect_date"
COLLECT_SYSINV="${COLLECT_PATH}/collect_sysinv"
rm -f ${COLLECT_CMD_TIMING_LOG}
function log_space()
{
local msg=${1}
@ -134,11 +135,19 @@ function collect_parts()
if [ -d ${COLLECT_PATH} ]; then
for i in ${COLLECT_PATH}/*; do
if [ -f $i ]; then
local start=$(date ${DATE_FORMAT})
if [ ${i} = ${COLLECT_SYSINV} ]; then
$i ${COLLECT_NAME_DIR} ${EXTRA_DIR} ${hostname} ${INVENTORY}
${IONICE_CMD} ${NICE_CMD} $i ${COLLECT_NAME_DIR} ${EXTRA_DIR} ${hostname} ${INVENTORY}
else
$i ${COLLECT_NAME_DIR} ${EXTRA_DIR} ${hostname}
${IONICE_CMD} ${NICE_CMD} $i ${COLLECT_NAME_DIR} ${EXTRA_DIR} ${hostname}
fi
local stop=$(date ${DATE_FORMAT})
duration=$(get_time_delta "${start}" "${stop}")
echo "${stop}: ${duration} Plugin $i" >> ${COLLECT_CMD_TIMING_LOG}
# Add delay between parts ; i.e. collect plugins
sleep ${COLLECT_RUNPARTS_DELAY}
fi
done
fi
@ -151,178 +160,140 @@ function collect_extra()
LOGFILE="${EXTRA_DIR}/process.info"
echo "${hostname}: Process Info ......: ${LOGFILE}"
delimiter ${LOGFILE} "ps -e -H -o ..."
${PROCESS_DETAIL_CMD} >> ${LOGFILE}
run_command "${IONICE_CMD} ${NICE_CMD} ${PROCESS_DETAIL_CMD}" "${LOGFILE}"
sleep ${COLLECT_RUNEXTRA_DELAY}
# Collect process and thread info (tree view)
delimiter ${LOGFILE} "pstree --arguments --ascii --long --show-pids"
pstree --arguments --ascii --long --show-pids >> ${LOGFILE}
run_command "${IONICE_CMD} ${NICE_CMD} pstree --arguments --ascii --long --show-pids" "${LOGFILE}"
sleep ${COLLECT_RUNEXTRA_DELAY}
# Collect process, thread and scheduling info (worker subfunction only)
# (also gets process 'affinity' which is useful on workers;
which ps-sched.sh >/dev/null 2>&1
if [ $? -eq 0 ]; then
delimiter ${LOGFILE} "ps-sched.sh"
ps-sched.sh >> ${LOGFILE}
run_command "${IONICE_CMD} ${NICE_CMD} ps-sched.sh" "${LOGFILE}"
sleep ${COLLECT_RUNEXTRA_DELAY}
fi
# Collect process, thread and scheduling, and elapsed time
# This has everything that ps-sched.sh does, except for cpu affinity mask,
# adds: stime,etime,time,wchan,tty).
delimiter ${LOGFILE} "ps -eL -o pid,lwp,ppid,state,class,nice,rtprio,priority,psr,stime,etime,time,wchan:16,tty,comm,command"
ps -eL -o pid,lwp,ppid,state,class,nice,rtprio,priority,psr,stime,etime,time,wchan:16,tty,comm,command >> ${LOGFILE}
# Collect per kubernetes container name, QoS, and cpusets per numa node
delimiter ${LOGFILE} "kube-cpusets"
kube-cpusets >> ${LOGFILE}
run_command "${IONICE_CMD} ${NICE_CMD} kube-cpusets" "${LOGFILE}"
# For debugging pods and cgroups, etc
run_command "sudo LANG=POSIX systemd-cgls cpu -k -l" "${LOGFILE}"
# Various host attributes
LOGFILE="${EXTRA_DIR}/host.info"
echo "${hostname}: Host Info .........: ${LOGFILE}"
# CGCS build info
delimiter ${LOGFILE} "${BUILD_INFO_CMD}"
${BUILD_INFO_CMD} >> ${LOGFILE}
# StarlingX build info
run_command "${BUILD_INFO_CMD}" "${LOGFILE}"
delimiter ${LOGFILE} "timedatectl"
timedatectl >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "timedatectl" "${LOGFILE}"
delimiter ${LOGFILE} "uptime"
uptime >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "uptime" "${LOGFILE}"
delimiter ${LOGFILE} "cat /proc/cmdline"
cat /proc/cmdline >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "cat /proc/cmdline" "${LOGFILE}"
delimiter ${LOGFILE} "cat /proc/version"
cat /proc/version >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "cat /proc/version" "${LOGFILE}"
delimiter ${LOGFILE} "lscpu"
lscpu >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "lscpu" "${LOGFILE}"
delimiter ${LOGFILE} "lscpu -e"
lscpu -e >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "lscpu -e" "${LOGFILE}"
delimiter ${LOGFILE} "cat /proc/cpuinfo"
cat /proc/cpuinfo >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "cat /proc/cpuinfo" "${LOGFILE}"
delimiter ${LOGFILE} "cat /sys/devices/system/cpu/isolated"
cat /sys/devices/system/cpu/isolated >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "cat /sys/devices/system/cpu/isolated" "${LOGFILE}"
delimiter ${LOGFILE} "ip addr show"
ip addr show >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "ip addr show" "${LOGFILE}"
delimiter ${LOGFILE} "lspci -nn"
lspci -nn >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "lspci -nn" "${LOGFILE}"
delimiter ${LOGFILE} "find /sys/kernel/iommu_groups/ -type l"
find /sys/kernel/iommu_groups/ -type l >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
sleep ${COLLECT_RUNEXTRA_DELAY}
run_command "find /sys/kernel/iommu_groups/ -type l" "${LOGFILE}"
# networking totals
delimiter ${LOGFILE} "cat /proc/net/dev"
cat /proc/net/dev >> ${LOGFILE}
run_command "cat /proc/net/dev" "${LOGFILE}"
delimiter ${LOGFILE} "dmidecode"
dmidecode >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "dmidecode" "${LOGFILE}"
# summary of scheduler tunable settings
delimiter ${LOGFILE} "cat /proc/sched_debug | head -15"
cat /proc/sched_debug | head -15 >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "cat /proc/sched_debug | head -15" "${LOGFILE}"
if [ "${SKIP_MASK}" = "true" ]; then
delimiter ${LOGFILE} "facter (excluding ssh info)"
facter | grep -iv '^ssh' >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "facter | grep -iv '^ssh'" "${LOGFILE}"
else
delimiter ${LOGFILE} "facter"
facter >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "facter" "${LOGFILE}"
fi
if [[ "$nodetype" == "worker" || "$subfunction" == *"worker"* ]] ; then
delimiter ${LOGFILE} "topology"
topology >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "topology" "${LOGFILE}"
fi
# CPU C-state power info
delimiter ${LOGFILE} "cpupower monitor"
cpupower monitor >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} cpupower monitor" "${LOGFILE}"
LOGFILE="${EXTRA_DIR}/memory.info"
echo "${hostname}: Memory Info .......: ${LOGFILE}"
delimiter ${LOGFILE} "cat /proc/meminfo"
cat /proc/meminfo >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "cat /proc/meminfo" "${LOGFILE}"
delimiter ${LOGFILE} "cat /sys/devices/system/node/node?/meminfo"
cat /sys/devices/system/node/node?/meminfo >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "cat /sys/devices/system/node/node?/meminfo" "${LOGFILE}"
delimiter ${LOGFILE} "cat /proc/slabinfo"
log_slabinfo ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "log_slabinfo" "${LOGFILE}"
delimiter ${LOGFILE} "ps -e -o ppid,pid,nlwp,rss:10,vsz:10,cmd --sort=-rss"
ps -e -o ppid,pid,nlwp,rss:10,vsz:10,cmd --sort=-rss >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} ps -e -o ppid,pid,nlwp,rss:10,vsz:10,cmd --sort=-rss" "${LOGFILE}"
# list open files
delimiter ${LOGFILE} "lsof -lwX"
lsof -lwX >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "timeout 60 ${IONICE_CMD} ${NICE_CMD} lsof -lwX" "${LOGFILE}"
# hugepages numa mapping
delimiter ${LOGFILE} "grep huge /proc/*/numa_maps"
grep -e " huge " /proc/*/numa_maps >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
# rootfs and tmpfs usage
delimiter ${LOGFILE} "df -h -H -T --local -t rootfs -t tmpfs"
df -h -H -T --local -t rootfs -t tmpfs >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
LOGFILE="${EXTRA_DIR}/filesystem.info"
echo "${hostname}: Filesystem Info ...: ${LOGFILE}"
# disk inodes usage
delimiter ${LOGFILE} "df -h -H -T --local -t rootfs -t tmpfs"
df -h -H -T --local -t rootfs -t tmpfs >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
# rootfs and tmpfs usage
run_command "df -h -H -T --local -t rootfs -t tmpfs" "${LOGFILE}"
# disk space usage
delimiter ${LOGFILE} "df -h -H -T --local -t ext2 -t ext3 -t ext4 -t xfs --total"
df -h -H -T --local -t ext2 -t ext3 -t ext4 -t xfs --total >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "df -h -H -T --local -t ext2 -t ext3 -t ext4 -t xfs --total" "${LOGFILE}"
# disk inodes usage
delimiter ${LOGFILE} "df -h -H -T --local -i -t ext2 -t ext3 -t ext4 -t xfs --total"
df -h -H -T --local -i -t ext2 -t ext3 -t ext4 -t xfs --total >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "df -h -H -T --local -i -t ext2 -t ext3 -t ext4 -t xfs --total" "${LOGFILE}"
sleep ${COLLECT_RUNEXTRA_DELAY}
# disks by-path values
delimiter ${LOGFILE} "ls -lR /dev/disk"
ls -lR /dev/disk >> ${LOGFILE}
run_command "ls -lR /dev/disk" "${LOGFILE}"
# disk summary (requires sudo/root)
delimiter ${LOGFILE} "fdisk -l"
fdisk -l >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "fdisk -l" "${LOGFILE}"
delimiter ${LOGFILE} "cat /proc/scsi/scsi"
cat /proc/scsi/scsi >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "cat /proc/scsi/scsi" "${LOGFILE}"
sleep ${COLLECT_RUNEXTRA_DELAY}
# Controller specific stuff
if [ "$nodetype" = "controller" ] ; then
run_command "cat /proc/drbd" "${LOGFILE}"
delimiter ${LOGFILE} "cat /proc/drbd"
cat /proc/drbd >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
delimiter ${LOGFILE} "/sbin/drbdadm dump"
/sbin/drbdadm dump >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} /sbin/drbdadm dump" "${LOGFILE}"
fi
# LVM summary
delimiter ${LOGFILE} "/usr/sbin/vgs --version ; /usr/sbin/pvs --version ; /usr/sbin/lvs --version"
/usr/sbin/vgs --version >> ${LOGFILE}
/usr/sbin/pvs --version >> ${LOGFILE}
/usr/sbin/lvs --version >> ${LOGFILE}
run_command "/usr/sbin/vgs --version" "${LOGFILE}"
run_command "/usr/sbin/pvs --version" "${LOGFILE}"
run_command "/usr/sbin/lvs --version" "${LOGFILE}"
delimiter ${LOGFILE} "/usr/sbin/vgs --all --options all"
/usr/sbin/vgs --all --options all >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} /usr/sbin/vgs --all --options all" "${LOGFILE}"
delimiter ${LOGFILE} "/usr/sbin/pvs --all --options all"
/usr/sbin/pvs --all --options all >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} /usr/sbin/pvs --all --options all" "${LOGFILE}"
delimiter ${LOGFILE} "/usr/sbin/lvs --all --options all"
/usr/sbin/lvs --all --options all >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} /usr/sbin/lvs --all --options all" "${LOGFILE}"
# iSCSI Information
LOGFILE="${EXTRA_DIR}/iscsi.info"
@ -330,12 +301,10 @@ function collect_extra()
if [ "$nodetype" = "controller" ] ; then
# Controller- LIO exported initiators summary
delimiter ${LOGFILE} "targetcli ls"
targetcli ls >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} targetcli ls" "${LOGFILE}"
# Controller - LIO sessions
delimiter ${LOGFILE} "targetcli sessions detail"
targetcli sessions detail >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} targetcli sessions detail" "${LOGFILE}"
elif [[ "$nodetype" == "worker" || "$subfunction" == *"worker"* ]] ; then
# Worker - iSCSI initiator information
@ -345,56 +314,64 @@ function collect_extra()
find ${collect_dir} -type d -exec chmod 750 {} \;
# Worker - iSCSI initiator active sessions
delimiter ${LOGFILE} "iscsiadm -m session"
iscsiadm -m session >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "iscsiadm -m session" "${LOGFILE}"
# Worker - iSCSI udev created nodes
delimiter ${LOGFILE} "ls -la /dev/disk/by-path | grep \"iqn\""
ls -la /dev/disk/by-path | grep "iqn" >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
fi
sleep ${COLLECT_RUNEXTRA_DELAY}
LOGFILE="${EXTRA_DIR}/history.info"
echo "${hostname}: Bash History ......: ${LOGFILE}"
# history
delimiter ${LOGFILE} "cat /home/sysadmin/.bash_history"
cat /home/sysadmin/.bash_history >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "cat /home/sysadmin/.bash_history" "${LOGFILE}"
LOGFILE="${EXTRA_DIR}/interrupt.info"
echo "${hostname}: Interrupt Info ....: ${LOGFILE}"
# interrupts
delimiter ${LOGFILE} "cat /proc/interrupts"
cat /proc/interrupts >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} cat /proc/interrupts" "${LOGFILE}"
sleep ${COLLECT_RUNEXTRA_DELAY}
delimiter ${LOGFILE} "cat /proc/softirqs"
cat /proc/softirqs >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} cat /proc/softirqs" "${LOGFILE}"
# Controller specific stuff
if [ "$nodetype" = "controller" ] ; then
netstat -pan > ${EXTRA_DIR}/netstat.info
LOGFILE="${EXTRA_DIR}/netstat.info"
run_command "${IONICE_CMD} ${NICE_CMD} netstat -pan" "${LOGFILE}"
fi
LOGFILE="${EXTRA_DIR}/blockdev.info"
echo "${hostname}: Block Devices Info : ${LOGFILE}"
# Collect block devices - show all sda and cinder devices, and size
delimiter ${LOGFILE} "lsblk"
lsblk >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "lsblk" "${LOGFILE}"
# Collect block device topology - show devices and which io-scheduler
delimiter ${LOGFILE} "lsblk --topology"
lsblk --topology >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} lsblk --topology" "${LOGFILE}"
# Collect SCSI devices - show devices and cinder attaches, etc
delimiter ${LOGFILE} "lsblk --scsi"
lsblk --scsi >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
run_command "${IONICE_CMD} ${NICE_CMD} lsblk --scsi" "${LOGFILE}"
}
log_space "before collect ......:"
collect_extra_start=$(date ${DATE_FORMAT})
echo "${collect_extra_start}: collect_extra start" >> ${COLLECT_CMD_TIMING_LOG}
collect_extra
collect_extra_stop=$(date ${DATE_FORMAT})
duration=$(get_time_delta "${collect_extra_start}" "${collect_extra_stop}")
echo "${collect_extra_stop}: ${duration} collect_extra done" >> ${COLLECT_CMD_TIMING_LOG}
collect_parts_start=$(date ${DATE_FORMAT})
echo "$(date ${DATE_FORMAT}): collect_parts start" >> ${COLLECT_CMD_TIMING_LOG}
collect_parts
collect_parts_stop=$(date ${DATE_FORMAT})
duration=$(get_time_delta "${collect_parts_start}" "${collect_parts_stop}")
echo "$(date ${DATE_FORMAT}): ${duration} collect_parts done" >> ${COLLECT_CMD_TIMING_LOG}
#
# handle collect collect-after and collect-range and then
@ -403,22 +380,30 @@ collect_parts
VAR_LOG="/var/log"
rm -f ${VAR_LOG_INCLUDE_LIST}
# Dated collect defaults to 1 month of logs.
# Consider adding a check for how long the system has been provisioned
# and avoid running dated collect, which causes a CPU spike that can
# take up to 10 seconds or more even on newly provisioned systems.
echo "$(date ${DATE_FORMAT}): Date range start" >> ${COLLECT_CMD_TIMING_LOG}
if [ "${STARTDATE_RANGE}" == true ] ; then
if [ "${ENDDATE_RANGE}" == false ] ; then
ilog "collecting $VAR_LOG files containing logs after ${STARTDATE}"
${COLLECT_DATE} ${STARTDATE} ${ENDDATE} ${VAR_LOG_INCLUDE_LIST} ${DEBUG} ""
${IONICE_CMD} ${NICE_CMD} ${COLLECT_DATE} ${STARTDATE} ${ENDDATE} ${VAR_LOG_INCLUDE_LIST} ${DEBUG} ""
else
ilog "collecting $VAR_LOG files containing logs between ${STARTDATE} and ${ENDDATE}"
${COLLECT_DATE} ${STARTDATE} ${ENDDATE} ${VAR_LOG_INCLUDE_LIST} ${DEBUG} ""
${IONICE_CMD} ${NICE_CMD} ${COLLECT_DATE} ${STARTDATE} ${ENDDATE} ${VAR_LOG_INCLUDE_LIST} ${DEBUG} ""
fi
elif [ "${ENDDATE_RANGE}" == true ] ; then
STARTDATE="20130101"
ilog "collecting $VAR_LOG files containing logs before ${ENDDATE}"
${COLLECT_DATE} ${STARTDATE} ${ENDDATE} ${VAR_LOG_INCLUDE_LIST} ${DEBUG} ""
${IONICE_CMD} ${NICE_CMD} ${COLLECT_DATE} ${STARTDATE} ${ENDDATE} ${VAR_LOG_INCLUDE_LIST} ${DEBUG} ""
else
ilog "collecting all of $VAR_LOG"
find $VAR_LOG ! -empty > ${VAR_LOG_INCLUDE_LIST}
${IONICE_CMD} ${NICE_CMD} find $VAR_LOG ! -empty > ${VAR_LOG_INCLUDE_LIST}
fi
echo "$(date ${DATE_FORMAT}): Date range stop" >> ${COLLECT_CMD_TIMING_LOG}
sleep ${COLLECT_RUNEXTRA_DELAY}
# collect the www lighttpd logs if they exists and are not empty.
# note: The lighttpd logs don't have the date in the right place.
@ -453,13 +438,17 @@ for i in /var/lib/ceph/data/rook-ceph/log/*; do
fi
done
sleep ${COLLECT_RUNEXTRA_DELAY}
echo "$(date +'%H:%M:%S.%3N'): Running host tars" >> ${COLLECT_CMD_TIMING_LOG}
log_space "before first tar ....:"
(cd ${COLLECT_NAME_DIR} ; ${IONICE_CMD} ${NICE_CMD} ${TAR_CMD} ${COLLECT_NAME_DIR}/${COLLECT_NAME}.tar -T ${VAR_LOG_INCLUDE_LIST} -X ${RUN_EXCLUDE} -X ${ETC_EXCLUDE} -X ${VAR_LOG_EXCLUDE} ${COLLECT_INCLUDE} 2>>${COLLECT_ERROR_LOG} 1>>${COLLECT_ERROR_LOG} )
(cd ${COLLECT_NAME_DIR} ; ${IONICE_CMD} ${NICE_CMD} ${TAR_CMD} ${COLLECT_NAME_DIR}/${COLLECT_NAME}.tar -T ${VAR_LOG_INCLUDE_LIST} -X ${RUN_EXCLUDE} -X ${ETC_EXCLUDE} -X ${VAR_LOG_EXCLUDE} ${COLLECT_INCLUDE} ${CHECKPOINT_CMD} 2>>${COLLECT_ERROR_LOG} 1>>${COLLECT_ERROR_LOG} )
log_space "after first tar .....:"
(cd ${COLLECT_NAME_DIR} ; ${IONICE_CMD} ${NICE_CMD} ${UNTAR_CMD} ${COLLECT_NAME_DIR}/${COLLECT_NAME}.tar 2>>${COLLECT_ERROR_LOG} 1>>${COLLECT_ERROR_LOG} )
(cd ${COLLECT_NAME_DIR} ; ${IONICE_CMD} ${NICE_CMD} ${UNTAR_CMD} ${COLLECT_NAME_DIR}/${COLLECT_NAME}.tar ${CHECKPOINT_CMD} 2>>${COLLECT_ERROR_LOG} 1>>${COLLECT_ERROR_LOG} )
log_space "after first untar ...:"
@ -482,24 +471,23 @@ if [ "${OMIT_CERTS}" != "true" ]; then
log_space "after certificates ..:"
fi
(cd ${COLLECT_BASE_DIR} ; ${IONICE_CMD} ${NICE_CMD} ${TAR_ZIP_CMD} ${COLLECT_NAME_DIR}.tgz ${COLLECT_NAME} 2>/dev/null 1>/dev/null )
log_space "after first tarball .:"
mkdir -p ${COLLECT_NAME_DIR}/${FLIGHT_RECORDER_PATH}
(cd /${FLIGHT_RECORDER_PATH} ; ${TAR_ZIP_CMD} ${COLLECT_NAME_DIR}/${FLIGHT_RECORDER_PATH}/${FLIGHT_RECORDER_FILE}.tgz ./${FLIGHT_RECORDER_FILE} 2>>${COLLECT_ERROR_LOG} 1>>${COLLECT_ERROR_LOG})
(cd /${FLIGHT_RECORDER_PATH} ; ${TAR_ZIP_CMD} ${COLLECT_NAME_DIR}/${FLIGHT_RECORDER_PATH}/${FLIGHT_RECORDER_FILE}.tgz ./${FLIGHT_RECORDER_FILE} ${CHECKPOINT_CMD} 2>>${COLLECT_ERROR_LOG} 1>>${COLLECT_ERROR_LOG})
# save the collect.log file to this host's tarball
cp -a ${COLLECT_ERROR_LOG} ${COLLECT_NAME_DIR}/${COLLECT_LOG}
cp -a ${COLLECT_CMD_TIMING_LOG} "${COLLECT_NAME_DIR}/${HOST_COLLECT_CMD_TIMING_LOG}"
log_space "with flight data ....:"
(cd ${COLLECT_BASE_DIR} ; ${IONICE_CMD} ${NICE_CMD} ${TAR_ZIP_CMD} ${COLLECT_NAME_DIR}.tgz ${COLLECT_NAME} 2>>${COLLECT_ERROR_LOG} 1>>${COLLECT_ERROR_LOG} )
(cd ${COLLECT_BASE_DIR} ; ${IONICE_CMD} ${NICE_CMD} ${TAR_ZIP_CMD} ${COLLECT_NAME_DIR}.tgz ${COLLECT_NAME} ${CHECKPOINT_CMD} 2>>${COLLECT_ERROR_LOG} 1>>${COLLECT_ERROR_LOG} )
log_space "after collect .......:"
echo "$(date +'%H:%M:%S.%3N'): Finished host tars" >> ${COLLECT_CMD_TIMING_LOG}
rm -rf ${COLLECT_NAME_DIR}
rm -f ${VAR_LOG_INCLUDE_LIST}
@ -525,4 +513,4 @@ else
fi
dlog "collect_host exit code: ${rc}"
exit ${rc}
exit ${rc}

View File

@ -1,6 +1,6 @@
#! /bin/bash
#
# Copyright (c) 2020 Wind River Systems, Inc.
# Copyright (c) 2020,2024 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
@ -52,10 +52,14 @@ SELECT
FROM information_schema.TABLES
ORDER BY table_schema, TABLE_NAME' >> ${LOGFILE}
sleep ${COLLECT_RUNCMD_DELAY}
# MariaDB dump all databases
delimiter ${LOGFILE} "Dumping MariaDB databases: ${DB_DIR}"
mkdir -p ${DB_DIR}
(cd ${DB_DIR}; mariadb-cli --dump --exclude keystone,ceilometer)
sleep ${COLLECT_RUNCMD_DELAY}
fi
exit 0

View File

@ -1,6 +1,6 @@
#! /bin/bash
#
# Copyright (c) 2013-2014 Wind River Systems, Inc.
# Copyright (c) 2013-2014,2024 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
@ -42,17 +42,24 @@ CMD="ip6tables-save"
delimiter ${LOGFILE} "${CMD}"
${CMD} > ${extradir}/ip6tables.dump 2>>${COLLECT_ERROR_LOG}
sleep ${COLLECT_RUNCMD_DELAY}
###############################################################################
# Only Worker
###############################################################################
if [[ "$nodetype" = "worker" || "$subfunction" == *"worker"* ]] ; then
NAMESPACES=($(ip netns))
# Create a list of network namespaces; exclude the (id: #) part
# Example: cni-56e3136b-2503-fe5f-652f-0998248c1405 (id: 0)
NAMESPACES=()
for ns in $(ip netns list | awk '{print $1}'); do
NAMESPACES+=("$ns")
done
for NS in ${NAMESPACES[@]}; do
delimiter ${LOGFILE} "${NS}"
delimiter "${LOGFILE}" "Network Namespace: ${NS}"
for CMD in "${CMDS[@]}" ; do
ip netns exec ${NS} ${CMD}
run_command "ip netns exec ${NS} ${CMD}" "${LOGFILE}"
done
done >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
sleep ${COLLECT_RUNCMD_DELAY}
done
fi
exit 0

View File

@ -1,6 +1,6 @@
#! /bin/bash
#
# Copyright (c) 2013-2019 Wind River Systems, Inc.
# Copyright (c) 2013-2019,2024 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
@ -90,10 +90,20 @@ function openstack_commands {
# nova commands
CMDS+=("nova service-list")
DELAY_THROTTLE=4
delay_count=0
for CMD in "${CMDS[@]}" ; do
delimiter ${LOGFILE} "${CMD}"
eval ${CMD} 2>>${COLLECT_ERROR_LOG} >>${LOGFILE}
echo >>${LOGFILE}
if [ ! -z ${COLLECT_RUNCMD_DELAY} ] ; then
((delay_count = delay_count + 1))
if [ ${delay_count} -ge ${DELAY_THROTTLE} ] ; then
sleep ${COLLECT_RUNCMD_DELAY}
delay_count=0
fi
fi
done
}
@ -130,6 +140,8 @@ if [ "$nodetype" = "controller" ] ; then
# host rabbitmq usage
rabbitmq_usage_stats
sleep ${COLLECT_RUNCMD_DELAY}
# Check for openstack label on this node
if ! is_openstack_node; then
exit 0
@ -138,6 +150,8 @@ if [ "$nodetype" = "controller" ] ; then
# Run as subshell so we don't contaminate environment
(openstack_credentials; openstack_commands)
sleep ${COLLECT_RUNCMD_DELAY}
# TODO(jgauld): Should also get containerized rabbitmq usage,
# need wrapper script rabbitmq-cli
fi

View File

@ -1,6 +1,6 @@
#! /bin/bash
#
# Copyright (c) 2022 Wind River Systems, Inc.
# Copyright (c) 2022,2024 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
@ -38,6 +38,7 @@ ostree log ${OSTREE_REF} --repo=${SYSROOT_REPO} >> ${LOGFILE} 2>>${COLLECT_ERRO
for feed_dir in ${FEED_OSTREE_BASE_DIR}/*/ostree_repo
do
sleep ${COLLECT_RUNCMD_DELAY}
delimiter ${LOGFILE} "ostree log ${OSTREE_REF} --repo=${feed_dir}"
ostree log ${OSTREE_REF} --repo=${feed_dir} >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
done
@ -48,6 +49,7 @@ done
for feed_dir in ${FEED_OSTREE_BASE_DIR}/*/ostree_repo
do
sleep ${COLLECT_RUNCMD_DELAY}
delimiter ${LOGFILE} "ostree summary -v --repo=${feed_dir}"
ostree summary -v --repo=${feed_dir} >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}
done

View File

@ -1,6 +1,6 @@
#! /bin/bash
#
# Copyright (c) 2013-2014 Wind River Systems, Inc.
# Copyright (c) 2013-2014,2024 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
@ -20,7 +20,7 @@ echo "${hostname}: Service Management : ${LOGFILE}"
if [ "$nodetype" = "controller" ] ; then
kill -SIGUSR1 $(</var/run/sm.pid)
sm-troubleshoot 2>>${COLLECT_ERROR_LOG} >> ${LOGFILE}
run_command "sm-troubleshoot" "${LOGFILE}"
fi
exit 0

View File

@ -19,7 +19,6 @@ PLOTFILE="${extradir}/${SERVICE}-startup-plot.svg"
###############################################################################
echo "${hostname}: Systemd analyze .........: ${LOGFILE}"
delimiter ${LOGFILE} "systemd-analyze plot > ${PLOTFILE}"
timeout 30 systemd-analyze plot > ${PLOTFILE} 2>>${COLLECT_ERROR_LOG}
run_command "timeout 10 systemd-analyze plot > ${PLOTFILE}" "${LOGFILE}"
exit 0

View File

@ -1,6 +1,6 @@
#! /bin/bash
#
# Copyright (c) 2023 Wind River Systems, Inc.
# Copyright (c) 2023,2024 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
@ -10,12 +10,12 @@
declare -i SCP_TIMEOUT_DEFAULT=600
declare -i SSH_TIMEOUT_DEFAULT=60
declare -i SUDO_TIMEOUT_DEFAULT=60
declare -i COLLECT_HOST_TIMEOUT_DEFAULT=600
declare -i COLLECT_HOST_TIMEOUT_DEFAULT=900
declare -i CREATE_TARBALL_TIMEOUT_DEFAULT=200
declare -i TIMEOUT_MIN_MINS=10
declare -i TIMEOUT_MAX_MINS=120
declare -i TIMEOUT_DEF_MINS=20
declare -i TIMEOUT_DEF_MINS=30
# shellcheck disable=SC2034
declare -i TIMEOUT_MIN_SECS=$((TIMEOUT_MAX_MINS*60))
# shellcheck disable=SC2034
@ -25,3 +25,29 @@ declare -i TIMEOUT_DEF_SECS=$((TIMEOUT_DEF_MINS*60)) # 20 minutes
# overall collect timeout
declare -i TIMEOUT=${TIMEOUT_DEF_SECS}
# sleep delay for specific operations outside of the run_command
# do not remove labels. Set to 0 for no delay.
COLLECT_RUNPARTS_DELAY=0.25 # delay time after running each collect plugin
COLLECT_RUNEXTRA_DELAY=0.20 # inline delay time for collect host extra commands
COLLECT_RUNCMD_DELAY=0.20 # inline delay for commands not using run_command
# collect run_command adaptive delay controls
COLLECT_RUN_COMMAND_ADAPTIVE_DELAY=true
# collect adaptive collect handling delays and time
# thresholds for collect commands that use run_command
# when COLLECT_RUN_COMMAND_ADAPTIVE_DELAY is true
# sleep <tsize>_DELAY if run_command took
# greater than or equal to <tsize>_THRESHOLD
COLLECT_RUNCMD_XLARGE_THRESHOLD=2 # secs command took to run
COLLECT_RUNCMD_XLARGE_DELAY=0.75 # sleep value
COLLECT_RUNCMD_LARGE_THRESHOLD=1 # secs command took to run
COLLECT_RUNCMD_LARGE_DELAY=0.5 # sleep value
COLLECT_RUNCMD_MEDIUM_THRESHOLD=100 # milliseconds command took to run
COLLECT_RUNCMD_MEDIUM_DELAY=0.2 # sleep value
COLLECT_RUNCMD_SMALL_THRESHOLD=50 # milliseconds command took to run
COLLECT_RUNCMD_SMALL_DELAY=0.1 # sleep value

View File

@ -20,18 +20,15 @@ USM_DIR="/opt/software"
function collect_usm {
RELEASES=$(software list | tail -n +4 | awk '{ print $2; }')
delimiter ${LOGFILE} "software deploy show"
software deploy show 2>>${COLLECT_ERROR_LOG} >> ${LOGFILE}
run_command "software deploy show" "${LOGFILE}"
delimiter ${LOGFILE} "software deploy host-list"
software deploy host-list 2>>${COLLECT_ERROR_LOG} >> ${LOGFILE}
run_command "software deploy host-list" "${LOGFILE}"
delimiter ${LOGFILE} "software list"
software list 2>>${COLLECT_ERROR_LOG} >> ${LOGFILE}
run_command "software list" "${LOGFILE}"
for release in ${RELEASES}; do
delimiter ${LOGFILE} "software show --packages ${release}"
software show --packages ${release} 2>>${COLLECT_ERROR_LOG} >> ${LOGFILE}
run_command "software show --packages ${release}" "${LOGFILE}"
sleep ${COLLECT_RUNCMD_DELAY}
done
}
@ -40,8 +37,8 @@ function collect_usm {
###############################################################################
function collect_feed {
for feed in /var/www/pages/feed/*; do
delimiter ${LOGFILE} "ls -lhR --ignore __pycache__ --ignore ostree_repo ${feed}"
ls -lhR --ignore __pycache__ --ignore ostree_repo ${feed} >> ${LOGFILE}
run_command "ls -lhR --ignore __pycache__ --ignore ostree_repo ${feed}" "${LOGFILE}"
sleep ${COLLECT_RUNCMD_DELAY}
done
}
@ -58,10 +55,12 @@ if [ "$nodetype" = "controller" ] ; then
collect_feed
# copy /opt/software to extra dir
rsync -a /opt/software ${extradir}
run_command "rsync -a /opt/software ${extradir}" "${LOGFILE}"
sleep ${COLLECT_RUNCMD_DELAY}
# copy /var/www/pages/feed to extra dir, excluding large and temp directories
rsync -a --exclude __pycache__ --exclude ostree_repo --exclude pxeboot /var/www/pages/feed ${extradir}
run_command "rsync -a --exclude __pycache__ --exclude ostree_repo --exclude pxeboot /var/www/pages/feed ${extradir}" "${LOGFILE}"
sleep ${COLLECT_RUNCMD_DELAY}
fi

View File

@ -7,6 +7,10 @@
##########################################################################################
source /etc/collect/collect_timeouts
hostname="${HOSTNAME}"
DEBUG=false
redirect="/dev/null"
@ -142,7 +146,8 @@ declare -i COLLECT_BASE_DIR_FULL_THRESHOLD=2147484 # 2Gib in K blocks rounded up
COLLECT_LOG=collect.log
COLLECT_ERROR_LOG=/tmp/$(whoami)_collect_error.log
HOST_COLLECT_ERROR_LOG="/tmp/$(whoami)_host_collect_error.log"
COLLECT_CMD_TIMING_LOG="/tmp/$(whoami)_collect_cmd_timing.log"
HOST_COLLECT_CMD_TIMING_LOG="collect_cmd_timing.log"
DCROLE_SYSTEMCONTROLLER="systemcontroller"
DCROLE_SUBCLOUD="subcloud"
@ -194,21 +199,28 @@ cmd_done_file="/usr/local/sbin/expect_done"
EXPECT_LOG_FILE="/tmp/collect_expect"
# Compression Commands
TAR_ZIP_CMD="tar -cvzf"
TAR_UZIP_CMD="tar -xvzf"
TAR_CMD="tar -cvhf"
TAR_CMD_APPEND="tar -rvhf"
UNTAR_CMD="tar -xvf"
TAR_ZIP_CMD="tar -czf"
TAR_UZIP_CMD="tar -xzf"
TAR_CMD="tar -chf"
TAR_CMD_APPEND="tar -rhf"
UNTAR_CMD="tar -xf"
ZIP_CMD="gzip"
NICE_CMD="/usr/bin/nice -n19"
IONICE_CMD="/usr/bin/ionice -c2 -n7"
COLLECT_TAG="COLLECT"
# Checkpoint definitions
# Default is 512 bytes per block
# Example 10000*512 = 5MBytes
CHECKPOINT_BLOCKS=10000
CHECKPOINT_CMD="--checkpoint=${CHECKPOINT_BLOCKS} --checkpoint-action=exec=/usr/local/sbin/collect_checkpoint"
STARTDATE_OPTION="--start-date"
ENDDATE_OPTION="--end-date"
DATE_FORMAT="+%H:%M:%S.%3N"
PROCESS_DETAIL_CMD="ps -e -H -o ruser,tid,pid,ppid,flags,stat,policy,rtprio,nice,priority,rss:10,vsz:10,sz:10,psr,stime,tty,cputime,wchan:14,cmd"
PROCESS_DETAIL_CMD="ps -e -H --forest -o ruser,tid,pid,ppid,flags,stat,policy,rtprio,nice,priority,rss:10,vsz:10,sz:10,psr,stime,etime,cputime,wchan:14,tty,cmd"
BUILD_INFO_CMD="cat /etc/build.info"
################################################################################
@ -254,14 +266,103 @@ function dlog()
function delimiter()
{
echo "--------------------------------------------------------------------" >> ${1} 2>>${COLLECT_ERROR_LOG}
echo "`date` : ${myhostname} : ${2}" >> ${1} 2>>${COLLECT_ERROR_LOG}
echo "`date` : ${hostname} : ${2}" >> ${1} 2>>${COLLECT_ERROR_LOG}
echo "--------------------------------------------------------------------" >> ${1} 2>>${COLLECT_ERROR_LOG}
}
function get_time_delta () {
start_epoch=$(date -d "1970-01-01 $1" +%s%3N)
stop_epoch=$(date -d "1970-01-01 $2" +%s%3N)
delta=$((${stop_epoch}-${start_epoch}))
secs=$((delta / 1000))
msecs=$((delta % 1000))
echo "${secs}.${msecs}"
}
###############################################################################
#
# Name: run_command
#
# Purpose: Run the specified command and direct the output of
# that command to the specified log.
#
# Assumptions: Requires 2 and only 2 arguments
#
# Arguments: $1 - string command to execute
# $2 - string path/name of file to direct command output to
#
# Warning: Command is not executed unless there are only 2 arguments
# supplied. This check helps identify code errors in command
# execution and output redirection. Error is logged to the error
# log as well as the execution timing summary log.
#
###############################################################################
function run_command () {
if [ "$#" -ne 2 ]; then
echo "Error: run_command requires 2 arguments only ; saw $# Argument(s): '$@'" >> ${COLLECT_CMD_TIMING_LOG}
return 1
fi
local cmd="${1}"
local log="${2}"
local start=$(date ${DATE_FORMAT})
delimiter "${log}" "${cmd}"
${cmd} >> ${log} 2>>${COLLECT_ERROR_LOG}
rc=$?
local stop=$(date ${DATE_FORMAT})
duration=$(get_time_delta "${start}" "${stop}")
# echo "$(date ${DATE_FORMAT}): ${duration} - ${log} - ${cmd}" >> ${COLLECT_CMD_TIMING_LOG}
# return ${rc}
# perform a short sleep based on how long this command took
# if COLLECT_RUN_COMMAND_ADAPTIVE_DELAY handling is true
# Example:
# sleep 0.75 if command took 3 seconds or longer
# sleep 0.50 if command took 1 second or longer
# sleep 0.2 if command took 100 milliseconds or longer
# sleep 0.1 if command took 50 milliseconds or longer
sleep_time=0
sleep_time_str="0"
if [ "${COLLECT_RUN_COMMAND_ADAPTIVE_DELAY}" = true ] ; then
secs=${duration%%.*}
if [ ${secs} -ge ${COLLECT_RUNCMD_XLARGE_THRESHOLD} ] ; then
sleep_time ${COLLECT_RUNCMD_XLARGE_DELAY}
sleep_time_str="${COLLECT_RUNCMD_XLARGE_DELAY}"
elif [ ${secs} -gt ${COLLECT_RUNCMD_LARGE_THRESHOLD} ] ; then
sleep_time ${COLLECT_RUNCMD_LARGE_DELAY}
sleep_time_str="${COLLECT_RUNCMD_LARGE_DELAY}"
else
msec=${duration#*.}
if [ ${msec} -gt ${COLLECT_RUNCMD_MEDIUM_THRESHOLD} ] ; then
sleep_time ${COLLECT_RUNCMD_MEDIUM_DELAY}
sleep_time_str="${COLLECT_RUNCMD_MEDIUM_DELAY}"
elif [ ${msec} -gt ${COLLECT_RUNCMD_SMALL_THRESHOLD} ] ; then
sleep_time ${COLLECT_RUNCMD_SMALL_DELAY}
sleep_time_str="${COLLECT_RUNCMD_SMALL_DELAY}"
fi
fi
fi
if [ "${sleep_time}" != "0" ] ; then
sleep ${sleep_time}
fi
echo "$(date ${DATE_FORMAT}): ${duration}->${sleep_time_str} - ${log} - ${cmd}" >> ${COLLECT_CMD_TIMING_LOG}
return ${rc}
}
function log_slabinfo()
{
PAGE_SIZE=$(getconf PAGE_SIZE)
cat /proc/slabinfo | awk -v page_size_B=${PAGE_SIZE} '
${IONICE_CMD} ${NICE_CMD} cat /proc/slabinfo | awk -v page_size_B=${PAGE_SIZE} '
BEGIN {page_KiB = page_size_B/1024; TOT_KiB = 0;}
(NF == 17) {
gsub(/[<>]/, "");

View File

@ -27,6 +27,7 @@ override_dh_auto_install:
install -m 755 -p collect_utils $(ROOT)/usr/local/sbin/collect_utils
install -m 755 -p collect_parms $(ROOT)/usr/local/sbin/collect_parms
install -m 755 -p collect_timeouts $(SYSCONFDIR)/collect/collect_timeouts
install -m 755 -p collect_checkpoint $(ROOT)/usr/local/sbin/collect_checkpoint
install -m 755 -p collect_mask_passwords $(ROOT)/usr/local/sbin/collect_mask_passwords
install -m 755 -p collect_certificates $(ROOT)/usr/local/sbin/collect_certificates
install -m 755 -p expect_done $(ROOT)/usr/local/sbin/expect_done