【Linux】-1-CPU问题排查常用命令

uptime

[root@iZ8lgm9icspkthZ ~]# uptime
 08:00:23 up 206 days, 23:17,  1 user,  load average: 0.07, 0.06, 0.08

08:00:23 当前时间

up 206 days, 23:17, 系统运行时间

1 user当前登录用户数量

load average: 0.07, （1分钟平均负载） 0.06,（5分钟平均负载） 0.08（15分钟平均负载）

平均负载：单位时间内，系统处于可运行或者不可中断的平均进程数，也就是平均活跃进程数

下面是通过man命令查询的uptime

[root@iZ8lgm9icspkthZ ~]# man uptime

SYNOPSIS
       uptime [options]

DESCRIPTION
       uptime  gives  a one line display of the following information.  The current time, how long the system has been running, how many users
       are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.

       This is the same information contained in the header line displayed by w(1).

       System load averages is the average number of processes that are either in a  runnable  or  uninterruptable  state.   A  process  in  a
       runnable  state  is either using the CPU or waiting to use the CPU.  A process in uninterruptable state is waiting for some I/O access,
       eg waiting for disk.  The averages are taken over the three time intervals.  Load averages are not normalized for the number of CPUs in
       a  system,  so a load average of 1 means a single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of
       the time.

OPTIONS
       -p, --pretty
              show uptime in pretty format

       -h, --help
              display this help text

       -s, --since
              system up since, in yyyy-mm-dd HH:MM:SS format

       -V, --version
              display version information and exit

FILES
       /var/run/utmp
              information about who is currently logged on

       /proc  process information

AUTHORS
       uptime was written by Larry Greenfield ⟨greenfie@gauss.rutgers.edu⟩ and Michael K. Johnson ⟨johnsonm@sunsite.unc.edu⟩

SEE ALSO
       ps(1), top(1), utmp(5), w(1)

REPORTING BUGS
       Please send bug reports to ⟨procps@freelists.org⟩

procps-ng                                                        December 2012                                                       UPTIME(1)

stress Linux系统性能压测工具

CPU问题分析

模拟CPU使用率100%的命令：

stress –cpu 1 –timeout 600

查看负载情况命令：

watch -d uptime

CPU使用率变化命令：

mpstat -P ALL 5

ALL参数是所有的CPU，5是间隔5秒输出

平均负载上升，CPU使用率上升，iowait为0或者很小那负载问题是由CPU使用率上升导致，可以分析CPU使用率上升的问题

平均负载上升，CPU使用率上升，iowait也上升很多问题可能是由io使用上升导致，可以分析io使用问题

CPU使用率上升分析：

pidstat -u 5 1

间隔5秒输出进程统计信息，然后可以找到问题进程

io问题分析

模拟io异常，持续执行sync

stress -i 1 –timeout 600

查看负载：

watch -d uptime

cpu使用率查看

mpstat -P ALL 5 1

平均负载上升，CPU使用率上升，iowait为0或者很小那负载问题是由CPU使用率上升导致，可以分析CPU使用率上升的问题

平均负载上升，CPU使用率上升，iowait也上升很多问题可能是由io使用上升导致，可以分析io使用问题

分析IO上升问题

pidstat -u 5 1

找到使用率上升的进程

上下文切换分为：

进程上下文
线程上下文
中断上下文切换

vmstat 命令

内存使用情况和CPU上下文切换和中断次数分析

如下：

# vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 6  0      0 4991428 189240 1574904    0    0     6    16  131  218  8  3 88  0  0
 3  0      0 4992688 189240 1574912    0    0     0    22 2024 3951  6  4 90  0  0
 1  0      0 4992436 189248 1574908    0    0     0    25 1719 3418  5  3 92  0  0
 0  0      0 4990672 189252 1574916    0    0     0    34 1976 3933  7  3 90  0  0

cs(Context Switch) 每秒上下文切换次数

in(interrupt) 每秒中断次数

r(running or runnable) 就绪队列的长度

b(blocked) 处于不可中断睡眠状态的进程数

每个进程上下文切换的查询命令

# pidstat -w 5
Linux 5.10.104-linuxkit (481b468ee053) 	06/27/22 	_x86_64_	(2 CPU)


00:21:24      UID       PID   cswch/s nvcswch/s  Command
00:21:29        0        24      0.00    298.60  stress
00:21:29        0        25      0.20      0.00  pidstat




00:21:29      UID       PID   cswch/s nvcswch/s  Command
00:21:34        0        24      0.00     59.60  stress
00:21:34        0        25      0.20      0.00  pidstat

cswch/s 每秒自愿上下文切换次数：进程无法获取资源，导致的上下文切换属于正常现象

nvcswch/s 每秒非自愿上下文切换次数：由于时间片已到等原因，被系统强制调度，发生的上下文切换，比如大量CPU的抢占

10个线程持续运行5分钟

sysbench –threads=10 –max-time=300 threads run

# sysbench --threads=10 --max-time=300 threads run
WARNING: --max-time is deprecated, use --time instead
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 10
Initializing random number generator from current time


Initializing worker threads...

Threads started!

每隔1秒输出1组数据

vmstat 1

每隔1秒输出1组进程数据

pidstat -w -u 1

每隔1秒输出1组进程+线程的数据

pidstat -wt 1

中断数据查看

watch -d cat /proc/interrupts

中断是系统用来响应硬件设备请求的一种机制，它会打断进程的正常调度和执行，然后调用内
核中的中断处理程序来响应设备的请求。

Linux 将中断处理过程分成了两个阶段，也就是上半部和下半部:

上半部直接处理硬件请求，也就是我们常说的硬中断，特点是快速执行; 比如响应网卡
而下半部则是由内核触发，也就是我们常说的软中断，特点是延迟执行。比如接收到网卡传递过来的数据

中断信息查询

/proc/softirqs 提供了软中断的运行情况;

/proc/interrupts 提供了硬中断的运行情况

如下图所示

# tail /proc/softirqs
          HI:          0          0
       TIMER:     112173     109993
      NET_TX:        122        128
      NET_RX:     260343     260278
       BLOCK:      36680      36453
    IRQ_POLL:          0          0
     TASKLET:          2          5
       SCHED:     199036     206830
     HRTIMER:          0          0
         RCU:     287404     285279
# 
# 
# tail /proc/interrupts
IWI:          0         30   IRQ work interrupts
RTR:          0          0   APIC ICR read retries
RES:      25477      26021   Rescheduling interrupts
CAL:     650516     606681   Function call interrupts
TLB:      74973      73394   TLB shootdowns
ERR:          0
MIS:          0
PIN:          0          0   Posted-interrupt notification event
NPI:          0          0   Nested posted-interrupt event
PIW:          0          0   Posted-interrupt wakeup event
#

TIMER(定时中断)、

NET_RX(网络接收)、

SCHED(内核调度)、

RCU(RCU 锁)

一般上面这4个软中断会频繁发生，但是找到最频繁发生变更那个可能就是有问题的软中断

![image-20220728080446167](/Users/mac/Library/Application Support/typora-user-images/image-20220728080446167.png)

![image-20220704081011989](/Users/mac/Library/Application Support/typora-user-images/image-20220704081011989.png)

CPU1, CPU问题排查常用命令1, Linux问题排查7