High CPU on a Catalyst switch running IOS

This is the troubleshooting process you can take to solve High CPU problems in your network.  The root cause is always something different, but the steps are mostly the same.

High CPU since Market open….

C6500#show ver

System image file is “disk0:s72033-advipservicesk9_wan-mz.122-33.SXH.bin”

=-=

Below we see 77% Total CPU with 33% from interrupt traffic, and ~ 40% from IP Input.

C6500#show proc cpu | exc 0.00

CPU utilization for five seconds: 77%/33%; one minute: 75%; five minutes: 77%

PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process

5    58144468   3708230      15679  0.87%  0.30%  0.29%   0 Check heaps

140     1850756  45849063         40  0.07%  0.11%  0.13%   0 CDP Protocol

146   106922548 674564539        158 40.09% 41.10% 41.95%   0 IP Input

168       15712     17711        887  0.23%  0.03%  0.22%   1 SSH Process

335   136091476 995454760        136  1.11%  0.51%  0.45%   0 Port manager per

374    18563992 190871473         97  0.15%  0.29%  0.28%   0 IGMP Input

376    11477444 194064823         59  0.15%  0.19%  0.18%   0 PIM Process

377      114620 192409350          0  0.15%  0.06%  0.06%   0 Mwheel Process

C6500#

=-=

Next we cleared the counters and then look at the vlan interfaces to see who has the most input queue drops.  Vlan 10, and vlan 200 seem to be getting hit the hardest.

C6500#show int | inc is up|drop

Vlan10 is up, line protocol is up

Input queue: 16/75/416/416 (size/max/drops/flushes); Total output drops: 0

Vlan200 is up, line protocol is up

Input queue: 4/75/565/565 (size/max/drops/flushes); Total output drops: 0

Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0

Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0

Loopback0 is up, line protocol is up

Next, we dump the buffers to see what kind of traffic is hitting the buffers of vlan 10, and 200.  We see that it is all multicast traffic.

C6500#show buffers input-interface vlan 10 packet | inc source:

source: 10.5.1.54, destination: 239.248.10.134, id: 0x0000, ttl: 15,

source: 10.5.1.54, destination: 239.248.10.134, id: 0x0000, ttl: 15,

source: 10.5.1.78, destination: 239.248.10.54, id: 0x0000, ttl: 15,

source: 10.5.1.78, destination: 239.248.10.55, id: 0x0000, ttl: 15,

source: 10.5.1.54, destination: 239.248.10.132, id: 0x0000, ttl: 15,

source: 10.5.1.78, destination: 239.248.10.55, id: 0x0000, ttl: 15,

=-=

C6500#show buffers input-interface vlan 200 packet | inc source:

source: 10.5.200.103, destination: 239.248.10.175, id: 0x0000, ttl: 15,

source: 10.5.200.108, destination: 239.248.10.145, id: 0x0000, ttl: 15,

source: 10.5.200.112, destination: 239.248.10.224, id: 0x0000, ttl: 15,

source: 10.5.200.103, destination: 239.248.10.175, id: 0x0000, ttl: 15,

source: 10.5.200.108, destination: 239.248.10.146, id: 0x0000, ttl: 15,

source: 10.5.200.113, destination: 239.248.10.94, id: 0x0000, ttl: 15,

=-=

So, we focus on 1 multicast stream to see why it would be getting punted to the CPU for processing.  We look at the mroute table and see many of the multicast routes in “Registering, Partial-SC”.  This indicates that the DR is trying to register to the rendezvous point(RP), but process is not completing.

C6500#show ip mroute 239.248.10.134

IP Multicast Routing Table

Flags: D – Dense, S – Sparse, B – Bidir Group, s – SSM Group, C – Connected,

L – Local, P – Pruned, R – RP-bit set, F – Register flag,

T – SPT-bit set, J – Join SPT, M – MSDP created entry,

X – Proxy Join Timer Running, A – Candidate for MSDP Advertisement,

U – URD, I – Received Source Specific Host Report,

Z – Multicast Tunnel, z – MDT-data group sender,

Y – Joined MDT-data group, y – Sending to MDT-data group

V – RD & Vector, v – Vector

Outgoing interface flags: H – Hardware switched, A – Assert winner

Timers: Uptime/Expires

Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.248.10.134), 02:06:47/stopped, RP 10.7.240.240, flags: SJCF

Incoming interface: Vlan6, RPF nbr 10.5.20.5, Partial-SC

Outgoing interface list:

Vlan10, Forward/Sparse, 01:44:18/00:02:48, H

(10.5.1.54, 239.248.10.134), 01:55:47/00:02:59, flags: PFT

Incoming interface: Vlan10, RPF nbr 0.0.0.0, Registering, Partial-SC

Outgoing interface list: Null

C6500#

=-=

So we look at the RP information and see several static RP statements.

C6500#show run | inc ip pim rp

ip pim rp-address 10.7.240.240 <—may not be needed

ip pim rp-address 198.140.52.4 AAAA

ip pim rp-address 198.140.52.3 BBBB

ip pim rp-address 198.140.52.1 CCCC

ip pim rp-address 198.140.52.2 DDDD

ip pim rp-address 198.140.33.5 EEEE

ip pim rp-address 198.140.33.2 FFFF

=-=

We set up a tempory rate-limit for the partial-SC packets hitting the cpu to only allow 10 per second(non-intrusive).  With the Rate-limiter in place, the CPU is now in the 10-20% range, with is inline with the 72hour historical average.  Customer will look into removing the invalid RP config.

C6500 (config)#mls rate-limit multicast ipv4 partial 10

C6500 (config)#do show proc cpu | exc 0.00

CPU utilization for five seconds: 13%/7%; one minute: 67%; five minutes: 71%

PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process

146   107426044 674852622        159  3.67% 36.28% 38.19%   0 IP Input

168       20684     22373        924  0.15%  0.14%  0.35%   1 SSH Process

335   136098248 995490856        136  1.43%  0.50%  0.51%   0 Port manager per

374    18567412 190884074         97  0.23%  0.26%  0.28%   0 IGMP Input

386    41108136 139317998        295  0.07%  0.13%  0.12%   0 SNMP ENGINE

C6500 (config)#

Hope this helps!

This entry was posted in Switching and tagged , , . Bookmark the permalink.

5 Responses to High CPU on a Catalyst switch running IOS

  1. Very informative post and I kind of agree with you (except on three points) on what you said. Keep the good work going. Your writing style is very good and I was able to understand the post clearly even though English is my second language. PS: I have already subscribed to your blog’s RSS feed.

  2. Joost says:

    I don’t understand the “may not be needed” statement for the RP config – Why would you come to the conclusion to remove this static RP & why would this resolve the high cpu load?

  3. Jeff says:

    Joost,

    Any time you see “Registering, Partial-SC” when you do “show ip mroute x.x.x.x”, a flag should go off. In general, the Registering process from the DR to the RP should be very quick(like less than 1 second quick). So, if you see it, it’s either coincidence, or stuck. This Registration process is done via software switching, not HW switching.

    In the above example, I don’t remember all of the details…these were notes that I took during the troubleshooting. Most likely the conversation was something like this:

    Me: Did you make any config changes recently?
    Answer: No, nothing that I can think of.

    Actually, changes were made prior, and it was probably something like adding the above static RP statement to several different switches in the network, assuming they all needed it. So this was probably added without this switch actually having a route to the RP 10.7.240.240, causing the problem.

  4. Ulrik says:

    Thanks fort the good information about this. We have this problem on interfaces facing Arris D5 Edge QAM units. The rate limit is solving the problem for now.

  5. Pingback: 3 Years in the Making | Jeff Greene

Leave a Reply

Your email address will not be published. Required fields are marked *