rcu_nocbs=1-11 does not work, kernel crash on some reboots
Dear Steven,
I am Prankur Chauhan a software engineer at Panasonic Weiterstadt, Germany. We develop video switchers and currently using the liquorix Debian kernel Linux 13 5.17.0-12.1-liquorix-amd64 #1 ZEN SMP PREEMPT liquorix 5.17-15 (2022-05-30) x86_64 GNU/Linux We have disabled the rcu callbacks on all cores except core 0 from the kernel commandline. BOOT_IMAGE=/vmlinuz root=/dev/disk/by-label/rootfsA rw rootdelay=3 ro hpet=enable isolcpus=1-11 hugepages=17827 clocksource=tsc rcu_nocbs=1-11 rcu_nocb_poll nohz_full=1-11 intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll mce=ignore_ce nosoftlockup audit=0 mitigations=off net.ifnames=0 biosdevname=0 nomodeset quiet console=tty2 And yet when I see the htop output the rcuop threads are running randomly on some cores that are suppose to have rcu_nocbs Somewhere in the changelog I saw you disabled the rcu_nocbs ? Unfortunately I see some errors during some reboot cycles that is somehow related to the rcu Aug 31 12:17:21 AT-KC100-30-15 kernel: rcu: INFO: rcu_preempt self-detected stall on CPU Aug 31 12:17:21 AT-KC100-30-15 kernel: rcu: 7-....: (24241 ticks this GP) idle=833/1/0x4000000000000000 softirq=1042693/1042693 fqs=14225 Aug 31 12:17:21 AT-KC100-30-15 kernel: (t=60000 jiffies g=789405 q=88593) Aug 31 12:17:21 AT-KC100-30-15 kernel: NMI backtrace for cpu 7 Aug 31 12:17:21 AT-KC100-30-15 kernel: CPU: 7 PID: 11878 Comm: SDI:Rx1 Tainted: P S OE 5.17.0-12.1-liquorix-amd64 #1 liquorix 5.17-15.1~bullseye Aug 31 12:17:21 AT-KC100-30-15 kernel: Hardware name: Supermicro Super Server/X11SPG-TF, BIOS 3.4 12/14/2020 Aug 31 12:17:21 AT-KC100-30-15 kernel: Call Trace: Aug 31 12:17:21 AT-KC100-30-15 kernel: <IRQ> Aug 31 12:17:21 AT-KC100-30-15 kernel: dump_stack_lvl+0x48/0x5e Aug 31 12:17:21 AT-KC100-30-15 kernel: nmi_cpu_backtrace.cold+0x32/0x7a Aug 31 12:17:21 AT-KC100-30-15 kernel: ? lapic_can_unplug_cpu+0x80/0x80 Aug 31 12:17:21 AT-KC100-30-15 kernel: nmi_trigger_cpumask_backtrace+0xc6/0xf0 Aug 31 12:17:21 AT-KC100-30-15 kernel: trigger_single_cpu_backtrace+0x24/0x27 Aug 31 12:17:21 AT-KC100-30-15 kernel: rcu_dump_cpu_stacks+0xa7/0xe0 Aug 31 12:17:21 AT-KC100-30-15 kernel: rcu_sched_clock_irq.cold+0x208/0x543 Aug 31 12:17:21 AT-KC100-30-15 kernel: ? sched_clock_cpu+0x9/0xb0 Aug 31 12:17:21 AT-KC100-30-15 kernel: ? update_rq_clock+0x2c/0x1d0 Aug 31 12:17:21 AT-KC100-30-15 kernel: update_process_times+0x8c/0xc0 Aug 31 12:17:21 AT-KC100-30-15 kernel: tick_sched_timer+0x88/0xa0 Aug 31 12:17:21 AT-KC100-30-15 kernel: ? tick_sched_do_timer+0x90/0x90 Aug 31 12:17:21 AT-KC100-30-15 kernel: __hrtimer_run_queues+0x127/0x2c0 Aug 31 12:17:21 AT-KC100-30-15 kernel: hrtimer_interrupt+0xfc/0x210 Aug 31 12:17:21 AT-KC100-30-15 kernel: __sysvec_apic_timer_interrupt+0x59/0x100 Aug 31 12:17:21 AT-KC100-30-15 kernel: sysvec_apic_timer_interrupt+0x6d/0x90 Aug 31 12:17:21 AT-KC100-30-15 kernel: </IRQ> Aug 31 12:17:21 AT-KC100-30-15 kernel: <TASK> My question is do you really disabled the rcu_nocbs option so even the kernel cmdline is not having any effect ? And I wonder why you did that, in comparison our previous Debian 9 kernel (standard kernel NOT liquorix) we had a tickless kernel and therefore were forced to offload all the rcus to core 0 ? I also want to inform you that we have isolated all cpu cores except core 0 from the scheduler and using real time threads (policy SCHED_FIFO) Cheers Prankur Back to top |
|||||
Hi Prankur, thanks for using Liquorix for your hardware, hope it's working out well.
Have you tried using a newer kernel to confirm if you hit a bug with your specific version? Also note, kernel 5.19 introduces realtime RCU kthreads and rcupdate.rcu_expedited=1 is passed through cmdline by default, so the behavior of your system may improve (or not), if you adopt the latest 5.19 kernel. :: Quote :: My question is do you really disabled the rcu_nocbs option so even the kernel cmdline is not having any effect ?
And I wonder why you did that, in comparison our previous Debian 9 kernel (standard kernel NOT liquorix) we had a tickless kernel and therefore were forced to offload all the rcus to core 0 ? I'm not entirely understanding your question, CONFIG_RCU_NOCB_CPU=y is enabled in Liquorix and has been for quite some time. If configuring no-callback CPUs doesn't work through kernel cmdline, that could be an issue with the code for that version of Linux. Back to top |
|||||
Dear Steven,
Thanks for the reply. Liquorix generally is better than our previous kernel. In terms of timing its more precise. The only problem we are running into is that during some reboot cycles the machine wont boot up completely. Sometimes there is kernel crash (as stated in previous message), sometimes the NIC card (mellanox) has some issues during boot. (mlx5_core : wait_func: MODIFY_CQ canceled on out of queue timeout) Sometimes there is a NETDEV watchdog triggered on ixgbe (1G ethernet) card. I checked the rcuop threads already have a FIFO priority of 1 on liquorix-kernel 5.17, maybe this is interfering with some other ksoftirqs from the NIC card, SDI card ? (our SDI card has lot of interrupts, we distributed it to some cores and also manually raise the ksoftirq priority on these cores to FIFO 87) We will for sure try to use the latest kernel 5.19 in the next release, but so close the release we dont want to take any risks. I cannot say why but my feeling is that this has to do something with some cores going in deadlock in kernel mode during boot. Any ideas would be helpful to try to understand the boot problems. By the way the boot problem happens only once or twice out of 30 times. Back to top |
|||||
Just for information:
:: Code :: root@AT-KC100-30-15:/usr/local/bin# inxi -bxx
System: Host: AT-KC100-30-15 Kernel: 5.17.0-12.1-liquorix-amd64 arch: x86_64 bits: 64 compiler: gcc v: 10.2.1 Console: pty pts/14 Distro: Debian GNU/Linux 11 (bullseye) Machine: Type: Server System: Supermicro product: Super Server v: 0123456789 serial: 0123456789 Chassis: type: 17 v: 0123456789 serial: 0123456789 Mobo: Supermicro model: X11SPG-TF v: 1.01 serial: ZM191S015608 UEFI-[Legacy]: American Megatrends v: 3.4 date: 12/14/2020 CPU: Info: 12-core Intel Xeon Silver 4214 [MCP] arch: Cascade Lake speed (MHz): avg: 2018 Graphics: Device-1: ASPEED Graphics Family vendor: Super Micro driver: N/A bus-ID: 03:00.0 chip-ID: 1a03:2000 Device-2: NVIDIA TU104GL [Quadro RTX 4000] driver: nvidia v: 515.43.04 arch: Lovelace pcie: speed: 8 GT/s lanes: 16 bus-ID: 65:00.0 chip-ID: 10de:1eb1 Display: server: No display server data found. Headless machine? tty: 253x67 Message: GL data unavailable in console for root. Network: Device-1: Intel Ethernet 10G X550T vendor: Super Micro driver: ixgbe v: kernel pcie: speed: 8 GT/s lanes: 4 port: N/A bus-ID: 01:00.0 chip-ID: 8086:1563 Device-2: Intel Ethernet 10G X550T vendor: Super Micro driver: ixgbe v: kernel pcie: speed: 8 GT/s lanes: 4 port: N/A bus-ID: 01:00.1 chip-ID: 8086:1563 Device-3: Mellanox MT27800 Family [ConnectX-5] driver: mlx5_core v: 5.6-1.0.3 pcie: speed: 8 GT/s lanes: 16 port: N/A bus-ID: b3:00.0 chip-ID: 15b3:1017 Device-4: Mellanox MT27800 Family [ConnectX-5] driver: mlx5_core v: 5.6-1.0.3 pcie: speed: 8 GT/s lanes: 16 port: N/A bus-ID: b3:00.1 chip-ID: 15b3:1017 Drives: Local Storage: total: 119.24 GiB used: 50.96 GiB (42.7%) Info: Processes: 286 Uptime: 2d 18h 13m Memory: 46.15 GiB used: 43.62 GiB (94.5%) Init: systemd v: 247 target: multi-user (3) default: multi-user Compilers: gcc: N/A Packages: pm: dpkg pkgs: 417 Shell: Bash v: 5.1.4 running-in: pty pts/14 inxi: 3.3.21 Back to top |
|||||
All times are GMT - 8 Hours
|