[RESOLVED] Hard lockups on 5.4ish kernels
wildstar84
Status: Interested
Joined: 31 May 2017
Posts: 41
Location: Texas
Reply Quote
I've tried 2 of the 5.4 kenels (linux-image-5.4.0-5.1-liquorix-amd64_5.4-3.1 and then, after waiting a cpl. weeks hoping pbm resolved: linux-image-5.4.0-8.2-liquorix-amd64_5.4-9.1) Within 24 hours my machine has suddenly locked up hard (requiring power-cycle): twice upon entering suspend-2-ram, once coming up locked up (with full screen displayed, no mouse/keyboard) on resume from s2ram, and once just randomly (mouse/keyboard/video activity) freezing up. No issues with any prev. kernels (up thru linux-image-5.3.0-16.3-liquorix-amd64_5.3-14.1 - that I'm back to using now). No errors in logs that I can find.

:: Code ::
inxi -F:
$>inxi -F
System:
  Host: wildstar Kernel: 5.3.0-16.3-liquorix-amd64 x86_64 bits: 64
  Desktop: AfterStep 2.2.12
  Distro: antiX-16.2_x64-base Berta Cáceres 15 June 2017
Machine:
  Type: Laptop System: Hewlett-Packard product: HP EliteBook 8440p v: N/A
  serial: <root required>
  Mobo: Hewlett-Packard model: 172A v: KBC Version 30.31
  serial: <root required> BIOS: Hewlett-Packard v: 68CCU Ver. F.11
  date: 11/25/2010
Battery:
  ID-1: BAT0 charge: 51.7 Wh condition: 54.0/54.0 Wh (100%)
CPU:
  Topology: Dual Core model: Intel Core i5 M 520 bits: 64 type: MT MCP
  L2 cache: 3072 KiB
  Speed: 1222 MHz min/max: 1199/2400 MHz Core speeds (MHz): 1: 1617 2: 1914
  3: 1440 4: 1443
Graphics:
  Device-1: Intel Core Processor Integrated Graphics driver: i915 v: kernel
  Display: x11 server: X.Org 1.20.6 driver: intel resolution: 1920x1080~60Hz
  OpenGL: renderer: Mesa DRI Intel Ironlake Mobile v: 2.1 Mesa 19.2.6
Audio:
  Device-1: Intel 5 Series/3400 Series High Definition Audio
  driver: snd_hda_intel
  Sound Server: ALSA v: k5.3.0-16.3-liquorix-amd64
Network:
  Device-1: Intel 82577LM Gigabit Network driver: e1000e
  IF: eth0 state: up speed: 100 Mbps duplex: full mac: b4:99:ba:e2:cb:7c
  Device-2: Intel Centrino Advanced-N 6200 driver: N/A
Drives:
  Local Storage: total: 931.51 GiB used: 163.01 GiB (17.5%)
  ID-1: /dev/sda vendor: HGST (Hitachi) model: HTS721010A9E630
  size: 931.51 GiB
Partition:
  ID-1: / size: 31.25 GiB used: 13.18 GiB (42.2%) fs: ext4 dev: /dev/sda2
  ID-2: /home size: 842.06 GiB used: 138.94 GiB (16.5%) fs: ext4
  dev: /dev/sda7
  ID-3: /var size: 31.25 GiB used: 10.89 GiB (34.9%) fs: ext4 dev: /dev/sda5
  ID-4: swap-1 size: 9.00 GiB used: 0 KiB (0.0%) fs: swap dev: /dev/sda6
Sensors:
  System Temperatures: cpu: 47.0 C mobo: N/A
  Fan Speeds (RPM): N/A
Info:
  Processes: 144 Uptime: 20h 50m Memory: 7.57 GiB used: 1.27 GiB (16.8%)
  Shell: bash inxi: 3.0.36


Anyone else experienced this?
Back to top
techAdmin
Status: Site Admin
Joined: 26 Sep 2003
Posts: 3987
Location: East Coast, West Coast? I know it's one of them.
Reply Quote
While very unlikely to be the cause, do a RAM extended test on the laptop, run a few iterations of memtest on it. That's often a boot option in many distros. If not, you can get a live cd of it or a live usb flash image.

You would want to exclude corrupted ram before anything else. When you suspend to ram, everything is held in ram, so if it's got a glitch, that would of course expose it.

Try also suspend to disk just to see, though that requires as much or more swap space than your consumed ram, which I see yours already has, so you are set. If suspend to disk works but ram crashes, it could well be corrupted ram.

Check to see what brand your ram is with inxi -mxxx as sudo or root, if it's hynix, that's a bad sign, that's not very good ram, and might be the issue.

All of these are just to exclude hardware causes, and are worth checking out, since if you have a bad ram stick, nothing anyone does will make it better.

Note that suspend to disk I believe will also write a lot at once to ram as it wakes up, but it's not stored there, to be corrupted if something is failing in the ram itself.

I'd give very low odds these are the causes, but they should be checked just to be on the safe side, once you can exclude those definitely, then it's probably a kernel issue, for some reason, hardware support for suspend on laptops tends to break for a few kernel versions, then comes back, that's been my experience on laptops too. The bright side is you have all Intel, which is the most likely to be fixed quite rapidly.

Check your boot logs too, AntiX if I remember right does not use systemd, so it would in one of the system boot log files, see if you can find any errors there right at the moment of waking. though you said you did that. Probably isn't getting logged since everything is probably fine until the hard lock wake event happens.
Back to top
damentz
Status: Assistant
Joined: 09 Sep 2008
Posts: 849
Reply Quote
It seems like the biggest problem right now is that Intel GPUs are affected by a full system hang bug that's stable backports have not solved.

bbs.archlinux.org/viewtopic.php?id=250765

It's a long thread, and there's been lots of patches. Jans has been doing a good job backporting them into our zen fixes branch, but it doesn't seem to matter. One person even reported that 5.5 RCs are affected.

My suggestion is to disable sleep states in the i915 modules through the /etc/modprobe.d/ directory if you're in a position to do that. Also, through TLP you can increase the minimum frequency to whatever tlp-stat -g reports. I'm currently doing that on my own X1C7.

:: Code ::
options i915 enable_rc6=0


And one other final thing, one of the latest microcode updates from Intel caused stability issues when mitigations=off is enabled. Might be worth seeing if disabling any microcode updates solves any of your stability problems.
Back to top
wildstar84
Status: Interested
Joined: 31 May 2017
Posts: 41
Location: Texas
Reply Quote
Thanks for the insiteful replies!

My suspicion also would be memory (I actually had very similar issue several years and laptops ago, and fixed by buying new memory!), but that's not the issue here, as no kernel prior to 5.4 causes this (even when being booted up for a cpl weeks or more), but neither 5.4 I've tried will last more than a few hours / >1 suspends).

I long ago had to comment out "enable_rc6=1" as it's not recognized by this (ironlake) gpu, so I assume it's zero. Another setting I've had to do for recent kernels is "i915.fastboot=0" on boot line and "options i915 modeset=1" (modprobe.d) to avoid black screen issues.

I read that "long thread", which sheds alot of light on what seem to be 2 different issues: 1) my "hard lockup" issue in 5.4x, and 2) a "multi-second gpu hang" issue (which I've also experienced rarely in 5.3x, but self-recovers after a few seconds with the "hang" reported in the logs). From that it would seem it's not a Liquorix issue, but up-stream and is still being worked on, so guess for now I'll stick w/5.3 and wait for 5.5 (while watching for future fixes / backports). As far as microcode goes, I'm at 3.20190618-1.

I'm not seeing any suspicious messages in dmesg nor in /var/log/messages or kern.log or Xorg.0.log (from prev. sessions that locked up), are there any other "boot logs" I should be checking? You are correct that AntiX (and I) do not use systemd.

Thanks again!

Jim
Back to top
damentz
Status: Assistant
Joined: 09 Sep 2008
Posts: 849
Reply Quote
Oh ya, I checked enable_rc6 later and it is indeed gone. I wonder what prompted Intel to remove it.

As for any other logs, I think that kern.log is similar to running journalctl -k on systemd. You'll want to look at the last written log file before your current boot. Depending on how your system hangs, it may not get a chance to write. On my work laptop, about half the time I can find the rcs0 reset message in the last kernel log.

And I wanted to put this gitlab issue here for reference. This seems to be where the fix will probably come from:
gitlab.freedesktop.org/drm/intel/issues/673
Back to top
damentz
Status: Assistant
Joined: 09 Sep 2008
Posts: 849
Reply Quote
Following up with a bit more information. The gitlab issue [1] is still active and some other workarounds have been suggested.

My first recommendation to set enable_rc6=0 doesn't work because it was replaced with enable_dc=0:

:: Code ::
$ modinfo -p i915 | grep enable_dc
enable_dc:Enable power-saving display C-states. (-1=auto [default]; 0=disable; 1=up to DC5; 2=up to DC6) (int)


So passing i915.enable_dc=0 to your kernel boot parameters or adding it to a conf file in /etc/modprobe.d/ (options i915 enable_dc=0), should prevent your system from hitting this bug.

And just for reference, this is what you're looking for at the end of your last kernel log:
:: Code ::
Jan 26 22:22:21 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0


[1] gitlab.freedesktop.org/drm/intel/issues/673

EDIT: Just got my first hang disabling display C-states, so this doesn't help on 5.4:

:: Code ::
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: GPU crash dump saved to /sys/class/drm/card0/error
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: [drm] GuC communication enabled
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
Jan 27 10:22:50 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: HuC firmware i915/kbl_huc_ver02_00_1810.bin version 2.0 authenticated:yes
Jan 27 10:22:58 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 27 10:23:06 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 27 10:23:08 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 27 10:23:10 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 27 10:23:12 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 27 10:23:14 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 27 10:23:16 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 27 10:23:18 steven-thinkpad-x1c7 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Back to top
[Resolved in 5.7] Hard lockups on 5.4ish kernels
wildstar84
Status: Interested
Joined: 31 May 2017
Posts: 41
Location: Texas
Reply Quote
After going thru several kernels (5.4x, 5.5x, 5.6x), this now seems to be silently FIXED in 5.7.0-12.1-liquorix-amd64 #1 ZEN SMP PREEMPT liquorix 5.7-17.1~sid! It's good to now finally no longer be orphaned at 5.3.0-15.1! Only very minor annoyance is that 5.7 now boots up using a/b 24meg MORE memory (but only two small modules added: cec, rc_core), so not sure why, but oh well! ;-)

NOTE: Command line includes: i915.fastboot=0 i915.enable_dc=0
Back to top
damentz
Status: Assistant
Joined: 09 Sep 2008
Posts: 849
Reply Quote
That's good! Intel has a pattern of sabotaging their own graphics drivers for multiple releases at a time.

There's a new article on Phoronix about how Intel focused on refactoring their code to fix a regression, but they still don't know the source: www.phoronix.com/scan.php?page=news_item&px=Intel-Graphics-WA-For-Regress

Expect to see more regressions like this in the future, unfortunately. Hopefully the next time the commit that causes the issue is easy to revert, or the patch is straight forward to apply (unlike the last few releases).
Back to top
Display posts from previous:   

All times are GMT - 8 Hours