Hard lockups on 5.4ish kernels
wildstar84
Status: Interested
Joined: 31 May 2017
Posts: 39
Location: Texas
Reply Quote
I've tried 2 of the 5.4 kenels (linux-image-5.4.0-5.1-liquorix-amd64_5.4-3.1 and then, after waiting a cpl. weeks hoping pbm resolved: linux-image-5.4.0-8.2-liquorix-amd64_5.4-9.1) Within 24 hours my machine has suddenly locked up hard (requiring power-cycle): twice upon entering suspend-2-ram, once coming up locked up (with full screen displayed, no mouse/keyboard) on resume from s2ram, and once just randomly (mouse/keyboard/video activity) freezing up. No issues with any prev. kernels (up thru linux-image-5.3.0-16.3-liquorix-amd64_5.3-14.1 - that I'm back to using now). No errors in logs that I can find.

:: Code ::
inxi -F:
$>inxi -F
System:
  Host: wildstar Kernel: 5.3.0-16.3-liquorix-amd64 x86_64 bits: 64
  Desktop: AfterStep 2.2.12
  Distro: antiX-16.2_x64-base Berta Cáceres 15 June 2017
Machine:
  Type: Laptop System: Hewlett-Packard product: HP EliteBook 8440p v: N/A
  serial: <root required>
  Mobo: Hewlett-Packard model: 172A v: KBC Version 30.31
  serial: <root required> BIOS: Hewlett-Packard v: 68CCU Ver. F.11
  date: 11/25/2010
Battery:
  ID-1: BAT0 charge: 51.7 Wh condition: 54.0/54.0 Wh (100%)
CPU:
  Topology: Dual Core model: Intel Core i5 M 520 bits: 64 type: MT MCP
  L2 cache: 3072 KiB
  Speed: 1222 MHz min/max: 1199/2400 MHz Core speeds (MHz): 1: 1617 2: 1914
  3: 1440 4: 1443
Graphics:
  Device-1: Intel Core Processor Integrated Graphics driver: i915 v: kernel
  Display: x11 server: X.Org 1.20.6 driver: intel resolution: 1920x1080~60Hz
  OpenGL: renderer: Mesa DRI Intel Ironlake Mobile v: 2.1 Mesa 19.2.6
Audio:
  Device-1: Intel 5 Series/3400 Series High Definition Audio
  driver: snd_hda_intel
  Sound Server: ALSA v: k5.3.0-16.3-liquorix-amd64
Network:
  Device-1: Intel 82577LM Gigabit Network driver: e1000e
  IF: eth0 state: up speed: 100 Mbps duplex: full mac: b4:99:ba:e2:cb:7c
  Device-2: Intel Centrino Advanced-N 6200 driver: N/A
Drives:
  Local Storage: total: 931.51 GiB used: 163.01 GiB (17.5%)
  ID-1: /dev/sda vendor: HGST (Hitachi) model: HTS721010A9E630
  size: 931.51 GiB
Partition:
  ID-1: / size: 31.25 GiB used: 13.18 GiB (42.2%) fs: ext4 dev: /dev/sda2
  ID-2: /home size: 842.06 GiB used: 138.94 GiB (16.5%) fs: ext4
  dev: /dev/sda7
  ID-3: /var size: 31.25 GiB used: 10.89 GiB (34.9%) fs: ext4 dev: /dev/sda5
  ID-4: swap-1 size: 9.00 GiB used: 0 KiB (0.0%) fs: swap dev: /dev/sda6
Sensors:
  System Temperatures: cpu: 47.0 C mobo: N/A
  Fan Speeds (RPM): N/A
Info:
  Processes: 144 Uptime: 20h 50m Memory: 7.57 GiB used: 1.27 GiB (16.8%)
  Shell: bash inxi: 3.0.36


Anyone else experienced this?
Back to top
techAdmin
Status: Site Admin
Joined: 26 Sep 2003
Posts: 3951
Location: East Coast, West Coast? I know it's one of them.
Reply Quote
While very unlikely to be the cause, do a RAM extended test on the laptop, run a few iterations of memtest on it. That's often a boot option in many distros. If not, you can get a live cd of it or a live usb flash image.

You would want to exclude corrupted ram before anything else. When you suspend to ram, everything is held in ram, so if it's got a glitch, that would of course expose it.

Try also suspend to disk just to see, though that requires as much or more swap space than your consumed ram, which I see yours already has, so you are set. If suspend to disk works but ram crashes, it could well be corrupted ram.

Check to see what brand your ram is with inxi -mxxx as sudo or root, if it's hynix, that's a bad sign, that's not very good ram, and might be the issue.

All of these are just to exclude hardware causes, and are worth checking out, since if you have a bad ram stick, nothing anyone does will make it better.

Note that suspend to disk I believe will also write a lot at once to ram as it wakes up, but it's not stored there, to be corrupted if something is failing in the ram itself.

I'd give very low odds these are the causes, but they should be checked just to be on the safe side, once you can exclude those definitely, then it's probably a kernel issue, for some reason, hardware support for suspend on laptops tends to break for a few kernel versions, then comes back, that's been my experience on laptops too. The bright side is you have all Intel, which is the most likely to be fixed quite rapidly.

Check your boot logs too, AntiX if I remember right does not use systemd, so it would in one of the system boot log files, see if you can find any errors there right at the moment of waking. though you said you did that. Probably isn't getting logged since everything is probably fine until the hard lock wake event happens.
Back to top
damentz
Status: Assistant
Joined: 09 Sep 2008
Posts: 783
Reply Quote
It seems like the biggest problem right now is that Intel GPUs are affected by a full system hang bug that's stable backports have not solved.

bbs.archlinux.org/viewtopic.php?id=250765

It's a long thread, and there's been lots of patches. Jans has been doing a good job backporting them into our zen fixes branch, but it doesn't seem to matter. One person even reported that 5.5 RCs are affected.

My suggestion is to disable sleep states in the i915 modules through the /etc/modprobe.d/ directory if you're in a position to do that. Also, through TLP you can increase the minimum frequency to whatever tlp-stat -g reports. I'm currently doing that on my own X1C7.

:: Code ::
options i915 enable_rc6=0


And one other final thing, one of the latest microcode updates from Intel caused stability issues when mitigations=off is enabled. Might be worth seeing if disabling any microcode updates solves any of your stability problems.
Back to top
wildstar84
Status: Interested
Joined: 31 May 2017
Posts: 39
Location: Texas
Reply Quote
Thanks for the insiteful replies!

My suspicion also would be memory (I actually had very similar issue several years and laptops ago, and fixed by buying new memory!), but that's not the issue here, as no kernel prior to 5.4 causes this (even when being booted up for a cpl weeks or more), but neither 5.4 I've tried will last more than a few hours / >1 suspends).

I long ago had to comment out "enable_rc6=1" as it's not recognized by this (ironlake) gpu, so I assume it's zero. Another setting I've had to do for recent kernels is "i915.fastboot=0" on boot line and "options i915 modeset=1" (modprobe.d) to avoid black screen issues.

I read that "long thread", which sheds alot of light on what seem to be 2 different issues: 1) my "hard lockup" issue in 5.4x, and 2) a "multi-second gpu hang" issue (which I've also experienced rarely in 5.3x, but self-recovers after a few seconds with the "hang" reported in the logs). From that it would seem it's not a Liquorix issue, but up-stream and is still being worked on, so guess for now I'll stick w/5.3 and wait for 5.5 (while watching for future fixes / backports). As far as microcode goes, I'm at 3.20190618-1.

I'm not seeing any suspicious messages in dmesg nor in /var/log/messages or kern.log or Xorg.0.log (from prev. sessions that locked up), are there any other "boot logs" I should be checking? You are correct that AntiX (and I) do not use systemd.

Thanks again!

Jim
Back to top
damentz
Status: Assistant
Joined: 09 Sep 2008
Posts: 783
Reply Quote
Oh ya, I checked enable_rc6 later and it is indeed gone. I wonder what prompted Intel to remove it.

As for any other logs, I think that kern.log is similar to running journalctl -k on systemd. You'll want to look at the last written log file before your current boot. Depending on how your system hangs, it may not get a chance to write. On my work laptop, about half the time I can find the rcs0 reset message in the last kernel log.

And I wanted to put this gitlab issue here for reference. This seems to be where the fix will probably come from:
gitlab.freedesktop.org/drm/intel/issues/673
Back to top
Display posts from previous:   

All times are GMT - 8 Hours