Tech

plecavalier

I just came across this post...I'm experiencing the exact same problem so naturally I attempted the workaround.

Rebooted in previous kernel and got the same error when running the script. Rebooted again and ran the script. It completed successfully as it always did before but upon a reboot I was stuck in the console once again and had to re-run the script.

So this didn't work for me.

I'm running Xorg 1.7, 2.6.32-trunk-686 and SGFXI 4.13.40. This happened for the first time about two weeks ago and ever since every time I start or reboot I must run the script.
Back to top

techAdmin

Generally the reason it fails is because grub isn't actually installed where you think it is.

Other possible causes are nouveau drivers locking things up or out.

However, when posting, please post at least the bare minimum information required to start looking at the problem, ideally the /var/log/sgfxi/sgfxi.log using paste.debian.net or some other paste service if that's too long for the debian one.

When posting failures, remember, while you may know what hardware/driver you are using, we don't. In this case, for example, you've given everything except the actual information needed, like what video card, what driver and so on, which is why I prefer getting the paste url to your sgfxi.log it has all that information.

However, the primary reason this happens is this:

user is running default debian install, with metapackages for linux-image. User does a system upgrade, new kernel is installed, without user noticing. On restart, new kernel has no nvidia module built, and fails.

An alternative version of this, slightly more complicated, is: user has meta kernel package, gets new kernel in upgrade, but grub is installed incorrectly, and user is actually booting into the old kernel without realizing it. That one is fairly common when it comes to this type of error report. In other words, the master grub points to a hardcoded kernel, not to a chainloaded boot, but the install has a grub that updates to partition boot sector, not mbr, so the hardcoded version doesn't get updated on a new kernel.

There's a few other options, the new nouveau stuff promises endless fun and messiness in terms of getting it out of the system, that may take more work.

Also, when X fails to start, you'll want to look at the last 50 or so lines of /var/log/Xorg.0.log and see if you can see what the actual error was.

like so: cat /var/log/Xorg.0.log | tail -n 50
Back to top

plecavalier

It's actually a driver mismatch. The kernel error is:

May 20 09:18:38 localhost kernel: [ 34.108179] NVRM: API mismatch: the client has the version 195.36.24, but
May 20 09:18:38 localhost kernel: [ 34.108181] NVRM: this kernel module has the version 195.36.15. Please
May 20 09:18:38 localhost kernel: [ 34.108183] NVRM: make sure that this kernel module and all NVIDIA driver
May 20 09:18:38 localhost kernel: [ 34.108184] NVRM: components have the same version

After running sgfxi again:

May 20 09:20:19 localhost kernel: [ 135.211281] NVRM: loading NVIDIA UNIX x86 Kernel Module 195.36.24 Thu Apr 22 09:18:20 PDT 2010

My first thought was to run sgfxi with -n hoping it would remove both and then run sgfxi to rebuild both. but somehow the kernel still complains about mismatched drivers after rebooting.

I can say for certain that this did not result from a kernel upgrade. On the other hand, it possible but not necessarily certain that some Debian/nvidia stuff could have been installed during an upgrade. I always force NO kernel upgrades and keep that for times where I can afford the potential downtime ;)

Sorry for the missing hardware nfo. I must have been distracted while posting.

paste.debian.net/74114

Coincidently, I'm a longtime Deb user and had never heard of paste.debian.net...If I did something wrong just let me know.
What a great service. One more reason to support Debian!
Back to top

techAdmin

There are now several other options, dkms should be handled by sgfxi but it looks like it's not.

dkms is probably about the worst thing anyone can use for a dynamic rolling release distro with non-free drivers, which can break at any xorg update or kernel update, but still debian insists on using it.

sgfxi was testing for that junk but I assume something, as usual, changed, randomly, in dkms, and now whatever sgfxi was trying to do is failing.

I have seen few new technologies worse implemented than dkms, and coming from Dell as it does, that's hardly a surprise.

sgfxi tests for and removes all nvidia packages, and if dkms had been done correctly, purging all nvidia packages would of course also purge all dkms cr#p, but it doesn't, leaving gunk in the system like you are seeing.

If you can give me a PRECISE step by step procedure for removing your dkms cruft, I'd appreciate it, I really hate spending any of my time on that garbage, but I will implement further tweaks if I get the steps that have to be done.
Back to top

techAdmin

To make this clear, it looks to me like you used the debian dkms nvidia build method, and although sgfxi tries to remove all that junk, it's quite likely it's simply missing something. Sadly, dkms authors didn't have the courtesy to include a simple purge / remove module by generic name option, preferring instead to require a specific module number for removal, which requires even further processing to get that data.

I wish distros had displayed a bit less enthusiasm about this method, and ensured it actually works as intended with their package managers, in terms of removing it, but I spend as little time as humanly possible tracking dkms so I'm sure something changed since my last efforts to subdue it.
Back to top

plecavalier

hm. I'm going to have to do some homework because I don't even know what DKMS is. It sounds like something Dell has done which implies to me it's firmware related but you're saying it's also Distro specific.

I'll see what I can find out about my "DKMS" on dld(Dell Linux Desktops mail list) and get back to you.
Back to top

techAdmin

the error you show suggests dkms, ie, the auto module rebuilding thing that doesn't really work but is used anyway.

It's a hack, nasty, and it fails. That's my first guess here.

apt-cache show dkms

If you ever followed a recent debian or sidux how to for nvidia installation, it probably had you do a dkms method.

Dkms is fine for stable, frozen pool distros like stable lenny, ubuntus, etc, but it's not at all fine for rolling release distros like sid/testing, because new kernels/xorgs can cause non free video driver failure.

At least your errors suggest this is the cause. If not, I have a new set of errors, caused by an as of yet unknown new agent.
Back to top

plecavalier

Very interesting. I'm sure you're right. I had many difficulties prior to stumbling onto your script. It is likely that I installed various pkgs while following (more than 1) howto.

So I definitely want to stick with sgfxi because in my opinion that's how things should work. I don't feel I should bury myself in building stuff when I can run a script to do all that for me.

That said, can I uninstall all the cr#p I previously installed and start clean so I only have your stuff?
Back to top

plecavalier

By your stuff I mean only the software your script installs of course.
Back to top

techAdmin

yes, those how-tos are not a positive step for Linux users.

They ignore the real problems, and offer no real solution to escape, and pretend that just because dkms is a debian package, it's somehow good, or better than doing a direct install.

The direct install method is the default for one very simple reason: about 3 plus years ago, when i started sgfxi, I read all those how-tos as well from the Debian wiki, and it became immediately obvious that for nvidia, the debian method should always be considered the second or third options, and the direct binary install the first. Years of experience have shown no time where this hasn't been true in Sid/Testing. In Stable it doesn't really matter unless you want the newer drivers, or beta drivers.

the sgfxi -! 40 option as well lets you build modules for each kernel, and the auto kernel module rebuild without reinstall tests make it that much cleaner for nvidia users as well.

However, sgfxi is supposed to try to remove dkms, but it's clearly failing, which is a technical bug, but the problem is, I hate dkms, and I find it very hard to justify spending the day or two it usually takes to really read up and test and debug and then fix the issue. It will happen, as it did with nouveau, which is equally bad at this point, only in a different way (binding a userland video module to console, and only being able to remove it after blacklisting and reboot sounds to me like everything bad people said about Windows reboots for every driver, core change. )
Back to top

Tech

Tech

patterns.com

tech forums