4.16 Kernel vs Nvidia proprietary drivers are the other distros affected?

Heythere!

the nvidia\amd kernel modules fail because the exported symbols in 4.16 have changed. If this is a brick wall then openSuse Tumbleweed applied it directly to the forehead causing users to watch as tGDM just dies.

This is the issue:

for nvidia: https://devtalk.nvidia.com/default/topic/1030082/linux/kernel-4-16-rc1-breaks-latest-drivers-unknown-symbol-swiotlb_map_sg_attrs-/

for AMD:
https://bugzilla.kernel.org/show_bug.cgi?id=199101

This is the solution:
https://www.reddit.com/r/openSUSE/comments/8bnzbl/nvidia_39618_beta_released/

How did this slip through QA? according to opensuse the modules are being rebuilt after every update:

The kernel module is being built during installation (as it’s done with the Leap packages). But the kernel module is also being rebuilt and reinstalled after a kernel update has been done, since we don’t necessarily keep the kABI stable for TW. This has been implemented by making use of RPM’s trigger scripts.

WHen I asked about it, I got the inane sounding answer that openSuse can’t test proprietary drivers because licensing forbids virtual machines which - upon reading the license - looks like a misinformation to me.

I just don’t get it. Did this issue manage to sneak past by maintainer’s of the other distros as well? This is absurd!

The reality seems to be that most of the people developing on the linux kernel are those of the ethos that it’s Nvidia’s problem not theirs to make sure the their closed source driver works on the latest kernel. I could only imagine the fit Linus would pitch if a meaningful patch was reverted for the purposes of making the proprietary Nvidia driver work. If you are willing to build your own kernel, the Arch developers seem to have grabbed a patch from somewhere that makes the proprietary drivers work again on the latest 4.16 kernel. Chances are this problem is going to be resolved when Nvidia publishes a new driver, not when the Linux upstream decides to revert this change.

I find it acceptable to maintain the absolute - as in no compromises - priority of the kernel. Regardless of what you think of Linus or the primacy of the kernel. It is reasonable to treat it as a development issue and push these errors into the domain of the proprietary developers or maintainers.

Because I think it is first and foremost a maintainer issue. I cannot see how can a distro maintainer can push out kernel updates without testing them against existing widely used proprietary drivers. subsequently breaking a giant portion of their user base. That is an unacceptable behaviour. I would like to know which devops philosophy when providd with the alternatives of :

  • Delaying an update
  • Patching a kernel

Chooses breaking the users’ computers.

Disclamer: i am myself not affected, I am on the stable (leap) release.

Kernel 4.16.x has been finnicky in general. I don’t have GPU problems… But as of 4.16.1 I’ve gotten non-stop kernel panics and I don’t know why.

Please read to the end, because I actually have a solution for you. : D

This is not an opensuse problem. Opensuse is just giving you the most up to date kernel which is what it is designed to do.

Opensuse has stated time and time and time and time and time again that it DOES NOT SUPPORT NVIDIA DRIVERS WITH TUMBLEWEED.

Tumbleweed packages and ships the Nouveau drivers and they test for reliability with the nouveau drivers. Nvidia drivers are considered 3rd party software by every distribution period. No distro out there officially supports Nvidia drivers. Some distros give you a nice easy installer, but they are in no way obligated to test the driver against newer kernels.

So Opensuse is working as advertised.

And trust me, I have a GTX1060, so I am not happy about this situation either, but it is completely unfair to blame any distro for this cluster fuck of a situation.


Now that we have that out of the way, this is actually most likely a Nvidia issue. I looks like they sent out a shit driver.

However their new shiny beta driver actually works with the 4.16 kernel and I can report that it works on opensuse tubleweed with my 1060.

2 Likes

Ahh so glad Vega is now in-kernel natively :slight_smile:

The bug in the first post about amd includes the open source drivers too…
However I really don’t see it as big of a problem as the OP says it is. Just boot from the last kernel or if you are on opensuse use snapper to reverse the updates.

I cant agree more with Tjj post.

1 Like

Thank you for linking the the new drivers. I will add the link to the OP momentarily.

Now with that out of the way. …

Okay, lets roll with that: Tumbleweed rejects the support of third party drivers. They don’t give a shit what happens to you if you use proprietary drivers. Does this mean that they should knowingly push a system breaking update especially without a warning? I mean, arch have patched the kernel for their users.

We are talking about something as prevalent as the nvidia gfx chips, just watching as users’ computers get ditched.

In case anyone was wondering. The issue was well known and the patch was first published in march., the kernel was released on 4.01,4.02, the TW image was shipped on 04.06 .

I am an extremely agreeable man but knowing that the problem was well published, I am sorry I just can’t accept that the oS (blessed be their work) didn’t at least notify their users that “hey, this will crash your system unless”

Except they didn’t even test for that. The first mentioning of the problem came after the snapshot was released:
https://lists.opensuse.org/opensuse-factory/2018-04/msg00316.html

I don’t agree that they have done everything they can reasonably expect from a maintainer. Let me repeat myself: they didn’t even check for a common error scenario that certainly affects a large part of their userbase even though they had the infrastructure to do so. if it were done by humans sure… … but this is all automated. that is whole point of openQA:(

I have read back on the oS subreddit and boy, nvidia vs TW comes up a lot! This is clearly an issue that is impacting oS’s usability and its image.

No they don’t reject. Nvidia doesn’t offer.

No? Where are you getting that from.

They have no way to test the proprietary stuff. As @Tjj226_Angel said:

The FOSS drivers are what they can look at.

Arch is not an organization. They have more freedom to what ever the fuck they want. SUSE is a company.

The issue is with proprietary drivers, things that SUSE can’t look at, breaking a rolling release distro. Don’t get all bent out of shape because you fail to understand what that means.

How? They ship with a config that works out of the box?

Again, this is a rolling release. If stability is a desire, or need then that’s what you get. Rolling has a habit of breaking, that’s what its meant for.

3 Likes

I don’t agree with most of your reply, but this is just untrue.

1 Like

Maybe my phrasing could’ve used work. Most RR distro’s are only tested at the minimum. There’s no way to test for everything, so there will be bugs. Whereas the stable stuff has had much more time and testing poured into it to ensure that it doesn’t break.

1 Like

Yes, now we can agree :slight_smile: :+1:

1 Like

Few more things for you.

1: The only people this really screws is the people who are doing a fresh install of opensuse. If you are doing a kernel update, then I think opensuse saves the last like 5 kernel updates in grub. Just boot up a older kernel option.

2: If Opensuse does not support Nvidia drivers, how do you expect them to test against it? They are not concerned about nvidia drivers. They have consistently made this very clear. With tumbleweed you can’t even install the Nvidia driver from an opensuse repository, you have to download the driver from the Nvidia website and use the CLI installer to get it to work. IDK how much clearer opensuse can make it that they are not responsible for Nvidia.

3: This same issue is also happening for people on Arch, Fedora Rawhide, and any other rolling release that got the 4.16 kernel update.

3: This is only a system breaking issue for people who have Nvidia cards. The people running officially supported GPU drivers are not affected. I honestly think it would be worse that the kernel gets held back just so that the people with non official drivers can have a safety net.

Yeah, it is an impact mitigation and trivial for those who understand what is going on but looking at the subreddit’s post, that is not true for everyone.

I am not sure I can follow why they can’t test it,. Help me understand what stops openQA from running a sub-scenario with proprietary drivers?

Oh I see it and I am saying that it is WRONG, I am saying that this frequent error scenario should be at least tested against not because they are responsible for the NVIDIA blobs , but because they are responsible for their users.

It is an ongoing, recurring problem whose handling is ‘recovery’ not preventive even though it would be easy to prevent or at least test for it.

But I have already addressed that this in my previous comment, I am repeating myself at this point.

3: This same issue is also happening for people on Arch, Fedora Rawhide, and any other rolling release that got the 4.16 kernel update.

That is actually what I would like to know. IIRC arch has patched it and I couldn’t find any complaints about the kernel bug on rawhide using ddg but I might be wrong. IF you have a link at hand, it would be greatly appreciated.

Again, I understand that they voided all liability. My point is that they had the opportunity to correct or at least warn the users about it. This is literally a case of “not my fucking job” . Frankly, would prefer if my maintainers to follow a philosophy like that.

All of this is besides that I am eternally grateful for the people whose work allows me to use this great distro for free .

Here is the thing. There is a reason no one officially supports Nvidia drivers.

Nvidia drivers is like playing russian roulette at any given moment, something could go horribly wrong.

There have been occasions where a driver would work with most GPUs, but not a select few. There have been times where installing other 3rd party apps break the drivers. There have been times where the driver would work, but for what ever reason OpenGL was broken for various games and applications.

The list goes on and on.

If anyone wanted to officially support Nvidia drivers, they would have to buy AT LEAST every current nvidia GPU (and before you say they only need 1 GPU, remember that drivers have been known to fail with specific models. For instance the GTX970 used to have a bunch of issues when the GTX960 and 980 were fine), and then expand their testing 10 fold just to be able to make sure the driver wasn’t broken somehow.

Tumbleweed is constantly updating, but for argument sake, lets say they have 1 update per day. This means that there are 2 major road blocks.

1: In order to roll out updates as consistently as they currently are, they would have to have multiple testing machines with the various nvidia GPU models. This is a community distro. They do get some funding from Suse, but not enough to spend the money on the infrastructure.

2: What happens when they do run into a bug? Nvidia has a really nasty habit of dragging their feet when it comes to issues like these. It could be a couple weeks or more before we see a proper fix. If you held back kernel updates for nvidia driver bugs, the kernel would almost never update. This is why there isn’t even an opensuse repo for the nvidia driver. It just literally would kill off all the “tumble” of tumbleweed.

I really think the only people who could possibly support it would be ubuntu or red hat simply because they have the money and clout to call nvidia directly and tell them to fix their shit.

Hell even fucking apple with the billions of dollars they have; struggle to get nvidia drivers to run right on their OS.


Now if Nvidia was even slightly better about linux support, then I would agree with you. I would personally pay a few hundred bucks so that rbrown could get an Nvidia card for the QA computer.

But Nvidia is usually so bad about releasing problematic drivers that it is literally an impossible task. Thats why everyone HAS taken the stance of “not my job”.

4 Likes

I think you just have the wrong expectations about TW :slight_smile: You should use something else, you will not change this.

I have my opinion of course, but I respect their freedom to build the distro they want. The ZFS FUD bothers me a lot more than this.

this = !first_time && !last_time

This has been happening since forever and its a non-issue in the scheme of bleeding-edge kernels.

Nvidia has generally been very responsive with beta and then new drivers to address this (in this case too, I believe they have a beta driver out already to address this)

The reality of living with nvidia proprietary drivers for a long time now has been philosophically irksome, but logistically straight forward:

set run-mode 3
update kernel
build nvidia driver with push-button script
set run-mode 5

occasionally update the driver when it won’t build.

Very rarely, hold off on a bleeding edge kernel to avoid symbol issues. If you are worried about stability and “bad things happening” with leading edge drivers/kernels, you should not be running 4.16 right now - period.

I’ve had fewer issues w/ linux nvidia drivers than I have had with their windows counterparts.

Would I like to see a full open-source stack and yum/apt "just work"TM? Yep.
Do I think this is going to produce the best performance/stability balance any time soon? Nope.

There may come a day when GPUs are sufficiently “figured out” that you don’t have to have inside info write a highly optimized driver. That day is not today. That day is not within any known or as yet realistically proposed architecture.

Even Intel/AMD x86 is subject to substantial gains with compiler tweaks “knowing a few things about how the latest silicon works”.

TL;DR
Enhance your calm… Have a towel - remain calm.

2 Likes

latest snapshot, kernel sources were missing, updating with the module is no longer possible. (where is your push button scripts now?)

last snapshot: samba config files were conflicting with Apparmour thus samba service could not be started.

Reading the factory mailing list can kinda kill your will to live, like seeing how sausages are made can turn you into a vegan.

Regarding the message of your post: i can only reiterate myself I am willing to accept shit occasionally breaking, but I have issues not testing for the most used scenarios that cover the majority of the users and RAG-ing it. It i not like there are 7852 different proprietary GPU drivers out there.

assume that there are 10’000 tumbleweeders around. Either check it centrally where infrastructure is already available or push the problem onto 10’000 users and who know how many laymen to deal with. With all due respect, this is a lot of man(and women and children) hours wasted.

Sure, don’t even attempt to fix them, that is fine, but there is absolutely no excuse for not testing the for the nvidia drivers.It is not like it isn’t fully automated with already 20 something different setups being tested, now is it?

Umm… how is it nvidia’s fault that your distro failed to include something needed for even an open source driver addition?

Again though, if your expectation is violated by the 4.16 kernel (being bleeding edge for non-kernel-dev), then your expectation is to blame not the kernel or nvidia.

This is what eventually happens when you use Tumbleweed instead of Leap, and you really need a stable release. It’s pretty clear in Tumbleweed’s Portal that --due to the Linux kernel being updated very frequently-- the Tumbleweed distribution is not recommended if you rely on 3rd party kernel driver modules (including graphic drivers):

While every effort is made to build them, at this point there is no guarantee to have all additional modules available in openSUSE Tumbleweed like for example, Vmware or Virtualbox. And while the Packman Tumbleweed Essential repository attempts to deliver them there is no guarantee they will always succeed due to the incompatibilities with the quickly advancing Linux Kernel. The problems with proprietary graphics drivers are similar and there is no guarantee they will work tomorrow, even if they do today. If you don’t know how to compile your own additional kernel modules and you don’t wish to learn or keep a very close eye on what is being updated, please don’t use Tumbleweed.

If you happen to be a developer and must use Tumbleweed to have the latest software updates, you should read the Support Database: NVIDIA the hard way