00:22fdobridge_: <gfxstrand> Yeah, I don't remember what all is in there. 😅 We obviously can't do non-saturating conversations. 😂
00:23fdobridge_: <karolherbst🐧🦀> well.. on some gens we can
00:23fdobridge_: <karolherbst🐧🦀> like floats are always possible
00:24fdobridge_: <karolherbst🐧🦀> ints on some gens
00:24fdobridge_: <karolherbst🐧🦀> well... `I2I.sat` was added with turing
00:25fdobridge_: <karolherbst🐧🦀> and only from 32 bit to 8/16 bit :ferrisUpsideDown:
00:25fdobridge_: <karolherbst🐧🦀> though saturated from 64 bit is trivial
00:25fdobridge_: <karolherbst🐧🦀> I think?
00:26fdobridge_: <karolherbst🐧🦀> ohhw ait
00:26fdobridge_: <karolherbst🐧🦀> I misread what you wrote :ferrisUpsideDown:
00:27fdobridge_: <karolherbst🐧🦀> so uhm... ignore all the above, and uhm.. the problem with non-saturated int conversion is, that using `iand` would be faster anyway
00:27fdobridge_: <karolherbst🐧🦀> all those conversion ops are variable runtime
00:27fdobridge_: <karolherbst🐧🦀> and don't run on the normal alu blocks
00:37fdobridge_: <gfxstrand> Yeah
00:37fdobridge_: <gfxstrand> I'm already doing int conversions with `prmt`
00:38fdobridge_: <gfxstrand> It can do everything we want anyway
00:38fdobridge_: <karolherbst🐧🦀> yeah
00:38fdobridge_: <gfxstrand> For truncating conversions, anyway.
00:38fdobridge_: <karolherbst🐧🦀> and is faster 😛
00:38fdobridge_: <gfxstrand> And, once I review and merge @marysaka's optimization pass, we can propgate and fold them very nicely.
00:38fdobridge_: <karolherbst🐧🦀> cool
00:38fdobridge_: <karolherbst🐧🦀> anyway
00:39fdobridge_: <karolherbst🐧🦀> I hope this run survives `Pass: 317311, Fail: 20, Crash: 5100, Warn: 3, Skip: 1769064, Flake: 2, Duration: 1:05:48, Remaining: 59:41`
00:39fdobridge_: <gfxstrand> For unsigned saturating conversions, `imnmx` is probably faster
00:39fdobridge_: <karolherbst🐧🦀> you have saturating conversions
00:39fdobridge_: <karolherbst🐧🦀> and on the hw you don't, you also don't have `imnmx` :ferrisUpsideDown:
00:39fdobridge_: <gfxstrand> For signed saturating conversions IDK if `imnmx` or `I2I` is going to be faster.
00:39fdobridge_: <karolherbst🐧🦀> yeah...
00:39fdobridge_: <karolherbst🐧🦀> good question
00:40fdobridge_: <gfxstrand> IDK that it matters much. We'll only ever see them in CL anyway.
00:40fdobridge_: <karolherbst🐧🦀> I noticed a `0.2%` improvement in pixmark_piano once I used `fadd` for `fneg` instead of `f2f` :ferrisUpsideDown:
00:41fdobridge_: <karolherbst🐧🦀> so it's not _super_ slow
00:41fdobridge_: <gfxstrand> Yeah, I'm already using fadd/iadd for neg
00:41fdobridge_: <karolherbst🐧🦀> just slow enough that it kinda matters in the most shader heavy benchmark
00:41RedSheep: Update on my message a few hours ago about adventures in trying to get 4k120 working with GSP, I was able to extract the modeline that my displayport monitor uses and that worked fine, but the modeline for my HDMI monitor seemed to lock things up.
00:41karolherbst: could be some HDMI specific problem then
00:42RedSheep: In researching this it may be related to significant struggles around the HDMI 2.1 FRL standard as discussed here: https://gitlab.freedesktop.org/drm/amd/-/issues/1417
00:42fdobridge_: <karolherbst🐧🦀> my assumption is, that using two instructions might kill any benefits
00:42fdobridge_: <karolherbst🐧🦀> but it might also depend on scheduling
00:42RedSheep: However it is mentioned there that the open nvidia driver that uses GSP actually can do it, so maybe there's just something we need in drm/nouveau to get FRL wired up?
00:43fdobridge_: <gfxstrand> Yeah, there's enough "It might be faster depending on `$STUFF`" that IDK that there's a right/wrong choice.
00:43fdobridge_: <karolherbst🐧🦀> they've added `I2I.SAT` probably having a good reason
00:43fdobridge_: <karolherbst🐧🦀> actually.. let me check something
00:44fdobridge_: <gfxstrand> Especially if you want to throw away the top bits. Then it's `imnmx` twice and an `iand`.
00:44fdobridge_: <gfxstrand> So three instructions at which point `I2I` is almost certainly faster.
00:45fdobridge_: <karolherbst🐧🦀> yeah
00:45karolherbst: RedSheep: so I guess it works if you lower the refresh rate?
00:46RedSheep: Yeah as long as I stay below the threshold where FRL is needed then the HDMI monitor works fine
00:46karolherbst: I see
00:46karolherbst: who is adding that mode btw?
00:46karolherbst: there are cases where userspace does
00:46karolherbst: and userspace is wrong to do so
00:46karolherbst: the modesetting DDX does so in Xorg
00:47karolherbst: and there is no way the kernel can actually do much about this, unless to double check the mode is sound
00:47karolherbst: which.. we don't always do
00:47karolherbst: let me check something..
00:47fdobridge_: <gfxstrand> Yeah, 32 is what I had in my WIP branch. I think it's fine for a start.
00:48RedSheep: I am not sure I understand what you mean, there were modes up to 4k60 automatically there, and I used xrandr to input modes I had created myself to try and test the limits
00:48karolherbst: ahh yeah
00:48karolherbst: then it's not surprising it doesn't work
00:50RedSheep: Yeah. I am new around here and wanted to see where I can help to make my use cases work better, so I just wanted to see if I could get it going anyway and try to understand where I would need to start with trying to get some code merged somewhere to fix it up
00:50karolherbst: something like this is needed: https://patchwork.freedesktop.org/patch/544681/?series=119998&rev=1
00:50karolherbst: just for HDMI
00:50karolherbst: Lyude: ^^ we haven't merged that patch yet
00:51karolherbst: RedSheep: mind testing that the kernel doesn't misbehave with that patch applied?
00:51Liver_K: Hm, is running the x server as a normal user (vs. root) expected to cause problems?
00:51karolherbst: depends on the problems
00:52karolherbst: it's only really supported through systemd+logind and as the actual session management
00:53Liver_K: Well take the problems of my own I've described. I'm mostly wondering just because of this diagram here https://en.wikipedia.org/wiki/X.Org_Server#/media/File:Linux_graphics_drivers_2D.svg but that might be specific to XAA
00:53RedSheep: Oops meant to tag, sorry new to IRC as well :) Yes I will try that kernel patch
00:54karolherbst: don't use diagrams from wikipedia, they are either outdated or wrong
00:54karolherbst: RedSheep: it won't make the mode work, however the kernel should reject userspace from switching to it
00:55Liver_K: Lol okay then
00:55RedSheep: Right, better than it locking up
00:56karolherbst: yeah
00:56karolherbst: the nouveau bug here is to not verify modes coming from userspace and assuming they are sane and fit into hardware constraints
00:56karolherbst: or rather...
00:56karolherbst: what nouveau supports
00:56karolherbst: the kernel never rejects those modes in the first place, because that's kinda how the interface works
00:56karolherbst: (it does reject them coming from the EDID however)
01:03fdobridge_: <karolherbst🐧🦀> @gfxstrand ohhh.. I might know why we only have 253 regs on volta....
01:03fdobridge_: <karolherbst🐧🦀> so that those binaries run on turing as well :ferrisUpsideDown: or at least are easily patchable
01:03fdobridge_: <karolherbst🐧🦀> I think nvidia has use cases like that
01:03fdobridge_: <karolherbst🐧🦀> and "freeing up" two registers is endless pain
01:04fdobridge_: <karolherbst🐧🦀> and I suspect they might already have worked on uniform registers to know it's a potential problem
01:04fdobridge_: <karolherbst🐧🦀> or at least... that's my current theory
01:04karolherbst: HdkR: does that sound like something nvidia would do? :D
01:34RedSheep: karolherbst: Sorry for the noob question, but what is the preferred method to build a kernel that includes that patch? I just tried shoving the .patch file into the linux-git aur package and hoping for the best but I am fairly sure that's a dumb way to do this, and I don't think it applied.
01:39RedSheep: Either that or it didn't work. I tested by doing cvt 3840 2160 120, and the resulting modeline isn't reduced and therefore doesn't actually fit in the DP HBR3 bandwidth limit, and I still got locked up when shoving that through with xrandr.
01:39karolherbst: uhhh... the log would tell... but I think doing it through the aur package is probably the sanest way for you
01:41karolherbst: RedSheep: but yeah.. I'm not sure we properly validate everything as well... so it might also fall through the cracks
01:42RedSheep: I will try doing the AUR package manually, I was using yay which might have trampled it somehow.
01:51fdobridge_: <karolherbst🐧🦀> @gfxstrand you have to make sure to run `lower_subgroups` after `nir_lower_non_uniform_access` as it can add `read_first_invocation`
01:51fdobridge_: <karolherbst🐧🦀> `dEQP-VK.descriptor_indexing.combined_image_sampler` seems to hit this
01:51fdobridge_: <karolherbst🐧🦀> but it doesn't look like a volta specific problem
01:53RedSheep: OK now I know for sure that yay had messed me up, doing it manually yielded an error with that hunk. I assume that means there is a conflict? Does that patch need to be rebased?
01:54karolherbst: possibly
01:55karolherbst: let me do that real quick
02:00fdobridge_: <karolherbst🐧🦀> @gfxstrand `dEQP-VK.graphicsfuzz.nested-for-loops-switch-fallthrough` also crashes, but this time something with phis 🙂
02:00fdobridge_: <karolherbst🐧🦀> and I think that's all...
02:01fdobridge_: <karolherbst🐧🦀> that read_first_invocation comes up in quite a bunch of tests
02:01karolherbst: RedSheep: https://gitlab.freedesktop.org/karolherbst/nouveau/-/commit/effb80a10f4ddb46c02cf21294f780a5c57db2b3.patch
02:11RedSheep: Awesome, let me see if that gets it to build
02:20RedSheep: Yes, that got it building, thank you! I will be back later to check that the patch has the expected effect
03:10fdobridge_: <gfxstrand> Right. That's easy enough to run again as needed.
06:11fdobridge_: <!DodoNVK (she) 🇱🇹> I originally published that patch to fix vkd3d-proton stuff but it's interesting that it makes some non-vkd3d games go further (also the pipeline caching MR could use a rebase)
06:15fdobridge_: <!DodoNVK (she) 🇱🇹> zmike will be happy for these performance improvements
07:29RedSheep: karolherbst: Yes I can confirm that patch works great, I tried the same steps from before trying to use xrandr to load a bogus modeline and the display didn't even flicker
08:08RedSheep: Huh. If I was reading you correctly it sounded like you expected that patch to only help with displayport, but I am seeing it also prevents me blowing everything up on HDMI as well, so that's great.
09:50HdkR: karolherbst: Doesn't sound right, pretty sure I poked at maximum register usage to test some limits.
10:02fdobridge_: <!DodoNVK (she) 🇱🇹> `Dec 11 11:58:06 RenoirBeast kwin_wayland[1446]: kwin_wayland_drm: failed to import XR24 gbm_bo for multi-gpu usage: Function not implemented` 🤔
11:37karolherbst: HdkR: mhh.. so the issue is, that on turing we are only really able to use 253 registers, which kinda makes sense if you assume uniform registers come from the same hardware block, however, on volta we apparently also need to reserve some, otherwise we run into OOR errors
11:38HdkR: Which to me doesn't seem right, but maybe I'm misremembering that there was some quirk there
11:39HdkR: The configured register count might truncate to some alignment as well which might explain needing +2
11:44karolherbst: HdkR: mhhh... would be weird tho
11:45karolherbst: I know that the alignment is 4 anyway, but
11:45karolherbst: but I think the hardware implicitly is doing it anyway
11:46karolherbst: maybe I need to dig a bit deeper, but so far doing a +2 makes it all work on volta and turing
11:46karolherbst: alignment of 3 is weird however
11:46karolherbst: so it's confusing on why +2 (and not +3) would fix it
11:47karolherbst: maybe on volta +1 is enough ...
11:54HdkR: Ah cool, I thought it was alignment of 4 but couldn't remember :P
11:54karolherbst: yeah... it might be 8 on turing+ now or something, but anyway...
11:54HdkR: As far as I'm aware with Turing, you shouldn't need to allocate the URegs
11:55karolherbst: yeah....
11:55karolherbst: sooo
11:55karolherbst: there might be other reasons :)
11:55karolherbst: the end result is kinda the same
11:55karolherbst: we need to allocate more
11:55karolherbst: and +2 makes it work realibly
11:55karolherbst: it's just that the maths on uregs checks out
11:56karolherbst: you got 64 uregs and 32 lanes in a warp, so it's 2*32 = 64
11:56HdkR: If you're thinking you can't use beyond 253 registers though, Does that mean you think you can't use register 254?
11:56HdkR: and 255 is ZR so w/e there
11:56karolherbst: it would also include reg 253
11:57HdkR: ah, so you lose those top two.
11:57karolherbst: but yeah.. not sure we explicitly tried that
11:57HdkR: That seems wrong, I don't remember that being a problem
11:57karolherbst: should probably check with my turing GPU and that vulkan test spilling :)
11:57HdkR: Create a cuda kernel that burns a ton of registers? :D
11:58karolherbst: let me try on my volta first though
11:58karolherbst: as that's already plugged in
11:58HdkR: Should be the same since it isn't a ureg problem
11:58HdkR: Volta not supporting ureg at all
11:58karolherbst: I'd be happy knowing the real reasons :)
11:59HdkR: My brain is a sieve and I'm happier not remembering the finer details :P
12:00HdkR: Full-time FEX for five years at this point
12:02karolherbst: fair :D
12:04karolherbst: shoo.. that test I was testing with doesn't actually end up using r253 and r254 :')
12:17HdkR: hmm
12:19karolherbst: I messed up nouveau :')
12:19karolherbst: HdkR: but anyway... soo let's assume we allocate 16 registers, would you say that using R15 is okay?
12:19HdkR: Yea, R0-R15 in that case
12:21karolherbst: yeah.. doesn't work :)
12:21HdkR: huh
12:21karolherbst: as I said.. that + 2 makes it work realibly
12:21HdkR: And works sanely pre-volta as well I guess?
12:21karolherbst: yes
12:22karolherbst: it's only needed since volta
12:22karolherbst: I'm not sold on the "needed for ugprs" assumption, but that's the one making the most sense
12:22karolherbst: just it's weird for volta, unless they wanted to burn those regs for weird binary compatability reasons
12:22HdkR: Curious, I did a bunch of work on Volta, which I don't remember that being a problem :D
12:22karolherbst: and already anticipated they add ugprs
12:23karolherbst: yeah.. dunno
12:23karolherbst: it's weird and we don't understand on why it's needed
12:23HdkR: Maybe something derpy in the header since that changed as well
12:24karolherbst: it didn't :)
12:24karolherbst: that's turing+
12:24HdkR: ah right
12:25karolherbst: so using R13 max and allocating 16 works.. let's try R14 with 16 :)
12:25karolherbst: mhh yeah.. that works on volta
12:26karolherbst: let's try on turing
12:28karolherbst: also works on turing...
12:28karolherbst: mhhhh
12:28karolherbst: I _wished_ we'd know what's going on here...
12:29karolherbst: HdkR: anyway... if you know of any other reasons we need to allocate more gprs (barriers? predicates? memory barriers? whatever else?) that would help :) until we know what's up, using +2 makes it work
12:34HdkR: Sadly I don't remember anything that consumes GPRs that need more allocation like that
12:34karolherbst: I wish I'd know
12:35karolherbst: would be funky if it's predicates tho
12:35fdobridge_: <marysaka> Totally forgot to reopen that against mesa main but here it is <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26633>
12:35karolherbst: but that would only explain +1
12:35HdkR: That would be funky :)
12:36karolherbst: yeah.. and cursed
12:36karolherbst: scoreboard barriers would also be cursed
12:36karolherbst: mem barriers? also
12:37karolherbst: barrier registers? mhhh maybe?
12:37karolherbst: but also cursed
12:37HdkR: Pretty sure all those are just fixed size resources so it is unnecessary
12:37karolherbst: given there are 16 special ones already
12:37karolherbst: yeah
12:37karolherbst: so yeah.. uniform regs was the only thing left which would make remotely sense
15:50fdobridge_: <mhenning> I could imagine it being something like: a yield saves some state in the last register or two
16:11fdobridge_: <karolherbst🐧🦀> pain...
16:11fdobridge_: <karolherbst🐧🦀> the annoying part is that the hardware thinks the shader shouldn't access it
16:53fdobridge_: <gfxstrand> It's entirely possible that Volta has the UGPR hardware but it's disabled.
16:56fdobridge_: <karolherbst🐧🦀> at least like the ISA was designed with UGRPs in mind already
16:56fdobridge_: <karolherbst🐧🦀> *it feels like
16:56fdobridge_: <karolherbst🐧🦀> but yeah... I have some shaders were doing +1 seems to also work
16:56fdobridge_: <karolherbst🐧🦀> but +2 is definetly needed in some cases
16:56fdobridge_: <karolherbst🐧🦀> I kinda wished we'd understand this better
16:59fdobridge_: <karolherbst🐧🦀> @gfxstrand do you have a tool to run raw shader binaries?
17:00fdobridge_: <gfxstrand> no
17:00fdobridge_: <karolherbst🐧🦀> maybe I rebase my trap handler thing, because it got everything.. I can just copy&paste a shader blob :ferrisUpsideDown:
17:00fdobridge_: <karolherbst🐧🦀> yeah...
17:00fdobridge_: <karolherbst🐧🦀> would be kinda fun to dig into that
17:13karolherbst: airlied: seems like your commit go handle ENODEV from kmsDisp breaks stuff :')
17:57Ilgaz: It looks like this nouveau bug is somehow happening on Intel 5500 GPU. https://bugzilla.opensuse.org/show_bug.cgi?id=1217748 . I accidentally launched gnome-terminal on i5500 graphics laptop and saw the exact same issue. Even more mysteriously same issue happens in "vanilla" kernel which is not modified by distro. How should I proceed? Both systems run opensuse tw
17:58Ilgaz: Basically gnome-terminal text area is corrupt. No hacks etc of any kind installed I don't actually use gnome at all
18:06karolherbst: Ilgaz: could be something up with fonts? dunno...
18:06karolherbst: ehh wait
18:06karolherbst: other issue
18:07karolherbst: Ilgaz: can you check if this issue is specific to gtk4?
18:07karolherbst: gnome-console is gtk4 where gnome-terminal is gtk3
18:07karolherbst: and you seem to use the new console one
18:08karolherbst: gtk4 uses more OpenGL so I'm not surprised if that hits new/old bugs
18:08karolherbst: or something
18:08karolherbst: could also be a bug inside gtk4
18:08Ilgaz: # GNOME Terminal 3.50.1 using VTE 0.74.1 +BIDI +GNUTLS +ICU +SYSTEMD
18:08karolherbst: and only hit by pre GL 4 hardware/drivers
18:08karolherbst: Ilgaz: sure, but the application looks like gnome console
18:09karolherbst: mhh maybe not?
18:09karolherbst: uhh
18:09karolherbst: I'm confused
18:10karolherbst: Ilgaz: ahh yeah.. ehh weird...
18:10karolherbst: Ilgaz: do you have any custom themes or something? because it looks kinda in between gnome-terminal and gnome-console...
18:10Ilgaz: karolherbst: I tried the "git" version which moved to gtk4 and it works fine. I am totally confused here too. I don't run any crazy hacks etc here. This is infamous OSX "shapeshifter" ape level issuıe
18:10karolherbst: ohh
18:10karolherbst: git of gnome-terminal?
18:10karolherbst: and that moved to gtk4?
18:11karolherbst: interesting
18:11karolherbst: but yeah.. in your screenshot it looks like gtk4, but in gnome-terminal...
18:12karolherbst: Ilgaz: gnome-terminal (3.48.1) looks like this here: https://i.imgur.com/ioPEfdj.png
18:13Ilgaz: yes gnome-terminal guys suggested me to try the git version as it moved to gtk4. I am checking if anything/theme installed . This one is stock gnome-terminal from tumbleweed stable
18:13karolherbst: mhhh
18:14karolherbst: what gnome version is that?
18:17Ilgaz: Version : 45.0-3.1
18:17karolherbst: I'm still on 44.6
18:17karolherbst: (should update to fedora 39 soon... )
18:19Ilgaz: I have run fedora 39 installer iso on nv9400/macbook 5.1 and it didn't have this gnome-terminal issue. I am running duperemove to get some space to try current fedora via distrobox. This video is taken on Intel HD5500 https://photos.app.goo.gl/dRjHiJFcdrzZdDR5A
18:20karolherbst: but yeah.. given that you also hit this on intel makes me think it's not really nouveau specific
18:21Ilgaz: karolherbst: while I love some of gnome tools I rarely use gnome desktop. Launching gnome-terminal was an accident. I am telling since I checked, I don't even have gtk theme set
18:22karolherbst: I see
18:22karolherbst: do you use plasma?
18:22karolherbst: I wonder if this is a bug inside breeze-gtk or whatever it's called
18:23Ilgaz: karolherbst: this could be related to SDDM glitch I reported on opensuse bugzilla but it was closed because of nvidia 9400 situation (upstream), I use plasma-wayland for a long time now on both intel and nvidia thanks to nouveau
18:23karolherbst: mhhh
18:23karolherbst: mind checking if gnome-terminal is fine with the adwaita theme set?
18:24Ilgaz: OK.
18:29Ilgaz: wow it is built in KDE theme. It works fine with adwaita!
18:31karolherbst: can still be a nouveau bug, but at least the variable to trigger this bug is the kde theme
18:32karolherbst: (or mesa bug)
18:33fdobridge_: <airlied> I doubt the ENODEV patch is the bug, just the driver probably doesn't load before that fix
18:34fdobridge_: <karolherbst🐧🦀> it does.. but runpm misbehaves it seems
18:34fdobridge_: <karolherbst🐧🦀> it's a GP107
18:34fdobridge_: <karolherbst🐧🦀> and at some point the nouveau gets reloaded or something?
18:34fdobridge_: <karolherbst🐧🦀> kinda odd
18:35fdobridge_: <karolherbst🐧🦀> ehh wait..
18:35fdobridge_: <karolherbst🐧🦀> it doesn't load, don't mind me :ferrisUpsideDown:
18:35Ilgaz: karolherbst: I have a lot of debug options set on kernel as I am living intel 5500 freezes. How would I check if I get the same "oom" alert on dmesg as it is really flooded here? not urgent of course
18:35fdobridge_: <karolherbst🐧🦀> but uhm.. it looks like a runpm bug
18:36karolherbst: Ilgaz: no idea.. normally you'd use something like systemd-oomd or disable swap so an oom doesn't take down your kernel
19:26Ilgaz: OK I will try the same thing (change gtk theme via kde settings) on the nouveau oneü
19:27Ilgaz: one.. sorry. If it is the same deal, I am changing the bug report to kde one rather than kernel
19:32Ilgaz: karolherbst: thanks again. I will try the same thing on nvidia 9400 when I get there and possibly change the bug category to KDE if it is the case
19:52fdobridge_: <karolherbst🐧🦀> @gfxstrand your suggestion breaks `nak_nir_lower_scan_reduce` 😢
19:53fdobridge_: <karolherbst🐧🦀> https://gist.githubusercontent.com/karolherbst/60508f76aca9a81093770f7bc5204c63/raw/771dc2d267c8d0a01caf79f446d93214913546ba/gistfile1.txt
19:53fdobridge_: <gfxstrand> You can move `nak_nir_lower_scan_reduce` too
19:53fdobridge_: <karolherbst🐧🦀> ohhh that's in preprocess...
19:54fdobridge_: <karolherbst🐧🦀> `nak_nir_lower_subgroup_id` as well?
19:54fdobridge_: <karolherbst🐧🦀> ehh
19:54fdobridge_: <karolherbst🐧🦀> that's before so it shouldn't matter
19:54fdobridge_: <karolherbst🐧🦀> okay, will try with that
19:55fdobridge_: <karolherbst🐧🦀> okay, it fixes that crash.. another run 🙂
19:56fdobridge_: <gfxstrand> \o/
19:58fdobridge_: <karolherbst🐧🦀> well.. not sure if it actually fixes tests, because I was too laze to figure out which hit this bug :ferrisUpsideDown: but I shall know soon
19:58fdobridge_: <karolherbst🐧🦀> `Pass: 13472, UnexpectedImprovement: 2, ExpectedFail: 227, Skip: 75299, Duration: 2:07, Remaining: 1:32:56` maybe that's it...
20:12fdobridge_: <gfxstrand> Hrm... it looks like `mufu.rcp rZ` is giving me NaN... The CTS does not like this. :xenia_sob:
20:12fdobridge_: <karolherbst🐧🦀> still looks good: `Pass: 108827, UnexpectedImprovement: 14, ExpectedFail: 1745, Skip: 603914, Duration: 16:23, Remaining: 1:15:07`
20:12fdobridge_: <karolherbst🐧🦀> mhhhh.....
20:14fdobridge_: <karolherbst🐧🦀> +Inf is expected, no?
20:14fdobridge_: <karolherbst🐧🦀> or what's expected?
20:14fdobridge_: <karolherbst🐧🦀> or undefined?
20:16fdobridge_: <karolherbst🐧🦀> is that even an actual instruction in spirv...
20:17fdobridge_: <karolherbst🐧🦀> `Clarify that OpFDiv has a defined result when the divisor is 0. (MR !195.)`
20:18fdobridge_: <karolherbst🐧🦀> (not part of the spec)
20:22fdobridge_: <gfxstrand> Oh, that's a spectacular SPIR-V MR. Drop the text that says it's undefined without actually bothering to define it. 😂
20:23fdobridge_: <dadschoorse> inherits from IEEE 745-2008 then?
20:23fdobridge_: <gfxstrand> Oh, you'd like to think so, wouldn't you?
20:27fdobridge_: <gfxstrand> Hrm.. Maybe it's our lowering and not the hardware. But how are other drivers working?!?
20:27fdobridge_: <dadschoorse> what specific case is your issue?
20:28fdobridge_: <gfxstrand> fp64 division by 0
20:28fdobridge_: <gfxstrand> Which is, of course, an rcp but we have a lowering pass which does that by using 32-bit rcp as an approximation and then building the fp64 rcp from that.
20:30fdobridge_: <dadschoorse> yea radv uses that too
20:31fdobridge_: <dadschoorse> but the lowering pass explicitly returns +-inf for 0 input
20:31fdobridge_: <dadschoorse> so I don't see how that could go wrong
20:32fdobridge_: <dadschoorse> so together with the div lowering you get inf unless you divide 0 by 0
20:40fdobridge_: <gfxstrand> Yeah. I think I have a bug. The question is where
20:40fdobridge_: <gfxstrand> copy-prop is doing very strange things
20:44fdobridge_: <dadschoorse> am I stupid or is fp64 precision completely undefined in vulkan?
20:45fdobridge_: <dadschoorse> the "Precision of Individual Operations" section only has a table for fp32/fp16
20:45fdobridge_: <gfxstrand> 🌶️
20:47fdobridge_: <karolherbst🐧🦀> is that `MUFU.RCP64H`?
20:48fdobridge_: <karolherbst🐧🦀> thought that's 64 big
20:48fdobridge_: <karolherbst🐧🦀> *bit
20:48fdobridge_: <karolherbst🐧🦀> just the upper 32 bit of the result
20:48fdobridge_: <karolherbst🐧🦀> or rather.. upper 32 bit of input and output
20:48fdobridge_: <karolherbst🐧🦀> not sure if that's a bit better in the end
20:49fdobridge_: <karolherbst🐧🦀> CL took the other extreme and everything is 0 ULP
20:50fdobridge_: <karolherbst🐧🦀> well.. "everything"
20:50fdobridge_: <karolherbst🐧🦀> but fdiv is 0 ulp, where it's 2.5 for fp32 :ferrisUpsideDown:
20:51fdobridge_: <gfxstrand> Ooh! I want that.
20:52fdobridge_: <dadschoorse> AMD has something like that too fwiw
20:52fdobridge_: <gfxstrand> Mind digging up what AMD has?
20:52fdobridge_: <karolherbst🐧🦀> there is also `MUFU.RSQ64H`
20:53fdobridge_: <dadschoorse> > v_rcp_f64
20:53fdobridge_: <dadschoorse> > This opcode has (2**29)ULP accuracy and supports denormals.
20:53fdobridge_: <gfxstrand> It takes and returns 32 bits?
20:53fdobridge_: <dadschoorse> no, 64bit input/output, but fp32 precision internally
20:54fdobridge_: <karolherbst🐧🦀> sounds like what nvidia has
20:54fdobridge_: <karolherbst🐧🦀> it's intended as a starting point in your sequence
20:54fdobridge_: <karolherbst🐧🦀> *lowering sequence
20:54fdobridge_: <gfxstrand> @karolherbst Does `mufu.rcp64h` take/return a vec2?
20:54fdobridge_: <karolherbst🐧🦀> fp64 input/output
20:55fdobridge_: <gfxstrand> Okay
20:55fdobridge_: <gfxstrand> So yeah they're probably the same.
20:55fdobridge_: <gfxstrand> Or close enough.
20:55fdobridge_: <karolherbst🐧🦀> "upper 32 bits of input/output" to be precise
20:55fdobridge_: <karolherbst🐧🦀> so mhh
20:55fdobridge_: <karolherbst🐧🦀> maybe it's scalar?
20:55fdobridge_: <gfxstrand> Yeah that sounds scalar
20:55fdobridge_: <karolherbst🐧🦀> should be easy to verify 🙂
20:55fdobridge_: <dadschoorse> upper 32 bits of input/output sounds scary with NaN/Inf
20:56fdobridge_: <gfxstrand> Depends on the NaN but yea
20:56fdobridge_: <karolherbst🐧🦀> nvidia canonicall NaN is 0x7ffffffff.....
20:57fdobridge_: <gfxstrand> That should be fine then
20:57fdobridge_: <gfxstrand> Well, no
20:57fdobridge_: <gfxstrand> It would throw away some nans
20:57fdobridge_: <gfxstrand> that'd be bad
20:58fdobridge_: <karolherbst🐧🦀> you can always check for nan in your lowering
20:59fdobridge_: <gfxstrand> `rcp.approx.f32`
20:59fdobridge_: <gfxstrand> I bet it's that in PTX
20:59fdobridge_: <karolherbst🐧🦀> possibly
20:59fdobridge_: <karolherbst🐧🦀> I could check what if I can make nvidia generate it with CL C
21:00fdobridge_: <gfxstrand> Compute a fast, gross approximation to the reciprocal as follows:
21:00fdobridge_: <gfxstrand>
21:00fdobridge_: <gfxstrand> - extract the most-significant 32 bits of .f64 operand a in 1.11.20 IEEE floating-point format (i.e., ignore the least-significant 32 bits of a),
21:00fdobridge_: <gfxstrand>
21:00fdobridge_: <gfxstrand> - compute an approximate .f64 reciprocal of this value using the most-significant 20 bits of the mantissa of operand a,
21:00fdobridge_: <gfxstrand>
21:00fdobridge_: <gfxstrand> - place the resulting 32-bits in 1.11.20 IEEE floating-point format in the most-significant 32-bits of destination d,and
21:00fdobridge_: <gfxstrand>
21:00fdobridge_: <gfxstrand> - zero the least significant 32 mantissa bits of .f64 destination d.
21:00fdobridge_: <gfxstrand> Compute a fast, gross approximation to the reciprocal as follows:
21:00fdobridge_: <gfxstrand> - extract the most-significant 32 bits of .f64 operand a in 1.11.20 IEEE floating-point format (i.e., ignore the least-significant 32 bits of a),
21:00fdobridge_: <gfxstrand> - compute an approximate .f64 reciprocal of this value using the most-significant 20 bits of the mantissa of operand a,
21:00fdobridge_: <gfxstrand> - place the resulting 32-bits in 1.11.20 IEEE floating-point format in the most-significant 32-bits of destination d,and
21:00fdobridge_: <gfxstrand> - zero the least significant 32 mantissa bits of .f64 destination d. (edited)
21:00fdobridge_: <karolherbst🐧🦀> `rcp.rn.f64 %fd2, %fd1;` mhh.. whatever 😄
21:00fdobridge_: <gfxstrand> Compute a fast, gross approximation to the reciprocal as follows:
21:00fdobridge_: <gfxstrand> - extract the most-significant 32 bits of .f64 operand a in 1.11.20 IEEE floating-point format (i.e., ignore the least-significant 32 bits of a),
21:00fdobridge_: <gfxstrand> - compute an approximate .f64 reciprocal of this value using the most-significant 20 bits of the mantissa of operand a,
21:00fdobridge_: <gfxstrand> - place the resulting 32-bits in 1.11.20 IEEE floating-point format in the most-significant 32-bits of destination d,and - zero the least significant 32 mantissa bits of .f64 destination d. (edited)
21:00fdobridge_: <karolherbst🐧🦀> `/*0070*/ MUFU.RCP64H R7, R5 ;`
21:00fdobridge_: <karolherbst🐧🦀> looks scalar
21:00fdobridge_: <gfxstrand> Compute a fast, gross approximation to the reciprocal as follows:
21:00fdobridge_: <gfxstrand> - extract the most-significant 32 bits of .f64 operand a in 1.11.20 IEEE floating-point format (i.e., ignore the least-significant 32 bits of a),
21:00fdobridge_: <gfxstrand> - compute an approximate .f64 reciprocal of this value using the most-significant 20 bits of the mantissa of operand a,
21:00fdobridge_: <gfxstrand> - place the resulting 32-bits in 1.11.20 IEEE floating-point format in the most-significant 32-bits of destination d,and
21:00fdobridge_: <gfxstrand> - zero the least significant 32 mantissa bits of .f64 destination d. (edited)
21:00fdobridge_: <karolherbst🐧🦀> https://gist.githubusercontent.com/karolherbst/971dcebd197b761b5cd86c1e43644060/raw/d31d89da8a4d7f9564ba237a5e6af2fcbd48af36/gistfile1.txt
21:00fdobridge_: <gfxstrand> Yeah, with odd numbers it is
21:00fdobridge_: <karolherbst🐧🦀> CL C: 1/a
21:01fdobridge_: <karolherbst🐧🦀> saves you the `f2f` apparently
21:02fdobridge_: <karolherbst🐧🦀> I like how they have a slowpath
21:02fdobridge_: <gfxstrand> But I think if we had `drcp_approx` and implement that in NV and AMD we should be good
21:02fdobridge_: <karolherbst🐧🦀> and the fastpath is like.. fast
21:02fdobridge_: <gfxstrand> We'd want to canonicalize the NaN on the input somehow, though.
21:03fdobridge_: <karolherbst🐧🦀> nvidia is doing that though, no?
21:03fdobridge_: <karolherbst🐧🦀> at least in the fast path
21:04fdobridge_: <gfxstrand> Not if they're throwing away the bottom 32 bits
21:04fdobridge_: <karolherbst🐧🦀> they still use the original input
21:04fdobridge_: <karolherbst🐧🦀> `DFMA R8, -R4, R6, 1 ;` // R4 == original value
21:04fdobridge_: <karolherbst🐧🦀> so there you get NaNs covered
21:04fdobridge_: <gfxstrand> Oh, yeah, that'll cover the NaN
21:04fdobridge_: <karolherbst🐧🦀> and it just propagates through
21:05fdobridge_: <gfxstrand> So we don't care about NaN correctness
21:05fdobridge_: <gfxstrand> Yeah, I'm going to make a NIR op
21:07fdobridge_: <karolherbst🐧🦀> they actually do a function call there 😄
21:07fdobridge_: <karolherbst🐧🦀> in case you ever wondered how that would look like
21:09fdobridge_: <karolherbst🐧🦀> no idea what's special about `CALL/RET` vs `BRA` though
21:11fdobridge_: <karolherbst🐧🦀> as you see they save the return address in `R12` and use that
21:13fdobridge_: <karolherbst🐧🦀> mhh.. CALL can use cb sources.. where BRA can only take an imm and BRX a register..
21:14fdobridge_: <karolherbst🐧🦀> ohhh
21:14fdobridge_: <karolherbst🐧🦀> seems like `BRA` is the more magic of those two
21:14fdobridge_: <karolherbst🐧🦀> BRA is actually funky
21:15fdobridge_: <karolherbst🐧🦀> @gfxstrand soo.. BRA has some really nice features btw... `BRA.U` (branches when all threads agree) `BRA.DIV` (branch on divergence) `BRA.CON` (branch on convergence)
21:15fdobridge_: <karolherbst🐧🦀> sounds like something we could use in corner cases
21:16fdobridge_: <karolherbst🐧🦀> soooo
21:17fdobridge_: <karolherbst🐧🦀> .U works this way: all _active_ threads (decided by the instruction predicates) jump only if their _input_ predicate is all true
21:17fdobridge_: <karolherbst🐧🦀> or rather.. evaluates to true
21:17fdobridge_: <karolherbst🐧🦀> kinda funky
21:19fdobridge_: <dadschoorse> so `if (subgroupVoteAll(cond))`?
21:19fdobridge_: <karolherbst🐧🦀> sounds like it
21:20fdobridge_: <karolherbst🐧🦀> `.CONV`/`.DIV` take a uniform register input which is a thread mask of participating threads
21:21fdobridge_: <karolherbst🐧🦀> but those are a little more complicated
21:22fdobridge_: <karolherbst🐧🦀> but yeah.. sounds useful for some subgroup stuff
21:31fdobridge_: <karolherbst🐧🦀> @gfxstrand is `dEQP-VK.glsl.shader_clock.vertex.clockARB` failing something known?
21:32fdobridge_: <karolherbst🐧🦀> mhhh
21:32fdobridge_: <karolherbst🐧🦀> doesn't seem to fail when running alone
21:36fdobridge_: <karolherbst🐧🦀> ehh.. the compute variants are always failing
21:36fdobridge_: <karolherbst🐧🦀> but I think you brought that one up...
21:36fdobridge_: <karolherbst🐧🦀> ohh.. I found UB in nak.. fun
21:40fdobridge_: <karolherbst🐧🦀> a phi_src being different in some runs
21:40fdobridge_: <karolherbst🐧🦀> `dEQP-VK.graphicsfuzz.nested-for-loops-switch-fallthrough` is the test
21:43fdobridge_: <karolherbst🐧🦀> updated the volta MR (code + desc)
21:44fdobridge_: <gfxstrand> Yeah, that's the problem we were discussing a week ago or so
21:45fdobridge_: <karolherbst🐧🦀> @gfxstrand btw.. `assert!` stays in release code, what you want to use is `debug_assert!` :ferrisUpsideDown:
21:46fdobridge_: <karolherbst🐧🦀> I also have to update some places in rusticl with that...
22:12Lyude: mhh, getting closer and closer but the laptop LCD still isn't working quite yet
22:13Lyude: I've definitely got it handling panel delays w/r/t aux transactions and link training properly now, and the display pushbuffer isn't falling over and crashing anymore - so now it's just an issue with whatever parameters we've got the display link trained on
22:17fdobridge_: <gfxstrand> Yeah. I'm aware
22:28fdobridge_: <airlied> Lyude: if you push a branch somewhere I can try and see if I can spot anything else
22:28Lyude: yeah sure thing one sec
22:31fdobridge_: <gfxstrand> @airlied What's the story with resizable BAR in nouveau?
22:31fdobridge_: <airlied> someone needs to write a story
22:32fdobridge_: <gfxstrand> 😕
22:32fdobridge_: <airlied> currently there is no story, and @karolherbst keeps saying there is problems mapping vram on the cpu, but we haven't worked out where
22:32fdobridge_: <gfxstrand> Mapping VRAM on the CPU works fine on Maxwel+. NVK does it all over.
22:32fdobridge_: <gfxstrand> It's cursed on Kepler and we don't know why.
22:32fdobridge_: <karolherbst🐧🦀> on my turing I only have 256M of mappable VRAM though 🥲
22:32fdobridge_: <airlied> okay so it's probably in the I'm not sure I care that much about kepler category 😛
22:32fdobridge_: <gfxstrand> The problem is that if we expose that to clients it's easy to run out and then bad stuff happens.
22:33fdobridge_: <gfxstrand> Yeah, that's because we need to resize the BAR.
22:33fdobridge_: <karolherbst🐧🦀> yeah...
22:33fdobridge_: <karolherbst🐧🦀> but apparently it depends also on the GPU
22:33fdobridge_: <gfxstrand> Exposing mapped VRAM is pretty important for DXVK
22:33fdobridge_: <karolherbst🐧🦀> and some do it on their own or something...
22:33fdobridge_: <karolherbst🐧🦀> yeah...
22:33fdobridge_: <airlied> well the BIOS usually should be doing it
22:33fdobridge_: <karolherbst🐧🦀> zink is where I ran into issues
22:33fdobridge_: <gfxstrand> Yeah, GPU, CPU, motherboard, phase of the moon ,etc.
22:34fdobridge_: <gfxstrand> Yeah, GPU, CPU, motherboard, phase of the moon, etc. (edited)
22:34fdobridge_: <airlied> make sure you enable 4G decode etc in the BIOS
22:34fdobridge_: <karolherbst🐧🦀> nvidia also doesn't resize BAR on my turing 🥲
22:34fdobridge_: <airlied> okay then probably not much we can do
22:34fdobridge_: <karolherbst🐧🦀> welll
22:34fdobridge_: <karolherbst🐧🦀> we can
22:34fdobridge_: <karolherbst🐧🦀> soooo
22:34fdobridge_: <airlied> since at least on Intel hw I'm not sure we can do it outside the BIOS
22:34fdobridge_: <airlied> Intel CPU
22:34fdobridge_: <karolherbst🐧🦀> how it works on nvidia is, that you upgrade your vbios rom
22:34fdobridge_: <karolherbst🐧🦀> and then it works
22:34fdobridge_: <airlied> some AMD CPUs we can do it I think
22:34fdobridge_: <gfxstrand> I've not looked at the blob on my Turing but my blob box is a Haswell so no resizing BARs at all anyway.
22:35fdobridge_: <karolherbst🐧🦀> and some people had patches on the open driver to do it anyway
22:35fdobridge_: <airlied> I updated the BIOS on a bunch of my machines to get it to work
22:35fdobridge_: <airlied> but I haven't checked what the nvidia devices do
22:35fdobridge_: <karolherbst🐧🦀> https://github.com/NVIDIA/open-gpu-kernel-modules/pull/3
22:35fdobridge_: <airlied> once I get home I can play around a bit
22:36fdobridge_: <karolherbst🐧🦀> okay, cool
22:36fdobridge_: <gfxstrand> I should update my bios and see what happens
22:36fdobridge_: <karolherbst🐧🦀> it works on my AMD GPU on my machine
22:36fdobridge_: <gfxstrand> Any easy way to figure out how big my BAR is?
22:36fdobridge_: <karolherbst🐧🦀> so there is that
22:37Lyude: airlied: https://gitlab.freedesktop.org/lyudess/linux/-/commits/wip/nv-gsp-edp-fix here's what I've got so far
22:37karolherbst: Lyude: btw, saw my message about the nouveau patch?
22:37fdobridge_: <airlied> lspci -vv
22:37Lyude: I think I may have missed it? what was it?
22:37karolherbst: Lyude: https://patchwork.freedesktop.org/patch/544681/?series=119998&rev=1
22:38karolherbst: rebased version: https://gitlab.freedesktop.org/karolherbst/nouveau/-/commit/effb80a10f4ddb46c02cf21294f780a5c57db2b3.patch
22:38karolherbst: helps some user adding modes the hw can't support and the system dies :')
22:38karolherbst: but also the modesetting DDX pushes modes regardless
22:38karolherbst: and then modesets to a mode not supported
22:38karolherbst: it's all terrible
22:39karolherbst: especially as it doesn't handle the initial modeset failing
22:39karolherbst: but with that patch it's all less terrible
22:43Lyude: hm, not totally sure if we want to be doing that directly from the connector atomic check - it also seems really strange to me that it's not already calling that hook elsewhere
22:43Lyude: karolherbst: does this happen if atomic is enabled in the kernel module?
22:43karolherbst: Lyude: it's not calling it for modes added by userspace
22:43karolherbst: yes
22:44karolherbst: userspace modes are not verified by drm core
22:45Lyude: hm. fwiw: the only reason I'm hesistant is since doing a drm_get_crtc_state() there inherently pulls the CRTC into the state even for things like connector prop changes that might not otherwise need or want that
22:45Lyude: I wonder how other drivers handle this or if they do
22:45Lyude: i can try checking in just a bit, I'm going to go grab some coffee first
22:45karolherbst: okay, cool :)
22:45fdobridge_: <gfxstrand> Looks like my motherboard has a toggle for ReBAR
22:45karolherbst: but yeah.. if there are better places to do it, fine
22:46karolherbst: Lyude: could also skip the check if there is no crtc_state attached
22:46karolherbst: as then it's probably already verified?
22:46karolherbst: or not relevant?
22:47karolherbst: something something.. I don't know what I'm doing there :')
22:49Lyude: I think so - that would be drm_atomic_get_new_crtc_state(). I'm going to check i915 though post-coffee. Also, why exactly is the modesetting suggesting this mode via userspace?
22:54fdobridge_: <gfxstrand> `Region 1: Memory at 13800000000 (64-bit, prefetchable) [size=16G]`
22:55Lyude: oh airlied forgot to mention - the other two things I'm planning on trying with that branch next: trying an interesting trick that nvidia's driver does in dp_connectorimpl.cpp where they re-read the link rates over DPCD before starting training (in ConnectorImpl::train(), "Read link rate table before link-train to assure on-board re-driver knows link rate going to be set in link rate
22:55Lyude: table.") since I can see the nvidia driver doing that on this machine. the other thing is implementing the EDP_PANEL_DATA call, but that's more for avoiding redundant MAIN_LINK_CTRL calls and handling other panels with quirks
22:56fdobridge_: <gfxstrand> Let's see if that lets me run with my mem types patch.
22:57fdobridge_: <gfxstrand> Do we have a way to query the BAR size from the kernel?
22:58karolherbst: Lyude: because "broken displays with broken edids" or something
22:58karolherbst: I've suggested the modesetting DDX to not do this, and the response is always "not happening"
22:59karolherbst: so the end result is.. the modesetting DDX checks the modes it gets from the kenrel
22:59Lyude: to be honest I don't think that broken EDIDs should be userspace's problem… but I also don't remember if there's some funny schenanigans X has that I'm forgetting about it
22:59karolherbst: and if it sees 4K@30 as the max on a 4K@60 display, it adds the 4K@60 mode itself
23:00fdobridge_: <airlied> Not sure, might need to grow one, I think amdgpu has a query
23:00karolherbst: Lyude: yeah.. my point is just, that it's how it always was and suggesting to change it gives you a "nack" response
23:00Lyude: karolherbst: *squints* that seems, just completely incorrect??? do you have a link to where they nack'd it/
23:02karolherbst: somewhere on IRC
23:02karolherbst: however
23:02karolherbst: one could just write a patch against drm and see what happens...
23:02karolherbst: (or the DDX)
23:02karolherbst: though changing it in DRM would be a regression.. probably
23:02karolherbst: like rejecting not working modes
23:02karolherbst: maybe it would be fine if nobody files bugs
23:03fdobridge_: <gfxstrand> Yeah, a query would be good. The other thing we probably need is some sort of protection in the kernel so it starts failing allocs or mmaps or something when we run out.
23:03Lyude: so I can see some comments in i915 about x.org adding modes w/r/t DBLSCAN, so it at least seems like we're probably not the only one who's had to deal with this before
23:04karolherbst: ohh yeah
23:04fdobridge_: <gfxstrand> Right now if I throw my nvk/mem-types branch at a non-ReBAR setup, it'll just run out and get stuck.
23:04fdobridge_: <airlied> We just migrate on overallocation
23:04karolherbst: DBLSCAN is the insane part here
23:04fdobridge_: <airlied> Should all work already
23:04Lyude: OH hm
23:04karolherbst: which has _insane_ bandwidth reqs
23:04fdobridge_: <gfxstrand> Hehe. "Should" is load-bearing, I'm afraid.
23:04fdobridge_: <gfxstrand> Hehe. That "should" is load-bearing, I'm afraid. (edited)
23:04fdobridge_: <airlied> It's generic code
23:04karolherbst: so it tries modes which have higher reqs as the ones the kernel pruned already
23:05karolherbst: and then the kernel just accepts it
23:05karolherbst: it's uhh...
23:05fdobridge_: <gfxstrand> Grab my branch, throw a CTS run at your machine, and watch the fireworks.
23:05karolherbst: pain
23:05fdobridge_: <gfxstrand> (I'm running -j18 for my CTS runs)
23:06fdobridge_: <airlied> Got some backtraces?
23:06fdobridge_: <gfxstrand> Oh, it doesn't crash the kernel
23:06fdobridge_: <gfxstrand> Just everything gets stuck
23:06fdobridge_: <gfxstrand> All my contexts start timing out
23:07karolherbst: Lyude: at least mutter e.g. is smart about it and only adds modes if it doesn't get anything useful or something... but patching Xorg is also something I don't want to do 🙃
23:08Lyude: yeah I'm digging through i915 right now to see what their answer is, I have a feeling it's going to be something like adding a mode_valid check in the atomic_check path when we go through CRTCs
23:08Lyude: i'm just a bit surprised the core doesn't already do that
23:08karolherbst: I think it's intentional
23:08karolherbst: or at least that's the feeling I've gotten
23:08karolherbst: let me find the place...
23:09fdobridge_: <gfxstrand> @airlied If you want something that crashes, I get this every once in a while:
23:09fdobridge_: <gfxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1183908520454795335/message.txt?ex=658a0bbb&is=657796bb&hm=2636933d8b6232b9bc86287a447abe1b13de5984ebfcda513cf4cea8fd851672&
23:11Lyude: ok I wasn't crazy - drm_atomic_helper_check_modeset() does actually call to mode_valid()
23:12karolherbst: Lyude: I think the problem was that it never reaches the driver
23:12karolherbst: like the driver have no chance of pruning userspace modes
23:12karolherbst: as they never see them in any mode_valid call
23:12Lyude: karolherbst: I mean - that's in the actual atomic check path
23:14karolherbst: there was one tiny annoying detail about it...
23:14karolherbst: like
23:14karolherbst: Lyude: we have those differentl mode_valid checks, but one is special
23:14karolherbst: uhhh...
23:16karolherbst: Lyude: back in June I said this: "drm_connector_helper_funcs::mode_valid would be the _perfect_ place for it, but it's not called on userspace provided modes"
23:17karolherbst: ahh found it
23:17karolherbst: Lyude: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/drm/drm_modeset_helper_vtables.h?h=v6.7-rc5#n114
23:17karolherbst: ehh
23:17karolherbst: wrong one
23:17karolherbst: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/drm/drm_modeset_helper_vtables.h?h=v6.7-rc5#n933
23:17karolherbst: that one
23:17karolherbst: ...
23:18karolherbst: "This allows userspace to force and ignore sink constraint (...) , which is useful for e.g. testing, or working around a broken EDID."
23:18karolherbst: "Any source hardware constraint (which always need to be enforced) therefore should be checked in one of the above callbacks, and not this one here."
23:19karolherbst: that's kinda the core problem
23:19karolherbst: but we kinda need to know the connector to enforce certain constraints like the bandwidth limitation
23:19Lyude: OK I see I see
23:20Lyude: honestly then I think I know the solution - we just need to add the mode_valid() checks after drm_atomic_helper_check() or whatever the big common atomic check call is
23:20Lyude: let me write up a patch real quick
23:20karolherbst: :)
23:20karolherbst: cool
23:21Lyude: there's definitely a bit in my brain that is telling me "we should move the call somewhere else then I think" but, considering nouveau's future I don't think it's going to make much difference.
23:22karolherbst: also doesn't fix the modesetting DDX ignoring any failures on the initial modeset :')
23:22karolherbst: (or xorg)
23:22Lyude: yeah but that's been an issue for a while :(
23:22karolherbst: yeah...
23:22Lyude: it used to have a proper atomic implementation but then some folks pointed out that's kind of impossible with xrandr
23:22Lyude: (and they were right, it was!)
23:22karolherbst: pain
23:22karolherbst: but
23:23karolherbst: any future modeswitch gets handled correctly
23:23Lyude: how about after christmas we just flip the atomic switch tbh. i mean that won't fix this problem
23:23karolherbst: it's just the initial one being special
23:23Lyude: but i've wanted us to do it for a while anyway and just keep pushing it off
23:23karolherbst: yeah.. the problem also exists with atomic :)
23:23karolherbst: and even with atomic the initial modeset is still special
23:23karolherbst: can't win against X here
23:24Lyude: what do you mean by initial modeset exactly? or are you just talking about X constraints
23:24karolherbst: yeah, so if X starts it tries to switch a mode initially and doesn't care if it fails on the UAPI level
23:24karolherbst: even if those are userspace added modes from the modesetting DDX
23:24karolherbst: I can switch to a working mode just fine via ssh
23:25karolherbst: and X doesn't let me switch to the 4K@60 mode from there either
23:25karolherbst: it's just the one at "start time" which.. uhm.. is broken
23:25karolherbst: I can double check though
23:25karolherbst: and I also haven't figured out why that all is
23:28fdobridge_: <gfxstrand> Machine's dead again. 😩
23:28fdobridge_: <karolherbst🐧🦀> :ferrisSob:
23:44fdobridge_: <gfxstrand> Okay, here's my ddiv:
23:44fdobridge_: <gfxstrand> ```
23:44fdobridge_: <gfxstrand> /*0030*/ MUFU.RCP64H R4, R9 ; /* 0x0000000900047308 */
23:44fdobridge_: <gfxstrand> /*0040*/ MOV R13, R4 ; /* 0x00000004000d7202 */
23:44fdobridge_: <gfxstrand> /*0050*/ MOV R12, R0 ; /* 0x00000000000c7202 */
23:44fdobridge_: <gfxstrand> /*0060*/ DFMA R10, R12, R8, -1 ; /* 0xbff000000c0a742b */
23:45fdobridge_: <gfxstrand> /*0070*/ DFMA R10, -R12, R10, R12 ; /* 0x0000000a0c0a722b */
23:45fdobridge_: <gfxstrand> /*0080*/ DFMA R8, R10, R8, -1 ; /* 0xbff000000a08742b */
23:45fdobridge_: <gfxstrand> /*0090*/ DFMA R8, -R10, R8, R10 ; /* 0x000000080a08722b */
23:45fdobridge_: <gfxstrand> /*00a0*/ DMUL R8, R6, R8 ; /* 0x0000000806087228 */
23:45fdobridge_: <gfxstrand> ```
23:45fdobridge_: <gfxstrand> and it's still wrong in some corners. 😭
23:45fdobridge_: <karolherbst🐧🦀> mhhh
23:45fdobridge_: <karolherbst🐧🦀> yeah
23:45fdobridge_: <karolherbst🐧🦀> I mean.. there is a reason nvidia also has a slowpath 😄
23:46fdobridge_: <gfxstrand> What does their slow path generate?
23:46fdobridge_: <karolherbst🐧🦀> https://gist.githubusercontent.com/karolherbst/971dcebd197b761b5cd86c1e43644060/raw/d31d89da8a4d7f9564ba237a5e6af2fcbd48af36/gistfile1.txthttps://gist.githubusercontent.com/karolherbst/971dcebd197b761b5cd86c1e43644060/raw/d31d89da8a4d7f9564ba237a5e6af2fcbd48af36/gistfile1.txt
23:46fdobridge_: <gfxstrand> What's confusing the hell out of my is the fact that I'm getting NaNs.
23:46fdobridge_: <karolherbst🐧🦀> ...
23:46fdobridge_: <karolherbst🐧🦀> https://gist.githubusercontent.com/karolherbst/971dcebd197b761b5cd86c1e43644060/raw/d31d89da8a4d7f9564ba237a5e6af2fcbd48af36/gistfile1.txt
23:46fdobridge_: <karolherbst🐧🦀> they have this `FSETP.GEU.AND P0, PT, |R6|, 5.8789094863358348022e-39, PT ;` condition
23:47fdobridge_: <karolherbst🐧🦀> and R6 is the upper 32 bits of the input slightly adjusted
23:48fdobridge_: <karolherbst🐧🦀> but that can also only matter for CL
23:48fdobridge_: <karolherbst🐧🦀> so ... dunno
23:48fdobridge_: <karolherbst🐧🦀> their fastpath is also a little different from yours
23:50fdobridge_: <gfxstrand> They're generating that whole pile of garbage for Vulkan, too. 😭
23:50fdobridge_: <gfxstrand> I just pulled a dump out of the blob for this CTS test
23:50fdobridge_: <karolherbst🐧🦀> 🥲
23:50fdobridge_: <karolherbst🐧🦀> the question is...
23:51fdobridge_: <karolherbst🐧🦀> is mesa's lowering worse?
23:52fdobridge_: <gfxstrand> Mesa's lowering doesn't produce correct values.
23:52fdobridge_: <karolherbst🐧🦀> kinda not surprising
23:52fdobridge_: <gfxstrand> But I'm not coninced that's because mesa's is wrong
23:52fdobridge_: <karolherbst🐧🦀> yeah...
23:52fdobridge_: <gfxstrand> I suspect nvidia's fast RCP is broken in both versions.
23:52fdobridge_: <karolherbst🐧🦀> maybe it works with `mufu.rcp`
23:52fdobridge_: <karolherbst🐧🦀> ahhh
23:52fdobridge_: <karolherbst🐧🦀> plausible
23:52fdobridge_: <gfxstrand> I'm very confused because `rcp.approx.ftz.f64` is supposed to only ever return NaN on a NaN input.
23:53fdobridge_: <gfxstrand> According to the PTX docs
23:53fdobridge_: <karolherbst🐧🦀> I can check what that produces 🙂
23:53fdobridge_: <karolherbst🐧🦀> ptx is more like spirv anyway
23:54fdobridge_: <gfxstrand> I suspect we need to do something like this and deal with infs ourselves or something
23:58fdobridge_: <karolherbst🐧🦀> @gfxstrand yeah soo.. `rcp.approx.ftz.f64` maps 1:1 to `MUFU.RCP64H`
23:59fdobridge_: <gfxstrand> Womp womp....