00:04fdobridge: <airlied> I know NAK is the future, but it would be nice to have fquantize2f16 in codegen just to avoid tests exploding
00:57fdobridge: <mhenning> @airlied want to point me to an example test? I might have a chance to hack at that
01:00fdobridge: <airlied> @mhenning dEQP-VK.spirv_assembly.instruction.graphics.opquantize*
02:43fdobridge: <airlied> @karolherbst for some reason I'm having trouble reproducing it now, because I changed kernels and gsp installs
03:20fdobridge: <gfxstrand> It's pretty easy to hook up. Just F2F twice with the right flags.
04:18fdobridge: <mhenning> yeah, pretty much have that written already. It's failing some tests for me on kepler but I havent pulled cts in a year so currently building an updated version
04:24fdobridge: <gfxstrand> @airlied I think I know part of why CTS is so slow: nouveau seems pretty bad when it comes to submit latency.
04:25fdobridge: <gfxstrand> dEQP-VK.api.external.fence.sync_fd.export_multiple_times_temporary takes 1:12.85 on NVK and 23s on ANV.
04:32fdobridge: <airlied> strace isn't pointing out any obvious pain
04:35fdobridge: <mhenning> hmm cts update didn't fix it.
04:36fdobridge: <mhenning> well, it doesn't pass all the tests but maybe better than nothing https://gitlab.freedesktop.org/mhenning/mesa/-/commit/935f3a7cbf2cb2f149a9121af06b7fc0185ff84f
04:38fdobridge: <gfxstrand> Care to make an NVK MR?
04:38fdobridge: <gfxstrand> Looks good to me. IDK why it'd be failing on kepler. May be unrelated
04:38fdobridge: <gfxstrand> Kepler is basically not tested
04:39fdobridge: <gfxstrand> I've got one in my pile somewhere but I've only ever really done much testing on Turing and Maxwell
04:42fdobridge: <mhenning> Sure https://gitlab.freedesktop.org/nouveau/mesa/-/merge_requests/236
04:43fdobridge: <gfxstrand> Merged! Now @airlied can stop feeling bothered. 😝
04:49fdobridge: <gfxstrand> I wonder if syncobj is just somehow heavier than `GEM_CPU_PREP`? 🤔
04:49fdobridge: <gfxstrand> It shouldn't be but sometimes installing `dma_fence` callbacks does things
04:51fdobridge: <gfxstrand> @airlied Does the new `EXEC` ioctl support zero push ranges? It would be really nice if it did.
04:52fdobridge: <airlied> so just wait/signals?
04:52fdobridge: <gfxstrand> Yeah
04:52fdobridge: <gfxstrand> Otherwise we have to push a tiny thing with a NOP just so we have something to execute
04:53fdobridge: <airlied> nope currently it just succeeds on 0 push counts, but does nothing
04:54fdobridge: <gfxstrand> That's kinda useless...
04:54airlied: dakr: ^ feature request :)
04:55fdobridge: <airlied> probably quite easy to fix to run the sync objects
04:56fdobridge: <gfxstrand> Yeah
04:56fdobridge: <gfxstrand> Just need whatever your job thingy is to support 0 pushbufs
04:57fdobridge: <gfxstrand> It needs to wait on all in fences and whatever came before it on the context and then signal out fences
04:57fdobridge: <airlied> I suspect removing the check will just work
04:57fdobridge: <gfxstrand> I'm also realizing that I'm pretty sure `QueueWaitIdle()` is broken right now
04:58fdobridge: <gfxstrand> WRT sparse binding in particular
05:00fdobridge: <gfxstrand> And I kinda don't care
05:00fdobridge: <airlied> isn't that generic code now?
05:01fdobridge: <gfxstrand> Yeah
05:01fdobridge: <gfxstrand> Well, the generic code makes assumptions that I don't think are true basically anywhere
05:01fdobridge: <gfxstrand> It sends a dummy submit down the queue and assumes that implicit ordering works.
05:02fdobridge: <gfxstrand> But nothing guarantees ordering between sparse binds
05:02fdobridge: <gfxstrand> Or between sparse binds and submits
05:03fdobridge: <airlied> ah the current code pushes an empty, but yeah we could drop that if we fix the kernel
05:03fdobridge: <gfxstrand> Yeah but the kernel won't synchronize binds with execs
05:03fdobridge: <gfxstrand> but like I said I don't care right now
05:04fdobridge: <gfxstrand> The only way you'd hit that in practice is if you were doing a bind followed by `QueueWaitIdle` and then a submit with no other synchronization.
05:04fdobridge: <gfxstrand> And got really bad at losing races
05:05fdobridge: <gfxstrand> We probably should come up with a plan to fix it one of these days but I don't care for now
05:07fdobridge: <gfxstrand> The thing that's really frustrating is that, because binds can complete out-of-order, you basically have to track all in-flight binds separately and wait for all of them.
05:07fdobridge: <gfxstrand> You can't just have a timeline
05:08fdobridge: <gfxstrand> I mean, *maybe* the timeline syncobj semantics would make it safe for us but IDK. I'd have to think about it more and go do some reading.
05:08fdobridge: <airlied> https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/airlied-testing-mctestface/ just testing the top patch now
05:10fdobridge: <airlied> https://gitlab.freedesktop.org/airlied/mesa/-/commits/nvk-new-uapi-wip-dummy
05:11fdobridge: <airlied> doesn't oops so should be working
05:12airlied: dakr: ^ you might want to pull in that top patch
05:16fdobridge: <gfxstrand> Looks plausible
05:16fdobridge: <gfxstrand> If this run survives, I'll pull that in and run it tomorrow
05:34fdobridge: <airlied> heaven on zink on nvk running slowly
05:34fdobridge: <airlied> and some corruption
05:36fdobridge: <gfxstrand> 🎉
05:39fdobridge: <gfxstrand> Okay, I'm going to have to spend some time figuring out why tests are taking so long. It isn't because we're running that many more. We're running a few more but not enough to increase runtimes 2x.
05:41fdobridge: <gfxstrand> I'm sure there's a reasonable explanation for it. Hopefully one which doesn't involve kernel patching.
05:51fdobridge: <airlied> I assume running with the old abi on the same kernel is back to the faster speed
05:52fdobridge: <gfxstrand> Yup
05:52fdobridge: <gfxstrand> So it's either something weird with the new exec path or syncobj
05:52fdobridge: <gfxstrand> Or possibly NVK taking a different, slower path
05:53fdobridge: <gfxstrand> It could also be device start-up time from all the stalling.
05:53fdobridge: <gfxstrand> I should test at the "stall all the time" commit
05:54fdobridge: <gfxstrand> My current run has 4 minutes left and then I'll try that
05:56fdobridge: <airlied> yeah that commit on its own might be responsible
05:59fdobridge: <gfxstrand> About a minute in, it's estimating 40 minutes. Slower but not that much. I'll let it run to completion to see
06:01fdobridge: <gfxstrand> Same if I just shut off the new uapi
06:01fdobridge: <gfxstrand> I'm going to let this run finish while I sleep
08:46fdobridge: <karolherbst🐧🦀> yeah mhh.. I think it might need a copysign thing, so that negative denorms become `0.0`?
08:46fdobridge: <karolherbst🐧🦀> or something?
08:47fdobridge: <karolherbst🐧🦀> or was that for huge negative values?
08:47fdobridge: <karolherbst🐧🦀> or maybe it's okay like this uhhh...
08:48fdobridge: <karolherbst🐧🦀> I'd have to think about it, but also...
09:18fdobridge: <dadschoorse> do you use the correct rounding mode? OpQuantizeToF16 requires rtne
09:19fdobridge: <dadschoorse> actually aco also gets that wrong, but I guess there is no cts test for float controls with fp16 rtz and OpQuantizeToF16
12:05fdobridge: <karolherbst🐧🦀> should be rtne by default
12:28fdobridge: <triang3l> Still implements D3D's rules? :nope_gears:
12:38fdobridge: <dadschoorse> wdym
12:39fdobridge: <dadschoorse> d3d doesn't have OpQuantizeToF16
12:39fdobridge: <mohamexiety> D3D, more specifically D3D9 had special rules wrt to normalizing iirc
14:18fdobridge: <gfxstrand> Without the new uAPI but with the "stall all the time" commit, I get 37 min. Yeah, something is hella slow. I'll look into it later
14:21fdobridge: <gfxstrand> I'll try to look at that a bit later. At the moment, I'm going to kick off another run. This we'll be the 3rd new UAPI run I've done since rebooting the machine. Not enough to prove it's as stable as before but not bad
14:21fdobridge: <gfxstrand> Actually.... Let me grab @airlied 's kernel patch first and try with zero-size jobs
14:29fdobridge: <gfxstrand> @airlied We should probably have the same zero behavior on `VM_BIND` as well.
14:29fdobridge: <gfxstrand> That or error. Returning success without signaling fences is bad.
14:30fdobridge: <mhenning> The spec says you can round to either +/- 0, so I'm not sure a copysign would help. The tests for denorms all pass, just the ones labeled "too small" fail - note that also the "negative too small" ones pass
14:31fdobridge: <mhenning> and yes, we're rounding to nearest
14:32fdobridge: <gfxstrand> "too small" as in inf flushing?
14:33fdobridge: <mhenning> no. I can look up the exact value again
14:34fdobridge: <mhenning> from the name I assumed it's less than an f16 denorm or something
14:35fdobridge: <gfxstrand> Could be
14:35fdobridge: <dadschoorse> does your ftz only flush input denorms maybe?
14:36fdobridge: <karolherbst🐧🦀> ohh wait.. I think there was something like that...
14:36fdobridge: <karolherbst🐧🦀> *reads ISA docs*
14:36fdobridge: <mhenning> the forbidden isa docs
14:36fdobridge: <karolherbst🐧🦀> forbidden indeed
14:36fdobridge: <gfxstrand> That patch is no good. IDK what it's doing wrong but it definitely blows up the whole CTS.
14:36fdobridge: <gfxstrand> I need to get me some of those forbidden docs....
14:37fdobridge: <dadschoorse> you could try if putting it on the f2f32 fixes the issues in that case
14:37fdobridge: <karolherbst🐧🦀> nope.. .FTZ flushes inputs and outputs
14:37fdobridge: <karolherbst🐧🦀> sign-preserving btw
14:38fdobridge: <dadschoorse> non-sign-preserving denorm flushing would be cursed
14:40fdobridge: <mhenning> Looks like the value is 2 ^ -16 https://github.com/KhronosGroup/VK-GL-CTS/blob/main/external/vulkancts/modules/vulkan/spirv_assembly/vktSpvAsmInstructionTests.cpp#L9062
14:43fdobridge: <dadschoorse> it looks to me like emitFRND doesn't even use ftz?
14:43fdobridge: <dadschoorse> or maybe I'm looking at the assembler for the wrong gen
14:44fdobridge: <dadschoorse> it looks to me like emitF2F doesn't even use ftz? (edited)
14:45fdobridge: <mhenning> We do emit ftz for OP_CVT on kepler gen 2, which is what I'm running https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir_emit_gk110.cpp#L1091
14:45fdobridge: <mhenning> I'd be curious if other gens work
14:46fdobridge: <karolherbst🐧🦀> also handled on volta
14:50fdobridge: <karolherbst🐧🦀> mhhh
14:50fdobridge: <karolherbst🐧🦀> yeah sooo...
14:50fdobridge: <karolherbst🐧🦀> f16 denorms shouldn't be flushed by default, no?
14:51fdobridge: <karolherbst🐧🦀> though p-16 isn't a f16 denrom, is it?
14:52fdobridge: <karolherbst🐧🦀> ehh wait
14:52fdobridge: <karolherbst🐧🦀> it is
14:52fdobridge: <karolherbst🐧🦀> I think we have to preserve denorms on the fp16 side
14:52fdobridge: <karolherbst🐧🦀> depending on the fp16 mode that is
14:54fdobridge: <karolherbst🐧🦀> @mhenning mind trying to replace the first `CVT.FTZ` with a `MUL 1` + `CVT` without flushing?
14:54fdobridge: <karolherbst🐧🦀> though not sure if we'd optimize that mul away...
14:55fdobridge: <dadschoorse> OpQuantizeToF16 has to flush denorms
14:55fdobridge: <dadschoorse> always
14:55fdobridge: <dadschoorse> OpQuantizeToF16 has to flushfp16 denorms (edited)
14:55fdobridge: <dadschoorse> OpQuantizeToF16 has to flush fp16 denorms (edited)
14:55fdobridge: <dadschoorse> > If the magnitude of Value is too small to represent as a normalized 16-bit floating-point value, the result may be either +0 or -0.
14:56fdobridge: <karolherbst🐧🦀> I see...
14:56fdobridge: <karolherbst🐧🦀> so maybe the output is indeed not flushed...
14:56fdobridge: <karolherbst🐧🦀> weird
14:56fdobridge: <karolherbst🐧🦀> oh well.. not going to argue with the hardware
14:57fdobridge: <karolherbst🐧🦀> @mhenning mind adding `ftz` to the second CVT and see if that fixes it?
14:57fdobridge: <karolherbst🐧🦀> and .. what's the error anyway.. it returns 0 or the value?
14:58fdobridge: <dadschoorse> cts expects 0
14:58fdobridge: <mhenning> I tried adding ftz to the second one last night and it didn't help
14:59fdobridge: <mhenning> Granted, the nearest fp16 value in this case is +zero so I'm not sure it's a flushing issue
14:59fdobridge: <karolherbst🐧🦀> maybe it's broken on older gens.. 😄
15:00fdobridge: <mhenning> yeah, that's part of what I'm wondering
15:00fdobridge: <karolherbst🐧🦀> the question is what does the hardware produce and what's expected
15:00fdobridge: <karolherbst🐧🦀> also.. how does the shader look like
15:00fdobridge: <mhenning> Right, I don't know how to get that info out of cts
15:00fdobridge: <karolherbst🐧🦀> codegen could be doing something silly
15:01fdobridge: <mhenning> yeah, I didn't get a chance to look at the asm last night
15:01fdobridge: <mhenning> I don't have a ton of time right now so might revisit tonight
15:02fdobridge: <karolherbst🐧🦀> meanwhile my ampere GPU keeps crashing 🥲
15:02fdobridge: <karolherbst🐧🦀> or rather
15:02fdobridge: <karolherbst🐧🦀> the kernel is getting confused I think
15:02fdobridge: <mhenning> yeah, ampere is pretty unstable for me
15:02fdobridge: <dadschoorse> unrelated question, what does dnz do for non mul opcodes? I see it's used for mul_zero_wins in some places, but I don't see what it would do for cvt for example
15:03fdobridge: <karolherbst🐧🦀> it's only available on MUL afaik
15:03fdobridge: <karolherbst🐧🦀> and FFMA
15:03fdobridge: <karolherbst🐧🦀> yep
15:03fdobridge: <karolherbst🐧🦀> (and other floating point MULs)
15:04fdobridge: <mhenning> The hope is that it would be flushing fp16 denorms to zero in this case (which can be normalized for the fp32 source), although I have not observed the flag actually making a difference
15:05fdobridge: <karolherbst🐧🦀> mhhh
15:05fdobridge: <karolherbst🐧🦀> I think .FTZ has no affect on fp16
15:06fdobridge: <karolherbst🐧🦀> the PTX docs have some comments on that
15:09fdobridge: <karolherbst🐧🦀> yeah.. there are some more comments of that being a FP32 only thing actually
15:10fdobridge: <gfxstrand> Ugh... Does it work if we put the ftz on both?
15:10fdobridge: <gfxstrand> It's supposed to be inputs and outputs.
15:10fdobridge: <gfxstrand> Oh, wait, no it wouldn't
15:11fdobridge: <gfxstrand> That sucks.
15:11fdobridge: <karolherbst🐧🦀> yep...
15:11fdobridge: <gfxstrand> We're going to have to do the dumb ANV thing and do a comparison, I guess.
15:11fdobridge: <karolherbst🐧🦀> maybe there is a context flag somewhere...
15:11fdobridge: <gfxstrand> Context flag doesn't help. It needs to be on that instruciton.
15:11fdobridge: <dadschoorse> you wouldn't want to flush all fp16 denorms
15:11fdobridge: <dadschoorse> games break if you do that
15:12fdobridge: <karolherbst🐧🦀> mhh.. right...
15:12fdobridge: <gfxstrand> That's okay. So quantize ends up being 4 instructions instead of 2 on Kepler and earlier. That's not a big deal.
15:12fdobridge: <karolherbst🐧🦀> @gfxstrand I'm sure this issue exists on all generations
15:12fdobridge: <karolherbst🐧🦀> or does it work on turing for you?
15:12fdobridge: <gfxstrand> Works great on Turing
15:12fdobridge: <karolherbst🐧🦀> mhhh
15:13fdobridge: <karolherbst🐧🦀> might be a legacy issue then
15:14fdobridge: <karolherbst🐧🦀> maybe there is something fancy in the encoding
15:14fdobridge: <karolherbst🐧🦀> some bit we don't know about
15:15fdobridge: <karolherbst🐧🦀> oh well.. would be good to figure out on what gens it works
15:17fdobridge: <dadschoorse> maybe it only works on generations that support fp16 arithmetic?
15:17fdobridge: <dadschoorse> iirc kepler doesn't
15:18fdobridge: <karolherbst🐧🦀> ohh.. probably
15:18fdobridge: <karolherbst🐧🦀> kepler only supports it for textures
15:19fdobridge: <karolherbst🐧🦀> let's see...
15:20fdobridge: <karolherbst🐧🦀> pascal was the first with fp16 I think?
15:20fdobridge: <karolherbst🐧🦀> yep..
15:20fdobridge: <karolherbst🐧🦀> SM60+ which is Pascal
15:21fdobridge: <dadschoorse> maxwell tegra stuff supports fp16 too, I think
15:22fdobridge: <karolherbst🐧🦀> well.. depends in what sense
15:22fdobridge: <karolherbst🐧🦀> that's SM53 anyway...
15:22fdobridge: <karolherbst🐧🦀> might be the first one in the end..
15:23fdobridge: <karolherbst🐧🦀> ahh indeed
15:24fdobridge: <karolherbst🐧🦀> at least according to wikipedia, SM53 added fp16 arithmetic
15:25fdobridge: <karolherbst🐧🦀> but no fp16 atomics, that's SM60
15:33fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Does SM mean Shader Model?
15:33fdobridge: <karolherbst🐧🦀> yeah
15:34fdobridge: <karolherbst🐧🦀> but the CUDA one
15:34fdobridge: <karolherbst🐧🦀> it's also sometimes called "compute capability"
15:35fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> So it's another case of HLSL/Cg? 🐸
15:35fdobridge: <karolherbst🐧🦀> ehh.. it relates to PTX
15:36fdobridge: <karolherbst🐧🦀> https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
15:36fdobridge: <karolherbst🐧🦀> there you sometimes see references to `sm_`
15:36fdobridge: <karolherbst🐧🦀> but they also have tables with limits and funky stuff like that
15:36fdobridge: <karolherbst🐧🦀> it's a good way to figure out what a GPU supports on the ISA level
15:37fdobridge: <karolherbst🐧🦀> but... ptx is a high level language still
15:37fdobridge: <karolherbst🐧🦀> and it might or might not translate to direct ISA instructions
15:38fdobridge: <karolherbst🐧🦀> but it's close enough for a lot of things
16:53fdobridge: <mohamexiety> iirc it stands for "Streaming Multiprocessor"; it comes from back when the individual execution units were called Streaming Processors in the Tesla days. groups of SPs were then called SMs
16:55fdobridge: <phomes> I am a bit stuck with the last bits of the conditional rendering implementation using MME. I tried to experiment with SET_RENDER_ENABLE again, but I might be doing something stupid
16:56fdobridge: <phomes> I cannot get even a simple case using a hard coded value to work. Is it work to use nvk_cmd_buffer_upload_data to set up the value?
16:57fdobridge: <phomes> simple test here https://gitlab.freedesktop.org/phomes/mesa/-/commit/211a6f55a5171a7685dab67a305af54ac12829f2
19:37fdobridge: <gfxstrand> Yeah, prior to full fp16 support, all you really needed it for was the fp16 pack opcodes which have different rules.
19:37fdobridge: <gfxstrand> We explicitly chose not to care about how many instructions quantize takes and require it to flush/round all the things so it can be used to emulate worst-case hardware.
19:40fdobridge: <karolherbst🐧🦀> yeah, that's fair I guess
19:57fdobridge: <airlied> @gfxstrand blows up how? I did a full CTS run here with that patch yesterday, I'll recheck it
20:00fdobridge: <gfxstrand> Here's the branch I'm running: https://gitlab.freedesktop.org/gfxstrand/linux/-/commits/nvk-uapi/
20:00fdobridge: <gfxstrand> Maybe I'm missing patches? What I'm seeing is LOTS of tests fail when I submit an empty exec for the `cmd_buffer_count == 0` case.
20:00fdobridge: <gfxstrand> Oh, shit...
20:01fdobridge: <gfxstrand> I think I screwed it up.
20:02fdobridge: <gfxstrand> Yup, I flubbed it. It seems to be running okay, now. 🙈
20:09fdobridge: <karolherbst🐧🦀> Ada + GSP is an instable mess so I have to run with 2 threads.. but at least my MR does seem to fix a significant amount of tests: `Pass: 4812, Fail: 1, UnexpectedPass: 8, ExpectedFail: 69, Skip: 21609, Flake: 1, Duration: 1:54, Remaining: 2:23:13` ... let's see how that looks like in 2 hours 🙃
20:10fdobridge: <karolherbst🐧🦀> I should check out if we have some weirdo races on Ampere+ because also Ampere without GSP is quite instable for me
20:10fdobridge: <karolherbst🐧🦀> probably something silly going on
20:10fdobridge: <airlied> My gap ampere is pretty solid for VK cts
20:10fdobridge: <airlied> Gap ampere
20:11fdobridge: <karolherbst🐧🦀> are you running with like 12 threads+ by default/
20:11fdobridge: <airlied> Yup all the threads
20:11fdobridge: <karolherbst🐧🦀> heh
20:11fdobridge: <airlied> Haven't tried GL
20:11fdobridge: <karolherbst🐧🦀> GL is fine
20:11fdobridge: <karolherbst🐧🦀> VK isn't
20:11fdobridge: <karolherbst🐧🦀> I think it has something to do with recovery
20:11fdobridge: <airlied> Takes about an hour to finish a full run, did like 5 yesterday
20:11fdobridge: <karolherbst🐧🦀> I'm also running like all the tests
20:11fdobridge: <karolherbst🐧🦀> mhhh
20:11fdobridge: <airlied> Oh I just run faith script
20:11fdobridge: <karolherbst🐧🦀> well.. guess you are more lucky than I am
20:12fdobridge: <karolherbst🐧🦀> ahh.. no, I run everything
20:12fdobridge: <airlied> Might be possible to narrow it down then
20:12fdobridge: <karolherbst🐧🦀> last complete run: `Pass: 369765, Fail: 4302, Crash: 205, Warn: 4, Skip: 1640771, Timeout: 14, Flake: 182, Duration: 3:29:03, Remaining: 0`
20:12fdobridge: <karolherbst🐧🦀> nah..
20:12fdobridge: <karolherbst🐧🦀> I don't think it's a specific test
20:12fdobridge: <karolherbst🐧🦀> just recovery being broken
20:13fdobridge: <karolherbst🐧🦀> but anyway.. seems my MR fixes like 10% of the fails?
20:14fdobridge: <karolherbst🐧🦀> not sure if you ran it already and seen similar results
20:15fdobridge: <karolherbst🐧🦀> it kinda feels like that we can trash other channels... my working theory is that if a process has the channel id 4, it gets reaped and continues to submit stuff, another process also getting id 4 assigned (after the old 4 was reaped) and then things break
20:15fdobridge: <karolherbst🐧🦀> but not entirely sure about that yet
20:15fdobridge: <airlied> I did but I was having trouble reproducing on the latest kernel
20:16fdobridge: <airlied> Not sure if skeggsb latest gsp branch reports different numbers for my card now
20:16fdobridge: <karolherbst🐧🦀> @airlied I was testing with `dEQP-VK.spirv_assembly.instruction.compute.opphi.wide` btw
20:16fdobridge: <karolherbst🐧🦀> nah
20:16fdobridge: <karolherbst🐧🦀> it has nothing to dow ith gsp
20:16fdobridge: <karolherbst🐧🦀> try that test, it kinda makes codegen spill and use quite a bunch of local mem
20:17fdobridge: <karolherbst🐧🦀> anyway, I'm not seeing any local mem errors in dmesg with my branch
20:23fdobridge: <airlied> Merge it!
20:24fdobridge: <karolherbst🐧🦀> nobody reviewed it yet
20:24fdobridge: <karolherbst🐧🦀> but it gets more promising over time: `Pass: 42322, Fail: 6, UnexpectedPass: 58, ExpectedFail: 422, Skip: 187680, Flake: 12, Duration: 17:27, Remaining: 2:15:13`
20:24fdobridge: <karolherbst🐧🦀> *becomes
20:36fdobridge: <gfxstrand> MR?
20:37fdobridge: <karolherbst🐧🦀> https://gitlab.freedesktop.org/nouveau/mesa/-/merge_requests/234
20:38fdobridge: <airlied> @karolherbst can confirm that test passes now for me, and fails before, not sure why my runs aren't catching it
20:38fdobridge: <karolherbst🐧🦀> mhhh.. maybe some install fail? dunno
20:38fdobridge: <karolherbst🐧🦀> I don't know if my runs actually do catch it either, but... it would be a big coincidence by now: `Pass: 77768, Fail: 11, UnexpectedPass: 143, ExpectedFail: 679, Skip: 343876, Flake: 23, Duration: 31:33, Remaining: 1:58:58` 😄
20:39fdobridge: <karolherbst🐧🦀> maybe you had an older version of my branch or something?
20:42fdobridge: <karolherbst🐧🦀> luckily we don't hit this issue in GL because we never resize the TLS buffer in the first place and just size is "big enough" from the start 🙃
20:45fdobridge: <gfxstrand> @karolherbst were you going to make a GL MR to go with it?
20:45fdobridge: <karolherbst🐧🦀> for the num_gprs I already merged it: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24261
20:45fdobridge: <karolherbst🐧🦀> for TLS it doesn't matter
20:46fdobridge: <karolherbst🐧🦀> though I should fix it there as well, but in GL if you run out of local memory you are screwed anyway
20:46fdobridge: <karolherbst🐧🦀> (should probably fix that, but...)
20:49fdobridge: <gfxstrand> Okay, I'm going to rebase nvk/main to pick that up so we don't duplicate the compiler change.
20:49fdobridge: <esdrastarsis> typo: unkown
20:50fdobridge: <karolherbst🐧🦀> that chunk should drop on a rebase anyway, but yeah..
21:30fdobridge: <gfxstrand> Yup
21:30fdobridge: <gfxstrand> Okay, nvk/main has been rebased on mesa/main. This included picking up @mhenning 's "get rid of TGSI" stuff which is nice but, sadly, doesn't let us drop the one remaining bit of TGSIness. 😭
21:31fdobridge: <gfxstrand> I've also merged the GPR fixes
21:31fdobridge: <karolherbst🐧🦀> cool
21:31fdobridge: <gfxstrand> And I'm going to do a turing run to see what all the GPR fixes fixed. 😄
21:32fdobridge: <karolherbst🐧🦀> it should
21:32fdobridge: <karolherbst🐧🦀> my last run was pretty clean afaik
21:32fdobridge: <gfxstrand> Oh, I'm not worried about adding bugs. Mostly want to get a new baseline after the fixes.
21:32fdobridge: <karolherbst🐧🦀> just there weren't many of those
21:32fdobridge: <gfxstrand> It all looked pretty sane to me.
21:34fdobridge: <karolherbst🐧🦀> there are some `INVALID_VALUE` errors left
21:36fdobridge: <karolherbst🐧🦀> @gfxstrand I think this was my last run on Turing with my fixes: https://gist.githubusercontent.com/karolherbst/cd9e0255402a4a193a173f678c1d6cf5/raw/0b2fbc4e1384965a6780fbe6fa48c25f0cdcd795/gistfile1.txt
21:37fdobridge: <karolherbst🐧🦀> most of those `CTXNOTVALID` errors would go away if nvk would stop submitting on dead channels... probably...
21:37fdobridge: <karolherbst🐧🦀> mhh.. still some `REGISTER_COUNT`.. maybe it was my base run? dunno.. journalctl is weird sometimes
21:37fdobridge: <karolherbst🐧🦀> but there might be more bugs, so dunno
21:38fdobridge: <karolherbst🐧🦀> ohhh
21:38fdobridge: <karolherbst🐧🦀> it contains two runs 🙃
21:38fdobridge: <karolherbst🐧🦀> drwxr-xr-x. 2 kherbst kherbst 928K Jul 20 18:03 nvk-tu102-2016ff00a592c4ed539d172272674bbc931ef9f4
21:38fdobridge: <karolherbst🐧🦀> drwxr-xr-x. 2 kherbst kherbst 28K Jul 20 19:14 nvk-tu102-lts-fixes
21:38fdobridge: <karolherbst🐧🦀> so around 18:03 I started the second run
21:39fdobridge: <karolherbst🐧🦀> and no register errors in that half
21:39fdobridge: <karolherbst🐧🦀> `Jul 20 18:07:59 localhost.localdomain kernel: nouveau 0000:01:00.0: fifo:PBDMA0: CTXNOTVALID chid:16` probably the first line of the run with my MR
21:40fdobridge: <karolherbst🐧🦀> mhhh.. but I think I restarted it midway, because I screwed up
21:40fdobridge: <karolherbst🐧🦀> something something
21:40fdobridge: <karolherbst🐧🦀> yeah.. forget, just do a run yourself and check the logs 🙃
21:41fdobridge: <karolherbst🐧🦀> the invalid value on `0dcc` is more interesting.. let's see
21:42fdobridge: <karolherbst🐧🦀> that's `NVC597_SET_PATCH`
21:42fdobridge: <karolherbst🐧🦀> ohhh
21:42fdobridge: <karolherbst🐧🦀> I think it can't be 0
21:43fdobridge: <karolherbst🐧🦀> let's see...
21:44fdobridge: <karolherbst🐧🦀> mhh
21:46fdobridge: <karolherbst🐧🦀> in OpenGL we default to 3
21:47fdobridge: <karolherbst🐧🦀> maybe I'll look into that one next if nobody wants to
21:50fdobridge: <gfxstrand> Go for it
22:20fdobridge: <mhenning> yeah, replacing the TGSI enums in the codegen interface is on my TODO list
22:23fdobridge: <gfxstrand> You may want to take a look at NAK. There I lower directly from locations to HW attribute addresses and I don't need any of that remapping madness.
22:23fdobridge: <gfxstrand> So much complexity just... gone.
22:24fdobridge: <gfxstrand> @airlied Why does the kernel need to manage VMA addresses at all? Talking about `drm_nouveau_vm_init`, specifically. Why not just have userspace assign all addresses all the time.
22:24fdobridge: <airlied> The kernel needs a safe range for its own stuff
22:25fdobridge: <airlied> I think there is some channely stuff allocated in the vm
22:25fdobridge: <gfxstrand> Okay
22:26fdobridge: <gfxstrand> Any guidelines on how big that space needs to be?
22:26fdobridge: <airlied> Not off hand, I was going with 1 bit out of 40 being enough
22:26fdobridge: <gfxstrand> lol, fair.
22:27fdobridge: <gfxstrand> At the very least, a gig or so should be fine
22:28fdobridge: <airlied> I should see what SVM uses, I also think there might be some restrictions on full 40 bit vma
22:28fdobridge: <airlied> Allocating from top seems not to work always
22:28fdobridge: <mhenning> @gfxstrand are you talking about putting the hardware addresses into driver_location? alyssa already pointed out that what codegen is currently doing is a little strange. Fixing that is also on the todo list
22:29fdobridge: <gfxstrand> Yup. https://gitlab.freedesktop.org/gfxstrand/mesa/-/blob/nak/main/src/nouveau/compiler/nak_nir.c#L316
22:29fdobridge: <gfxstrand> @mhenning It should be noted that NAK doesn't work on all the graphicsy things yet so there may be a gotcha I'm not thinking of but I don't know what it would be.
22:29fdobridge: <karolherbst🐧🦀> yeah.. that callback madness was a bit... weird
22:30fdobridge: <gfxstrand> Only 40 bits is already annoying for SVM. 😕
22:30fdobridge: <gfxstrand> The big thing with SVM is that we can do a big `MAP_ANONYMOUS` and hand that off to the kernel as its range.
22:31fdobridge: <gfxstrand> So as long as we're giving the kernel a range and not querying, we're fine.
22:31fdobridge: <karolherbst🐧🦀> system SVM is different
22:31fdobridge: <gfxstrand> Yeah, system SVM is a mess
22:31fdobridge: <karolherbst🐧🦀> and nouveau supports it
22:31fdobridge: <airlied> Yeah the SVM init path does a similar interface, not sure how to cross the streams
22:32fdobridge: <gfxstrand> Even there, though, if you reserve a big CPU VA space and give that to the kernel, that should satisfy what system SVM needs, too.
22:32fdobridge: <gfxstrand> As long as the app isn't incredibly evil and tries to SVM into that range.
22:32fdobridge: <karolherbst🐧🦀> it won't
22:32fdobridge: <gfxstrand> At which point just kill the app
22:32fdobridge: <karolherbst🐧🦀> that's the problem
22:32fdobridge: <gfxstrand> Why not?
22:32fdobridge: <karolherbst🐧🦀> because you have to mirror the entire CPU VM
22:33fdobridge: <airlied> You can usually put the kernel GPU VA resv at same place as CPU kernel resv
22:33fdobridge: <gfxstrand> Yes, but as a driver you can mmap 1GB of nothing and not tell the client about it and then the client will never touch that memory
22:33fdobridge: <karolherbst🐧🦀> it's not about the client not touching it
22:33fdobridge: <karolherbst🐧🦀> the client doesn't know about any of this
22:33fdobridge: <karolherbst🐧🦀> and doesn't have to
22:33fdobridge: <karolherbst🐧🦀> it just mallocs memory
22:34fdobridge: <karolherbst🐧🦀> and the driver can't place any GPU memory in that region
22:34fdobridge: <gfxstrand> Yes and if you've mmap'd a range, it'll never malloc inside that range.
22:34fdobridge: <karolherbst🐧🦀> okay, this way around...
22:35fdobridge: <karolherbst🐧🦀> I.... was looking into this with iris and I concluded it's not the right approach though
22:35fdobridge: <karolherbst🐧🦀> I think what might work is if you just mmap a region for any GPU memory you place in userspace. But I'm still not sure how well that pans out wiht system level SVM
22:35fdobridge: <karolherbst🐧🦀> because there is that page faulting kernel side mess as well
22:36fdobridge: <karolherbst🐧🦀> but maybe that would work
22:36fdobridge: <karolherbst🐧🦀> the issue with kinda allocating 64GB+ of VM is that every process now has a virtual memory usage of 64GB+
22:36fdobridge: <gfxstrand> The important thing is that whenever GPU memory exists, its address range is also "removed" from the CPU address space.
22:36fdobridge: <karolherbst🐧🦀> yeah...
22:37fdobridge: <karolherbst🐧🦀> but I think the better approach is to simply mmap each bo and place it there... but I still don't know if that works properly with system SVM
22:37fdobridge: <karolherbst🐧🦀> maybe it does
22:37fdobridge: <gfxstrand> Yup
22:38fdobridge: <gfxstrand> It should
22:38fdobridge: <gfxstrand> And that's exactly what you do
22:38fdobridge: <gfxstrand> For the kernel reserved region, though, you mmap nothing (Well, it'll be the CoW zero page) and then tell the kernel to use that.
22:38fdobridge: <karolherbst🐧🦀> yeah...
22:38fdobridge: <karolherbst🐧🦀> I think that's the part i915 doesn't have yet
22:39fdobridge: <karolherbst🐧🦀> but we have it in nouveau
22:39fdobridge: <gfxstrand> i915 doesn't have a lot of things. 😂
22:39fdobridge: <karolherbst🐧🦀> and I'm using that for SVM on nouveau
22:39fdobridge: <karolherbst🐧🦀> anyway.. my 2 thread run on Ada completed
22:39fdobridge: <karolherbst🐧🦀> `Pass: 369780, Fail: 63, UnexpectedPass: 738, ExpectedFail: 3730, Skip: 1640812, Flake: 120, Duration: 2:30:23, Remaining: 0`
22:39fdobridge: <gfxstrand> \o/
22:39fdobridge: <karolherbst🐧🦀> doesn't look too bad
22:40fdobridge: <gfxstrand> @airlied Do you want comments on `nouveau_drm.h` on the MR or on a mailing list somewhere?
22:40fdobridge: <airlied> Dri-devel posting is probably best
22:58fdobridge: <gfxstrand> Hey, look! I reviewed a kernel patch.
22:59fdobridge: <gfxstrand> Well, not R-B'd quite yet but I'm really quite happy with it.
23:26fdobridge: <airlied> btw unmanaged just was taken from the svm api that existed already
23:27fdobridge: <airlied> didn't put too much thought into it other than aligning with existing api
23:42fdobridge: <gfxstrand> Fair
23:43fdobridge: <gfxstrand> @airlied If you want to give the fixups in your branch a skim for sanity, that'd be good. Also, maybe think through my image sparse binding code in the queue patch. I had to rework it a good bit and I think I got it but another set of eyes would be good.
23:43fdobridge: <gfxstrand> Otherwise, I think we're 99% good to go.
23:44fdobridge: <gfxstrand> Things have been pretty stable since fixing a few bugs.
23:44fdobridge: <gfxstrand> I'd still like to hunt down the slowness but it's almost 7:00 PM on a Friday here so that'll have to wait until Monday.
23:50fdobridge: <gfxstrand> Even there, I highly doubt the slowness will affect the API but I'd like to give a bit of time to it just in case it does for some reason.