05:28fdobridge: <!DodoNVK (she) 🇱🇹> So would even glLight() have a shader on Mesa? And did Mesa have direct fixed-function support before? :frog_gears:
05:31fdobridge: <airlied> it's not really about mesa having it, as much as the hardware has to have it
05:32fdobridge: <airlied> and hardware hasn't had fixed function in that area since maybe nv10/20 or amd r100 eras
05:33fdobridge: <airlied> and I think we've retired all the drivers for that era hw out of mesa main now, so removed the drivers that used ff
06:17fdobridge: <Misyl with Max-Q Design> The rest of the TGSI crew should be next
06:25fdobridge: <airlied> why?
06:26fdobridge: <airlied> maybe we could burn the hw ones, but softpipe and virgl are kinda hard
07:36fdobridge: <airlied> so I tried to parse some more nvidia headers into clang AST, this was what was needed for one : clang -Wno-error -std=gnu11 -Xclang -ast-dump -fsyntax-only -include $BD/src/common/sdk/nvidia/inc/cpuopsys.h -DRPC_MESSAGE_GENERIC_UNION -DRPC_MESSAGE_STRUCTURES -DRPC_STRUCTURES -DRPC_GENERIC_UNION -DPORT_MODULE_memory=1 -DPORT_MODULE_cpu=1 -DPORT_MODULE_core=1 -DPORT_MODULE_debug=1 -DPORT_MODULE_util=1 -DPORT_MODULE_safe=1 -DNVRM -D_LANGUAGE_C -
07:37fdobridge: <airlied> include path hygiene isn't a thing 🙂
07:40fdobridge: <dadschoorse> virgl should just switch to spirv by reusing zink's compiler, softpipe is a bit of a problem but maybe it could consume nir with registers intrinsics instead?
07:43fdobridge: <dadschoorse> what TGSI hw is still in mesa anyway? I think just i915g and ancient nvidia, r300 consumes NIR nowadays
07:49fdobridge: <airlied> How can you replace the virgl wire protocol?
08:05fdobridge: <dadschoorse> add support for spirv, drop tgsi support after x years?
08:36fdobridge: <triang3l> There's TGSI software, by the way… D3D9 and/or D3D10
08:37fdobridge: <dadschoorse> nine doesn't need nir to tgsi
09:06fdobridge: <clangcat> So I found something bizzare while making my compositor. I swapped TTY opened a seat with libseat and after a while my Nouveau GPU on my sway session timeout and caused sway to crash. Timeout was at `drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmtu102.c:45` is this a known issue?
09:07fdobridge: <clangcat> Oh also it left my PC in a funny state where nothing worked and even logging in with SSH to manually restart TTY services didn't restore graphics output :)
13:43fdobridge: <gfxstrand> `Pass: 1222465, Fail: 59, Crash: 5, Warn: 1, Skip: 1519390, Timeout: 8, Flake: 1, Duration: 35:48, Remaining: 0`
13:49fdobridge: <redsheep> Is it just me or is that a pretty fast CTS run?
13:56fdobridge: <!DodoNVK (she) 🇱🇹> ||https://tenor.com/view/misaka-misaka-mikoto-railgun-aetrna-gif-18654651 ||
15:01fdobridge: <Sid> does anyone here have a quadro card?
15:01fdobridge: <Sid> there's something I'd like to have tested c:
15:05fdobridge: <dadschoorse> I got used 20min cts runs, then radv added shader object support
15:14fdobridge: <clangcat> Quadro is normally for mac?
15:16fdobridge: <Sid> workstation
15:16fdobridge: <Sid> but yeah some mac pros have quadro cards
15:21fdobridge: <ahuillet> @tiredchiku what testing do you need?
15:21fdobridge: <Sid> an OBS plugin
15:21fdobridge: <Sid> on 555 drivers
15:22fdobridge: <Sid> https://gitlab.com/Sid127/obs-nvfbc
15:24fdobridge: <Sid> 1. compile and install the obs plugin
15:24fdobridge: <Sid> 2. remove all other sources in OBS and add NvFBC source
15:24fdobridge: <Sid> that's all that has to be done, I just wanna know if it works (the preview should show the framebuffer)
15:25fdobridge: <clangcat> XD of course the NVIDIA employee has easy access to odd Nvidia hardware.
15:25fdobridge: <clangcat> I could ask my brother. if you want. But I doubt he knows how to compile code.
15:25fdobridge: <Sid> I can probably set up a CI for it
15:26fdobridge: <clangcat> Yea but my brother uses Ubuntu.
15:26fdobridge: <clangcat> So like gotta bare in mind the libraries are probably hella out of date
15:26fdobridge: <Sid> that's fine
15:26fdobridge: <Sid> as long as it's nvidia driver 555
15:26fdobridge: <Sid> and OBS v28 or newer
15:27fdobridge: <Sid> oh and it's currently X11 only .-.
15:27fdobridge: <mohamexiety> iirc ubuntu is still on 525 or something
15:27fdobridge: <Sid> I wanna confirm if it actually works before bashing my head into wayland
15:27fdobridge: <clangcat> Yea :p and my brother once needed me to "fix" his computer growing up
15:28fdobridge: <clangcat> Turns out computers sometimes need to be charged
15:28fdobridge: <Sid> I have this stupid idea to buy a cheap quadro card along with my (future) PC
15:29fdobridge: <Sid> 2 GPUs :wolfFIRE:
15:29fdobridge: <clangcat> XD
15:29fdobridge: <Sid> an RTX 3070 *and* an nvidia quadro, just out of spite /j
15:31fdobridge: <Sid> I'd have tested it myself but the patch to enable nvfbc on consumer cards broke with the 555 driver
15:35fdobridge: <clangcat> get a GT210
15:35fdobridge: <clangcat> For the memes
15:35fdobridge: <zmike.> @gfxstrand have you tried out the piglits yet
15:37fdobridge: <clangcat> I was so confused I read pigtails. and now I am imagining Faith with pigtails XD
15:38fdobridge: <mohamexiety> the problem is all the cheap quadros are old/obsolete gens so it's questionable value
15:38fdobridge: <mohamexiety> maybe if you can find a cheap T400 or so
15:40fdobridge: <Sid> T400 for 170 USD new
15:40fdobridge: <clangcat> Could sell a kidney to buy a quardo
15:40fdobridge: <clangcat> XD
15:40fdobridge: <Sid> I'd rather buy a ticket to you
15:41fdobridge: <Sid> uhm
15:41fdobridge: <Sid> there's someone on facebook marketplace selling it for a dollar???
15:41fdobridge: <gfxstrand> That's what you get when you have a 36-thread CPU and two GPUs
15:41fdobridge: <Sid> I need to investigate this
15:41fdobridge: <clangcat> That sounds like a scam.
15:42fdobridge: <Sid> still gonna investigate
15:44fdobridge: <Sid> quadro for 60$
15:44fdobridge: <Sid> T400
15:44fdobridge: <Sid> man replied immediately
15:45fdobridge: <mohamexiety> so that'
15:45fdobridge: <mohamexiety> s a turing quadro
15:46fdobridge: <mohamexiety> make no mistake though, it's extremely weak. iirc it's around GTX 1650 level, or below that
15:46fdobridge: <Sid> I know, but it'll give me something to play around the workstation features with
15:46fdobridge: <mohamexiety> https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/productspage/quadro/quadro-desktop/nvidia-t400-datasheet-1987150-r3.pdf
15:46fdobridge: <Sid> and it being so weak will also mean low power draw
15:47fdobridge: <Sid> meaning I might be able to easily accommodate it into (future) PC
15:47fdobridge: <mohamexiety> yeah it's 30W
15:47fdobridge: <mohamexiety> very smol and only uses slot power (no external power connector needed)
15:48fdobridge: <Sid> :eyeshake:
15:48fdobridge: <Sid> he's willing to sell it for 60$ including shipping
15:49fdobridge: <Sid> it's also still in the box, apparently, just missing an invoice
15:53fdobridge: <Sid> I've asked him for time until wednesday to confirm it
15:53fdobridge: <Sid> a few friends owe me money >:3
15:55fdobridge: <mohamexiety> make sure it's working/functional though given you can't test it until you get a PC later
15:59fdobridge: <Sid> yeah, I'll ask him to send me proof of it working
15:59fdobridge: <Sid> even if it means getting it unboxed
15:59fdobridge: <Sid> assuming I do decide to pick it up, that is
15:59fdobridge: <Sid> trying to weigh the pros and cons
21:12fdobridge: <gfxstrand> https://mastodon.gamedev.place/@gfxstrand/112537713509093022
21:15fdobridge: <Sid> so cool
21:21fdobridge: <gfxstrand> And now I can get dep info out of the blob. 😈
21:23fdobridge: <redsheep> Is that how you teased apart the information to get closer to conformance on your new branch?
21:25fdobridge: <redsheep> Hmm I see all the tooling commits after that, so I'd guess no
21:27fdobridge: <gfxstrand> Those are the tools, yes. Today I pulled them all into the same project and cleaned things up
21:27fdobridge: <gfxstrand> But version of them have existed for a while
21:29fdobridge: <gfxstrand> The important thing from today is that I can now get dependency information out of the blob:
21:29fdobridge: <gfxstrand> ```
21:29fdobridge: <gfxstrand> 0x000000: mov r1, c[0x2][0x0] // delay=1 reuse=000001
21:29fdobridge: <gfxstrand> 0x000010: ipa.pass r3, a[0x7c] // delay=1 wr:1 yld
21:29fdobridge: <gfxstrand> 0x000020: ipa.pass r0, a[0x80] // delay=1 wr:1 yld
21:29fdobridge: <gfxstrand> 0x000030: ipa.pass r2, a[0x88] // delay=1 wr:1 yld
21:29fdobridge: <gfxstrand> 0x000040: lea r6, r1.reuse, 0x8, 0x3 // delay=1 reuse=000001 yld
21:29fdobridge: <gfxstrand> 0x000050: ipa.pass r4, a[0x8c] // delay=1 wr:1 yld
21:29fdobridge: <gfxstrand> 0x000060: lea r1, r1, 0x70, 0x4 // delay=1 yld
21:29fdobridge: <gfxstrand> 0x000070: umov ur9, 0x5 // delay=3
21:29fdobridge: <gfxstrand> 0x000080: ldc.64 r6, c[0x1d][r6] // delay=1 wr:1 yld
21:29fdobridge: <gfxstrand> 0x000090: mufu.rcp r3, r3 // delay=1 wt=000001 wr:1 yld
21:29fdobridge: <gfxstrand> 0x0000a0: r2ur ur4, r6 // delay=4 wt=010000 wr:4 yld
21:29fdobridge: <gfxstrand> 0x0000b0: r2ur ur5, r7 // delay=1 wr:1 yld
21:29fdobridge: <gfxstrand> 0x0000c0: fmul.ftz.rz r0, r3.reuse, r0 // delay=2 wt=000010 reuse=000001 yld
21:29fdobridge: <gfxstrand> 0x0000d0: fmul.ftz.rz r2, r3.reuse, r2 // delay=2 wt=000100 reuse=000001 yld
21:29fdobridge: <gfxstrand> 0x0000e0: fmul.ftz.rz r3, r3, r4 // delay=1 wt=001000 yld
21:29fdobridge: <gfxstrand> ```
21:29fdobridge: <gfxstrand> Also, I made it not shout at me. 😅
21:34fdobridge: <mohamexiety> that is really really impressive. nice read too. awesome work!
21:35fdobridge: <gfxstrand> Rust makes this stuff easy
21:36fdobridge: <gfxstrand> It's as good at bit-twiddling as C and as good at OS stuff and string processing as Python.
21:38fdobridge: <redsheep> I haven't seen the old output, but more info and less shouting is always good
21:42fdobridge: <redsheep> Was this from your uniform alu branch? Is that one in a state where it would help to do testing?
21:46fdobridge: <gfxstrand> Yes it was but no it's not
21:54fdobridge: <redsheep> I'm curious about the general workflow surrounding these tools.
21:54fdobridge: <redsheep>
21:54fdobridge: <redsheep> Just from the instruction docs we talked about recently and guessing I think I can read about 70% of the output you sent, but not quite all of it. Is there a guide somewhere?
21:55fdobridge: <redsheep> Also, maybe this is an overly basic question but what is the easiest way to get the shader binaries or hex out of real applications that you might use to see how things have been compiled?
21:55fdobridge: <redsheep> It seems like it would be interesting to compare how Nvidia is doing it with current output from NAK
21:56fdobridge: <gfxstrand> I had a layer for that at one point but it's very much fallen into disrepair.
21:57fdobridge: <gfxstrand> Guide? 🤣 🤣 🤣
21:57fdobridge: <redsheep> Can you get that from renderdoc somewhere?
21:57fdobridge: <redsheep> And yes I know, anyone using this should probably just know how to read it
21:59fdobridge: <gfxstrand> I mean, it's not documented by NVIDIA. Not really
22:00fdobridge: <gfxstrand> No. If someone wanted to make a proper layer that exposes VK_KHR_pipeline_executable_properties by using the CUDA disassembler, that'd be really cool.
22:01fdobridge: <!DodoNVK (she) 🇱🇹> `mov` and `lea` also exist on x86
22:03fdobridge: <gfxstrand> I should really make NAK use lea
22:04fdobridge: <redsheep> Oh I know it's probably ptx instructions, there's bound to be overlap with other ISA
22:06fdobridge: <redsheep> Or I guess this is probably the actual hardware instructions, one level lower than ptx? I dunno I'm at the edge of what I understand here
22:07fdobridge: <gfxstrand> It's HW instructions
22:07fdobridge: <gfxstrand> which are typically pretty close to PTX but not the same
22:10fdobridge: <redsheep> What are you doing inside of lea? Docs say that is compute effective address?
22:10fdobridge: <redsheep> Instead*
22:11fdobridge: <gfxstrand> Just doing the math
22:11fdobridge: <redsheep> Or load effective address. Hmm. These docs are actually quite vague.
22:12fdobridge: <gfxstrand> LEA is shift and 64-bit add or something like that
22:19fdobridge: <gfxstrand> There is still a bug affecting some of the uniform control-flow tests. However, given that everything else is fine, I doubt it's intrinsic to uniform ALU and probably has something to do with re-convergence.
22:19fdobridge: <gfxstrand>
22:19fdobridge: <gfxstrand> That said, that was with `NAK_DEBUG=serial` because there are still instruction dependency issues I don't understand.
22:20fdobridge: <gfxstrand> Also, I haven't written the optimization pass to make it not suck yet so right now it's generating and absolute flood of R2UR and MOV
22:21fdobridge: <redsheep> Another potentially quite basic question but... How do you know how fast a bit of shader assembly will be? Does this output convert pretty one to one to machine instructions that are ingested one at a time every clock cycle, so you just want a shorter list of instructions, or do some of these take cause it to wait to keep ingesting more? Like, some instructions are just worse to use, like div on a cpu
22:22fdobridge: <redsheep> Trying to figure out what the mental model is for a GPU and I'm not sure how much comes over from CPU world where it's all out of order and you retire instructions way later and all that
22:22fdobridge: <gfxstrand> It's nearly impossible to know from static analysis alone
22:22fdobridge: <gfxstrand> GPUs are always in-order so you don't need to worry about htat
22:22fdobridge: <gfxstrand> GPUs are always in-order so you don't need to worry about that (edited)
22:23fdobridge: <gfxstrand> However, they work in warps and divergence has funny effects on how many cycles it actually takes.
22:23fdobridge: <gfxstrand> And memory ops are always an unknown
22:24fdobridge: <gfxstrand> And then there's occupancy. So the fewer GPRs you use, the more threads can run at a time.
22:24fdobridge: <redsheep> So, they are in order, but if something is blocking then some other pending execution can come along and utilize that bit of hardware though, right?
22:25fdobridge: <gfxstrand> yes
22:25fdobridge: <gfxstrand> They're massively hyper-threaded
22:25fdobridge: <gfxstrand> With typically zero thread-switch cost
22:26fdobridge: <gfxstrand> So the moment you touch memory, the next cycle will be on a different thread
22:27fdobridge: <redsheep> Hmm. So from the compiler code side then why is it important to know the latency of an instruction? Is it not just built in that whenever the latency is over it goes back to executing on that thread?
22:29HdkR: Because the hardware doesn't do dependency tracking. So it'll try reading the value from an instruction from one pipeline from another :P
22:29fdobridge: <redsheep> So telling it to block until the latency is over is explicit?
22:30HdkR: in-order dispatch doesn't mean it it won't be executing multiple operations at the same time
22:30HdkR: Yes, compiler does explicit dependency tracking for the hardware
22:31HdkR: And of course the compiler will try and put independent operations in-between write and use to keep hardware utilization up
22:32fdobridge: <gfxstrand> Yes. They deleted the HW scoreboard on Maxwell.
22:33HdkR: Meanwhile AMD went the opposite direction and added one recently :D
22:33fdobridge: <gfxstrand> AMD had a weird "everything is s single cycle" design until RDNA2
22:34fdobridge: <redsheep> So with that in mind, what does NAK debug serial do? How is it different from normal execution?
22:35fdobridge: <gfxstrand> It goes scorched earth and uses the token system for every instruction. Everything waits on and sets a token.
22:35HdkR: lol wtf that's awesome
22:36fdobridge: <gfxstrand> There are a couple instructions which can't signal a token and those are special cased.
22:36HdkR: That sounds like so much overkill compared to scoreboarding for maximum wait
22:36fdobridge: <gfxstrand> At one point, it also set a delay of 15 on everything
22:36fdobridge: <gfxstrand> I should probably put that back, TBH.
22:37fdobridge: <gfxstrand> It's great for "is this a dependency bug?" debugging.
22:38HdkR: Good for sanity checking dependency problems
22:38fdobridge: <redsheep> So 15 is a pretty reasonable max for latency?
22:39fdobridge: <redsheep> Memory operations could be like... Thousands I'd assume, but I guess those can use these tokens you're referring to?
22:39fdobridge: <nanokatze> you're conflating two things
22:39fdobridge: <nanokatze> there's things with variable latency, e.g. memory ops
22:39fdobridge: <nanokatze> there's things with fixed latency, e.g. alu ops
22:40fdobridge: <nanokatze> for variable latency things you tell them which slot to signal and then wait on that slot
22:40fdobridge: <nanokatze> for fixed latency things that's what the counter is for
22:40fdobridge: <nanokatze> for fixed latency things that's what the delay thing is for (edited)
22:43fdobridge: <nanokatze> also I should read things to completion before replying :didnotread:
22:43fdobridge: <nanokatze> yes
22:43fdobridge: <redsheep> Ok so three things I think I need to read up up then, tokens, slots, and scoreboarding
22:44fdobridge: <nanokatze> tokens and slots is the same in the above I think
22:44fdobridge: <redsheep> Ah ok
22:45fdobridge: <gfxstrand> Yes
22:45fdobridge: <redsheep> So even under normal operation then I assume memory ops use these slots? Like, the memory controller goes and signals that token when it's done fetching?
22:46fdobridge: <gfxstrand> It's also good for just separating concerns. Every time you can take a huge chunk of the complexity and make it a problem for future you, it makes things easier.
22:47fdobridge: <redsheep> Replies still don't work with the bridge yet, that was directed at you Hdkr
22:47HdkR: Oh
22:47fdobridge: <gfxstrand> In this case, I wrote `NAK_DEBUG=serial` early on and had most of NAK working and tested before I ever shut it off.
22:48fdobridge: <gfxstrand> Yes, that's primarily what they're for.
22:48HdkR: Yea, memory loads will use the scoreboarding slots for signaling. Kind of required since memory accesses are slow AF and the hardware will immediately context switch if allowed
22:48fdobridge: <gfxstrand> But they work on everything. Even on Maxwell where the disassembler rejects them, the hardware is fine with it.
22:50fdobridge: <nanokatze> does the hardware actually signal it when it's specified for fixed latency things or is it just ignored?
22:50fdobridge: <gfxstrand> It signals it
22:50HdkR: You can even do multiple loads and scoreboard slot on only one to have multiple in flight, which is a nice feature of the hardware
22:52fdobridge: <redsheep> Probably another really basic question, but it sounds like y'all are saying tokens are a scoreboarding thing, so if Maxwell removed the hardware scoreboard how is it possible to make almost everything use these tokens, and how is it possible to wait on memory ops at all?
22:53HdkR: Nah, it kept the scoreboarding, but it removed the pre-maxwell hardware dependency tracking
22:54fdobridge: <gfxstrand> Previously, there was a magic scoreboard internally where every instruction automatically set the scoreboard based on the registers it wrote and waited based on registers accessed. That's gone. What we have now requires the compiler to figure out dependences.
22:54fdobridge: <redsheep> Oh I missed this message
22:54HdkR: Kepler and below was weird and let you do scoreboarding plus hardware dependency tracking, potentially because NVIDIA wasn't yet willing to commit to full compiler scoreboarding
22:54HdkR: But anything pre-maxwell is dinosaur class hardware, so it can be completely ignored these days :)
22:55fdobridge: <nanokatze> HdkR: do you know if operations to the same memory (e.g. global loads) can actually complete in different order they were started?
22:55HdkR: I don't recall
22:56fdobridge: <gfxstrand> You say that but... https://www.collabora.com/news-and-blog/news-and-events/mesa-241-brings-new-hardware-support-for-arm-and-nvidia-gpus.html#qcom4085
22:56HdkR: Oh no
22:56fdobridge: <gfxstrand> They can, depending on ordering flags
22:58fdobridge: <nanokatze> btw @gfxstrand earlier you said that kepler can't implement vmm, why is that?
22:58fdobridge: <gfxstrand> Because it can't guarantee ordering in some scenarios. 🤡
22:58fdobridge: <nanokatze> I see
22:59fdobridge: <gfxstrand> I don't know the details, just that one of the smartest people I know convinced himself it wasn't possible
23:00fdobridge: <redsheep> I'd assume you can end up hitting cache and sometimes go much faster?
23:00fdobridge: <gfxstrand> Yup
23:04fdobridge: <gfxstrand> I honestly can't tell if that comment on the Collabora blog is a troll or not.
23:04fdobridge: <redsheep> Yeah that's pretty weird
23:06fdobridge: <redsheep> Some people just don't understand not being entitled to your hardware or use case being supported
23:07fdobridge: <Sid> always the kepler users..
23:10fdobridge: <redsheep> I mean just from the perspective of being an effective contributor, why would anyone work on kepler when there are still things to do on modern hardware unless they personally need that? It makes sense to work on whatever will be the most effective to help the most people, and kepler is almost nowhere to be seen on current hardware charts
23:14fdobridge: <gfxstrand> And if someone shows up and starts adding Kepler support to NAK, I'll happily take the patches.
23:15fdobridge: <gfxstrand> But they rarely have any idea just how much work they're asking for.
23:16fdobridge: <gfxstrand> Hell, even if they just want to fix some codegen bugs and wire up images.
23:16fdobridge: <gfxstrand> But there are a lot more important problems to be solved in the Linux graphics ecosystem than Kepler Vulkan.
23:23fdobridge: <gfxstrand> Which I don't mean to be callous. But my time is a finite resource and I have to budget it aggressively. I can figure out Mesa compute or gallium2 or common ray tracing or I can work on Kepler... One of these clearly falls off the list.
23:25fdobridge: <redsheep> Gallium2? What? Did I miss something?
23:25fdobridge: <!DodoNVK (she) 🇱🇹> When will NVK start using that common RT code?
23:25fdobridge: <gfxstrand> It only exists in my brain at the moment.
23:26fdobridge: <gfxstrand> I mean, If I write it, I'll probably develop against NVK
23:26fdobridge: <gfxstrand> Part of why I wrote NVK was so that I can have a "home driver" again.
23:40fdobridge: <phomes_> nvdump fails with `unsafe precondition(s) violated: slice::from_raw_parts requires the pointer to be aligned and non-null, and the total size of the slice not to exceed isize::MAX`. I this because I do not have the right driver version?
23:51fdobridge: <gfxstrand> Uh... Maybe?
23:51fdobridge: <gfxstrand> Not sure
23:58fdobridge: <phomes_> okay. I will see if I can figure it out. You mention that it only works for 6 months worth of driver and I am on a pretty old 535