00:29airlied[d]: dEQP-VK.compute.pipeline.basic.empty_workgroup_x fails with an encoding error if you want to start simpler
01:52gfxstrand[d]: Womp womp.
01:52gfxstrand[d]: I'll look tomorrow
01:54gfxstrand[d]: But I really want to get these SSBO tests sorted. I have a feeling whatever is failing is something that's going to bite us elsewhere in even less nice ways.
07:34suzukimitake: https://www.go4expert.com/articles/c-program-calculate-combinations-n-digit-t25380/ you thought that this is how combinations should be handled to say words like i should kill myself, lol what a set of scientists , very mentally stable ones indeed! this thing will swap pretty fast then oom hang. Otherwise these iterations would be carried out in few minutes if the table was packed. And for
07:34suzukimitake: arithmetics there aren't such combinations amounts, cause the entropy table encodes any possible outcome with lower uncertainty but hundred percent probability i.e without losses if decoded back! In other words you were and are going to be always incorrect in thinking different. Asshole is a very sophisticated appratus that you eagerly play within your body, this activity probably caused
07:34suzukimitake: all or most such delusions you came across.
07:38suzukimitake: I saw that a past guy called ilia or ilya Mirkin had also a python script to extract the firmware from blob driver, but was considered as much as illegal as to software fault injection type of locigal scanning of instructions, but whatever, if you go this path it's illegal anyways.
07:46karolherbst[d]: airlied[d]: I nuked the membars, perf doubled 🙃 and --correctness is still happy
07:46karolherbst[d]: sooo I think we need to be smarter about members 🙃
07:46karolherbst[d]: *membars
07:47airlied[d]: Yeah I'm not sure the spirv code adding those is correct or if we need to optimise it out
07:47suzukimitake: So all in all, if you expose the functionality of like cross-domain checking of self-timing or partial one on gpus that is called something like multiple wavefronts per thread, you end up going below the clock pulses however in full self-timed mode there is no clock signal that causes any of the heat to come through in terms of impedance of the materials, so if you solve the things another
07:47suzukimitake: way around at maximum throughput i.e insert empty cycles you are about to achieve a low-power computation too.
07:47karolherbst[d]: I think we can merge some
07:47karolherbst[d]: what I saw was a few ldsm after each other with a bunch of membars around them
07:48karolherbst[d]: and I think we only need to keep the last one if the results aren't used and those don't overlap or something weird?
07:48karolherbst[d]: but yeah.. getting nice numbers now
07:48karolherbst[d]: "41.745574 TFlops"
07:48karolherbst[d]: nvidia is around 50 or so
07:49karolherbst[d]: maybe not with this matrix type..
07:49karolherbst[d]: but yeah.. I think optimizing membars sounds promising at least
07:50karolherbst[d]: there is another micro opt we could be doing. We end up with a lot of movs after each other. We should alternate them with imad instead
07:50karolherbst[d]: not sure nvidia does it on all gens tho
07:50karolherbst[d]: but I doubt that gives significant gains
07:51karolherbst[d]: mhhhhh
07:52karolherbst[d]: looking at the code with the membars it looks a loooot nicer, so maybe it's not the membars per se
07:52karolherbst[d]: https://gist.githubusercontent.com/karolherbst/4e3c6097ef5df12ec8f792afa6a44fc4/raw/7bc0d07a4de36aa3361883db1a7a23c64c64e3c1/gistfile1.txt
07:52karolherbst[d]: looks pretty optimal to me
07:55karolherbst[d]: mhhh
07:55karolherbst[d]: interesting...
07:56mangodev[d]: karolherbst[d]: that's actually pretty impressive so far
07:56mangodev[d]: 4/5 of the matrix perf of official
07:57karolherbst[d]: I'd have to check
07:57karolherbst[d]: because the perf varies a lot between what matrices are used
07:57karolherbst[d]: and the one I'm working on I think isn't even advertised by nvidia 🙃
07:57mangodev[d]: fair, though this is a good start
07:57karolherbst[d]: but it's simpler to reason about
07:57mangodev[d]: karolherbst[d]: that's even better >:)
07:58mangodev[d]: already beating them to the punch and you haven't even made a fist yet
07:58karolherbst[d]: anyway.. ditching the membars nukes 6 cycles of waits
07:58karolherbst[d]: inside a loop
07:58karolherbst[d]: and like 20 of those are removed
07:58karolherbst[d]: so yeah.. I can see this mattering a lot
07:58karolherbst[d]: we basically stall the warp just hitting a membar
07:59karolherbst[d]: spots like this: https://gist.githubusercontent.com/karolherbst/337abcbce70ee68b40a9f1a54d526fcf/raw/706c89c8135e2c5fe4d436254e699de52fee7b10/gistfile1.txt
07:59karolherbst[d]: I'm sure we can just keep the last one and it's fine
07:59mangodev[d]: i honestly envy the day that nvk can somehow outperform nvidia proprietary in some areas (in horsepower, not just wsi)
08:00mangodev[d]: it's probably far from us, but nowhere near as far as some may think
08:00mangodev[d]: karolherbst[d]: what kind of matrices are being worked on rn
08:00karolherbst[d]: cooperative matrix
08:01mangodev[d]: matrix math for rendering, or for compute?
08:01karolherbst[d]: compute
08:01mangodev[d]: oooh nice
08:01mangodev[d]: i wonder what compute stuff works with NVK already
08:02mohamexiety[d]: It could be used for rendering too if needed, the coop matrix extension basically just allows you to leverage onboard matmul accel in Vk without having to do CUDA/OpenCL/etc interop
08:02mangodev[d]: mohamexiety[d]: blender cycles on vulkan compute when 🥺
08:03karolherbst[d]: mohamexiety[d]: I was considering if we could use ldsm to optimize some shared memory loads
08:03karolherbst[d]: but...
08:03karolherbst[d]: that's also just for compute shaders
08:03karolherbst[d]: ldsm is a funky instruction
08:03mangodev[d]: still a good effort though, afaik compute is really rough rn
08:04karolherbst[d]: but it can nuke a bit of address calculation once the star all aligned perfectly
08:04karolherbst[d]: *staars
08:04karolherbst[d]: *stars
08:08mangodev[d]: i'm more concerned with common rendering tasks and wsi, but those are frequently worked on issues that i'm not sure where to even begin diagnosing (nonetheless fixing) issues with
08:08mangodev[d]: i'd love to learn more and eventually contribute, but by the time i'm worthy enough to contribute anything helpful, this driver will probably support raytracing on the RTX AI 8090 Ti Super
08:08mangodev[d]: i live and breathe optimization and love seeing the dev process and progress on the driver, though i'm fully aware that there's little i can do aside from report issues
08:15karolherbst[d]: I doubled checked, and we are rather around 50% perf 😄
08:17karolherbst[d]: might need to pick more of daves hacks
08:17stillfuipitsu: this was actually most valuable work i did on hardware researchs. To determine such a hack being supported without pulsing the hardware in any kind of way like configuring such as with FPGA's, it's supported also on all chipsets, but hasn't been exposed by firmware, this is the most wished method to be exposed to humans to pay less for electrics or to electric companies. Otherwise i would
08:17stillfuipitsu: not had talked about that tesla electrical batteries could be smaller, i do not know how they solve this at all, but there is just a hard method achievable for using almost no power at all, empty cycles do not have be inserted but, then you have another problem that people get too capable computers the power loss is a result of heat, so hence dissipation of heat such as cooling would
08:17stillfuipitsu: never be needed. This is too complex project for me, i can not reach to there alone, i did only carry out with huge problems the hardware inspections in software, and discovered the path hence, this has to be put on the public vote and needs more programmers too likely.
08:19mangodev[d]: karolherbst[d]: oh? curious
08:32notthatclippy[d]: How are you testing the perf? If comparing NVK+nouveau vs proprietary stack you might be comparing apples and oranges without knowing it.
08:33notthatclippy[d]: For this stuff I think it'd be good to make a setup where you can only swap the SASS and keep the rest of the stack the same. Whichever stack that is.
08:33karolherbst[d]: just a benchmark doing number crunching
08:35notthatclippy[d]: One of the things where prop (and especially CUDA) will blow you away is launch latency, so make sure that the number crunching is long enough to make the constant overhead irrelevant.
08:36karolherbst[d]: well.. fair, but for now just closing the gap is useful
08:36karolherbst[d]: but it's vulkan vs vulkan
08:36notthatclippy[d]: Even vulkan, the NV driver doesn't need syscalls to submit work
08:36karolherbst[d]: but yeah.. it runs for 1-2 seconds or so
08:36karolherbst[d]: not that long, but also not very short
08:37karolherbst[d]: right...
08:37notthatclippy[d]: I think CUDA launch latency is on the order of a microsecond.
08:37karolherbst[d]: but like if the old perf is 15 TFLOPs and with opts we get 40, and nvidia is around 80, it's still shows progress, and that's all I care about atm 😄
08:38notthatclippy[d]: Progress from yesterday's NVK is always good. I'm just saying that you might have set yourself too high a target. The reason NV gets 80 might be due to things other than the compiler
08:38karolherbst[d]: ohh, sure
08:39notthatclippy[d]: Like, if you use the prop stack and just stream your instructions instead of hours, you might be at 50 instead of 40. Or, conversely, if you extract our bitstream and feed that into the NVK stack, we might be at 75 instead.
08:43karolherbst[d]: mhh yeah, but if things would be that simple...
08:50hamsterscode: This is not straightforward testing, nor coding for me, i have few experience there, input drivers and drivers is hard code, might require many dma firmwares to get reversed but over all this mode comes in descriptor chaining , it's that host works on incremental dma , devices post third party accesses to get attention from the host, there is an arbiter, then main firmwares are encoded
08:50hamsterscode: with high security based of my other ideas, such as on separate isolated threads they would have firmware packed tables for certain modes , that would permute their access codes every time. Passive coolers and sensors can be mounted in case some liquid is spilled on, so the chip would not fry, however actually they are not much needed elsewhere by then. This needs more work and votes to
08:50hamsterscode: get done, i lack time for it nowdays.
08:53airlied[d]: I don't know how many submissions the test does, should be easy to see
08:55notthatclippy[d]: Submit latency is part of it. Completion notifications, SW methods, error recovery and probably a dozen things I'm missing are all hitting other parts of the stack.
08:57notthatclippy[d]: How big is the final disassembly for this test? Maybe just tweaking it by hand to match what prop does and measuring that on the same stack could give more insight on what the real target is.
08:58notthatclippy[d]: (well, assuming the target is equal to proprietary, which is a good simple target to set, but there's nothing claiming that the proprietary version is actual ideal)
09:03karolherbst[d]: so let's see how perf is with all of those address calc opts
09:04karolherbst[d]: notthatclippy[d]: like 1k instructions
09:04karolherbst[d]: with some loops
09:04karolherbst[d]: not the thing I'd wanna touch by hand tbh (tm)
09:08airlied[d]: I think on my turing one test i was getting 27tflops with NVIDIA, and got to 19.9 with nvk
09:10karolherbst[d]: which test? Because I have some where nvidia runs at 80tflops
09:10karolherbst[d]: but it's also like.. a quadro one
09:11karolherbst[d]: quadro 6000
09:11airlied[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12817
09:12airlied[d]: Has all the details and hackery
09:12karolherbst[d]: ahh the 16x16x16 one
09:12karolherbst[d]: I test against the 16x8x16 one, because that's what the hw directly supports, but it's the same perf as the 16x16x16 one anyway
09:12karolherbst[d]: so I guess my GPU is just faster
09:13airlied[d]: Also has the NVIDIA shader in it to admire how few movs there is 🙂
09:13karolherbst[d]: ahh yes... predication
09:14karolherbst[d]: they don't do as much nonsense as we do 🙃
09:16karolherbst[d]: I've seen a lot of prmt with nak and am wondering what that's all about
09:16karolherbst[d]: feels like fp16 packing
09:17karolherbst[d]: but maybe that's gone with the 32 bit cmat load/stores
09:17karolherbst[d]: should check again
09:17karolherbst[d]: mhh no
09:17karolherbst[d]: wait...
09:18karolherbst[d]: ehh yeah.. uhh
09:18karolherbst[d]: might be the phi stuff
09:18karolherbst[d]: but yeah.. I think the coop matrix side of things doesn't look bad, it's just everything else which isn't very optimal
09:31snowycoder[d]: I think that dEQP is creating a imageView with storage for a format that does not support it, but `--deqp-validation=enable` is not printing anything.
09:31snowycoder[d]: Am I in the wrong?
09:31snowycoder[d]: P.s. the test case is: `dEQP-VK.image.mutable.2d.r8g8b8a8_srgb_r16g16_sfloat_clear_load`
09:54snowycoder[d]: Nevermind, we are adding the usage bit ourselves
12:39gfxstrand[d]: Why are we adding a storage bit?!?
12:43snowycoder[d]: Sorry, that was me misreading the whole frontend code.
12:43snowycoder[d]: The error was that I swapped image and view format 😅
13:05hamsterscode: So Tesla electric cars have their own system which i never delt with, but doubtful that it's graphics and digital/analog drivers are something high end in house modified for such purpose to get smaller batteries and reduce overall weight of the car along with electrics consumption, that is very complex where though american universities again have posted something such as University of
13:05hamsterscode: Princeton among many others. I could at that moment handle only the security parts some code such as something similar to lattice based cryptography to defend the needed IO's and reduce duty cycles for safety. But the thing is devices like GPUs have complex clock domain firmwares on the way, every such device needs it's own subarbiter to manipulate clockless based clock domain pulse
13:05hamsterscode: insertings for PCIe or AXI arbitration there again. So such socs and chipsets i inspected some at overseas. Should behave as clockless aka always on self-timed clock on DMA and this clock can be routed with descriptors and such through the signals similar to interrupts. The results i got were that duty cycles on main chip weren't 50 50 percent on the chips, synthesizers lifted the clock
13:05hamsterscode: edges so that one edge negative or positive is 90 percent or even more in, where as another edge is proportionally less, so for the timing but everything was latched anyhow so dma should have such clock distributor trees in that mode i saw where host or host device is at master position and it's controllees at slave position where as clock is continuosuly in by default, so say dma is
13:05hamsterscode: master shader engine memory engine or ai engine and however many domains those in slave and then they switch. So it is horribly difficult project for from home followers and active thinkers.
13:57notercegro: So i forgat to say, that they'd have self-timing mode for most chips i had ever seen in netlists, but this is latched for clock scew in digital circuits, signal should hold long enough according to wikipedia, so the destination engine could pick it up, but shader engine is pulled to self-timing by mentioned multiple wavefronts per thread or multiple threads per wave, like cpus are just doing
13:57notercegro: operand forwarding to get there on both in and out-of-order cores. That stuff is not so straightforward, i do not please myself or expect results from my homelabs there as said, but should be theoretically possible.
13:59karolherbst[d]: okay.. compared Daves wip branch and my: 22 (Dave) vs 35 (mine). If I disable membar, Daves branch also jumps to 35, but nvidia is still at 70
13:59karolherbst[d]: so I guess I should figure out the membar situation and see what to do about it
13:59karolherbst[d]: because it looks huge
13:59karolherbst[d]: all the address stuff doesn't really do much
14:00karolherbst[d]: the RA vector thing and the membar are making a massive difference tho
14:01asdqueerfromeu[d]: Is this on Blackwell or an older architecture?
14:02karolherbst[d]: Turing
14:02karolherbst[d]: Blackwell will be so much more work
14:25firingplankton: The previous is one thing i am not dealing with, cause it is technically same to toggle the clock to 1mhz and do the super-computation on sequences the last which is way easier to do through c-states s-states etc. But the former is more capable stuff in througputs if someone gets it right they have edge. Karol delt with the clocks firmwares, and theory, i never did. So i am not sure as
14:25firingplankton: to how atom processor or i945GME kind of accelerators work to pull into self-timing, i think atom can do 4 instructions max below the clock, and graphics too, but unsure there maybe they do 1 instruction per clock instead. or maybe whatever x stuff.
14:40gfxstrand[d]: snowycoder[d]: That'll do it. :silvy_sweat:
14:40gfxstrand[d]: gfxstrand[d]: Lasted an hour:
14:40gfxstrand[d]: `Pass: 30713, Fail: 16456, Crash: 8348, Skip: 219213, Missing: 2568864, Flake: 167, Duration: 57:01, Remaining: 0`
14:43snowycoder[d]: There's also Kepler `dEQP-VK.image` that's looking pretty good!
14:43snowycoder[d]: Test run totals:
14:43snowycoder[d]: Passed: 19817/210958 (9.4%)
14:43snowycoder[d]: Failed: 1685/210958 (0.8%)
14:43snowycoder[d]: Not supported: 189456/210958 (89.8%)
14:45snowycoder[d]: All fails are when texelFetch and imageStore are used together 0_o
14:45snowycoder[d]: Everything else seems to work.
14:45snowycoder[d]: (I should probably also test robustness
14:57gfxstrand[d]: Debating how I feel about landing my hacked up QMDs
15:02accuratelydone: well google's ai engine actually responded very clearly if no hazards on the pipeline than atom can issue 2 and r300 and i945XXX max 8 pixels and nv34 and such old ones all alike. So i was reading similar chips, mips ooo had central queues in a verilog loop, that one issued many more, and toggled clock too alike. So yes technically if you are able to target that pixel which was rendering
15:02accuratelydone: in case of gpu and instruction that was executing you would get to self-timing it goes 95percent on one edge and 5 percent on different edge if i screwed not up there, i saw it clearly back then in clock flattened verilog. so you have plenty of time to hit the edge you want through a latch, which also makes sense, so that makes also sense to me now that memory clock is slightly slower,
15:02accuratelydone: cause technically gpu clock splits edged not proporsely, you know such findings i had back then, now i need to work outside. domino circuits would only have 50 50 duty cycles for clock edges. But sure higher density chips like current nvidia and amd ones, would still always beat others in any scenario linearly, they can process more instructions.
15:07gfxstrand[d]: airlied[d]: karolherbst[d] mhenning[d] Are we sure where all the UGPR changes cut off? Is it Blackwell or Hopper? Specifically:
15:07gfxstrand[d]: - Increasing UGPR count to 255
15:07gfxstrand[d]: - Getting rid of cbufs in ALU ops
15:07gfxstrand[d]: Dave's got it all cutting off at SM100
15:08karolherbst[d]: I don't know
15:08gfxstrand[d]: I guess we're kind-of guessing on everything Hopper-related
15:08karolherbst[d]: have you checked the whitepapers?
15:08karolherbst[d]: I'm sure it's mentioned somewhere
15:09mhenning[d]: gfxstrand[d]: Some of this is tested in nvdisasm_tests, and that does check against hopper, blackwell1, blackwell2
15:09karolherbst[d]: the 64 one number is also public
15:09mhenning[d]: nvdisasm_tests is, for example, how I discovered that the atomic changes happened in hopper instead of blackwell
15:10mhenning[d]: the texture changes were definitely blackwell
15:11mhenning[d]: I'm not sure for the two things you listed, but it isn't too hard to add those tests
15:11gfxstrand[d]: That's no surprise. No one uses the texture unit on Hopper
15:11karolherbst[d]: gfxstrand[d]: mhhh they nuked cbufs in alus?
15:12mhenning[d]: yep, that's gone in blackwell
15:12karolherbst[d]: indeed
15:12karolherbst[d]: interesting
15:12karolherbst[d]: weird, but okay
15:17mhenning[d]: karolherbst[d]: for the membar stuff, I tried adding nir_opt_barrier_modes to our opt passes a while ago, but it broke dEQP-VK.memory_model.write_after_read.core11.u32.coherent.fence_fence.atomicwrite.device.payload_nonlocal.buffer.guard_local.image.comp so I removed it. It could be worthwhile to look into how to use that pass correctly
15:17karolherbst[d]: sounds like a good plan
15:19karolherbst[d]: even if I only get 50% more perf instead of 100% it's a big win 😄
15:19karolherbst[d]: though I don't think the membars are the issue, more the 6 cycles per membar that are not doing anything
15:19karolherbst[d]: and potential reordering opportunities
15:48accuratelydone: what i figured out that electrical engineers are so smart they never do high mistakes these days around, i envision this as clock lines and power lines are rails in a circuit, they are continuous, so in duty cycle definition where pulse is in it has not been grounded from the circuit as of yet, it goes all over the block again such stuff is called inverter networks so when the latch gets
15:48accuratelydone: source clock the destination clock opens it again, but this clock has longer pulse in a circuit, it is being kept longer around, i think this is not the frquency but pulse length, cause if this confusion i tried to do that research cause i suspected disambiguation of terms.
16:03karolherbst[d]: mhhh.. maybe I just need to figure out how to make `nir_opt_combine_barriers` trigger more often..
16:05karolherbst[d]: https://gist.githubusercontent.com/karolherbst/c1dbf5de8df9d949e4c88327923a6cd9/raw/bb2228e1c0001fe22c1e441256d8c854d46a1e17/gistfile1.txt
16:05karolherbst[d]: mhenning[d]: any thoughts on what needs to be proven so we could nuke all but the last barrier?
16:07karolherbst[d]: maybe just needs more flags on the intrinsics added...
16:12gfxstrand[d]: Ugh... LDC is a mess but I think I've about got it working
16:15gfxstrand[d]: Did we ever get a full opcode dump out of Hopper or Blackwell? I know a couple of us poked at it
16:16gfxstrand[d]: mhenning[d]: Found it!
16:32gfxstrand[d]: Okay, I think I have LDC for realz. I'm gonna add nvdisasm tests for this shit
16:45gfxstrand[d]: Because ugh...
17:11snowycoder[d]: Is there any way to find at what instruction address a GPU trap happens?
17:11snowycoder[d]: I have a `MISALIGNED_REG` but the shader is just too long to parse
17:25gfxstrand[d]: Ugh... Not that I'm aware of, no.
17:25gfxstrand[d]: Does it do texturing?
17:34rapid_drone_era: I do not think i described things well enough since grounding of a clock is not a term, i meant attenuation of the signal, but they use clock repeaters in mesh and tree topology, but whatever i recapped some and generally i suspect the chips i saw in design can infact per genuinly good design be possibly running in lock-step sequence quite heacy number of instrusctions. This is
17:34rapid_drone_era: definitely one of the hardest things to do, where our relations aren't so good that i could participate in achieving anything alike however, that might remain hypothetical hack, but i consider it possible likely they flatten the circuits all functionality as bottom up , it yields no errors such as corruption when they delay is enough, timing is invented dimension in circuits too, all in
17:34rapid_drone_era: closed contunous circuit happens at current time, turning back into earlier buffered content isn't thoughtful but gpus use large register files still and if you get the timing window correct it can still do the needed self-timing. I head off the most latent block in the pipeline has to be measured and used for this windows defining sake.
17:34snowycoder[d]: gfxstrand[d]: yep, why?
17:35snowycoder[d]: There's also some `ILLEGAL_INSTR_ENCODING` sprinkled in ðŸ˜
18:06gfxstrand[d]: Because I suspect we need to always use a vec4 for both texture sources on Kepler. Currently I have a thing for the second one but not the first.
18:08gfxstrand[d]: gfxstrand[d]: Properly encoded and tested. :frog_party:
18:08gfxstrand[d]: That's probably a LOT of failures right there
18:17gfxstrand[d]: Ooh! `gsp: Xid:13 Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Out Of Range Register`
18:21gfxstrand[d]: Looks like there is a limit on UGPRs
18:32gfxstrand[d]: Apparently we have 80 of them
18:34mhenning[d]: karolherbst[d]: I think you actually need the first barrier, not the last. But yeah, I don't think nir does this just yet. I think you want something along the lines of https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33504 (that patch doesn't work on normal loads, only atomics, but it could be extended for loads)
18:36karolherbst[d]: mhhh, yeah I guess keeping the first one is the right one yeah 😄
18:37karolherbst[d]: I don't even know where the barriers are coming from...
18:37karolherbst[d]: maybe I should check that, because if something adds them pointlessly might be esaier to convince it to not do that
18:38rapid_drone_era: That seems pretty understandable however which is the edge that the interconnect descriptor chaining sends is is where i am not sure. That might come out soon, maybe + for master and - edge for slave, but i tend to think it holds it quite long on maybe with repeaters but more likely some capacitors on mobo well i either do not remember or i have not known that. CDC can result in scew
18:38rapid_drone_era: but easy to take care of. Memory clock is higher. So i suspect theoretically possible sw controlled hack, it would be hugely beneficial but i do not know how to do it. Also very accelerator device specific, if there is no timing or netlist info it's trial and run, and possibly could damage something internally?
18:40mhenning[d]: karolherbst[d]: Maybe from loading with acquire semantics?
18:41karolherbst[d]: those were initially plain shared loads
18:41karolherbst[d]: like the matrix loads I mean
18:42karolherbst[d]: mhhhhhhhhh
18:42mhenning[d]: gfxstrand[d]: 80 feels like a weird number of UGPRs. like, we did all that work to add an extra bit to the encoding because apparently and extra 6 of them is going to make a huge difference?
18:42karolherbst[d]: actually
18:42karolherbst[d]: sooo
18:42karolherbst[d]: how the coop matrix works is that you have a deref_load on the nir_variable being the matrix
18:43gfxstrand[d]: mhenning[d]: An extra 17 but yes
18:43karolherbst[d]: and we split it up based on components... maybe something there is overly eager duplicating barriers
18:43mhenning[d]: gfxstrand[d]: ah, right did that math wrong
18:44gfxstrand[d]: gfxstrand[d]: I'm also a little skeptical of this. I wonder if there's a bit somewhere to specify how many. I guess at the end of the day they had to pick a number and they probably balanced hw cost vs. benefit.
18:44airlied[d]: The barriers come from the mesa spirv coopmat code
18:44karolherbst[d]: airlied[d]: not all of them
18:45karolherbst[d]: the original shader has 6 barriers
18:45karolherbst[d]: before going to nak we have 21
18:46mhenning[d]: Does the original have acquire semantics on loads though? Because we do need to turn those into barriers
18:46karolherbst[d]: maybe it's the coop matrix lowering code, but I haven't seen anything... maybe I missed it
18:47karolherbst[d]: https://gist.github.com/karolherbst/4c7d8df309c380cfa8cbe930e96f50a5
18:48karolherbst[d]: I think it's just some lowering duplicating them or something...
18:48karolherbst[d]: just have to find what
18:48karolherbst[d]: mhhhh
18:48karolherbst[d]: maybe vector stuff?
18:54karolherbst[d]: loop unrolling add some
18:55gfxstrand[d]: karolherbst[d]: Did you ever figure out how to get the CTS to build?
18:56karolherbst[d]: adding `#include <cstdint>` in a couple of places
18:56karolherbst[d]: in the deps
18:57karolherbst[d]: dunno where anymore, git diff won't tell me
18:57gfxstrand[d]: 😢
18:57karolherbst[d]: but yeah, just add it to headers where it fails to compile
18:57karolherbst[d]: it's like 2 or so
18:57karolherbst[d]: maybe 3
18:57rapid_drone_era: maybe new gpus can have something that measure the pipeline stages latencies , have they ever gone that advanced with this stuff? Something such as performance counters?
18:57gfxstrand[d]: That's because it's in amber
18:58karolherbst[d]: okay yeah.. the barrier come through loop unrolling
18:58karolherbst[d]: what a pain
18:59rapid_drone_era: there is only two more ideas to do that, maybe error codes from decoder and punch of other stages, however there might be something possible in jtag too.
19:01gfxstrand[d]: Looks like Ricardo fixed it upstream
19:05gfxstrand[d]: https://github.com/google/amber/commit/7fc1c7eded4c28e4b35ce0d3929dfd768f57663a
19:26rapid_drone_era: *bunch, but llvm-mca says dispatch stage is the most latent pipeline, i remembered incorrect, i actually thought issue.
19:29rapid_drone_era: *in the pipeline
19:30gfxstrand[d]: karolherbst[d]: Where do the chipset numbers come from?
19:31rapid_drone_era: I think jtag boundary scan would do much more accurate job, but might be pain to expose.
19:32karolherbst[d]: gfxstrand[d]: mmio reg 0
19:35gfxstrand[d]: kk
19:35gfxstrand[d]: I found the table in the kernel
19:35gfxstrand[d]: Though why hopper is GB100, I don't know
19:37karolherbst[d]: should be GH100 tho
19:40gfxstrand[d]: <a:shrug_anim:1096500513106841673>
19:40gfxstrand[d]: It's GB in the kernel. Maybe someone should fix that
19:40gfxstrand[d]: I guess that would be good review feedback for skeggsb9778[d]'s series?
19:46gfxstrand[d]: Test run totals:
19:46gfxstrand[d]: Passed: 12160/12160 (100.0%)
19:46gfxstrand[d]: Failed: 0/12160 (0.0%)
19:47gfxstrand[d]: That's all of `dEQP-VK.ssbo.*`
19:55mhenning[d]: airlied[d]: Do you have latencies for REDUX? (sm80+ only)
20:11airlied[d]: decoupled/to_ur
20:13mohamexiety[d]: gfxstrand[d]: yoo that's all of them, awesome!!
20:13airlied[d]: gfxstrand[d]: nv1a0 is gb100
20:13airlied[d]: hopper is nv180
20:18airlied[d]: gfxstrand[d]: just limiting ugprs fixed it or other stuff needed?
20:18gfxstrand[d]: I fixed up ldc, fixed ugprs, fixing sm_for_chipset and a few other things
20:19gfxstrand[d]: Wait... So what's SM100?
20:19gfxstrand[d]: Is that Blackwell A?
20:19gfxstrand[d]: i.e. GB10x? i.e. server parts?
20:21mhenning[d]: Yeah, sm90 is hopper, sm100 blackwell a, sm120 blackwell b
20:26mohamexiety[d]: and for the class, cd97 is blackwell A and ce97 is blackwell B
21:01gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34910
21:02gfxstrand[d]: I'm CTSing this bunch on Ampere to make sure they don't regress anything but I think we can land them. The rest still needs more baking.
21:28hentai: Can someone point me where do I procure some nvd9_fuc084?
21:34gfxstrand[d]: karolherbst[d]: Ugh... The nouveau wiki/docs situation needs an overhaul so bad.
21:34karolherbst[d]: yes
21:47gfxstrand[d]: Including do we even want to keep the old wiki alive in anything other than archival form?
22:38rapid_drone_era: technically i see such things as hardware and software breakpoints, which are both like happening inside hardware, just one has intstructions to cause it in isa, other in jtag comparators through address , but there is no timer accurate enough for this, so gem5-salam models in full system simulation dump something too with kvm fast forwarding but that makes fewer sense even so
22:38rapid_drone_era: gem5-gpufs has amd ones and gpgpu-sim has nvidia ones, there are some timing reports included. This does make sense in fact of course. It appears there is no way to measure things such as pipeline stage latencies lol without using simulators, and there have not been uploaded any. Well if you'd wire somehow an external cpu2cpu modern instruction based looped clock this could work. Ah
22:38rapid_drone_era: this crap is too hard anyways.
22:54gfxstrand[d]: Blackwell is very grump about depth/stencil
22:57karolherbst[d]: gfxstrand[d]: I think the right question to ask is rather, who wants to put in the work to update it
23:01jellybeanfisher: i do not posess such docs, it's not cpu2cpu such method would overheat the old mosfets, the only possible way would be to bypass a jtag chip to a enough modern device and log the clock ticks (depending on the process bypassed to and a process of that jtag tap to bybass from), nothing makes sense there to me, i gotta say i do not understand there enough , design-reuse sells old gpu
23:01jellybeanfisher: netlists as only real possibility maybe.
23:02gfxstrand[d]: Maybe we should plan on doing a nouveau hackfest before XDC this year and make all sitting around and fixing up the docs part of it.
23:36karolherbst[d]: mhhh
23:44magic_rb[d]: I can serve you folks coffee, or something. I should be at xdc, can also be there a bit earlier