00:16 gfxstrand[d]: Why does `redg` throw an illegal instruction error?
00:17 hentai: Does https://github.com/envytools/firmware/blob/master/extract_firmware.py not work on 390?
00:19 mhenning[d]: gfxstrand[d]: I tried to fix redg in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34334/diffs?commit_id=f70b7d10c2e6435702641db4dd98e34841bf64e0 maybe I didn't fix it enough
00:25 airlied[d]: gfxstrand[d]: I think it's a bit 91 vs some ugpr issue, but I didn't manage to spot it yesterday, even after dumping the prop driver
00:25 airlied[d]: mhenning[d]: it's one of those cases where nvdisasm works but the hw complains
00:26 mhenning[d]: yep, that would do it
00:27 airlied[d]: I think some of the int cases also have a bit 91 encoding
00:31 mhenning[d]: Hmm, yeah bit 91 seems to toggle between redg and redg_uniform:
00:31 mhenning[d]: 0x98e 0x0 redg_int_uniform__Ra32
00:31 mhenning[d]: 0x98e 0x1 redg_int_uniform__Ra64
00:31 mhenning[d]: 0x98e 0x2 redg_int__RaNonRZ
00:31 mhenning[d]: 0x98e 0x3 redg_int__RaNonRZ
00:31 mhenning[d]: so maybe set it true for the general case?
00:31 mhenning[d]: although it does that before blackwell too and we don't have an issue there
00:32 mhenning[d]: but maybe blackwell cares where earlier hardware doesn't
00:48 gfxstrand[d]: Yeah, I got it
00:48 gfxstrand[d]: We were setting a predicate destination for red which doesn't exist and aliases some important bits
00:48 mangodev[d]: airlied[d]: why 91 :blobcatnotlikethis:
00:49 mhenning[d]: gfxstrand[d]: that makes sense
00:51 mhenning[d]: mangodev[d]: dunno. nvidia engineers decided that bit 91 changes stuff
01:05 gfxstrand[d]: Woohoo! I've gotten rid of the "SQUASH from Dave" patch and everything is real patches now. Still have piles of texture hacks, though.
01:06 gfxstrand[d]: Oh, of course shared doesn't work. I haven't hooked that up to the QMD yet. :bim_giggle:
01:08 gfxstrand[d]: It's probably the barrier that's not working. 🤔
01:10 gfxstrand[d]: Time to reboot to the blob
01:43 gfxstrand[d]: Okay, I think I found BARRIER_COUNT in the QMD
01:44 gfxstrand[d]: Now if only we can figure out shared memory sizes
02:53 gfxstrand[d]: `dEQP-VK.glsl.atomic_operations.exchange_signed_compute_shared` passes
03:11 gfxstrand[d]: Okay, so I fixed shared but now `dEQP-VK.glsl.atomic_operations.exchange_signed_compute` fails
03:11 gfxstrand[d]: unless...
03:28 gfxstrand[d]: The weird part is that they fault on NULL. IDK why just changing QMD bits would cause that. Unless those QMD bits cause a conflict with the SLM window or something.
03:28 gfxstrand[d]: Or maybe I got them wrong and I'm accidentally flipping on some bad mode.
03:33 gfxstrand[d]: Who knows? I'm flying blind here.
03:34 gfxstrand[d]: https://tenor.com/view/bugs-bunny-reading-fly-flying-pilot-gif-8771762
08:09 snowycoder[d]: Fixed the instr_encoding bug (tld does not support all lods).
08:09 snowycoder[d]: Now the only major problem is textureBuffers, for those we need tics, not su_info -.- (and we need both if usage requires both).
08:39 mohamexiety[d]: gfxstrand[d]: What do you mean?
08:41 mohamexiety[d]: I already got size and potentially the target_sm_config_shared stuff (I found something in the Blackwell qmd that changes similarly to what that changes in Ada but I didn’t understand what that field was about so I wasn’t sure if it really was it)
08:41 mohamexiety[d]: I couldn’t find the minimum and maximum tho
09:38 nilmemoryathand: So i do not have an fpga that is suggested mostly for measuring zero-delay interrupts but google ai found something like this: Increment Without Transfer:
09:38 nilmemoryathand: This refers to a scenario where the DMA controller is set to increment the address even if the transfer itself is halted or not initiated. This could be useful for tasks like:
09:38 nilmemoryathand: Checksum Calculation: You might increment the source address to point to the next data block without actually transferring the data, potentially for calculating a checksum of a large block of data without loading it into RAM.
09:38 nilmemoryathand: Pointer Manipulation: You could increment pointers within the DMA descriptor without actually moving data, which can be helpful for manipulating memory addresses or creating new addresses.
09:38 nilmemoryathand: Pre-allocation: Incrementing without transferring could be used to pre-allocate memory or set up DMA transfer parameters without triggering an actual data transfer.
09:47 nilmemoryathand: https://psx.arthus.net/sdk/Psy-Q/DOCS/CONF/SCEA/adv_gpu.pdf and indeed some controllers or maybe most have when DC=0 that would not transfer anything but only increment things per how many bytes in total.
10:06 nilmemoryathand: well i see that CRC can be used as this, but it would still read the data to the crc engine.
10:24 nilmemoryathand: Now the ai started to be honest that it is not actually the intended behaviour, that is just side effect of something not being configured before use when absolutely no data is being read or written and only counters increment.
11:50 trinidadsbrigade: Now how that could work, no floating point exceptions either, so you modify the driver a bit according to not make memory transfers at one point when fetch/decode cycles are done you are in the issue stage which would finish with say texture read or in case of fragment processor a write of that. So as the memory transfer is not being done you get approximate issue pipeline latency, so
11:50 trinidadsbrigade: the shader core likely would cause a dma controller to make the memory read, by ignoring that we land there approximately. So fetch/decode/issue+dma is the whole pipeline , where fetch uses dma, decode and issue would then be indistinguishable, so if memory transfers are done without execution unit we get the latency of decode, and if they are done in exec unit, we get latency of
11:50 trinidadsbrigade: decode+issue, hence we assume that illegal instruction will be the end of decode, but likely there is no feedback on that, so all we could ever get is approximate length for decode+issue/exec :( though milkymist code could offer more pointers, i would assume that.
13:10 gfxstrand[d]: mohamexiety[d]: Yeah, I found that. And I think you got it right, up to a shift. And I think I might know where the other bits go. It's the top commit in my branch.
13:10 gfxstrand[d]: But now that's causing faults and I need to figure those out.
13:23 mohamexiety[d]: :PainPeko:
13:24 anglosaxon901: according to wikipedia of illegal opcode https://en.wikipedia.org/wiki/Illegal_opcode it does not halt the old gpu engines i think. So there are instructions in r300 which would wait for 2d/3d datapath to become idle in r300_reg.h, so illegal instruction if detected causes the engine to become idle, hence now you can follow with memory instruction, then you get fetch+decode and then
13:24 anglosaxon901: fetch+decode+issue, and using arithmetic to subtract fetch+decode latency you get issue/exec one , but it's a long shot.
13:25 gfxstrand[d]: notthatclippy[d]: skeggsb9778[d] There's this thing I noticed yesterday where after about 8-10 GPU exceptions, the GSP (I think) starts punishing me by making every GPU exception a 10s stall or so. This is probably fine in production but it's really rough on developers. Is there any way to shut that off?
13:26 anglosaxon901: AI at least responded that shaders are internal/integral part of 2d/3d datapath.
13:26 gfxstrand[d]: I'm assuming GSP because I definitely don't see this on old cards and as far as I know it's new behavior relative to what I've seen in the past and I haven't heard of anything like that going into Nouveau.
13:27 anglosaxon901: But i was never interested in r300 but atom's gpus , which just likely very likely behave the same there.
13:29 anglosaxon901: Those are very old drivers, tungsten commented nothing much on the defines, but did pretty good job now vmware.
13:42 notthatclippy[d]: gfxstrand[d]: Huh. IIRC there is some buffer that has only 8+8 slots for exceptions but that _should_ just wrap around. I can imagine it taking up to a millisecond longer when you go above 8 but not 10s. Totally sounds like a bug.
13:42 notthatclippy[d]: I’ll try to repro on NV stack on Monday
13:43 gfxstrand[d]: Cool. Yeah, a buffer filling up makes sense because I should be getting a new context each time unless nouveau is doing some weird reuse thing.
14:56 anglosaxon901: writeback is separate stage but it seems that operand forwarding isn't, hence that method would also bypass writeback if the operand is read right after writing. And it much looks like gpus have these forwarding networks.
15:04 anglosaxon901: On a cpu i have not delt with this yet, but there are old kernel mode patches, no zephyr seggers embos or anything will offer zero delay interrupt routing for x86.
15:14 karolherbst[d]: maybe we should just move to matrix and accept the pain, because that way we can at least delete those messages...
15:29 snowycoder[d]: Kepler almost passes all image and robustness 2 tests!
15:29 snowycoder[d]: There's only failures for mutable srgb and a single sample_cubemap.write_face_0.
15:29 snowycoder[d]: Everything else works!
15:31 blltrigopll: your pain is infinite from the day you started to bully me, you fucking rats. You will face such pain that oi oi oi in life. Interrupt controller can not route the signals without zero delay to do measurements to dma, but the kernel mode linux could actually, they had patches from japan for this. Otherwise this is APIC -- advanced programmable interrupt controller, this is self-timed, but
15:31 blltrigopll: it needs supervisor mode, but in case of gpu that is not needed. Cause as soon as the units comes to idle it will initiate a dma transfer, there is no such mechanism for cpu that i am aware of without talking to interrupt controller and defining its priorities.
15:41 gfxstrand[d]: snowycoder[d]: Congrats!
15:49 sravn: snowycoder[d]: Well done! But to understand this from a user persepctive, does this imply that NVK+Zink can be used for a Kepler based machine now (as in, when this hit mesa), or are we far away from that?
15:52 snowycoder[d]: sravn: We are still missing proper instruction latencies so for now it would be much slower.
15:52 snowycoder[d]: But a NVK+Zink for Kepler should be in the not-too-distant future I think? (I haven't worked with Zink at all, sorry)
15:53 snowycoder[d]: gfxstrand[d]: Thanks! I've also updated the initial merge with requested changes and fixes I found along the way (I have no more misaligned regs nor illegal instructions)
15:54 mohamexiety[d]: snowycoder[d]: does kepler even need those? it had a HW scheduler iirc
15:54 mohamexiety[d]: but yeah, congrats and awesome work!! <a:vibrate:1066802555981672650>
15:54 sravn: snowycoder[d]: Sounds brilliant. As you have guessed I have a Kepler in my box (do not play games), and would love to use NVK. I will continue lurking around and follow your progress. Keep up the spirit and the good work!
15:57 snowycoder[d]: mohamexiety[d]: IIRC, Fermi has the hardware scheduler but both Kepler A and B need them.
15:57 snowycoder[d]: The original issue said the same but I guess it was just confusion between KeplerA and Fermi sharing the encoding.
15:57 snowycoder[d]: For now I set timings at 0 and everything seems to work, but every instruction is delayed by 32 cycles.
15:58 mohamexiety[d]: hmm interesting
15:58 snowycoder[d]: sravn: Wow, that's a really big dopamine boost, thank you :3
15:59 mohamexiety[d]: ah damn yeah
15:59 mohamexiety[d]: somehow I thought the same as that issue but I just rechecked official NV slides
15:59 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1370792338573623306/image.png?ex=6820c959&is=681f77d9&hm=273017a9c2495edd4f232d8b228312fa867dabaffecb8ff9fab11f2ca41e3842&
15:59 mohamexiety[d]: https://www.highperformancegraphics.org/previous/www_2012/media/Hot3D/HPG2012_Hot3D_NVIDIA.pdf
16:00 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1370792515732508692/image.png?ex=6820c983&is=681f7803&hm=723b224fc4149424e24bea50c7129f29324ffe32b3cf09ca00e0cf1247fd3180&
16:00 mohamexiety[d]: as an aside, ngl it's funny to read this in hindsight
16:00 mohamexiety[d]: (Maxwell was an _insane_ efficiency jump over Kepler. practically unprecedented for iso-node uarch changes)
16:01 snowycoder[d]: I should really read all the whitepapers, seems really interesting
16:08 gfxstrand[d]: snowycoder[d]: Sounds good. I'm calling into a Khronos meeting next week so IDK where my brain will be at RE review vs. R/E on Blackwell but hopefully we can get some of it merged.
18:18 CompanionCube: hi, i don't feel like actually debugging it right now or maybe soon, but would ' drm: failed to create ce channel, -22' explain why xorg appears to load fine but GLX is borked?
18:27 blltrigopll: what i can see, that maximum hpet resolution being 100ns up to date, so cpu can not measure such things at all, however one can perhaps use interrupt pin on jtag pins on the motherboard to route anything to some fpga board to be able to succeed in this, even zero delay interrupts are at best 100ns latency, it's way too high.
18:27 karolherbst: CompanionCube: yeah, that means that the kernel isn't able to create a hardware context, so you can't really do much GL, and I suspect xorg is doing software rendering or simply is doing something differently
18:27 karolherbst[d]: mohamexiety[d]: well.. that's more that tesla and fermi weren't great
18:29 CompanionCube: karolherbst[d]: well i did a update on this particularly-outdated-laptop including from 6.2 to 6.14, and poof no GL, and that's the difference in dmesg i noticed.
18:29 karolherbst: weird...
18:29 karolherbst: what GPU?
18:31 CompanionCube: GT218
18:38 karolherbst: mind filing a bug on https://gitlab.freedesktop.org/drm/nouveau/-/issues then? Sounds like a kernel regression
18:52 HdkR: 💃
19:15 CompanionCube: karolherbst: https://gitlab.freedesktop.org/drm/nouveau/-/issues/427
19:18 karolherbst: thanks
21:18 ladygagafan: Well it's not like i ever tried , but it's SAMPLE command and should be inside the processor's tap, i can't remember there was some type of interrupt called like level or edge triggered, so one should hold the interrupt down for a while , maybe i misundersood something again. https://interrupt.memfault.com/blog/diving-into-jtag-part-3 https://www.youtube.com/watch?v=518nC_8RZBM but cpu
21:18 ladygagafan: might run ahead of sample command or fall behind, i was never sure how those synchronizations work, but if that's not some loop in the capture state coded on a chip than how could the pins signal at all if interrupt or whatever signal is not held long enough?
22:02 paralleluniverse: https://www.ganssle.com/articles/adma.htm • Legacy support for ISA regime protocol (PHOLD/PHOLDA) required for parallel port. DMA, floppy drive, and LPC bus masters. • DC coupling – no capacitors , so intel has hardwired legacy ISA in fact, which then would not use any of the JTAG, LOL. Not sure how to go into this hardwired interrupt routing which ai also said can be done somehow,
22:02 paralleluniverse: cause osdev said always that those APIC gpios are pinned by efi type of acpi code etc.
23:25 soreau: does "OpenGL renderer string: NV168" mean zink on nvk?
23:26 gfxstrand[d]: No. That means old-school Nouveau GL.
23:26 soreau: ah, ok
23:26 gfxstrand[d]: Though why it says NV168 and not TU116, I don't know.
23:27 gfxstrand[d]: If you're getting NVK+Zink, the renderer should say Zink
23:28 gfxstrand[d]: `OpenGL renderer string: zink Vulkan 1.4(Intel(R) Graphics (RPL-P) (INTEL_OPEN_SOURCE_MESA))`
23:28 gfxstrand[d]: Only with NVK/NVIDIA stuff in there