02:21chikuwad[d]: interesting
02:21chikuwad[d]: thanks :D
02:53chikuwad[d]: okay, adding 10:3 nullbytes for non-keyframes did remove the green from the video I'm using to test
02:53chikuwad[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1417705072456106006/image.png?ex=68cb743c&is=68ca22bc&hm=8fb30ed43c0287948b1627fe11245de78b0ef2eb07f4f20ce6b114ab680b79f6&
02:53chikuwad[d]: but it's still garbled
02:54chikuwad[d]: or, perhaps, the driver is attempting a modern-art mosaic
02:56chikuwad[d]: ooh
02:57chikuwad[d]: adding the bytes for all frames actually restores some of the frame
02:57chikuwad[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1417706089541013534/image.png?ex=68cb752f&is=68ca23af&hm=a8357ec52375aa407aa9f9534136028799682ae52b217a63c28f34e6e3cbdead&
02:57chikuwad[d]: but now it only plays the i-frames I think
03:26chikuwad[d]: :o I did a dumb, it works
03:27chikuwad[d]: avhe[d]: that was it, thanks averne 🩷
07:36asdqueerfromeu[d]: chikuwad[d]: I've had this to a smaller extent on my phone sometimes after ads finished playing on YouTube
09:30avhe[d]: chikuwad[d]: nice job!
09:46chikuwad[d]: thank, couldn't have done it without you :3
14:06karolherbst[d]: Yeah soo.. we need a solution for applications which compile multi-threaded in a way that you can't debug anything 🙃
15:36karolherbst[d]: okay, I have a solution there
15:36karolherbst[d]: `r6 = ld.global.a64.constant.b32 [r3+r4..6] // delay=1 wr:0` ehh yeah...
15:37karolherbst[d]: about that 🙃
15:40karolherbst[d]: uhhh...
15:40karolherbst[d]: `con 32x2 %71 = @ldc_nv (%48 (0x1), %29 (0x0)) (access=none, align_mul=16, align_offset=0)` that gets emited as a gpr?
15:41karolherbst[d]: %r74 = copy c[0x1][0x0]
15:41karolherbst[d]: %r75 = copy c[0x1][0x4]
15:41karolherbst[d]: %r76 = copy %r74
15:41karolherbst[d]: %r77 = copy %r75
15:41karolherbst[d]: %r78 = ld.global.a64.constant.b32 [%r73+{%r76 %r77}]
15:43karolherbst[d]: gfxstrand[d]: I think I'm missing something. Is there any specific part in from_ir.rs that forces a gpr to be used here?
15:43karolherbst[d]: ohhh...
15:43karolherbst[d]: I missed the `let dst = b.alloc_ssa_vec(RegFile::GPR, size_B.div_ceil(4));`
15:44karolherbst[d]: mhhhh
15:45karolherbst[d]: guess I need to deal with the situation that actually convergent nir values might not end up as ugpr later on...
15:45gfxstrand[d]: yes
15:46gfxstrand[d]: The rule is `!divergent && !block.divergent`
15:46karolherbst[d]: guess I could legalize it to an add
15:46gfxstrand[d]: yup
15:46karolherbst[d]: gfxstrand[d]: not the issue I'm seeing here tho
15:46gfxstrand[d]: Or lower late enough that the lowering can be smart
15:46karolherbst[d]: just ldc handling seems to always emit to a gpr
15:47gfxstrand[d]: `opt_non_uniform()` in NAK should add an R2UR
15:47karolherbst[d]: div 32 %70 = ishl %67, %69 (0x2)
15:47karolherbst[d]: con 32x2 %71 = @ldc_nv (%48 (0x1), %29 (0x0)) (access=none, align_mul=16, align_offset=0)
15:47karolherbst[d]: con 64 %72 = pack_64_2x32 %71
15:47karolherbst[d]: div 32 %73 = @load_global_nv (%70, %72) (base=0, access=readonly|reorderable, align_mul=4, align_offset=0)
15:47karolherbst[d]: yeah.. might not do it for loads?
15:47karolherbst[d]: but the block is indeed divergent.. mhh
15:48karolherbst[d]: annoying
15:48karolherbst[d]: ohh right.. because threads could clobber the ugpr...
15:49karolherbst[d]: yeah okay, then I guess I have to check also the def block and hope the issue goes away
15:49karolherbst[d]: totally missed that aspect
15:50gfxstrand[d]: 👍🏻
15:50mhenning[d]: gfxstrand[d]: That's not correct. We have situations where def_block is not divergent but we still don't allow defining ugprs
15:50karolherbst[d]: we have `nir_opt_licm` to move such values out of blocks I should maybe also wire up at the same time
15:50mhenning[d]: the correct way to do it is to traverse the control flow tree looking for divergent if/loop
15:51karolherbst[d]: that would move the ldc_nv outside the divergent block
15:51karolherbst[d]: I think
15:51karolherbst[d]: mhhh
15:51karolherbst[d]: mhenning[d]: add this point it almost sounds saner to lower it back to a gpr + gpr add if the assumption I make on the nir side doesn't hold up
15:52gfxstrand[d]: mhenning[d]: Not sure what you mean. https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/compiler/nak/from_nir.rs?ref_type=heads#L4039
15:52mhenning[d]: maybe
15:53karolherbst[d]: I should check the block to at least get as few false positives as possible, but if there is a sup-optimal add not merged into a lea in like 2 shders, whatever
15:53gfxstrand[d]: Yes, our definition of what blocks are divergent is slightly different from NIR's in a few cases. But relative to what `nak_nir_lower_cf()` does, that is the rule.
15:53gfxstrand[d]: But yeah, if you check before that pass is run you might get slightly different results
15:53karolherbst[d]: I'm just prototyping still and only want to get rid of a crash atm 🙃
15:59karolherbst[d]: ... now I'm running into the issue of convergent block inside a divergent block..
15:59karolherbst[d]: I should just lower that
16:02gfxstrand[d]: The uniform optimization might also be assuming you can shove GPRs in anywhere and you might need some more special casing there.
16:03mhenning[d]: gfxstrand[d]: Ah, I guess I've been assuming that karol is working in nir before `nak_nir_lower_cf` which would then cause that discrepancy
16:03karolherbst[d]: Well..
16:03karolherbst[d]: it's weird
16:04mhenning[d]: the block's divergence info is incorrect before that pass for the purpose of checking if you can use a ugpr
16:07karolherbst[d]: atm I move the uniform base address into the load/store in nir, but that was me ignoring that divergent blocks prevent the use of ugprs...
16:10gfxstrand[d]: So, yeah, I think there are two steps: Make the lowering a little smarter so it checks `def->divergent || def->parent_instr->block->divergent` and also we need to make NAK smart enough to not fall over if things aren't UGPRs for some reason.
16:10karolherbst[d]: yeah..
16:10karolherbst[d]: gfxstrand[d]: apparently I need to indeed check the entire chain, at least with the shader I'm now craashing on
16:11karolherbst[d]: or uhm...
16:11karolherbst[d]: I guess it's the divergency info is incorrect situtaion
16:11karolherbst[d]: which now gets even more annoying
16:13karolherbst[d]: well.. I'll get it to work
18:01orowith2os[d]: Nvidia shield received 🔥
18:02orowith2os[d]: steel01[d]: I'm gonna rope you in, since you seem most active here: any documentation I should follow to get Linux onto it?
18:02orowith2os[d]: Specifically, I'm gonna try my hand at an ostree variant of Fedora, if possible.
18:02HdkR: Woo SHIELD {portable,tablet,tv}
18:11steel01[d]: orowith2os[d]: Afaik, no. There's a thread on xda about loading an old version of l4t, nvidia's downstream ubuntu fork, but that's all that I know. Mainline doesn't run due to lack of a device tree.
18:11steel01[d]: Once I get my units back with uart exposed, I'll be trying to get those device trees done.
18:12steel01[d]: Which variant did you get? If you boot the unit and go to settings->device->about, it should tell you.
18:14orowith2os[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1417936827721187328/PXL_20250917_181433386.jpg?ex=68cc4c13&is=68cafa93&hm=af798e46decb69490db1758d4551f0391ca83129d8af47a50bce22ff43c4fed5&
18:15orowith2os[d]: Oh, damn, that's a mean camera artifact.
18:15orowith2os[d]: My phone does NOT like taking pictures of the TV.
18:15steel01[d]: Scroll down a bit further. Should have a build version.
18:15steel01[d]: And say like darcy, foster_e, etc.
18:16orowith2os[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1417937148732244129/PXL_20250917_181552041.jpg?ex=68cc4c60&is=68cafae0&hm=6a5ff49483459fe8520d5f15409b4a2f7773c39dd31a07d22c3fb976baef2d24&
18:16orowith2os[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1417937267904745522/PXL_20250917_181619920.jpg?ex=68cc4c7c&is=68cafafc&hm=364a973718600808e8800a3b930aeb201a358ce0cbf788cd4e48537c572c29ac&
18:16orowith2os[d]: About -> Status -> Hardware Information fwiw
18:16steel01[d]: Ah, there we go.
18:16steel01[d]: So the 2017 model, darcy.
18:18steel01[d]: That actually does have half a kernel device tree. But not enough to boot in a user friendly way. Like, no display, etc.
18:19orowith2os[d]: So just a plain TTY, no graphics at all?
18:19orowith2os[d]: (basically only good for an arm headless server?)
18:19steel01[d]: https://github.com/torvalds/linux/blob/master/arch/arm64/boot/dts/nvidia/tegra210-p2894.dtsi
18:20steel01[d]: The display won't even turn on because the software won't know it's there. You'd have to have uart serial soldered down to get anything out of it.
18:20orowith2os[d]: Ah, I see
18:20orowith2os[d]: So I have to run a tegra kernel?
18:21orowith2os[d]: Is that even kept up to date enough for me to reliably use Linux with, even if it's without GPU accel...?
18:21orowith2os[d]: (does it even support KMS?)
18:21steel01[d]: The downstream 4.9 kernel? Yeah, no. Nvgpu isn't using drm at all.
18:22steel01[d]: https://xdaforums.com/t/dev-ubuntu-18-04-lts-bionic-linux-for-tegra-for-shield-android-tv.4465949/
18:22steel01[d]: Oh. This only advertises support for the 2015 models anyways.
18:23steel01[d]: Well, hopefully in a few weeks, I'll have mainline support in better shape.
18:24orowith2os[d]: Alrighty, lmk if there's anything I can do :EmmaValPenguinSalute:
18:24orowith2os[d]: (though, I don't have uart, or any way to access that)
18:24orowith2os[d]: Just a keyboard and remote.
18:25steel01[d]: Yeah. Once it's booting, and initial nvk support is running, having more people throw 'random' stuff at it will be useful. Break all the things early to get it fixed earlier.
18:26orowith2os[d]: At the very least, display is all I need
18:26orowith2os[d]: I'm not worried about GPU accel yet
18:27steel01[d]: Yeah, that part 'should' be pretty simple. Tegra-drm should 'just work'. But when I tried to boot stuff blind on my units, it failed to boot even the simplest stuff. Hence having to go hunt down uart to get logs.
18:27steel01[d]: Probably something exceeding stupid. But without logs, it's almost impossible to know what.
18:34orowith2os[d]: Keep me posted, so long as it doesn't involve uart, I'll be okay :EmmaValPenguinSalute:
18:35mhenning[d]: marysaka[d]: I'm trying to get envyhooks running again on my machine. Right now it's dying with "assertion failed: old_va_entry.is_none()" Is this a known issue?
18:35mhenning[d]: currently on 580.82.09 fwiw
18:43marysaka[d]: hmm I need to update it to 580.xx likely
18:43marysaka[d]: what are you testing that against?
18:44mhenning[d]: both vkcube and a test I wrote
18:45marysaka[d]: okay will try to reproduce
18:49marysaka[d]: For reference I pushed the rewrite of the memory tracker last month so it might have some bugs still but should track way more than before (and be able to differentiate VAs)
18:58marysaka[d]: mhenning[d]: fixed with latest commit
18:59marysaka[d]: there was some user api changes with 580.xx
19:27mhenning[d]: marysaka[d]: working now, thanks!
21:07karolherbst[d]: okay.. better, but:
21:07karolherbst[d]: ur9 = ldc.b32 c[0x1][+0x0] // delay=1
21:07karolherbst[d]: ur10 = ldc.b32 c[0x1][+0x4] // delay=1
21:07karolherbst[d]: r16 = mov ur9 // delay=1
21:07karolherbst[d]: r17 = mov ur10 // delay=5
21:07karolherbst[d]: r15 = ld.global.a64.constant.u16 [r15+r16..18] // delay=2 rd:0 wr:1
21:07karolherbst[d]: I guess that's opt uniform's fault somewhere
21:08karolherbst[d]: or uniform lowering actually..
21:08karolherbst[d]: ohh wait.. shoo
21:11karolherbst[d]: `ld.global.a64.constant.b32 [%r16832+{%ur224 %ur225}]` ehh
21:14karolherbst[d]: yeah so it's legalize, even though I already removed the legalize stuff from the opcode
21:17karolherbst[d]: that's gonna be hell to debug
21:17mhenning[d]: huh, apparently the proprietary driver will combine consecutive overlapping vkCmdUpdateBuffer into one
21:22mhenning[d]: karolherbst[d]: leglaize.rs line 441 is what's bit us in the past, not sure if that's what you're running into or not
21:22karolherbst[d]: yeah...
21:23karolherbst[d]: I'm looking at it and was wondering
21:24karolherbst[d]: uniform regs are almost for free, I wonder if the solution is simply to like load them elsewhere
21:25karolherbst[d]: but I've also seen nvidia rematerialize them over and over again probably to reduce pressure ont he file
21:26mhenning[d]: well, spilling can't currently load to uniform regs in nonuniform control flow, so you need to be careful not to use too many uregs
21:26mhenning[d]: or spilling will just give up
21:26karolherbst[d]: yeah.. that's the "almost" part
21:27karolherbst[d]: mhhhh
21:27karolherbst[d]: mhenning[d]: I wonder.. is this an assumption or has anybody actually reverse engineered how it all works?
21:27karolherbst[d]: like I'm sure it's a bit more predictable
21:28mhenning[d]: It's a limitation of the compiler
21:28karolherbst[d]: right... I was like thinking of a divergent if/else within a convergent parent, where each path could just use different uniform regs and it would be fine
21:29karolherbst[d]: probably
21:29mhenning[d]: Yeah, I think pre-blackwell we could just do that and it would be fine
21:29mhenning[d]: post-blackwell I think you need to do some additional work to prevent it from switching
21:29karolherbst[d]: but not gonna do it just for the io ureg stuff 😄
21:30karolherbst[d]: for now I just want more complex shaders to finally compile, lol
21:30mhenning[d]: but anyway that's a rough understanding and it's a really big task to reverse engineer semantics and rework the compiler
21:30karolherbst[d]: mhhh
21:30karolherbst[d]:anyway
21:31karolherbst[d]: the op legalize function is called after that
21:31karolherbst[d]: I could just check if the load/store somehow ended up with two gprs and just convert it to an add
21:31karolherbst[d]: probably the less painful workaround for now
21:32mhenning[d]: Yeah, as long as that doesn't happen too often that should be fine
21:33karolherbst[d]: I need to do some stats, but I already had benchmarks seeing +10% just because it hammers down register pressure a lot
21:33karolherbst[d]: the whole thing I mean
21:36karolherbst[d]: okay.. I'm getting `ld.global.a64.constant.u16 [%r14526+{%r15605 %r15606}]` into the `legalize` function, so I can work with that indeed
22:17marysaka[d]: mhenning[d]: I guess it delays emitting it before the next different command? that or they merge/replace the content of the inline upload
22:18marysaka[d]: also note for Fill and Copy if the size is bigger than 1 << 17 bytes, they schedule a compute shader for it
22:18mohamexiety[d]: yeah doing compute meta copies and fill was on my list
22:30karolherbst[d]: okay, got it to compile, but getting faults.. well that's kinda expected almost
22:30karolherbst[d]: prolly missed to wire something up
23:20mhenning[d]: marysaka[d]: yeah, not sure. It seems it can even do it sometimes if I insert a vkCmdPipelineBarrier between them.
23:20mhenning[d]: I guess their command recording is somewhat complex