01:04steel01[d]: on early-fs
01:04steel01[d]: wait /dev/dri/card0 10
01:04steel01[d]: Alright gfxstrand[d]. Under the repo `device/nvidia/foster`, in the file `initfiles/ack.rc`, add this block. If you want to modify this on device and not do a reinstall, it's at `/vendor/etc/init/ack.rc`. vim and nano are both available on device. This makes sure that tegra-drm finishes enumeration before nouveau gets loaded, thus ensure card order.
02:31steel01[d]: steel01[d]: And tx2 is throwing effectively the same stack trace now. So they might be at parity.
02:40gfxstrand[d]: When I do get the right device order, it's now faulting in dmesg. With that, it should be way more reproducible.
02:49steel01[d]: Mmm, I'm not getting driver faults. Just the texture creation failure.
02:50steel01[d]: This might be helpful for reference. TX1:
02:50steel01[d]: 01-01 00:00:56.331 490 583 E [minigbm:nouveau_bo_create_for_modifier(681)]: Allocating new BO: domain = 0x4, tile_mode = 0xfe, pte_kind = 0x50, modifier = 0x3000000000fe015
02:50steel01[d]: TX2:
02:50steel01[d]: 01-01 00:01:06.015 511 529 E [minigbm:nouveau_bo_create_for_modifier(681)]: Allocating new BO: domain = 0x4, tile_mode = 0xfe, pte_kind = 0x50, modifier = 0x3000000000fe015
02:50steel01[d]: Which... looks the same to me.
08:57larrystanford: All i was saying that monster screwed countless of bums/scrubs and illborn people complaining about the rape i committed, various earths highest thieves, and extorting money through humiliating me by negro scammers vikings scammers, europes scammers, who ruined my vacation and all our clients welfare on our territory, to all of them i say after surviving this run, i care nothing about who
08:57larrystanford: you fuck or not, once again you show your faces on the area of our hotels primarly financed through my businesses and relatives, you are dead, otherwise they were already handled and free to go as BIG WINNERS like aids carrying charly sheen with their delusions, frame complaints somewhere else and we are good , never show your scrubs faces and yours again and we are good, do those things
08:57larrystanford: again and you are definitely dead. Now obviously you understand few about technology, which is why you do senile/idiotic things alike and with big pain i gave you correct calculations for AI memory management, it was not hard to me, i just thought you do not deserve it for longer period, calculations themselves were easy, and took 1hour working on phones, calculator.
09:27Martsworth: The firmware yes is secured, you can disassemble the isa, but you would never understand what those bits do, it's nearly on the bar with real hardware, i told tothose osmcom-bb cloner thieves already and beamers of terror, i really can not give more , that's chaos, i did not foresee that before, where people land duty cycle calculations combine them with antenna of wifi/5g/3g/LTE and start
09:27Martsworth: to kill people that way, people like me, who were never meant to be killed, now only small hardware patchings can be done for people to play games under linux and that's just it, in terms of security, they would only understand like in real hardware some bits coming in some bits going out, security is so insane that can be composed, and the work is not hard.
09:45karolherbst[d]: mhenning[d]: yeah... https://gist.github.com/karolherbst/befa838221d50b1fa2741b28629307fd here I see the issue I've mentioned.
09:49sergeremonas: well actually you do not know probably what this mob agenda in asia is about. There are money transferring companies like wing, and they duplicate the transfers, i intercepted one of such, office pays out the money twice and insurance pays it out under fraud clauses or something, this is known as heisenberg paradox in the angles of detecting it is possible, when you configure low band pass
09:49sergeremonas: filters , but corrupted officials are not giving access to those. Heisenberg paradox states that the position can be not determined when this and that happens, if measuring instruments are not configured properly, it's what the real thought of that position based detection sentence is.
09:51jja2000[d]: I think you can probably build some wordlists and score a message based on those wordlists and then ban the person if it exceeds some score
09:51jja2000[d]: Lots of countries and regions, some very arbitrarily weird words
09:53karolherbst[d]: maybe
09:53karolherbst[d]: but a bot isn't fast enough to react
09:54karolherbst[d]: like IRC is slow, they will get 1-3 messages through in bursts and then get kicked. Nothing changes
09:55karolherbst[d]: I'm just banning all those corpo /24 networks now π should run out of them eventually (or out of money)
09:57karolherbst[d]: some of the last ones are actually hosted in the US... maybe with a lawyer something can be done there...
09:57asuasuasu[d]: soon enough they're gonna create a disruptive tech startup that instead of using botnets to scrap the web, spams IRC
10:01realmeaningpelase: It's primarly my tactics, for such word illustrators like jja2000[d] who does say things like anti-semites, where clearly in context he does not know who are semites or sumerians or whatever, but still shits around on the channel and harasses people with it's shitsayings. So that you know, you are an hopeless apes here, and the same tactics you always use, demented or illborn people
10:01realmeaningpelase: screw up and you ban me, good luck in doing that, what i care, i stay out.
10:23karolherbst[d]: yeah.. adding a bit of waits after `war_latency` does fix the issue and nothing else did...
10:23karolherbst[d]: so yeah, confirmed it's that issue
10:48snowycoder[d]: karolherbst[d]: Ooooh, I see the problem, this is because we're walking the code backwards, right?
10:53karolherbst[d]: not entirely sure what's going wrong, but I think it never checks with the hmma
10:53snowycoder[d]: Current code works like this:
10:53snowycoder[d]: - visit C, R0 = [write C]
10:53snowycoder[d]: - visit B, resolve WaR, R0 = [read B]
10:53snowycoder[d]: - cisit A, R0 = [read B, read A]
10:53snowycoder[d]: So it only accounts for the C-B WaR but not for C-A WaR
10:53karolherbst[d]: yeah something like that
10:56snowycoder[d]: A solution would be to always keep the latest Write alongside Reads
10:57karolherbst[d]: there are two places where `war_latency` is called, and it's in calc_instr_dep and the sched_postpass
10:59snowycoder[d]: `calc_instr_deps` is the one that assigns delays, the other is instruction reordering
11:00karolherbst[d]: yeah I can ignore the postpass for now, but it could reorder in a suboptimal way
11:19snowycoder[d]: Don't hate me, but can I try to fix it in my cross-block branch?
11:19snowycoder[d]: It is actually better suited for a faster fix (reguses can contain both writes and reads at the same time)
11:19karolherbst[d]: sure
11:19karolherbst[d]: just throw me a branch
11:21karolherbst[d]: I saw another issue when not disabling ldsm use.. I wonder if it's the same issue...
11:25karolherbst[d]: ah yeah.. seems to be the same issue π
11:25karolherbst[d]: "great"
11:27phomes_[d]: I updated the perfetto tracing layer and made it easy to run from devenv. The layer traces all vulkan calls and things from common mesa code. I added a readme file with instructions of how to use it
11:27phomes_[d]: https://gitlab.freedesktop.org/phomes/mesa/-/tree/perfetto-layer/
11:39snowycoder[d]: karolherbst[d]: Try this: https://gitlab.freedesktop.org/SnowyCoder/mesa/-/commits/cross_block_sched_fix_war
11:39snowycoder[d]: (haven't tested yet)
11:49karolherbst[d]: snowycoder[d]: does seem to work, but I also don't know if any of the other patches are masking the problem or not π₯²
11:52snowycoder[d]: Remove the last one
11:52snowycoder[d]: If what I think is correct, only the WaRaR patch should matter for your use-case
12:27karolherbst[d]: I finally cleaned up airlied[d] vec cmat load/store opt, but also support it for 8 and 32 bit matrices π
12:40karolherbst[d]: on top of my fixes it's even relatively small now
12:42snowycoder[d]: karolherbst[d]: Does it still work without the last patch?
12:42karolherbst[d]: currently testing that
12:49karolherbst[d]: snowycoder[d]: yeah, looks like that still works
12:50karolherbst[d]: unless I messed up git, which looks like I might have
12:50karolherbst[d]: let me test again π
12:51snowycoder[d]: karolherbst[d]: Uhh, that's weird, it should only modify things near block intersections
12:51karolherbst[d]: ohh ehh
12:51karolherbst[d]: yeah
12:51karolherbst[d]: it's broken
12:52karolherbst[d]: without `Hackfix: patch WaRaR` I mean
12:52karolherbst[d]: I messed up using git
12:52snowycoder[d]: Perfect, then yeah, you found the first WaRaR
12:52karolherbst[d]: nice.. π₯²
12:52karolherbst[d]: to be fair.. HMMA is the only instruction that can hit this apparently
12:54snowycoder[d]: We should still fix it, this has very subtle implications that I don't want to debug (WaPaR)
12:56karolherbst[d]: yeah...
13:15snowycoder[d]: Should I retrofit the fix for the main branch? My pass still has a 12% compile time slow-down (geomean)
13:17karolherbst[d]: yeah I think that would be great
13:17karolherbst[d]: then we could already merge the bits needed to fix this issue
15:56gfxstrand[d]: Looking at gpuinfo and it looks like NVIDIA claims `integerDotProductAccumulatingSaturating4x8BitPackedUnsignedAccelerated` and friends. So I guess IDP4 has a saturate bit?
15:57karolherbst[d]: let me check
15:59karolherbst[d]: gfxstrand[d]: the saturate is for the accumulation, right?
16:00karolherbst[d]: afaik there is no .sat modifier on `IDP`, however maybe there are other good ways to accelerate it?
16:01karolherbst[d]: maybe it just accumulates with 0 and then does the saturation manually?
16:01gfxstrand[d]: Yes, saturate is on the accumulate
16:01karolherbst[d]: ohhh wait...
16:01karolherbst[d]: mhh
16:01karolherbst[d]: there is something
16:01mhenning[d]: Doesn't look like there's a saturate in PTX's dp4a
16:01karolherbst[d]: `I2IP`
16:02karolherbst[d]: it's packed integer to integer conversation with .sat
16:02karolherbst[d]: maybe that's used?
16:02karolherbst[d]: though that's 32 bit to packed...
16:03gfxstrand[d]: Yeah, we could dp4 with 0 accum and then do a saturating add
16:04gfxstrand[d]: But we also don't have iadd_sat so...
16:04karolherbst[d]: mhh
16:04gfxstrand[d]: Technically, it's less instructions than the client just open-coding it, I guess
16:04karolherbst[d]: well could do the sat with a carry and predicate
16:04gfxstrand[d]: For uadd, that works. Not so much for iadd
16:05karolherbst[d]: ~~use iadd3 and add `0x80000000`~~
16:06karolherbst[d]: though I could check what nvidia generates there for CL π
16:06gfxstrand[d]: might be interesting
16:07gfxstrand[d]: But even so, the lowering for iadd_sat can't possibly be worse than open-coding a vec4 dp4
16:07gfxstrand[d]: Especially since you need to do an iadd_sat anyway
16:07karolherbst[d]: a lot...
16:08karolherbst[d]: ohh.. the ptx is cursed
16:08snowycoder[d]: karolherbst[d]: What if I magically made the new pass faster than the older one? (-3% on avg)
16:08karolherbst[d]: doing a 64 bit add and then saturated i2i π
16:08gfxstrand[d]: karolherbst[d]: That's actually not horrible. π
16:08karolherbst[d]: yeah..
16:08gfxstrand[d]: It's an i2i and it's 2 iadd3s but still
16:09karolherbst[d]: the generated code is still massive
16:09gfxstrand[d]: π
16:09karolherbst[d]: it doesn't actually generates i2i
16:10karolherbst[d]: https://gist.github.com/karolherbst/9fb5aa5621220863cc56e00dfb953fc4
16:10gfxstrand[d]: `uadd_sat` should just be
16:10gfxstrand[d]: x, p = iadd3 a, b, 0
16:10gfxstrand[d]: x = sel p, ~0, x
16:11karolherbst[d]: heh there is `add{.sat}.s32` in PTX tho...
16:11karolherbst[d]: let me try that
16:13karolherbst[d]: https://gist.github.com/karolherbst/a97f01c8cb59cfe2fe210496e31c5d36
16:14karolherbst[d]: but I suspect because the IDP can be like.. fast doing a lowered iadd.sat is probably still better than whatever the app could do
16:15karolherbst[d]: I like my carry predicate idea better π
16:16gfxstrand[d]: Ooh! What is this sign bit form of plop3?
16:16mhenning[d]: snowycoder[d]: That would be useful. I think the compile time hit is the main thing blocking that MR right now
16:17karolherbst[d]: gfxstrand[d]: only reads the sign bit
16:18gfxstrand[d]: Well, yes. But it looks useful
16:18gfxstrand[d]: We don't have that encoding in NAK right now
16:18gfxstrand[d]: karolherbst[d]: You can't use carry predicates for `iadd_sat`.
16:18snowycoder[d]: mhenning[d]: I used the magic of allocating 1/500 of the memoryπ
16:18gfxstrand[d]: For `uadd_sat`, you can
16:18karolherbst[d]: yeah but you could add 0x80000000 and remove it later
16:18gfxstrand[d]: Eh...
16:18gfxstrand[d]: I guess?
16:19karolherbst[d]: mhhh
16:19gfxstrand[d]: I don't see why that's better than the plop magic
16:19karolherbst[d]: probably isn't
16:19karolherbst[d]: so apparently .SIGN exists on all three sources
16:19karolherbst[d]: nice
16:20mhenning[d]: snowycoder[d]: not sure what that means concretely but sounds promising
16:21snowycoder[d]: mhenning[d]: I made a sparse RegTracker, nothing special but removes most of the memory-boundness
16:29gfxstrand[d]: Filed it so we don't lose it: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14153
18:53gfxstrand[d]: Did we figure out the sample locations bug?
18:53snowycoder[d]: What bug?
18:53gfxstrand[d]: It's still open so I'm going with no https://gitlab.freedesktop.org/mesa/mesa/-/issues/14108
19:21marysaka[d]: I think we have something similar on HK btw
19:21marysaka[d]: still haven't had the time to just dig into that
19:56phomes_[d]: mohamexiety[d]: I have tested the compression MR with a lot of games the last few days. No issues found. Same perf improvement as before
19:57mohamexiety[d]: phomes_[d]: Awesome! Thanks so much β€οΈ β€οΈ
20:15mhenning[d]: gfxstrand[d]: No, I haven't managed to figure that one out. Next steps I can think of are seeing if it passes on proprietary and checking if the test case makes sense
20:17mhenning[d]: also, if anyone has cycles to review, https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37964 is ready
20:35mohamexiety[d]: oof, sorry about that one
20:53gfxstrand[d]: mohamexiety[d]: No worries. It was easy to miss and Mel caught it pretty quick.
20:54mhenning[d]: Yeah, also I only caught it because cts added a test.
20:58mhenning[d]: and in addition to that a lot of mesa drivers don't actually need any special handling for that bit. Very easy to skip over.