00:08redsheep[d]: mohamexiety[d]: I'd expect it to work on ada, last time I checked Talos 2 worked fine and it seems to be almost the same as Talos 1 remastered. Not that that says anything about Blackwell.
00:09redsheep[d]: cubanismo[d]: I want to say microcenter has them, I know I saw some components but not sure the minimum number in a package. I've only been once though, microcenter is an 8 hour drive away.
00:11redsheep[d]: Really wish I could be helping with the blackwell bug hunt. The enormous purchase just never made enough sense outside of testing nvk.
00:17gfxstrand[d]: Running DA:TV with `NVK_DEBUG=push_sync causes it to hang on something unrelated.
00:28gfxstrand[d]: Yeah, looks like `push_sync` syncs too much and waits start timing out inside the app. π
00:30redsheep[d]: Maybe a crazy and overly basic solution but how hard would it be to wire up a delay so push sync only starts once the app has been open for X number of seconds?
00:40gfxstrand[d]: Honestly? Not hard.
00:47redsheep[d]: Might get you past the timeouts
00:53redsheep[d]: If there's anything I've learned working in IT instead of dev it's that the low tech or no tech solution is often the best one π
00:53gfxstrand[d]: Yeah, as soon as my timed out sync kicks in, it locks up again
00:55redsheep[d]: Does it output anything before the app dies?
00:55gfxstrand[d]: Nope
00:56redsheep[d]: That's annoying
00:56mhenning[d]: does cts like push_sync
00:57gfxstrand[d]: Most tests do
00:58redsheep[d]: Does it currently output to disk on the same thread? Wonder if spinning off that task makes it fast enough not to break
00:59redsheep[d]: Though if it's not outputting even a little I feel like the answer is no
00:59gfxstrand[d]: I made it also spin up a submit thread if `NVK_DEBUG=push_sync` is set
00:59gfxstrand[d]: That seems to help a little
01:01redsheep[d]: I recall push sync having an insane amount of output, I wonder if a ramdisk might help too, if you only need it to last for a minute or less
01:02gfxstrand[d]: `NVK_DEBUG=push` has a lot of output
01:02gfxstrand[d]: but it's not dumping any pushbufs
01:06gfxstrand[d]: Yeah, it kinda looks like syncing just slows it down too much and somewhere either in the app or in VKD3D it gives up
01:11gfxstrand[d]: I'm also sometimes seeing `-EBUSY` on memory allocations
01:11gfxstrand[d]: I feel like this is a kernel bug
01:23cubanismo[d]: redsheep[d]: I don't actually live in Silicon Valley anymore, but there was a Fry's in San Diego as well. Nearest Micro Center is in Orange County, which is like a 1.5 hour drive at midnight, but 2-3 hours pretty much any time they'd actually be open. I never need a resistor that bad, fortunately.
01:29redsheep[d]: cubanismo[d]: I'd love to have ever had either one. Utah actually has a pretty booming tech sector but ever since CompUSA died nobody has been interested in those kinds of stores existing here.
01:30gfxstrand[d]: `[ 1086.424587] nouveau 0000:01:00.0: Dragon Age The [7023]: job timeout, channel 8 killed!`
01:30gfxstrand[d]: That's new
01:33redsheep[d]: I'm pretty sure my area is near the top of the list for getting a new microcenter location but they only seem to build one every like... 5-10 years. But, it pays to expand cautiously when that caution is probably the reason you're around and all your competitors have gone bust
01:36redsheep[d]: I've nearly opted for the 8 hour drive several times when there's big upcoming hardware releases
01:39gfxstrand[d]: I think fencing is a bit wrong in the kernel somewhere.
01:39gfxstrand[d]: These sync waits shouldn't be sticking
01:43gfxstrand[d]: Or not...
01:43gfxstrand[d]: [0x00000013] HDR 2001001e subch 0 NINC
01:43gfxstrand[d]: mthd 0078 unknown method
01:43gfxstrand[d]: .VALUE = 0x0
01:43gfxstrand[d]: [0x00000015] HDR 2004000a subch 0 NINC
01:43gfxstrand[d]: mthd 0028 NV906F_MEM_OP_A
01:43gfxstrand[d]: .OPERAND_LOW = (0x0)
01:43gfxstrand[d]: .TLB_INVALIDATE_ADDR = (0x0)
01:43gfxstrand[d]: .TLB_INVALIDATE_TARGET = VID_MEM
01:43gfxstrand[d]: mthd 002c NV906F_MEM_OP_B
01:43gfxstrand[d]: .OPERAND_HIGH = (0x0)
01:43gfxstrand[d]: .OPERATION = 0x0
01:43gfxstrand[d]: .MMU_TLB_INVALIDATE_PDB = ONE
01:43gfxstrand[d]: .MMU_TLB_INVALIDATE_GPC = ENABLE
01:43gfxstrand[d]: mthd 0030 unknown method
01:43gfxstrand[d]: .VALUE = 0x0
01:43gfxstrand[d]: mthd 0034 unknown method
01:43gfxstrand[d]: .VALUE = 0x28000000
01:43gfxstrand[d]: [0x0000001a] HDR 200120a6 subch 1 NINC
01:43gfxstrand[d]: mthd 0298 NVC7C0_INVALIDATE_SKED_CACHES
01:43gfxstrand[d]: .V = (0x0)
01:48mhenning[d]: gfxstrand[d]: Kernel-side, semaphores are supposed to be the same gv100+ so in theory they shouldn't be different
01:49gfxstrand[d]: The only thing that makes sense for that unknown method is `INVALIDATE_SHADER_CACHES_NO_WFI` but tiat's very much not 0x0078
01:49mhenning[d]: but also sync waits sticking sounds a little like some of the transfer-only queue issues I've seen
01:49mhenning[d]: gfxstrand[d]: NVC86F_WFI ?
01:52mhenning[d]: I'm guessing that
01:52mhenning[d]: I'm guessing that's beginning part is from NVK_BARRIER_INVALIDATE_MME_DATA it looks fine from me
01:54gfxstrand[d]: Yup. And right before that is the texture data piece of `nvk_cmd_invalidate_deps()` so the the `0x0078` should be `INVALIDATE_SHADER_CACHES_NO_WFI`
01:55gfxstrand[d]: And there's nowhere in NVK where we ever emit `6F_WFI`
01:57mhenning[d]: gfxstrand[d]: what https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/vulkan/nvk_cmd_buffer.c#L664
01:58gfxstrand[d]: Ugh... I failed at grep
01:58gfxstrand[d]: Okay, so there's nothing strange here
02:04mhenning[d]: admittedly that code could be suspicious since we don't really know the semantics of OPERATION_MEMBAR in detail
02:07cubanismo[d]: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/kernel-open/nvidia-uvm/clc96f.h
02:07cubanismo[d]: Method 0x30 and 0x34 on the GPFIFO class are MEM_OP_C and MEM_OP_D.
02:07cubanismo[d]: I assume they're showing up as unknown because they're being decoded as NV906F class methods, where they don't exist yet.
02:08mhenning[d]: Yeah, the printer doesn't handle 6f methods quire right yet
02:11mhenning[d]: mhenning[d]: okay I lied we actually do have membar docs in manuals/volta/gv100/dev_pbdma.ref.txt and it sounds fine to me
02:12gfxstrand[d]: I fixed 6f a bit
02:19gfxstrand[d]: Now that I'm starting to see some hang batches, every single one of them is cache maintenance.
02:19gfxstrand[d]: I wonder if we've found our first HW bug
02:20gfxstrand[d]: Like, this one is just doing the usual beginning of command buffer stuff and then a couple `REPORT_SEMAPHORE`
02:21cubanismo[d]: Impossible. The hardware is perfect.
02:21gfxstrand[d]: So far it's been damn close
02:22HdkR: It's perfect, including all of the limitations that some people call bugs :)
02:23gfxstrand[d]: Seriously. I have yet to find an actual bug which, coming from Intel, is nothing short of a miracle.
02:24gfxstrand[d]: Unexpected behavior? Yes, but it always makes sense once I take the time to understand it. Annoying limitations? Also yes but same. Bugs? Haven't found one yet.
02:28gfxstrand[d]: Like, how does this time out?!?
02:28gfxstrand[d]: [0x00000000] HDR 20018040 subch 4 NINC
02:28gfxstrand[d]: mthd 0100 NVCAB5_NOP
02:28gfxstrand[d]: .PARAMETER = (0x0)
02:28gfxstrand[d]: [0x00000002] HDR 200120a6 subch 1 NINC
02:28gfxstrand[d]: mthd 0298 NVC7C0_INVALIDATE_SKED_CACHES
02:28gfxstrand[d]: .V = (0x0)
02:28gfxstrand[d]: [0x00000004] HDR 20012509 subch 1 NINC
02:28gfxstrand[d]: mthd 1424 NVC7C0_INVALIDATE_SAMPLER_CACHE_NO_WFI
02:28gfxstrand[d]: .LINES = ALL
02:28gfxstrand[d]: .TAG = (0x0)
02:28gfxstrand[d]: [0x00000006] HDR 20012091 subch 1 NINC
02:28gfxstrand[d]: mthd 0244 NVC7C0_INVALIDATE_TEXTURE_HEADER_CACHE_NO_WFI
02:28gfxstrand[d]: .LINES = ALL
02:28gfxstrand[d]: .TAG = (0x0)
02:28gfxstrand[d]: [0x00000008] HDR 20020509 subch 0 NINC
02:28gfxstrand[d]: mthd 1424 NVC797_INVALIDATE_SAMPLER_CACHE_NO_WFI
02:28gfxstrand[d]: .LINES = ALL
02:28gfxstrand[d]: .TAG = (0x0)
02:28gfxstrand[d]: mthd 1428 NVC797_INVALIDATE_TEXTURE_HEADER_CACHE_NO_WFI
02:28gfxstrand[d]: .LINES = ALL
02:28gfxstrand[d]: .TAG = (0x0)
02:28gfxstrand[d]: [0x0000000b] HDR 20010369 subch 0 NINC
02:28gfxstrand[d]: mthd 0da4 NVC797_INVALIDATE_SHADER_CACHES_NO_WFI
02:28gfxstrand[d]: .INSTRUCTION = TRUE
02:28gfxstrand[d]: .GLOBAL_DATA = TRUE
02:28gfxstrand[d]: .CONSTANT = TRUE
02:28gfxstrand[d]: [0x0000000d] HDR 200104a2 subch 0 NINC
02:28gfxstrand[d]: mthd 1288 NVC797_INVALIDATE_TEXTURE_DATA_CACHE_NO_WFI
02:28gfxstrand[d]: .LINES = ALL
02:28gfxstrand[d]: .TAG = (0x0)
02:29gfxstrand[d]: That's it. That's the pushbuf that's timing out.
02:29gfxstrand[d]: It makes zero sense
02:31mhenning[d]: In the transfer-only queue stuff I'm also seeing pushbufs that do almost nothing time out. I've been wondering if it's some sort of race where the interrupt fires too soon and then the kernel doesn't re-check the fence value or something like that
02:32mhenning[d]: but that's random guessing and I haven't been able to point out anything that's actually wrong on the kernel side
02:32gfxstrand[d]: So you're saying the reason Blackwell crashes is because it's just too fast? :bim_giggle:
02:33gfxstrand[d]: airlied[d]: ^^
02:33mhenning[d]: something like that π
02:49kasperschultz: The highest limitation between data access and contiguous queue jumper is that data access can not do dependent parallel loads, so that one result gotten is subindexing into another result, but they both have their usages, cause data banks do not have to be contiguous and subindex range to correct to is contiguous. The talks where accurate but i never covered as to how to map that range
02:50kasperschultz: aside from permutations. It's that you have highest bit in mapping to the same range, so this is a brainbanger how can you reduce something from 1024*1024*4096 into a 1024 or less, and encode them to something, it's quite difficult concept. I am still working on it, but it might turn out to be reliable i assume.
03:05airlied[d]: I've long suspected but never proven a race on the irq/fence handling
03:33airlied[d]: Esp around the event allow/block and intr allowed stuff, but I've only found the race I fixed a year or two ago
03:36cubanismo[d]: What's the sequence for the fence/semaphore op + interrupt?
03:37cubanismo[d]: It's possible to do that wrong.
03:38mhenning[d]: cubanismo[d]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/nouveau/gv100_fence.c?h=v6.17-rc2
03:39mhenning[d]: It looks the same as ogk's src/common/unix/nvidia-push/src/nvidia-push.c VoltaReleaseTimelineSemaphore / VoltaAcquireTimelineSemaphore to my eye
03:40mhenning[d]: I was just looking at it because I was wondering if we were missing a membar or something
03:43mhenning[d]: maybe we need to also set SEM_PAYLOAD_HI?
03:59airlied[d]: I do think we are missing a membar or wfi somewhere
04:00airlied[d]: uvm_channel.c also has some sequences
04:12airlied[d]: Might be good when we see a timeout to dump the whole fence bo to see what seq numbers made it
04:32roberteckner: First there is a need to get the definition of contiguous, like is 1 2 3 4 transitioned to 1 1 1 3 still contiguous for corresponding answers. In which case, one could find out two highest fields for operands all 4bit fields, add them together, and all lower value fields will subindex So in case of multiply you find offset of multiply op being a match from operand's 1st, field 4 being
04:32roberteckner: largest and operand's 2nd. field 2, these sums will result into rounded indirect transition main index of X and correcting direct or indirect transition subindex of Y (can do both). So the answer is one of bank4+2. So biggest number of field 4 is virtual representative of 65535 and biggest number of field2 is virtual representative of 255. we can determine the bank and decode it into the
04:32roberteckner: hash. It's just one way of very many possibilities.
07:08cubanismo[d]: Yeah, it looks right. WFI_EN does an implicit sys membar
07:10cubanismo[d]: You could check the mappings imply coherent CPU access.
07:11cubanismo[d]: But I assume there'd be bigger problems if that wasn't set up right.
07:13cubanismo[d]: Make sure there aren't any races if you're initializing the semaphore destination from the CPU as well.
07:16cubanismo[d]: And yeah, PAYLOAD_HI should be irrelevant if PAYLOAD_SIZE is 32BIT
09:19chikuwad[d]: how would I debug these mmu faults?
09:19chikuwad[d]: [ 1308.849389] nouveau 0000:01:00.0: gsp: rc engn:00000009 chid:10 gfid:0 level:2 type:39 scope:1 part:233 fault_addr:0000000000000000 fault_type:00000000
09:19chikuwad[d]: [ 1308.849395] nouveau 0000:01:00.0: fifo:c00000:000a:000a:[deqp-vk[49462]] errored - disabling channel
09:19chikuwad[d]: [ 1308.849399] nouveau 0000:01:00.0: deqp-vk[49462]: channel 10 killed!
09:20chikuwad[d]: Test case 'dEQP-VK.api.ds_color_copy.r32_sfloat_d32_sfloat_s8_uint_depth_level0_to_level0'..
09:20chikuwad[d]: Fail (Unexpected results found; check log for details)
09:21chikuwad[d]: hmm
09:23OftenTimeConsuming: File -> save as an image or video in tor browser now crashes it?
09:23OftenTimeConsuming: Updating to Linux-libre 6.16.2 and mesa 25.2.0 didn't help.
09:27chikuwad[d]: what gpu?
09:30OftenTimeConsuming: 780 Ti.
09:32OftenTimeConsuming: There's also another bug that I could reproduce across 2 different model 780 Ti's, where if any screen locker is ever run (xscreensaver or i3lock or xfce4-screensaver) then any future suspend will cause Xorg to segfault on resume.
10:59haraldknopper: technically 4*4*4*4*4*4*4*4=65536 that is how 32bit number is represented sparsely. now if you transition from from say field1 5 9 13 17 as max to something that get's encoded as 65536 contiguously, you already have maximum of coverage, but how to know how to point at correct set basing on add only. So you have two operands, they get treated as sum, but the answer of multiply is
12:00snowycoder[d]: How are you running nouveau with Blackwell?
12:00snowycoder[d]: Each time I try to launch sddm or kde (directly from CLI) the system just hangs up with no warnings in dmesg nor journalctl
12:00snowycoder[d]: Running with Linux 6.16.2 and mesa 25.2.1
12:07mohamexiety[d]: that's.. interesting
12:07mohamexiety[d]: do you have the prop driver installed?
12:10mohamexiety[d]: if you do, there's something else you have to do to make it work
12:10mohamexiety[d]: otherwise I am out of ideas
12:11snowycoder[d]: I do have it installed, but with a complex system of mirrors and levers, just nouveau is loaded and binded
12:13snowycoder[d]: Interestingly vulkaninfo hangs too unless I start it with my local mesa build. Is arch's mesa build borked?
12:18karolherbst[d]: was blackwell support added to 25.2?
12:18karolherbst[d]: ohh indeed...
12:19karolherbst[d]: snowycoder[d]: ohh.. that's prolly the nvidia driver
12:19mohamexiety[d]: snowycoder[d]: yeah ok this is the NV driver
12:19mohamexiety[d]: let me give you something, 1 sec
12:20snowycoder[d]: Ooooh because devenv removes the other drivers from loading
12:20mohamexiety[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13436#note_2983601 snowycoder[d] courtesy of chikuwad[d], do this and restart
12:20mohamexiety[d]: it will work fine
12:20karolherbst[d]: snowycoder[d]: exactly
12:21karolherbst[d]: mohamexiety[d]: ahh yeah π
12:21mohamexiety[d]: the explanation for what happens is in that issue too btw if you're curious
12:21mohamexiety[d]: it's not a nv kernel thing, but a nv userspace thing
12:21mohamexiety[d]: so even if you block the NV kernel module you will still run into this
12:21karolherbst[d]: it's the best workaround ~~(and totally not because it was my suggestion π)~~
12:22karolherbst[d]: the nvidia userspace tries very hard to load the nvidia kernel module
12:22karolherbst[d]: so every time you launch an application that loads any of nvidias drivers it executes `nvidia-modprobe`
12:23karolherbst[d]: it's a suid binary which only purpose is to load the driver
12:24mohamexiety[d]: now funnily enough I am actually running into a similar issue on my fedora system with the che copr
12:24mohamexiety[d]: and doing that systemd thing didnt fix it
12:24mohamexiety[d]: but I have no clue how to fix it there
12:24snowycoder[d]: mohamexiety[d]: Thanks! We should mention it in the arch wiki
12:25mohamexiety[d]: yeah some more easily accessible documentation would be nice
12:25mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1408789224118812802/image.png?ex=68ab04b5&is=68a9b335&hm=28bed71cae85c1f07c3f3affc93edce9de913a161a2b1d9a9f28581ed2e7d3f9&
12:25mohamexiety[d]: karolherbst[d]: loading headless with `VK_LOADER_DEBUG=all` I can see that it gets stuck here. but idk anything else
12:46karolherbst[d]: uhhh
12:46karolherbst[d]: device selection layer...
12:46karolherbst[d]: uhhh
12:46karolherbst[d]: maybe run it inside a debugger and check
12:46mohamexiety[d]: oh huh I didnt think that would work
13:28gfxstrand[d]: airlied[d]: The good news is that it's way more reproducible on Blackwell so maybe we can find it?
13:47x512[m]: After updating NVK I started to get assert failure for some clients when doing copy from VRAM to CPU memory:... (full message at <https://matrix.org/oftc/media/v1/media/download/AfIhhXLFFE4wnLhZdODsoxh3JKtl24eB-OSnAB5rC3GsLn5GX7fId7GUgXT1IxBiUPpepAfh-KnEJ7UTo9S2yuRCeZH0gBkwAG1hdHJpeC5vcmcvSmFKenVMcHdaQ0d1dUdoRnVLU1ZXdGNY>)
13:47x512[m]: Code that cause failure: https://github.com/X547/VideoStreamsWsi/blob/6bd3926c4394f429ede1736835d17c591b84c353/Wsi.cpp#L510
13:48x512[m]:sent a code block: https://matrix.org/oftc/media/v1/media/download/Aewi0nAwhSEBVsGvvAZLChn3QZ7y3fhJ92a58IT1W_1MrI4xyHsSuZDP6rtvp7pZ0QB1rCHnqtoQkBxitzRapnxCeZH0jGhQAG1hdHJpeC5vcmcvUkdHZWZzVVNweXNhQ1FFcVN6ZVhvVE1a
14:05gfxstrand[d]: I'm not sure what would have changed with that recently. Wait... Why isn't driver-internal set? It should be if it comes from blit.
14:08gfxstrand[d]: Oh, I know why...
14:10gfxstrand[d]: But also, it shouldn't assert unless the client isn't setting TRANSFER_SRC.
14:13x512[m]: https://gitlab.freedesktop.org/mesa/demos/-/blob/main/src/vulkan/vkgears.c?ref_type=heads#L589
14:13x512[m]: It get fixed if add TRANSFER_SRC/DST flags.
14:14x512[m]: WSI layer should add this flags?
14:31gfxstrand[d]: only if it's doing blits, which it does.
14:31gfxstrand[d]: But VkGears is doing its own thing
15:19gfxstrand[d]: x512[m]: It got a little long but...
15:19gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36957
15:19gfxstrand[d]: vkgears is still wrong, though.
15:25gfxstrand[d]: x512[m]: Also, why the hell are they using `vkCmdBlitImage()` for that?!? There's no scaling. They should be using a copy, not a blit.
15:26gfxstrand[d]: Oh... That's your repo. π Sorry if that was an over-the-top reaction.
15:26gfxstrand[d]: Yeah, you need to add transfer_src/dst flags since you're the one doing the blit. Also, use copy not blit.
16:34x512[m]: Yes, it is my code for Haiku Vulkan WSI implementation as Vulkan layer. It also currently abuse headless surface type cast to private Haiku struct because Haiku Vulkan WSI extension is not officially proposed yet.
18:16gfxstrand[d]: Ah. Yeah, the swapchain doesn't add any usage flags you don't ask for (especially not headless) so if you want transfer you have to ask for it.
18:57snowycoder[d]: Can I work on cross-block scheduling if nobody else is doing it?
18:57snowycoder[d]: I already did the Kepler texture thingy and this should be similar, it seems fun.
19:29karolherbst[d]: go ahead, I probably won't
20:03snowycoder[d]: karolherbst[d]: Did you have an idea on how to make the merge operation? The current model takes the whole instruction not just the delay (like codegen).
20:03snowycoder[d]: Should we just, not do a merge and check all successors for each instruction if we have multiple?
20:06karolherbst[d]: snowycoder[d]: the issue is, that it's a cyclic dependency problem, like if one of the last instructions depend on one of the first instructions of a block, the wait count depends on all the other instructions in between them (the remaining and the first of the same block)
20:08karolherbst[d]: and no, I didn't come up with a good idea on how to deal with it
20:08karolherbst[d]: well.. except to rewrite the whole thing π
20:08snowycoder[d]: I have an Idea but I don't think it's a good oneπ
20:09karolherbst[d]: I think it needs to be a two pass thing
20:10mhenning[d]: Are you talking about the fixed-latency waits or about the variable latency barriers?
20:10karolherbst[d]: fixed
20:10karolherbst[d]: at least I am
20:11mhenning[d]: (I was asking Snowy)
20:11karolherbst[d]: barriers aren't really a problem here are they?
20:11snowycoder[d]: Me too, I don't think we need a two-pass
20:11mhenning[d]: karolherbst[d]: we don't do them cross-block yet
20:11karolherbst[d]: ohh..
20:11karolherbst[d]: how does it work then?
20:12mhenning[d]: you wait on all the barirers at any control flow
20:12karolherbst[d]: ahh
20:12karolherbst[d]: well that just needs a bit of tracking
20:14mhenning[d]: snowycoder[d]: I think a two-pass version would make sense to me - you could schedule each block individually and then add delays on edges to legalize them in a second pass
20:15karolherbst[d]: I was more thinking of a "decide earliest cycle per instruction first, then schedule" approach, but not sure if that's any better
20:16mhenning[d]: I don't know what you mean by that
20:17karolherbst[d]: like you just count the cycles from start to finish and figure out the earliest each instruction could be executed at, and then in a second pass figure out the actual waits between them
20:18karolherbst[d]: I think the issue with the current impl is, that it goes backwards
20:18mhenning[d]: I don't think the issue is that it goes backwards
20:18mhenning[d]: I think a forwards impl would be pretty similar
20:19karolherbst[d]: the issue is just that if you go backwards you don't know how long to wait for cross block uses
20:19mhenning[d]: you don't know that forwards either
20:20snowycoder[d]: My idea is to keep the same algorithm and pass the register tracking between blocks.
20:20snowycoder[d]: If two successors have incompatible read/writes we can just check both
20:20mhenning[d]: snowycoder[d]: The tricky part is: what do you do with loop back edges?
20:21snowycoder[d]: You start with the worst latency for each read/write and the next pass you refine it
20:21mhenning[d]: One simple option would be to assume the worst case for them (which wouldn't be a regression since we currently always assume the worst case)
20:22mhenning[d]: snowycoder[d]: Okay, that sounds like a two-pass algorithm to me π
20:22mhenning[d]: But yeah, that works
20:23snowycoder[d]: More or less? I guess I see it as a conservative data-flow pass.
20:24karolherbst[d]: not sure you get around any sort of two-pass even if it's not something where you look at everything in the second one
20:25mhenning[d]: Yeah, it's possible that the most general version of this is a dataflow analysis eg. maybe we should be using one to calculate the delays along each edge
20:25mhenning[d]: but I haven't thought too carefully about that yet
20:26mhenning[d]: It definitely doesn't need that much complexity just to be an improvement on the current one though
21:23TheHypervisor[m]: Do Ampere GPUs have any sort of internal crash log I can reference when they die? I imagine a nouveau user/dev would be most aware of undocumented diagnostics
21:24TheHypervisor[m]: I'm debugging a GPU which seems to artifact then die under load
21:27TheHypervisor[m]: Also, on EVGA cards, does the OC/Normal switch even do anything?
21:28TheHypervisor[m]: It doesn't look like a BIOS switch especially given there's only one bios floating around
23:02airlied[d]: dying under load usually implies some sort of hardware failure, don't really have any good logging mechanism for that
23:12TheHypervisor[m]: Well, the nvidia driver on windows did crash in a way that didn't take down the operating system a couple times. Makes me wonder if Nouveau could provide more info
23:28gfxstrand[d]: mhenning[d]: Yeah, the most general thing would be a full data flow pass. But just doing the forward thing and throwing up our hands and stalling in the presence of back edges should be good enough.
23:31gfxstrand[d]: Barriers is a little more tricky. We need a very accurate model of how barriers are tracked (per warp vs. per lane) and what happens if the waits aren't 1:1, i.e., what happens if you double wait or if you signal before waiting. And then we have to model it to ensure we're we emit doesn't break under weird conditions.