02:14redsheep[d]: Has anybody else tested NOUVEAU_USE_ZINK after !29771? Is that going to llvmpipe for everybody, or is that just a me problem?
02:22redsheep[d]: I haven't bisected to that MR but it seems the most likely culprit. Do you have an nvidia card installed at the momentzmike[d]?
02:30zmike[d]: at the moment I am on my couch watching the news before bed
02:30zmike[d]: but tomorrow's me might be in a different situdation
02:30redsheep[d]: Depending on what news that sounds either more or less fun
09:14karolherbst[d]: skeggsb9778[d]: maybe something fun for you to look into: https://gitlab.freedesktop.org/drm/nouveau/-/merge_requests/25
12:35zmike[d]: redsheep[d]: works fine for me
14:02gfxstrand[d]: karolherbst[d]: Does Maxwell have requirements on the ordering of blocks? Like that everything is nicely nested in the command stream?
14:02gfxstrand[d]: Because I think that would explain most of my remaining fails
14:04gfxstrand[d]: I know a lot of hardware has greedy scheduling strategies of the form "run the thread with the lowest IP" and depends on ordering
14:04karolherbst[d]: gfxstrand[d]: not quite sure what you mean here
14:08gfxstrand[d]: Right now NAK has an annoying habbit of doing crazy things like putting the guts of the inner-most loop at the end of the program for no reason.
14:08gfxstrand[d]: I'm worried that's screwing up the stack somehow
14:09gfxstrand[d]: Like maybe the hardware has some "get out of jail free" strategy that assumes ordering and pops the stack until it gets back to sanity
14:09ahuillet[d]: I'm not aware of any weird memory layout requirements for code (like baking dominance in the memory space which is I assume you're implying?)
14:09karolherbst[d]: ohh mhh
14:09karolherbst[d]: though you can run out of stack, but the hardware would complain about it
14:09karolherbst[d]: CSR something
14:09karolherbst[d]: ehh CRS
14:10karolherbst[d]: might also just misrender
14:10karolherbst[d]: but I think it reports a soft error in dmesg
14:10karolherbst[d]: as in, not something which would nuke the channel
14:10karolherbst[d]: sooo...
14:10karolherbst[d]: there is this part in the shader header to spill the CRS into VRAM, not sure if you always set it or neber
14:10karolherbst[d]: *never
14:10gfxstrand[d]: I'm seeing two categories of really annoying errors, both of which seem to happen with multiply nested loops:
14:10gfxstrand[d]: 1. Suff just doesn't execute for some reason
14:10gfxstrand[d]: 2. It pushes breaks onto the stack without popping them and blows through the stack.
14:11karolherbst[d]: 1. might be out of CRS space, because the shader aborts in that case, check dmesg
14:11gfxstrand[d]: In both cases the max stack size should be like 5. It's not CRS
14:12karolherbst[d]: it's definetly not 5
14:12karolherbst[d]: the stack size calculation is super weird
14:12karolherbst[d]: and it's shared across the warp
14:12gfxstrand[d]: Sure
14:12karolherbst[d]: so the more your warp diverges, the more CRS you need
14:12gfxstrand[d]: Right...
14:13gfxstrand[d]: And how much CRS do you get by default?
14:13karolherbst[d]: I've had a reproducer which ran out of CRS with two nested loops
14:13karolherbst[d]: gfxstrand[d]: I think 512 bytes is the on-chip one?
14:13karolherbst[d]: maybe more?
14:14karolherbst[d]: `The SPH field ShaderLocalMemoryCrsSize sets the additional (off chip) call/return stack size (CRS_SZ). Units are in Bytes/Warp. Minimum value 0, maximum 1 megabyte. Must be multiples of 512 bytes. `
14:14karolherbst[d]: at least
14:14gfxstrand[d]: Yeah, so it's probably 512B
14:14karolherbst[d]: 1 megabyte is impressive
14:14karolherbst[d]: anyway, it lives in local memory
14:15gfxstrand[d]: Yeah, I know that
14:15gfxstrand[d]: I'm just a bit distrustful that this stuff is working at all right now
14:15karolherbst[d]: but if dmesg doesn't print anything in regards to that it should be something else
14:15gfxstrand[d]: I mean, it works for piles of cases, but I've got weird shaders where it just doesn't
14:15karolherbst[d]: yeah..
14:15karolherbst[d]: I have like one bug report where this is an issue
14:15karolherbst[d]: because most of the time it's fine using the on-chip one
14:16karolherbst[d]: for 2., there is a hierarchy of how things are pushed/poped
14:17karolherbst[d]: and one pop also pops ofter things implicitly
14:17karolherbst[d]: I just don't know the hierarchy
14:17gfxstrand[d]: Yeah, I assume it's brk > cont > sync
14:17karolherbst[d]: it might be that break also pops the cont one, and things like that
14:17karolherbst[d]: maybe
14:18karolherbst[d]: mhh actually, maybe I have docs on that
14:19karolherbst[d]: maybe I don't.. though I thought I was reading about that somewhere
14:20gfxstrand[d]: Now might be the time where I use my fancy new unit-testing framework to poke at things.
14:20karolherbst[d]: I also don't know how deep it implicitly clears the things
14:20karolherbst[d]: it might pop until it hits something else
14:20gfxstrand[d]: If I can push enough to see it in memory, I can RE it.
14:20karolherbst[d]: ohh yeah...
14:21karolherbst[d]: nobody bothered with reing how all of that works, so udnerstanding how it works will also be helpful in calculating how big the CRS needs to be anyway
14:22gfxstrand[d]: Yeah. I think that's today's project then
14:22gfxstrand[d]: Can't write a compiler for hardware you don't understand
14:28gfxstrand[d]: Before I do that, though, I'm going to CTS this giant pile of patches to make sure it doesn't break Ampere.
14:33gfxstrand[d]: karolherbst[d]: Does the the CRS come at the start of local mem or end?
14:33gfxstrand[d]: Or do we even know?
14:33karolherbst[d]: I don't think we know, but it doesn't impact the shader itself
14:33gfxstrand[d]: Nah, just impacts RE
14:34karolherbst[d]: how so?
14:34karolherbst[d]: ehh
14:34karolherbst[d]: RE
14:34karolherbst[d]: I read RA :ferrisUpsideDown:
14:34gfxstrand[d]: You need to add all those feris emotes to fd.o
14:37karolherbst[d]: I guess
14:37karolherbst[d]: but there are a lot
14:37gfxstrand[d]: Either that or I need to join that server just for the emotes
14:37karolherbst[d]: that's what I've done
14:45karolherbst[d]: though if I'd have to guess, it's at the end, because the local memory area has weird alignment rules and if the CRS would be at the start you'd end up wasting tons of memory
14:45karolherbst[d]: probably
14:50karolherbst[d]: gfxstrand[d]: though, might be worth setting the stack to the max and see how much it helps, but given that we need to know this stuff anyway...
15:21mhenning[d]: gfxstrand[d]: My old RE notes on this include the sentence "It's apparently illegal to join to the address of a join(?)"
15:22mhenning[d]: I forget why I wrote that down, but you could try inserting NOPs for that case
15:23mhenning[d]: For the sizing of the stack, I have "Allocate 512 + 512 * (depth / 32) bytes (round up)"
15:24mhenning[d]: where the PreCont and PreBreak entries take one slot, the `else` side of an if takes two slots, and the `then` side takes one slot
15:26mhenning[d]: Note that that was all on kepler, so maxwell could differ slightly
16:40RSpliet: Looks like Arnd's "what takes more than 5s to compile on an M1" script flagged up nouveau/nvkm/subdev/fb/ramgk104.c 5.90s ... sorry!
16:40RSpliet: https://pastebin.com/raw/fXJe7y9P
16:42MrTrueDev: I'ts been a minute since I've frequented this IRC. Wondering if Ilia Mirkin still lurks around here.
16:48DodoGTA: MrTrueDev: Are you aware of the new NVK driver?
16:48gfxstrand[d]: MrTrueDev: No, I've not seen him around in a few years. Might still be lurking somewhere on IRC but IDK.
16:48MrTrueDev: NVK?
16:50DodoGTA: MrTrueDev: It's a new open-source Vulkan driver for modern NVIDIA GPUs (Turing+ is the main priority but even Kepler kind of works)
16:51DodoGTA: https://www.collabora.com/news-and-blog/news-and-events/nvk-is-now-ready-for-prime-time.html
16:56clangcat[d]: RSpliet: What script is this? Also yea sometimes some files will just take longer to compile it depends on the content of the file and if it can be Optimized for compile time without causing undesired effects on the runtime. Also are you compiling this for an M1 or just cross compiling x86 Linux on a M1?
16:59karolherbst[d]: MrTrueDev: he sometimes replies to patches on the ML or on gitlab
17:01clangcat[d]: karolherbst[d]: What's ml
17:01clangcat[d]: all I can think is Machine learning
17:01karolherbst[d]: mhenning[d]: you also have to keep divergency in mind as the stack is per warp
17:01selaaaa[d]: clangcat[d]: mailing list i presume :p
17:01selaaaa[d]: (think lkml)
17:02clangcat[d]: selaaaa[d]: Ohhhh right those things I hate them
17:02clangcat[d]: XD
17:02clangcat[d]: I think my brain learns mailing lists then suppresses them to protect my squishy grey matter
17:03karolherbst[d]: mhenning[d]: though when I was looking into it, I did come to the conclusion, that the on-chip stack is not 512, but I was thinking 768 or 1024, which I tihnk both is wrong
17:13mhenning[d]: karolherbst[d]: Not sure what you're saying here. The "two entries for an else branch" was assuming a divergent if - the taken side of the divergent branch is executed first and during that block's execution we have both an entry from the ssy and an entry that remembers the state for the not-taken branch
17:16karolherbst[d]: mhenning[d]: what if the diverged threads push more entries, now you have two thread blocks pushing things
17:19karolherbst[d]: I have an example where you have only two nested loops, nothing funny going on and you run out of stack space quickly, because the more the warp diverges the more things get pushed
17:40RSpliet: clangcat[d]: you'd have to ask Arnd Bergmann, a kernel heavyweight
17:41RSpliet: Also, I didn't write ramgk104.c, but I'm pretty sure I contributed some bits to it
17:41clangcat[d]: RSpliet: I don't really care about that just saying some files will take longer. I'm more curious what script did you run to test this?
17:42RSpliet: clangcat[d]: you'd have to ask Arnd Bergmann, a kernel heavyweight
17:42RSpliet: https://society.oftrolls.com/@arnd/112837247327000513
17:42clangcat[d]: Okay
17:43tiredchiku[d]: irc bridge probably being funky rn
17:44MrTrueDev: I've working on an FPGA based GPU. My next stop is to get a hand rolled libGL.so working for the GPU.
17:45MrTrueDev: I'm toying with adding MESA support
17:45MrTrueDev: But seems like work
17:49clangcat[d]: tiredchiku[d]: Sadly doesn't seem like they shared script big sad.
17:57notthatclippy[d]: Can't reproduce the ramgk104.c thing. Compiles in line with all the other files for me.
17:58notthatclippy[d]: Shame though, sounded like a fun problem to dive into.
17:59clangcat[d]: notthatclippy[d]: Yea it's very complex cause a lot of that will depend on the compiler used(version aswell), options passed and a lot more.
18:00clangcat[d]: Though I do question why you would compile Amdgpu and Nouveau on M1, didn't really know you could use Nvidia or AMD cards on a M1 GPU
18:01clangcat[d]: Though I guess you could do one of those USB to PCI-E things
18:01tiredchiku[d]: cross compile
18:02tiredchiku[d]: could be compiling for another machnie
18:02notthatclippy[d]: I'm guessing that's just what they have as a primary dev machine, they're compiling the whole kernel and finding these outliers
18:02tiredchiku[d]: that too
18:03clangcat[d]: tiredchiku[d]: Yea I suppose also I just hate cross compiling and prefer to do native compiles.
18:04karolherbst[d]: RSpliet: there was a regression where the min/max macro explode quite a lot
18:04karolherbst[d]: like megabytes of text
18:05RSpliet: Yup
18:05RSpliet: and then Arnd went and benchmark all the other files too
18:05karolherbst[d]: ahh
18:05RSpliet: Lots of offenders in amdgpu when judged purely by compile time, and nouveau popped up once
18:05RSpliet: could be the luck of the scheduler-draw of course
18:06RSpliet: it's only 6 seconds, not 45 :-)
18:06karolherbst[d]: I don't think it's our fault tho 😄
18:07karolherbst[d]: though maybe
18:07karolherbst[d]: dunno
18:07RSpliet: all that reclocking code in ramgk104 is quite macro-heavy, so wouldn't surprise me if it is :-)
18:07RSpliet: but in the grand scheme of things it's fairly irrelevant too
18:07karolherbst[d]: ohh `ram_mask` and co are macros...
18:07karolherbst[d]: why are those macros and not static inlines :ferrisUpsideDown:
18:07karolherbst[d]: ehh wait...
18:07karolherbst[d]: there is a concat
18:07karolherbst[d]: pain
18:08clangcat[d]: RSpliet: Yea also M1 has performance and efficiency cores. so it can also be luck of which core a job got placed on cause `-j$(nproc)` probably would lead to some jobs being on efficiency cores due to all cores being in use.
18:08RSpliet: karolherbst[d]: some of that macro stuff is definitions too, not functions
18:09karolherbst[d]: I would hope that those are not expensive to use
18:09RSpliet: inline functions I guess are sometimes an option, but also there's no way to force GCC to actually in-line them. inline is just a hint
18:09clangcat[d]: karolherbst[d]: Yea inline in C is just a hint
18:09karolherbst[d]: static inline forces an inline
18:09RSpliet: even if -O0 is passed?
18:10RSpliet: Anyway, it's a different rabbithole :D
18:10RSpliet: and as we know, every rabbithole is a can of worms
18:10karolherbst[d]: RSpliet: pretty sure
18:10RSpliet: well, minus the can
18:10RSpliet: but there's definitely worms in rabbitholes
18:11mhenning[d]: karolherbst[d]: I'd be interested in this example because this doesn't make any sense in my mental model of the hardware.
18:11clangcat[d]: karolherbst[d]: `Regardless of the storage class, the compiler can ignore the inline qualifier and generate a function call in all C dialects and C++`
18:12clangcat[d]: I always thought this meant static inline is still just a hint tho?
18:12karolherbst[d]: it depends, practically with clang and gcc it's pretty much always an inline unless you pass some extra compiler flags to force those to get compiled to code
18:12karolherbst[d]: *to a function
18:13karolherbst[d]: or rather, I have never seen it not get inlined
18:13karolherbst[d]: which is quite a pain
18:13RSpliet: I guess the proof of the pudding is in the eating... and there goes my night stuck in godbolt ;x
18:13karolherbst[d]: oh yeah.. good idea actually
18:14clangcat[d]: karolherbst[d]: Yea I mean from my experience `inline` is enough anyway but Iirc the compiler can choose to make it a normal function if it's to complex for the compiler inline the function.
18:14clangcat[d]: But yea I imagine so long as the inline function isn't anything insane modern Clang and GCC can probably just inline it
18:15notthatclippy[d]: `static inline` won't _force_ it. However, assuming some level of competence, you're only tagging functions that should be static inline as such, so the compiler will honor it in all real scenarios. Easy to make an artificial case where it will ignore it.
18:15karolherbst[d]: seems like with O0 the compilers indeed don't inline it
18:15karolherbst[d]: which for the kernel is irrelevant anyway, because the kernel is O2 only
18:17clangcat[d]: notthatclippy[d]: Yea I mean I've never seen `inline` not be inlined but it's one of those it can "technically" happen. But so long as you use inline for small utility functions like (min/max) it shouldn't really ever happen.
18:17clangcat[d]: Unless yea force all optimizations off
18:17RSpliet: GCC has the __attribute__((always_inline)) thingy too... which may even force inlining in O0
18:17karolherbst[d]: yeah, I've seen that one
18:18RSpliet: Not sure, never actually tried, but otherwise it defeats the purpose of the attribute
18:18karolherbst[d]: I wonder what happens if you do function pointer things
18:18clangcat[d]: RSpliet: Yea from what I understand C only uses it as a hint so compilers that aren't competent enough to inline a certain function can choose to leave the function as a function.
18:18clangcat[d]: Which should only really be an issue if you start using custom C compilers with poor inline support
18:19notthatclippy[d]: IIRC force-inline works as expected and will error out if not possible to inline
18:19clangcat[d]: notthatclippy[d]: Yea it does. Just isn't standard.
18:19karolherbst[d]: I think the kernel needs both
18:19karolherbst[d]: function never inlined and functions alwyas inlined
18:20RSpliet: for normal tools, not inlining stuff in your debug build has advantages when debugging. valgrind/sanitiser backtraces make more sense that way. Kernels are a bit special in that your normal debugging flow is different anyway
18:20notthatclippy[d]: not inlining lets you attach ebpf probes easier too
18:20clangcat[d]: But yea I get why C leaves it as a hint. Otherwise you might have code that compiles in one compiler but not the other.
18:20clangcat[d]: RSpliet: I mean I use -O0 when debugging my kernel.
18:21karolherbst[d]: clangcat[d]: that's technically unsupported
18:21clangcat[d]: karolherbst[d]: My kernel Karol.
18:21clangcat[d]: Not Linux
18:21karolherbst[d]: ahh
18:21karolherbst[d]: are you supporting your kernel?
18:21RSpliet: :D
18:21clangcat[d]: Mmm I do it fun
18:21karolherbst[d]: but yeah...
18:21karolherbst[d]: kernel world is fun
18:21clangcat[d]: But yea it's also useful for kernels as you can attach GDB over sockets/vms
18:22clangcat[d]: and can make stack traces super nice to read
18:22RSpliet: clangcat[d]: does it support crowdstrike?
18:22RSpliet: :D
18:22clangcat[d]: NO
18:22karolherbst[d]: I forgot why, but the linux kernel absolutely has functions which have to be inlined and such that must never be
18:22RSpliet: aight I'm out xD
18:22clangcat[d]: I mean
18:22clangcat[d]: Maybe
18:22clangcat[d]: I should support crowdstrike
18:22tiredchiku[d]: I support crowds going on a strke
18:22karolherbst[d]: need it for enterprise customers
18:22tiredchiku[d]: does that count
18:23clangcat[d]: ```c++
18:23clangcat[d]: if(crowd_strike == true) {
18:23clangcat[d]: BSOD();
18:23clangcat[d]: }
18:23clangcat[d]: There
18:23clangcat[d]: support for crowd strike XD
18:24clangcat[d]: Though maybe writing an actual module loading system could be fun.
18:24clangcat[d]: at the minute everything is just linked it to one chunky kernel.
18:27karolherbst[d]: clangcat[d]: won't need it for crowd strike, could just support bpf instead 😄
18:27karolherbst[d]: or I think that's what they do on linux
18:27karolherbst[d]: which is also the reason the problem didn't occur on linux
18:28clangcat[d]: karolherbst[d]: Yea but like I kinda need to anyway. Cause like the viritio GPU driver is always built in to the final kernel atm
18:28clangcat[d]: Which like
18:28clangcat[d]: It would actually be cool to make virtio driver a module
18:28karolherbst[d]: yeah, fair
18:29karolherbst[d]: I suspect it's quite some work though to make it all work properly
18:31clangcat[d]: Yea I imagine so as I would need to think of some method to send devices from PCI/MMIO/etc... interface to the modules to drive the devices.
18:32karolherbst[d]: you know how linux does that? 😄
18:32clangcat[d]: along with other stuff like mapping the modules and if a real system I'd probably have to make the modules relocatable.
18:32asdqueerfromeu[d]: clangcat[d]: `panic("csagent.sys"):`
18:32clangcat[d]: As hardcoded addresses would run the risk two modules using the same address
18:33karolherbst[d]: right.. but I'd just check out how linux handles it (because it doesn't) so you might be better of doing the same, because... you might be able to just reuse the same stuff for now
18:34karolherbst[d]: conceptually I mean
18:34karolherbst[d]: or rather, there are very good reasons why the "which driver to load" is done in userspace
18:35karolherbst[d]: though I never checked out how the kernel loads other types of modules (e.g. crypto algo, or compression stuff). I just know that those are special
18:48clangcat[d]: karolherbst[d]: yea I'll probably have a look at Linux modules. Just to get a rough idea of how it's doing it but I think it should be roughly
18:48clangcat[d]: 1. Define some common entry point in each module e.g. module_main()
18:48clangcat[d]: 2. Have the module scan for devices it's interested it.
18:48clangcat[d]: 3. Define some way for a module to say it's taken ownership of device cause I imagine multiple modules using one device would be a headache.
18:48clangcat[d]: 4. Then just expose common kernel functions to module to manage device with.
18:49clangcat[d]: Though devices that send interrupts
18:49clangcat[d]: Would need a way to direct interrupts to the modules
18:50karolherbst[d]: clangcat[d]: so how linux does it, that each module has a list of devices it supports or a filter or something, e.g. for nouveau it's `alias: pci:v000010DEd*sv*sd*bc03sc*i*` Then `udev` loads modules based on existing or hotplugged devices and simply monitors for changes. I think there might even be a kernel interface which pushes those changes
18:51karolherbst[d]: and then you can have policies on when/if modules to load all in userspace
18:51karolherbst[d]: and you don't have to bother having all that stuff in the kernel
18:51karolherbst[d]: e.g. how do you make it not load a specific module
18:51karolherbst[d]: how do you overwrite that behavior, etc...
18:51karolherbst[d]: what if you have multiple modules matches the same device?
18:52karolherbst[d]: *matching
18:52notthatclippy[d]: It is also not unheard of to have two equivalent devices and wanting to have a different module drive each. e.g. nvidia.ko and nouveau.ko both driving one GPU.
18:52clangcat[d]: karolherbst[d]: Well yea that stuff would probably come from just having a black list for certain modules.
18:53karolherbst[d]: clangcat[d]: how are you configuring it? `sysfs` like interface? kernel command line? Do you need to reboot when changing it?
18:53clangcat[d]: notthatclippy[d]: Yea i mean I don't intent my project to ever be super serious and that sounds like headache. But could probably implement that by having matching the PCI address.
18:53karolherbst[d]: but anyway
18:53karolherbst[d]: I'm sure it's fun to figure out all those issues 😄
18:53clangcat[d]: karolherbst[d]: I mean Idk yet I currently don't have a sysfs type thing nor a kernel command line XD
18:54karolherbst[d]: sooo... module loading is quite the project 😄
18:54clangcat[d]: Yup.
18:54karolherbst[d]: though I think what linux is doing here is quite the good idea, as in, make userspace resonsible to load device drivers when necessary
18:54clangcat[d]: But it should be easy to port the Virtio driver into it's own set of files cause it already is.
18:55karolherbst[d]: having a mechanism to define policies in the kernel is always a huge mess
18:55karolherbst[d]: or rather, manage those policies
18:56clangcat[d]: karolherbst[d]: Yea I mean the most I'd be willing to do would probably just be a command line option for `module_blacklist=names;to;ignore`
18:56clangcat[d]: But anything more complex I wouldn't wanna do in Kernel space
18:56karolherbst[d]: yeah...
18:57karolherbst[d]: device hotplugging is another nightmare area 😄
18:57clangcat[d]: Though doing it in userspace would mean creating a hot-plugging interface for userspace
18:58karolherbst[d]: yeah...
18:58karolherbst[d]: which like
18:58karolherbst[d]: you'll need anyway
18:58clangcat[d]: cause the kernel is aware of the hotplugs
18:58karolherbst[d]: for e.g. if you hotplug storage, then userspace wants to know anyway
18:58clangcat[d]: but I'd need to export that to userspace
18:58clangcat[d]: karolherbst[d]: Yea or input/other things.
19:07gfxstrand[d]: Ugh... Something's wrong with my CRS setup. If I go more than 16 deep, I hit
19:07gfxstrand[d]: `[ 9242.997596] nouveau 0000:17:00.0: gr: GPC0/TPC0/MP trap: global 00000000 [] warp 0016 []`
19:07gfxstrand[d]: SLM works so I know it's not that
19:07karolherbst[d]: are you increasing the SLM size as well?
19:07gfxstrand[d]: Yeah, should be big enough
19:07gfxstrand[d]: And I'm setting CRS_SIZE in the QMD
19:08gfxstrand[d]: No I'm not. 🤦🏻♀️
19:08karolherbst[d]: though I'm curious on why you get this error then...
19:08karolherbst[d]: mhhh
19:09gfxstrand[d]: Okay, now I am setting CRS size (I know because the HW throws an error if CRS size + LMEM size is too big)
19:09karolherbst[d]: I thought we had a CRS specific error...
19:10gfxstrand[d]: Okay, now I've got it plumbed through correctly. Time to see what it dumps
19:13karolherbst[d]: gfxstrand[d]: also.. 16 huh? If the on-chip stack is 512 bytes, that means each entry would be 32 bytes
19:14karolherbst[d]: ~~how high can you go if you set the CRS to 1 and 2?~~
19:20gfxstrand[d]: Okay, things I've learned:
19:20gfxstrand[d]: 1. The CRS size set in the SPH or the QMD includes the 512B of on-chip stack
19:20gfxstrand[d]: 2. It never seems to spill the on-chip stack to memory
19:20gfxstrand[d]: 3. You need space for the on-chip anyway?!? (That's not a question. I just don't know why.)
19:20gfxstrand[d]: 4. With on-chip only, I can got 16 SSY deep and with 1024B of CRS, I can go 32 SSY deep.
19:21karolherbst[d]: gfxstrand[d]: I'm sure that 3. isn't true
19:22gfxstrand[d]: They why does setting 512 behave the same as setting 0?
19:22gfxstrand[d]: And setting 1024 only doubles the number of ssy I can have
19:22karolherbst[d]: yeah mhhh... maybe there is something hw specific going on
19:23karolherbst[d]: my understanding is rather, that you flip between on and off chip entirely
19:23karolherbst[d]: but the docs also say "additionally"
19:23karolherbst[d]: it's kinda weird
19:24karolherbst[d]: but I think it would also make sense to see the written stuff anyway
19:24gfxstrand[d]: With `crs_size = 2048B`, I can push 96 on the stack?!?
19:26karolherbst[d]: when I was looking into it it didn't make much sense either
19:27gfxstrand[d]: with `crs_size = 4096` I can push 224
19:28karolherbst[d]: ohh wait
19:28karolherbst[d]: I forgot something 😄
19:28karolherbst[d]: what a pain
19:28karolherbst[d]: I think the trap handler needs a bit of space
19:29karolherbst[d]: even if unused
19:29gfxstrand[d]: gfxstrand[d]: `crs_size = 8192` -> 480
19:29karolherbst[d]: plot it and maybe it makes sense 😄
19:30gfxstrand[d]: It's `(depth + 32) * 16`
19:30gfxstrand[d]: With some special cases when depth is small
19:30karolherbst[d]: yeah..
19:30gfxstrand[d]: That's weird but not bad
19:30karolherbst[d]: so each entry is 16 bytes
19:31gfxstrand[d]: Yup
19:31gfxstrand[d]: Well, each ssy entry is 16 bytes
19:31gfxstrand[d]: I've not tried pbr or pcnt
19:32karolherbst[d]: I hope for everybody's sanity the others are the same 😄
19:34gfxstrand[d]: gfxstrand[d]: Seems to hold up to about 1000 frames
19:36karolherbst[d]: that behavior on small CRS is kinda weird, but also aligns with what I've saw, because... it was weird
19:37gfxstrand[d]: Okay, pbk without a pcnt is invalid
19:39gfxstrand[d]: Wait, how does that make sense?!? You do pbk before pcnt anyway
19:41gfxstrand[d]: drp...
19:41gfxstrand[d]: IT's because I'm syncing on a pbk
19:49gfxstrand[d]: Looks like each frame looks like:
19:49gfxstrand[d]: ```rust
19:49gfxstrand[d]: #[repr(C)]
19:49gfxstrand[d]: struct CRSFrame {
19:49gfxstrand[d]: addr: u32,
19:49gfxstrand[d]: mask: u32,
19:49gfxstrand[d]: frame_type: u32, // ssy = 4, pbk = 8
19:49gfxstrand[d]: zero: u32,
19:49gfxstrand[d]: }
19:49gfxstrand[d]: At least for ssy and pbk
19:49gfxstrand[d]: Let's throw pcnt at it
19:51karolherbst[d]: I wonder if the hw reserves some space for internal usage the bigger the stack is... maybe you can see the first entry in the off-chip cache changing the bigger it gets, as the hw might reserve it in the on-chip one only
19:52gfxstrand[d]: I do wonder a bit if something like that isn't going on.
19:53karolherbst[d]: or the reverse might also be true, if you set it to 1, maybe not everything is filled with entries
19:53gfxstrand[d]: Like maybe the HW stack has 16 entries and whenever it gets full, it copies it out to memory in one big blob and then uses the newly empty stack
19:53karolherbst[d]: ohh.. maybe
19:53gfxstrand[d]: because the entries I see in memory don't 100% make sense
19:53karolherbst[d]: that actually might make sense
19:54gfxstrand[d]: It makes a lot of sense actually. If you're 300 stack frames into a cuda program, you want whatever you're scratching around on in the current function to be blazing fast.
19:54karolherbst[d]: yeah...
19:55gfxstrand[d]: There is definitely some sort of bulk thing going on. I know that much
19:55gfxstrand[d]: Because when I go from 16 to 17 frames, I see 4 frames show up in memory
19:55karolherbst[d]: hah
19:55karolherbst[d]: interesting
19:57gfxstrand[d]: Same when I go from 20 to 21
19:58gfxstrand[d]: So yeah there's some internal stack and then it spills old frames or something like that
19:58karolherbst[d]: feels like the on-chip stack is a circular buffer then, and it somehow keeps track on where it is, and always swaps things out when it gets full
20:01gfxstrand[d]: Something like that. 🤷🏻♀️
20:10gfxstrand[d]: Looks like pcnt has the same frame format
20:10gfxstrand[d]: Okay, good, there is sanity. 😄
20:24mhenning[d]: gfxstrand[d]: yesh, that matches the 512 + 512 * (depth / 32) in my notes
20:52karolherbst[d]: mhenning[d]: just that each entry is 16 bytes
20:54karolherbst[d]: anyway, there is sometihng funky going on
20:55karolherbst[d]: I know that the on-chip stack isn't 512 bytes big and that also matches that you can only push 16 entries with the off-chip disabled
20:55karolherbst[d]: unless it's bigger, but half the space is wasted on something
22:09gfxstrand[d]: karolherbst[d]: Do 980s generally reclock? I'm not hearing fans
22:09karolherbst[d]: gfxstrand[d]: they don't
22:10gfxstrand[d]: Is this something where I can throw a super sketchy patch on my kernel and hack it or a never?
22:10karolherbst[d]: well... we have to code, but controlling the fans requires signed firmware so it was never neabled
22:10karolherbst[d]: well..
22:10gfxstrand[d]: So if I'm willing to burn the card...
22:10karolherbst[d]: if you like it toasty...
22:10gfxstrand[d]: hehe
22:10karolherbst[d]: but the reclocking code is also living in our custom PMU, but the PMU is also responsbile for loading the signed firmware
22:11gfxstrand[d]: I'm not going to pass CTS without reclocking. 😢
22:11karolherbst[d]: so it's all quite the mess
22:11karolherbst[d]: I think you used some old kernel branch of mine to test it on a laptop
22:11gfxstrand[d]: I did
22:11gfxstrand[d]: I don't have that laptop anymore, though. It had working fans.
22:11karolherbst[d]: gfxstrand[d]: huh? too slow?
22:11karolherbst[d]: there are tests failing if things are slow?
22:11gfxstrand[d]: Yeah, a couple of the sparse tests
22:11karolherbst[d]: mhhh
22:12gfxstrand[d]: Could still be a shader bug but that's getting less and less likely at this point
22:12karolherbst[d]: well.. the clocks shouldn't impact correctness
22:12gfxstrand[d]: No, but they do impact whether or not the test completes before we kick the context.
22:12karolherbst[d]: ahh...
22:13karolherbst[d]: I have an alternative idea. We'll need to make the timeout configurable for serious compute stuff (tm) anyway...
22:14karolherbst[d]: otherwise, I had some patches to only reclock the engines and leaving memory alone, and the GPU in general has enough overheating protections in place so it shouldn't cause any actual issues anyway
22:15karolherbst[d]: mhh.. maybe you could just enable the reclocking stuff, but I'm not quite sure how the code would react if the PMU stuff doesn't work/returns an error/whatever
22:19gfxstrand[d]: 🤷🏻♀️
22:19gfxstrand[d]: Honestly, I don't care how we accomplish it. I just want to be able to sumbit conformance on Maxwell.
22:19gfxstrand[d]: But if I never do that's also not that big of a deal, honestly.
22:22gfxstrand[d]: As long as Zink runs the desktop as well or better than nouveau GL, that's all I care about.
22:22gfxstrand[d]: And honestly maxwell perf is going to be pretty hard to get with NVK because constant data sucks.
22:23gfxstrand[d]: I have a lot more tricks in my toolbox on Turing
22:24gfxstrand[d]: At this point, I'm working on it more because I'm interested in the history than anything else.
22:27airlied[d]: you could dcrease the sparse bufer limits
22:27RSpliet: gfxstrand[d]: unlucky really, GTX 980 is second-gen Maxwell. First-gen doesn't require signed firmware
22:27airlied[d]: pretty sure the tests that fail are doing a bunch of copy engine work on a 1<<24 sized buffer
22:28RSpliet: Don't think the first-gen Maxwell came in high-end though
22:29gfxstrand[d]: I've got a first-gen maxwell I can run, too.
22:29gfxstrand[d]: Maybe I submit on that and pretend it applies to Maxwell B, too? 😂
22:30RSpliet: :D
22:30gfxstrand[d]: I think that's probably not quite in the spirit of the rules but 🤷🏻♀️
22:31RSpliet: First-gen should hit ramgk104 for reclocking, should work fine if not for the screen flicker because we never figured out how to sync to vblank, or better, operate the scanout buffer so that there's enough pixels in it for scanout to take RAM offline for 100 microseconds
22:33RSpliet: sorry, to be precise, how to wait for vblank before taking RAM offline in the PMU
23:14marysaka[d]: gfxstrand[d]: Another solution for Maxwell B would be to implement host1x syncpoint in nouveau and run CTS on a Tegra X1/Switch :painpeko:
23:14marysaka[d]: but last time I tried, host1x driver apis were quite invasive and I had no good way to handle it...
23:15RSpliet: I for one would rather get my gums surgically removed
23:20karolherbst[d]: gfxstrand[d]: at least the ISA is almost the same :ferrisUpsideDown:
23:21karolherbst[d]: gfxstrand[d]: is it really that bad? Like bad because how vulkan/spir-v works or what's the issue here?
23:40gfxstrand[d]: No ldg.constant
23:41karolherbst[d]: I see
23:41gfxstrand[d]: If I kept a CPU shadow copy of the descriptor set, I could probably do better promotion to bound cbufs
23:43karolherbst[d]: right.. descriptor sets kinda don't fit the model here very well.. I wonder what nvidia is doing to deal with this
23:55gfxstrand[d]: I'm not sure. I've never looked at any shaders from the blob on Maxwell.