05:21 qq[IrcCity]: hello. From where does the symbol rcu_read_unlock_strict originate in nouveau.ko? I only see it defined in <linux/rcupdate.h>,
05:22 qq[IrcCity]: but cannot find the use of it in the source.
05:26 qq[IrcCity]: It is the thing that obstructed testing of karolherbst’s patch to nouveau_gem_new() on my system. I cannot boot linux-5.12 built at home,
05:27 qq[IrcCity]: hence resorted to inserting the patched module into 5.12 downloaded (as binaries) from Internet.
08:20 qq[IrcCity]: Bypassed the problem (made rcu_read_unlock_strict() a dummy function), but yet cannot load my nouveau.ko: Exec format error and “Skipping invalid relocation target…” in dmesg.
08:21 qq[IrcCity]: Where to look for causes of binary incompatibility between Linux kernels (of the same version)?
08:23 RSpliet: rcu_read_unlock_strict is kernel API
08:24 RSpliet: There's quite a few consumers, I'd be fairly confident that rcu_read_unlock_strict itself isn't broken
08:31 qq[IrcCity]: I don′t suspect bugs there. I was unable to boot a home-built linux-5.12, whereas the home-built nouveau.ko is incompatible with %Urunning%U linux-5.12.
08:32 qq[IrcCity]: Hence I just can′t test fixes.
08:34 qq[IrcCity]: Look, I was nearly out of touch with Linux kernel for 11 years, not surprisingly I cannot do things deemed easy by you guys.
08:39 pmoreau: karolherbst: Your free-after-use patch did get rid of the warning here.
08:39 pmoreau: Now, why does OpenCL think that the allocation succeeded when I can see a `bo_new(1fa00000, 0): -12` right before… and why does it try allocating the max size when the code seemed to restrict it to only 60%…
09:15 RSpliet: qq[IrcCity]: I've been out of touch w kernel dev myself, I find it tedious to build a new nouveau.ko, don't worry, I understand the pain. If it's incompatibility it's usually because it's using the wrong includes - e.g. for a different kernel version.
09:16 qq[IrcCity]: And the code in includes depends on .config and fuck-knows-what other things.
09:18 qq[IrcCity]: And there are also different (sometimes binarily incompatible) versions of gcc and fuck-knows-which ancillary tools used in the building process.
11:03 karolherbst: pmoreau: cool
11:04 karolherbst: pmoreau: probably because we don't pass the error along or something?
11:05 karolherbst: qq[IrcCity]: just clone a linux tree and use your distribution config and install the whole kernel. Probably the easiest way
11:12 pmoreau: karolherbst: Actually it’s because the allocation in VRAM fails, but it then tries to reallocate in GART and that works. So that’s why OpenCL remains happy.
11:12 pmoreau: Now on to understanding why the allocation test still fails even though the allocation succeeded. It looks like they are filling the memory with something and that check afterwards, and that check fails. They run some kernel as well, and that one triggers a bunch of “gr: TRAP_MP_EXEC - TP 0 MP 0: 00000008 [TIMEOUT] at 000248 warp 12, opcode 1004b003 00000780”.
11:12 karolherbst: pmoreau: ahh
11:12 karolherbst: mhh
11:12 karolherbst: yeah.. I was seeing those messages as well but didn't care all that much. It's the same with nvc0
11:13 karolherbst: maybe it is indeed something like "we can't use GART for this access type" or something
11:24 pmoreau: Even if I reduce the allocation size so that it fits in VRAM, I still get those.
11:25 pmoreau: Also I know that local_arg_def, local_kernel_def, async_copy_global_to_local, async_copy_local_to_global also trigger those TIMEOUTs and fail.
11:26 karolherbst: yeah..
11:26 karolherbst: I think there is something weird going on
11:26 karolherbst: I just never really looked into it
11:26 karolherbst: although I got all test_basis tests to pass on nvc0
11:27 karolherbst: and I doubt there are any patches pending from nvc0 point of view even
11:27 pmoreau: This new kernel is only 89 instructions whereas the other ones are closer to 140, so it might make it easier to debug.
11:28 pmoreau: I’ll try to figure out what is going on, as that and the LOCAL_LIMIT_(WRITE|READ) are the two main sources of errors for basic on nv50.
17:53 pmoreau: What could the reason be for a “gr: TRAP_MP_EXEC - TP 1 MP 0: 00000008 [TIMEOUT] at 000298 warp 12, opcode 30020801 c4100780”? When I encountered them before, it was in kernels using synchronisation and thought it might be something linked to some of the threads never reaching the barrier and getting timed out or something.
17:53 pmoreau: But in this case there is no synchronisation going on. Does the driver/card have some kind of killing mechanism if the kernel takes too long, and that would result in a timeout?
17:54 imirkin: pmoreau: iirc timeout means the shader ran for too many cycles
17:54 imirkin: so infinite loop of some sort
17:54 imirkin: or just a _massive_ shader
17:54 pmoreau: Ok
17:55 imirkin: it tells you what the "last" opcode was
17:55 imirkin: when it decided to die
17:55 pmoreau: I could believe it being an infinite loop
17:56 pmoreau: I had a look at some of the last opcodes, and they usually are still within the loop except for some almost at the end. It’s a mix of loads and shuffles.
18:09 karolherbst: ohh...
18:09 karolherbst: yeah.. with CL that can happen I guess :/
18:09 karolherbst: pmoreau: does CL have any way of deaing with this?
18:10 pmoreau: With kernels that do not want to leave the cosy environment of a GPU? :-)
18:10 pmoreau: No idea
18:10 karolherbst: no I mean.. if a kernel execution aborts, because you execute too many instructions
18:10 karolherbst: let's see...
18:11 pmoreau: Mmh, still no idea
18:11 karolherbst: CL_OUT_OF_RESOURCES probably?
18:11 karolherbst: just need to catch it somehow
18:11 karolherbst: and deal with it
18:12 karolherbst: pmoreau: do you know if CUDA has anything official here?
18:12 karolherbst: probably best to mimic what nvidia is doing
18:12 pmoreau: No idea either, and I don’t remember running into such a situation (trying to execute too many instructions).
18:12 karolherbst: mhh
18:13 karolherbst: well.. easy to try
18:13 pmoreau: Infinite loops on the other hand…
18:13 karolherbst: you just need 2 million on tesla
18:15 pmoreau: But that’s 2 million instructions loaded on the GPU, not 2 million instructions executed, right? Like if the only code on the GPU is 1000 instructions for doing a loop from 0 to 1000, would that trigger it?
18:15 karolherbst: mhh..
18:15 karolherbst: good question
18:16 pmoreau: Though having too many instructions loaded would probably not trigger a timeout but some other kind of error.
18:16 karolherbst: maybe we just set a very low timeout somewhere
18:16 karolherbst: maybe we can configure it
18:18 pmoreau: I’ll try to reproduce it with a shorter example. A bit too lazy to go through those 89 instructions and double-check that it’s doing what it’s supposed to.
18:18 karolherbst: adding debugging features is also kind of on my todo list
18:19 karolherbst: it's even simpler for compute to support it
18:19 pmoreau: That would be awesome, yes!
18:21 pmoreau: Priority-wise, I’m planning on getting at least test_basic working on NV50, then probably have a quick look at images here with the work that Ilia and you did. And after that, we will see.
18:21 pmoreau: It might be 2022 by then and future involvement will depend on the new job.
18:24 karolherbst: yeah, no rush