01:49Ariel_Cabello: Hi guys, does anyone know where I can download the "pmu" firmware?
01:49Ariel_Cabello: dmesg saying "pmu: firmware unavailable"
01:50Ariel_Cabello: I have upstream 5.10 kernel and fresh inux-firmware
01:52imirkin: does not exist.
01:53Ariel_Cabello: So I am not missing anything?
01:53imirkin: nope
01:54Ariel_Cabello: And this is because I dont have it? "DRM: failed to create kernel channel, -22"
01:55Ariel_Cabello: Its a red scary message in the middle of my logs
01:55imirkin: yeah, that's a different problem
01:55imirkin: it's indicative of acceleration not working on your gpu
01:55imirkin: what gpu do you have?
01:55Ariel_Cabello: 2060
01:56imirkin: ok. make sure you've updated your linux-firmware, and that the firmware is accessible at the time of nouveau kernel module load
01:56imirkin: (frequently, that means included in initrd)
01:56Ariel_Cabello: Baking it into the kernel will work?
01:56imirkin: if nouveau is built in, you have to include it into EXTRA_FIRMWARE
01:56imirkin: and i'm pretty sure if it's in EXTRA_FIRMWARE, it'll work with a module as well
01:57Ariel_Cabello: Then I can compile Nouveau as a module and put the firmware in EXTRA_FIRMWARE right?
01:57imirkin: i believe that will work, but it's an uncommon configuration
01:58imirkin: this implies that you are building your own kernel, etc. which is totally fine, but just making sure.
02:01Ariel_Cabello: Can I put "nvidia" and it will detect all files under that folder?
02:01Ariel_Cabello: No it wont
02:01imirkin: yeah, iirc you have to enumerate
02:01imirkin: you don't actually need all the files
02:01imirkin: but it's not necessarily easy to work out which ones you need iirc
08:36pabs3: https://boilingsteam.com/amd-vs-nvidia-are-linux-gamers-switching-yet/ https://news.ycombinator.com/item?id=25984328
13:04karolherbst: imirkin: I've added the multithreading fixes for nouveau_mm to the MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8765
13:06karolherbst: uhm... I should add the helper macros first
13:06karolherbst: anyway, that only affects the trylock thing
14:47karolherbst: ohh we have a simple_mtx_t
14:47karolherbst: nice
14:47karolherbst: comes with the assert as well
15:39karolherbst: and I think I will submit the fix for the races on the fence list next :) but that will already touch driver code :/ so somebody needs to verifiy on nv50 and nv30
15:39imirkin: i have nv50 plugged in
15:42RSpliet: imirkin: have you notified your power supplier?
15:42imirkin: RSpliet: nah, but they notify with a high electricity bill
15:42imirkin: also it's just a G84 ;)
15:42imirkin: Quadro FX 370
15:42RSpliet: Oh that's fine then ;-D
15:50Ariel_Cabello: Hi, I had a problem yesterday with "DRM: failed to create kernel channel, -22" and somebody sugested making sure the firmware was in the initramfs. I have done it but I still get the error. Any help?
15:51imirkin: Ariel_Cabello: my suggestion was to ensure that the firmware was available at the time 'nouveau' loads
15:51imirkin: does nouveau load from initramfs?
15:52imirkin: Ariel_Cabello: oh also, it just occurred to me ... you're on a TUsomething ... is your kernel recent enough? what kernel are you on?
15:52imirkin: accel on those is somewhat recent
15:52Ariel_Cabello: It is embedded in the kernel with the entire /lib/firmware/nvidia directory
15:53Ariel_Cabello: Dmesg says that I have TU106
15:53Ariel_Cabello: Its a 2060
15:53imirkin: right. what kernel?
15:53Ariel_Cabello: 5.10
15:53imirkin: TU106 is the important bit, the marketing name is irrelevant
15:54imirkin: anyways ... would have to check if 5.10 had turing accel. sorta assume it does
15:54imirkin: karolherbst: --^
15:54karolherbst: 5.9 has already
15:55karolherbst: probably mesa outdated
15:55karolherbst: uhh wait
15:55karolherbst: error doesn't fit
15:55imirkin: yeah :)
15:55karolherbst: I guess the firmare is missing :D
15:55karolherbst: so, dmesg would help
15:55imirkin: Ariel_Cabello: pastebin dmesg
15:57Ariel_Cabello: pastebin.com/t2KwwfgG
15:58imirkin: [ 0.573010] nouveau 0000:01:00.0: gr: firmware unavailable
15:58imirkin: so yeah, definitely can't find the firmware.
15:59imirkin: based on the timings it seems like nouveau is built into the kernel, yes?
15:59Ariel_Cabello: Yes
15:59Ariel_Cabello: But the firmware is also built into the kernel
15:59imirkin: apparently not hard enough :)
15:59imirkin: it's failing to find something
15:59imirkin: maybe something that's a symlink on the fs isn't making it into EXTRA_FIRMWARE?
16:04Ariel_Cabello: If I put something in EXTRA_FIRMWARE and it fails to find it, make stops
16:05imirkin: try booting with nouveau.debug=trace -- that should dump more info iirc.
16:14karolherbst: Ariel_Cabello: probably some files missing
16:14karolherbst: but uh,,,
16:14karolherbst: strange
16:14karolherbst: normally it tells what file is missing
16:14karolherbst: wait...
16:15Ariel_Cabello: Well I have booted with nouveau.debug=trace
16:15Ariel_Cabello: And dmesg does not display all logs
16:15imirkin: pastebin the boot messages?
16:15imirkin: do you have a level= in there?
16:15karolherbst: Ariel_Cabello: shouldn't matter
16:16Ariel_Cabello: No I dont have any level in there
16:16karolherbst: the firmware loader prints that stuff normally
16:16imirkin: hm, nope
16:16imirkin: Ariel_Cabello: do you have a /lib/firmware/nvidia/tu106/gr dir
16:16Ariel_Cabello: I pastebin them even when they are not all the logs?
16:17imirkin: Ariel_Cabello: can you actually pastebin your EXTRA_FIRMWARE setting?
16:17karolherbst: I am actually wondering if having i915 and nouveau builtin does cause other issues.. like what happens if nouveau is loaded quicker?
16:17imirkin: at least the nvidia-related bits of it
16:17imirkin: karolherbst: same as when they're modules?
16:17Ariel_Cabello: Yes I have tu106/gr with 11 files in it
16:17karolherbst: imirkin: mhhh....
16:17karolherbst: that... worries me now
16:17imirkin: Ariel_Cabello: aha
16:17imirkin: i knew it
16:17imirkin: you took a shortcut
16:18imirkin: there should be 13 files there.
16:18imirkin: 2 are symlinks
16:18karolherbst: same for sec2
16:18karolherbst: there are 3 symlinks
16:19Ariel_Cabello: I have git cloned the firmware repo from kernel.org
16:19Ariel_Cabello: Were are the missing files?
16:19karolherbst: there are there
16:19karolherbst: or should be at least
16:19karolherbst: ohhhh
16:19karolherbst: wait..
16:20imirkin: Ariel_Cabello: you probably did a 'find -type f'?
16:20Ariel_Cabello: Yes
16:20imirkin: does that pick up symlinks?
16:20karolherbst: mhhh
16:20Ariel_Cabello: But they are still not there
16:20karolherbst: there are only 11 in git
16:20Ariel_Cabello: Even if I ls the directory
16:20imirkin: Ariel_Cabello: uhm... that's odd.
16:21karolherbst: but I think you need to do something..
16:21imirkin: Ariel_Cabello: yeah, that seems accurate - they're not in linux-firmware
16:21karolherbst: run "make" once
16:21imirkin: but they are in my "linux-firwmare" install
16:21Ariel_Cabello: Where?
16:22imirkin: i don't see where the symlinks are coming from tbh
16:22karolherbst: imirkin: there is this WHENCE file doing crazy shit
16:22karolherbst: "Link: nvidia/tu106/acr/ucode_ahesasc.bin -> ../../tu102/acr/ucode_ahesasc.bin" etc...
16:22imirkin: oh lol
16:22karolherbst: yeah
16:22karolherbst: run "make" :)
16:22imirkin: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/WHENCE#n4345
16:22karolherbst: yep
16:23Ariel_Cabello: make: nothing to be done for 'all'
16:23imirkin: it's done by install
16:24karolherbst: ohh, true
16:24Ariel_Cabello: Oh now i have them
16:24karolherbst: yeah...
16:25karolherbst: usually why I use distribution provided stuff, they usually know what to do :p
16:25Ariel_Cabello: Sorrh to bother you guys and thank you. Im dumb...
16:26karolherbst: no worries
16:27karolherbst: imirkin: btw, will you have time to look at the modifier stuff or should I just merge it? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3724
16:27imirkin: karolherbst: i'll try over the next few days
16:27karolherbst: cool
16:27karolherbst: I am fairly sure everything is alright, but maybe you'll spot something
17:11Ariel_Cabello: Well I have recompiled with the firmware and the error is gone
17:11imirkin: yay
17:12Ariel_Cabello: But now there are worse things
17:12imirkin: boo
17:14Ariel_Cabello: pastebin.com/WmL9MDgW
17:15Ariel_Cabello: The first error is "acr: unload binary failed"
17:16Ariel_Cabello: But I think it starts going south in pmu: firmware unavailable
17:21karolherbst: you can ignore the pmu stuff
17:21karolherbst: Ariel_Cabello: mhhh yeah.. this is a known issue which is just also very painful to debug :/
17:22Ariel_Cabello: And the [cut here] stuff is a product of that?
17:24imirkin: this is a laptop, right?
17:24karolherbst: Ariel_Cabello: yeah, essentially.
17:25karolherbst: tsan is starting to get useless :( https://gist.githubusercontent.com/karolherbst/cdf6d2dea3a88e5b7ad2ab6050714181/raw/8b39e410c8929482a278136663b00e3b479492c0/gistfile1.txt
17:26karolherbst: not sure why it fails to resolve symbols of some functions
17:26imirkin: karolherbst: i think it's not resolving the right SO
17:27imirkin: look at that offset
17:27imirkin: probably larger than the whole binary
17:27karolherbst: imirkin: well... not sure, because some entries have some functions resolved
17:27karolherbst: search for nouveau_pushbuf_kick
17:27imirkin: sure
17:27imirkin: but for those other ones
17:27karolherbst: but yeah.. the offset is huge
17:27imirkin: i think nouveau_dri.so just happens to be the last one
17:27imirkin: karolherbst: oh, could be some generated code
17:27imirkin: from translate
17:27karolherbst: called from pushbuf_submit?
17:28karolherbst: but yeah.. was thinking the same
17:28karolherbst: ohh, right
17:28karolherbst: it calls the callback inside libdrm
17:29imirkin: translate would definitely not call anything in libdrm
17:29karolherbst: yeah..
17:29imirkin: it's just sse'ish instructions to convert a handful of formats the hw doens't do
17:29karolherbst: and we wouldn't save the func pointer in kick_notify
17:31imirkin: it's POSSIBLE that we instruct translate to write to the pushbuf directly
17:32karolherbst: mhhh
17:32imirkin: it's all in nvc0_vbo_translate.c iirc
17:32karolherbst: sure, but why would kick_notify call into translate?
17:32imirkin: it wouldn't
17:32karolherbst: uhm
17:32karolherbst: would be a translate function
17:32karolherbst: yeah.. well, that's what the stack is saying
17:32imirkin: i'm just explaining how we use translate, which generates code on the fly.
17:32karolherbst: "pushbuf_submit ../nouveau/pushbuf.c:324 (libdrm_nouveau.so.2+0x74b0)" -> push->kick_notify(push);
17:33karolherbst: it probably is a stupid bug in libtsan :D
18:08karolherbst: sooo.. let's see what chromium does
18:09karolherbst: I think "chromium works without issues" is probably a good enough baseline for now
18:13imirkin: karolherbst: so my concern with this tsan-driven stuff is that you're just mashing the keyboard at random until tsan says it's all good
18:29HdkR: Even if tsan shows false positives, it can still manage to point out bad code smells at least :)
18:30imirkin: i'm not saying "don't use tsan"
18:31imirkin: i'm saying "have a global approach, implement it, verify with tsan"
18:31HdkR: Yea, legacy codebase usually means you have to flail to clean up the noise before then though
18:58karolherbst: imirkin: yeah, I know, I do that when writing the patches. Just tsan is nice at pointing out what is wrong
18:59imirkin: karolherbst: like i said, no problem with tsan
19:00karolherbst: ehh.. how could I start chromium with a custom GL driver again ...
19:01imirkin: like for debugging?
19:01imirkin: here's what i've used:
19:01imirkin: chromium --no-sandbox --user-data-dir=/tmp/chrome-gpu-debug --gpu-launcher='xterm -title gpu-launcher -e gdb -ex run --args'
19:01imirkin: i guess that's a little advanced
19:02imirkin: it dumps it into gdb directly
19:02imirkin: you might not want that, dunno
19:03karolherbst: mhhh
19:03karolherbst: ohh
19:03karolherbst: forgot the env var
19:03karolherbst: heh...
19:03imirkin: i think you can put the env var on the outside
19:03imirkin: i.e. not in the gpu-launcher command. i forget.
19:04karolherbst: it spams https://gist.github.com/karolherbst/9dd18750df247b6762dc1cd5314c8036
19:04imirkin: hm
19:04imirkin: well that didn't happen before
19:04imirkin: are you on wayland or something?
19:04imirkin: probably don't want the xterm launcher then...
19:05karolherbst: it does work with DRI_PRIME=0 mhhh
19:05karolherbst: yeah, wayland, but that shouldn't matter
19:05karolherbst: or maybe we don't support something we need for wayland?
19:05karolherbst: let me disable the wayland stuff
19:05karolherbst: heh..
19:05karolherbst: that works
19:06karolherbst: " ANGLE (novu, NV167, OpenGL 4.3 core)" at last
19:06karolherbst: uhhh
19:06karolherbst: they disable a bunch of stuff if they detect anything besides intel
19:06karolherbst: workarounds I mean
19:06karolherbst: okay.. so.. play store I guess would crash it
19:07karolherbst: or well.. nothing
19:07imirkin: i mean ... pull up google maps and turn on 3d
19:07imirkin: that always used to be a good test
19:07karolherbst: I am running WebGL tests
19:08imirkin: i think i got those mostly working
19:09karolherbst: yeah well.. even maps doesn't crash :D
19:09imirkin: chrome does work fairly well for me
19:09imirkin: at least on this pascal board
19:09imirkin: (i have the ignore-gpu-blacklist thing on)
19:09imirkin: the maps issues were fixed ages ago
19:10karolherbst: ahh... maybe I should turn that on as well
19:10karolherbst: but chrome://gpu is saying everything is alright
19:10imirkin: yea
19:11karolherbst: "Multiple Raster Threads: Enabled"
19:11karolherbst: yeah well...
19:11karolherbst: I am running without my patches though
19:12karolherbst: but I guess we should get chromium in wayland mode to work with novueau regardless :D
19:12karolherbst: or maybe it's a stupid prime thing
19:12karolherbst: mhh "disabled_extension_GL_NV_path_rendering"
19:12karolherbst: what's GL_NV_path_rendering :D
19:12karolherbst: uhhh
19:12karolherbst: sounds like something big
19:21airlied: its a 2d accel ext and big
19:21HdkR: You don't want to burn time implementing that
19:22HdkR: Nightmare
19:24karolherbst: looks like it, yeah
19:28karolherbst: ohh, right. android emulator, that's what I actually wanted to try out :D
19:32karolherbst: oh wow :O
19:32karolherbst: that's brutal
19:32karolherbst: so.. not only do theee GPU context crash like immediatly
19:32karolherbst: it took down my wifi and my cursoer behaves... "strangely"
19:32karolherbst: guess too many IRQs
19:33karolherbst: now it spams "[263324.615318] nouveau 0000:01:00.0: fifo: PBDMA0: 80000000 [] ch 3 00000480 004c5100" :)
19:34karolherbst: ohh, now the reset triggers
19:35karolherbst: ehh "DRM: failed to idle channel 0 [DRM]"
19:36karolherbst: nice, the GPU fails to suspend
19:39karolherbst: anyway.. I think I have a goal now :D
19:40karolherbst: imirkin: but if you find some time, we could already land the MT fixes in nouveau_mm as those are quite self contained and I think the changes in itself make totally sense
19:41karolherbst: we could also try to make it less racy, but I'd rather not replace something on a whim there
19:54imirkin: karolherbst: well, without understanding the overall strategy, making individual things "thread-safe" may not make sense
19:54imirkin: an argument could be that it shouldn't be thread-safe, but rather should be accessed in a thread-safe manner. etc.
21:14karolherbst: imirkin: sure, but the mm code has more or less sane interfaces and all races are internal
21:14karolherbst: for the fence list eg that's not possible and requires driver changes
21:14karolherbst: the races are essentially an implementation detail of the interfaces we have
21:15karolherbst: of course, we could rewrite it so it doesn't race, but...
21:15imirkin: i mean, by that same logic c++ map should be made thread-safe
21:15imirkin: we could say "no, there should be external locking" etc
21:15imirkin: i dunno what the right thing is
21:15imirkin: i haven't looked at how it's used in quite a while
21:16imirkin: maybe the right thing *is* to make nouveau_mm thread-safe, but i don't take that as a given
21:16karolherbst: well.. the memory is shared on a screen level
21:16karolherbst: it's essentially just slab based allocation
21:16imirkin: right
21:16imirkin: within a bo, right?
21:16karolherbst: and used in quite a few palces
21:16karolherbst: yes
21:16karolherbst: and has multiple buckets of different sizes
21:17imirkin: right
21:17karolherbst: we could of course replace the entire thing with a bo cache
21:17karolherbst: that's what other drivers are doing
21:17karolherbst: and have magic for sub allocated bos
21:17karolherbst: but my target was rather to fix the current thing and think about all the reworks later :)
21:18karolherbst: mhh, nouveau_buffer_allocate uses nouveau_mm_allocate
21:18karolherbst: and transfer_staging
21:19karolherbst: anyway.. the user are mostly without a context
21:19karolherbst: which... makes it annoying to implement without races
21:21imirkin: like i said - perhaps it's the right call
21:21karolherbst: imirkin: I mean.. I totally get what you are trying to say, the fixes in mm are just the ones I am happy with.. the onces I did for the overall driver and the fence lists are... messy and I don't like those ;)
21:21imirkin: sounds good
21:26karolherbst: the main idea I have for fixing the pushbuffer races is to start each sequence with a "PUSH_ACQ" which does the locking and end with a "PUSH_END" which also always kicks, but intermediate GPU state and everything can be problematic... so.. probably need to get a better overview of everything. But it also shows random other issues, so have to see if I either ditch libdrm comepletly and rewrite everything or just move it in and
21:26karolherbst: remove bunch of code... dunno :) but yeah.. the fencing and fixing pushbuffer races are super annoying :/
21:31imirkin: that could work. the stuff in the push_kick callback is the trickiest
21:31imirkin: that's really the driver of everyhting else
21:31imirkin: get that to work, and everything falls into line
21:32karolherbst: yeah.. I am also thinking of asserting on the pushbuffer being empty in PUSH_ACQ
21:32karolherbst: all the other macros already assert on the lock to be taken
21:33karolherbst: so, that's roughly the main idea: touch it only if you hold the lock and verify it through asserts