01:49 Ariel_Cabello: Hi guys, does anyone know where I can download the "pmu" firmware?
01:49 Ariel_Cabello: dmesg saying "pmu: firmware unavailable"
01:50 Ariel_Cabello: I have upstream 5.10 kernel and fresh inux-firmware
01:52 imirkin: does not exist.
01:53 Ariel_Cabello: So I am not missing anything?
01:53 imirkin: nope
01:54 Ariel_Cabello: And this is because I dont have it? "DRM: failed to create kernel channel, -22"
01:55 Ariel_Cabello: Its a red scary message in the middle of my logs
01:55 imirkin: yeah, that's a different problem
01:55 imirkin: it's indicative of acceleration not working on your gpu
01:55 imirkin: what gpu do you have?
01:55 Ariel_Cabello: 2060
01:56 imirkin: ok. make sure you've updated your linux-firmware, and that the firmware is accessible at the time of nouveau kernel module load
01:56 imirkin: (frequently, that means included in initrd)
01:56 Ariel_Cabello: Baking it into the kernel will work?
01:56 imirkin: if nouveau is built in, you have to include it into EXTRA_FIRMWARE
01:56 imirkin: and i'm pretty sure if it's in EXTRA_FIRMWARE, it'll work with a module as well
01:57 Ariel_Cabello: Then I can compile Nouveau as a module and put the firmware in EXTRA_FIRMWARE right?
01:57 imirkin: i believe that will work, but it's an uncommon configuration
01:58 imirkin: this implies that you are building your own kernel, etc. which is totally fine, but just making sure.
02:01 Ariel_Cabello: Can I put "nvidia" and it will detect all files under that folder?
02:01 Ariel_Cabello: No it wont
02:01 imirkin: yeah, iirc you have to enumerate
02:01 imirkin: you don't actually need all the files
02:01 imirkin: but it's not necessarily easy to work out which ones you need iirc
08:36 pabs3: https://boilingsteam.com/amd-vs-nvidia-are-linux-gamers-switching-yet/ https://news.ycombinator.com/item?id=25984328
13:04 karolherbst: imirkin: I've added the multithreading fixes for nouveau_mm to the MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8765
13:06 karolherbst: uhm... I should add the helper macros first
13:06 karolherbst: anyway, that only affects the trylock thing
14:47 karolherbst: ohh we have a simple_mtx_t
14:47 karolherbst: nice
14:47 karolherbst: comes with the assert as well
15:39 karolherbst: and I think I will submit the fix for the races on the fence list next :) but that will already touch driver code :/ so somebody needs to verifiy on nv50 and nv30
15:39 imirkin: i have nv50 plugged in
15:42 RSpliet: imirkin: have you notified your power supplier?
15:42 imirkin: RSpliet: nah, but they notify with a high electricity bill
15:42 imirkin: also it's just a G84 ;)
15:42 imirkin: Quadro FX 370
15:42 RSpliet: Oh that's fine then ;-D
15:50 Ariel_Cabello: Hi, I had a problem yesterday with "DRM: failed to create kernel channel, -22" and somebody sugested making sure the firmware was in the initramfs. I have done it but I still get the error. Any help?
15:51 imirkin: Ariel_Cabello: my suggestion was to ensure that the firmware was available at the time 'nouveau' loads
15:51 imirkin: does nouveau load from initramfs?
15:52 imirkin: Ariel_Cabello: oh also, it just occurred to me ... you're on a TUsomething ... is your kernel recent enough? what kernel are you on?
15:52 imirkin: accel on those is somewhat recent
15:52 Ariel_Cabello: It is embedded in the kernel with the entire /lib/firmware/nvidia directory
15:53 Ariel_Cabello: Dmesg says that I have TU106
15:53 Ariel_Cabello: Its a 2060
15:53 imirkin: right. what kernel?
15:53 Ariel_Cabello: 5.10
15:53 imirkin: TU106 is the important bit, the marketing name is irrelevant
15:54 imirkin: anyways ... would have to check if 5.10 had turing accel. sorta assume it does
15:54 imirkin: karolherbst: --^
15:54 karolherbst: 5.9 has already
15:55 karolherbst: probably mesa outdated
15:55 karolherbst: uhh wait
15:55 karolherbst: error doesn't fit
15:55 imirkin: yeah :)
15:55 karolherbst: I guess the firmare is missing :D
15:55 karolherbst: so, dmesg would help
15:55 imirkin: Ariel_Cabello: pastebin dmesg
15:57 Ariel_Cabello: pastebin.com/t2KwwfgG
15:58 imirkin: [ 0.573010] nouveau 0000:01:00.0: gr: firmware unavailable
15:58 imirkin: so yeah, definitely can't find the firmware.
15:59 imirkin: based on the timings it seems like nouveau is built into the kernel, yes?
15:59 Ariel_Cabello: Yes
15:59 Ariel_Cabello: But the firmware is also built into the kernel
15:59 imirkin: apparently not hard enough :)
15:59 imirkin: it's failing to find something
15:59 imirkin: maybe something that's a symlink on the fs isn't making it into EXTRA_FIRMWARE?
16:04 Ariel_Cabello: If I put something in EXTRA_FIRMWARE and it fails to find it, make stops
16:05 imirkin: try booting with nouveau.debug=trace -- that should dump more info iirc.
16:14 karolherbst: Ariel_Cabello: probably some files missing
16:14 karolherbst: but uh,,,
16:14 karolherbst: strange
16:14 karolherbst: normally it tells what file is missing
16:14 karolherbst: wait...
16:15 Ariel_Cabello: Well I have booted with nouveau.debug=trace
16:15 Ariel_Cabello: And dmesg does not display all logs
16:15 imirkin: pastebin the boot messages?
16:15 imirkin: do you have a level= in there?
16:15 karolherbst: Ariel_Cabello: shouldn't matter
16:16 Ariel_Cabello: No I dont have any level in there
16:16 karolherbst: the firmware loader prints that stuff normally
16:16 imirkin: hm, nope
16:16 imirkin: Ariel_Cabello: do you have a /lib/firmware/nvidia/tu106/gr dir
16:16 Ariel_Cabello: I pastebin them even when they are not all the logs?
16:17 imirkin: Ariel_Cabello: can you actually pastebin your EXTRA_FIRMWARE setting?
16:17 karolherbst: I am actually wondering if having i915 and nouveau builtin does cause other issues.. like what happens if nouveau is loaded quicker?
16:17 imirkin: at least the nvidia-related bits of it
16:17 imirkin: karolherbst: same as when they're modules?
16:17 Ariel_Cabello: Yes I have tu106/gr with 11 files in it
16:17 karolherbst: imirkin: mhhh....
16:17 karolherbst: that... worries me now
16:17 imirkin: Ariel_Cabello: aha
16:17 imirkin: i knew it
16:17 imirkin: you took a shortcut
16:18 imirkin: there should be 13 files there.
16:18 imirkin: 2 are symlinks
16:18 karolherbst: same for sec2
16:18 karolherbst: there are 3 symlinks
16:19 Ariel_Cabello: I have git cloned the firmware repo from kernel.org
16:19 Ariel_Cabello: Were are the missing files?
16:19 karolherbst: there are there
16:19 karolherbst: or should be at least
16:19 karolherbst: ohhhh
16:19 karolherbst: wait..
16:20 imirkin: Ariel_Cabello: you probably did a 'find -type f'?
16:20 Ariel_Cabello: Yes
16:20 imirkin: does that pick up symlinks?
16:20 karolherbst: mhhh
16:20 Ariel_Cabello: But they are still not there
16:20 karolherbst: there are only 11 in git
16:20 Ariel_Cabello: Even if I ls the directory
16:20 imirkin: Ariel_Cabello: uhm... that's odd.
16:21 karolherbst: but I think you need to do something..
16:21 imirkin: Ariel_Cabello: yeah, that seems accurate - they're not in linux-firmware
16:21 karolherbst: run "make" once
16:21 imirkin: but they are in my "linux-firwmare" install
16:21 Ariel_Cabello: Where?
16:22 imirkin: i don't see where the symlinks are coming from tbh
16:22 karolherbst: imirkin: there is this WHENCE file doing crazy shit
16:22 karolherbst: "Link: nvidia/tu106/acr/ucode_ahesasc.bin -> ../../tu102/acr/ucode_ahesasc.bin" etc...
16:22 imirkin: oh lol
16:22 karolherbst: yeah
16:22 karolherbst: run "make" :)
16:22 imirkin: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/WHENCE#n4345
16:22 karolherbst: yep
16:23 Ariel_Cabello: make: nothing to be done for 'all'
16:23 imirkin: it's done by install
16:24 karolherbst: ohh, true
16:24 Ariel_Cabello: Oh now i have them
16:24 karolherbst: yeah...
16:25 karolherbst: usually why I use distribution provided stuff, they usually know what to do :p
16:25 Ariel_Cabello: Sorrh to bother you guys and thank you. Im dumb...
16:26 karolherbst: no worries
16:27 karolherbst: imirkin: btw, will you have time to look at the modifier stuff or should I just merge it? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3724
16:27 imirkin: karolherbst: i'll try over the next few days
16:27 karolherbst: cool
16:27 karolherbst: I am fairly sure everything is alright, but maybe you'll spot something
17:11 Ariel_Cabello: Well I have recompiled with the firmware and the error is gone
17:11 imirkin: yay
17:12 Ariel_Cabello: But now there are worse things
17:12 imirkin: boo
17:14 Ariel_Cabello: pastebin.com/WmL9MDgW
17:15 Ariel_Cabello: The first error is "acr: unload binary failed"
17:16 Ariel_Cabello: But I think it starts going south in pmu: firmware unavailable
17:21 karolherbst: you can ignore the pmu stuff
17:21 karolherbst: Ariel_Cabello: mhhh yeah.. this is a known issue which is just also very painful to debug :/
17:22 Ariel_Cabello: And the [cut here] stuff is a product of that?
17:24 imirkin: this is a laptop, right?
17:24 karolherbst: Ariel_Cabello: yeah, essentially.
17:25 karolherbst: tsan is starting to get useless :( https://gist.githubusercontent.com/karolherbst/cdf6d2dea3a88e5b7ad2ab6050714181/raw/8b39e410c8929482a278136663b00e3b479492c0/gistfile1.txt
17:26 karolherbst: not sure why it fails to resolve symbols of some functions
17:26 imirkin: karolherbst: i think it's not resolving the right SO
17:27 imirkin: look at that offset
17:27 imirkin: probably larger than the whole binary
17:27 karolherbst: imirkin: well... not sure, because some entries have some functions resolved
17:27 karolherbst: search for nouveau_pushbuf_kick
17:27 imirkin: sure
17:27 imirkin: but for those other ones
17:27 karolherbst: but yeah.. the offset is huge
17:27 imirkin: i think nouveau_dri.so just happens to be the last one
17:27 imirkin: karolherbst: oh, could be some generated code
17:27 imirkin: from translate
17:27 karolherbst: called from pushbuf_submit?
17:28 karolherbst: but yeah.. was thinking the same
17:28 karolherbst: ohh, right
17:28 karolherbst: it calls the callback inside libdrm
17:29 imirkin: translate would definitely not call anything in libdrm
17:29 karolherbst: yeah..
17:29 imirkin: it's just sse'ish instructions to convert a handful of formats the hw doens't do
17:29 karolherbst: and we wouldn't save the func pointer in kick_notify
17:31 imirkin: it's POSSIBLE that we instruct translate to write to the pushbuf directly
17:32 karolherbst: mhhh
17:32 imirkin: it's all in nvc0_vbo_translate.c iirc
17:32 karolherbst: sure, but why would kick_notify call into translate?
17:32 imirkin: it wouldn't
17:32 karolherbst: uhm
17:32 karolherbst: would be a translate function
17:32 karolherbst: yeah.. well, that's what the stack is saying
17:32 imirkin: i'm just explaining how we use translate, which generates code on the fly.
17:32 karolherbst: "pushbuf_submit ../nouveau/pushbuf.c:324 (libdrm_nouveau.so.2+0x74b0)" -> push->kick_notify(push);
17:33 karolherbst: it probably is a stupid bug in libtsan :D
18:08 karolherbst: sooo.. let's see what chromium does
18:09 karolherbst: I think "chromium works without issues" is probably a good enough baseline for now
18:13 imirkin: karolherbst: so my concern with this tsan-driven stuff is that you're just mashing the keyboard at random until tsan says it's all good
18:29 HdkR: Even if tsan shows false positives, it can still manage to point out bad code smells at least :)
18:30 imirkin: i'm not saying "don't use tsan"
18:31 imirkin: i'm saying "have a global approach, implement it, verify with tsan"
18:31 HdkR: Yea, legacy codebase usually means you have to flail to clean up the noise before then though
18:58 karolherbst: imirkin: yeah, I know, I do that when writing the patches. Just tsan is nice at pointing out what is wrong
18:59 imirkin: karolherbst: like i said, no problem with tsan
19:00 karolherbst: ehh.. how could I start chromium with a custom GL driver again ...
19:01 imirkin: like for debugging?
19:01 imirkin: here's what i've used:
19:01 imirkin: chromium --no-sandbox --user-data-dir=/tmp/chrome-gpu-debug --gpu-launcher='xterm -title gpu-launcher -e gdb -ex run --args'
19:01 imirkin: i guess that's a little advanced
19:02 imirkin: it dumps it into gdb directly
19:02 imirkin: you might not want that, dunno
19:03 karolherbst: mhhh
19:03 karolherbst: ohh
19:03 karolherbst: forgot the env var
19:03 karolherbst: heh...
19:03 imirkin: i think you can put the env var on the outside
19:03 imirkin: i.e. not in the gpu-launcher command. i forget.
19:04 karolherbst: it spams https://gist.github.com/karolherbst/9dd18750df247b6762dc1cd5314c8036
19:04 imirkin: hm
19:04 imirkin: well that didn't happen before
19:04 imirkin: are you on wayland or something?
19:04 imirkin: probably don't want the xterm launcher then...
19:05 karolherbst: it does work with DRI_PRIME=0 mhhh
19:05 karolherbst: yeah, wayland, but that shouldn't matter
19:05 karolherbst: or maybe we don't support something we need for wayland?
19:05 karolherbst: let me disable the wayland stuff
19:05 karolherbst: heh..
19:05 karolherbst: that works
19:06 karolherbst: " ANGLE (novu, NV167, OpenGL 4.3 core)" at last
19:06 karolherbst: uhhh
19:06 karolherbst: they disable a bunch of stuff if they detect anything besides intel
19:06 karolherbst: workarounds I mean
19:06 karolherbst: okay.. so.. play store I guess would crash it
19:07 karolherbst: or well.. nothing
19:07 imirkin: i mean ... pull up google maps and turn on 3d
19:07 imirkin: that always used to be a good test
19:07 karolherbst: I am running WebGL tests
19:08 imirkin: i think i got those mostly working
19:09 karolherbst: yeah well.. even maps doesn't crash :D
19:09 imirkin: chrome does work fairly well for me
19:09 imirkin: at least on this pascal board
19:09 imirkin: (i have the ignore-gpu-blacklist thing on)
19:09 imirkin: the maps issues were fixed ages ago
19:10 karolherbst: ahh... maybe I should turn that on as well
19:10 karolherbst: but chrome://gpu is saying everything is alright
19:10 imirkin: yea
19:11 karolherbst: "Multiple Raster Threads: Enabled"
19:11 karolherbst: yeah well...
19:11 karolherbst: I am running without my patches though
19:12 karolherbst: but I guess we should get chromium in wayland mode to work with novueau regardless :D
19:12 karolherbst: or maybe it's a stupid prime thing
19:12 karolherbst: mhh "disabled_extension_GL_NV_path_rendering"
19:12 karolherbst: what's GL_NV_path_rendering :D
19:12 karolherbst: uhhh
19:12 karolherbst: sounds like something big
19:21 airlied: its a 2d accel ext and big
19:21 HdkR: You don't want to burn time implementing that
19:22 HdkR: Nightmare
19:24 karolherbst: looks like it, yeah
19:28 karolherbst: ohh, right. android emulator, that's what I actually wanted to try out :D
19:32 karolherbst: oh wow :O
19:32 karolherbst: that's brutal
19:32 karolherbst: so.. not only do theee GPU context crash like immediatly
19:32 karolherbst: it took down my wifi and my cursoer behaves... "strangely"
19:32 karolherbst: guess too many IRQs
19:33 karolherbst: now it spams "[263324.615318] nouveau 0000:01:00.0: fifo: PBDMA0: 80000000 [] ch 3 00000480 004c5100" :)
19:34 karolherbst: ohh, now the reset triggers
19:35 karolherbst: ehh "DRM: failed to idle channel 0 [DRM]"
19:36 karolherbst: nice, the GPU fails to suspend
19:39 karolherbst: anyway.. I think I have a goal now :D
19:40 karolherbst: imirkin: but if you find some time, we could already land the MT fixes in nouveau_mm as those are quite self contained and I think the changes in itself make totally sense
19:41 karolherbst: we could also try to make it less racy, but I'd rather not replace something on a whim there
19:54 imirkin: karolherbst: well, without understanding the overall strategy, making individual things "thread-safe" may not make sense
19:54 imirkin: an argument could be that it shouldn't be thread-safe, but rather should be accessed in a thread-safe manner. etc.
21:14 karolherbst: imirkin: sure, but the mm code has more or less sane interfaces and all races are internal
21:14 karolherbst: for the fence list eg that's not possible and requires driver changes
21:14 karolherbst: the races are essentially an implementation detail of the interfaces we have
21:15 karolherbst: of course, we could rewrite it so it doesn't race, but...
21:15 imirkin: i mean, by that same logic c++ map should be made thread-safe
21:15 imirkin: we could say "no, there should be external locking" etc
21:15 imirkin: i dunno what the right thing is
21:15 imirkin: i haven't looked at how it's used in quite a while
21:16 imirkin: maybe the right thing *is* to make nouveau_mm thread-safe, but i don't take that as a given
21:16 karolherbst: well.. the memory is shared on a screen level
21:16 karolherbst: it's essentially just slab based allocation
21:16 imirkin: right
21:16 imirkin: within a bo, right?
21:16 karolherbst: and used in quite a few palces
21:16 karolherbst: yes
21:16 karolherbst: and has multiple buckets of different sizes
21:17 imirkin: right
21:17 karolherbst: we could of course replace the entire thing with a bo cache
21:17 karolherbst: that's what other drivers are doing
21:17 karolherbst: and have magic for sub allocated bos
21:17 karolherbst: but my target was rather to fix the current thing and think about all the reworks later :)
21:18 karolherbst: mhh, nouveau_buffer_allocate uses nouveau_mm_allocate
21:18 karolherbst: and transfer_staging
21:19 karolherbst: anyway.. the user are mostly without a context
21:19 karolherbst: which... makes it annoying to implement without races
21:21 imirkin: like i said - perhaps it's the right call
21:21 karolherbst: imirkin: I mean.. I totally get what you are trying to say, the fixes in mm are just the ones I am happy with.. the onces I did for the overall driver and the fence lists are... messy and I don't like those ;)
21:21 imirkin: sounds good
21:26 karolherbst: the main idea I have for fixing the pushbuffer races is to start each sequence with a "PUSH_ACQ" which does the locking and end with a "PUSH_END" which also always kicks, but intermediate GPU state and everything can be problematic... so.. probably need to get a better overview of everything. But it also shows random other issues, so have to see if I either ditch libdrm comepletly and rewrite everything or just move it in and
21:26 karolherbst: remove bunch of code... dunno :) but yeah.. the fencing and fixing pushbuffer races are super annoying :/
21:31 imirkin: that could work. the stuff in the push_kick callback is the trickiest
21:31 imirkin: that's really the driver of everyhting else
21:31 imirkin: get that to work, and everything falls into line
21:32 karolherbst: yeah.. I am also thinking of asserting on the pushbuffer being empty in PUSH_ACQ
21:32 karolherbst: all the other macros already assert on the lock to be taken
21:33 karolherbst: so, that's roughly the main idea: touch it only if you hold the lock and verify it through asserts