02:40 AndrewR: ..it seems blender 2.79 performance settings->system tab can freeze my g92 by using mesa git. I tried few older version and it seems problem started between 40a554c8dd06d24cbf56b919ea0a08ffdc7a4e5b and 8ecace073ea68e3819ea7c5637bd479067e1420b (9-13 Jan 2021). I hope to bisect but it may take some time ...
02:40 imirkin: woohoo, i must have broken it with all my fixes
02:41 imirkin: i definitely see my name in there :(
02:42 imirkin: sorry :(
02:43 imirkin: AndrewR: might be this guy? ec668e2fd142db27dfa9ea1084005db328889721
02:43 imirkin: or this? c0171c4626319ae6822f9c490d9118d5caf43246 ? but seems super-unlikely
02:44 AndrewR: imirkin, not sure yet, I just started compile of first bisection step ...let it finish?
02:44 imirkin: sure
02:47 AndrewR: imirkin, I had old dri install on liveDVD, but was surprized just replacing nouveau_dri.so there with new version doesn't introduce bug. Still moving in nouveau_pipe.so did this ... so first I looked for st/mesa commits .... may be I was wrong.
02:47 imirkin: let me know what broke - happy to (try to) fix
03:01 AndrewR: imirkin, https://cloud.mail.ru/public/ww6k/mhfLh5kh6 - there is apitrace (3 mb) from working mesa version - but be aware it may hang your gpu... sometimes I was able to kill blender and sgutdown, but if it hangs for first time with dmesg complain it usually will hang again harder on second try.
03:01 imirkin: AndrewR: ok thanks. i have a G84 but it's not my primary
03:01 imirkin: but if i disconnect, i hope you'll understand :)
03:02 AndrewR: imirkin, yeahh ....if you need your machine for something this is not urgent .....
03:02 imirkin: it's night
03:02 imirkin: i'll need it for work tomorrow
03:03 imirkin: but hopefully it'll recover by then ;)
03:03 imirkin: AndrewR: replays fine fwiw
03:04 imirkin: with the latest master
03:04 imirkin: + a couple patches, mostly unrelated
03:05 AndrewR: imirkin, good ... do you use 64-bit mesa? I have weird setup with 64b kernel + 32b mesa compiled with clang ..... it may complicate something
03:05 imirkin: yes, 64-bit userspace
03:05 imirkin: doing 32-bit would not be trivial for me
03:05 imirkin: achievable, but non-trivial
03:06 AndrewR: imirkin, then lets see where my bisect lands .....
03:06 imirkin: AndrewR: does replaying the trace trigger the issue for you?
03:07 AndrewR: imirkin, I did it on working mesa just now, so not yet tried to reply on known bad mesa ...let me make cook (compile) mesa again :}
03:08 imirkin: ok
03:08 imirkin: could be something that doesn't end up causing a problem when replaying the trace
03:08 imirkin: or could be something that only happens in your setup
03:13 AndrewR: ..X restart ...
03:20 AndrewR: ..so, " iris: Delete iris_resolve_conditional_render" seems to be good ... now testing 4439757db23490b9a4b75487d699470cdd7ebcf4 (it seems to be slightly above those nv50 commits .. )
03:21 imirkin: actually i didn't check it with today's commits
03:21 imirkin: only yesterday's i think? or day before?
03:24 AndrewR: imirkin, I think bug was around (for me) for few days already ..I was checvking another tab usually :}
03:25 imirkin: oh no, it was this morning
03:25 imirkin: i rebased on test out the drm shim
03:45 AndrewR: so, 4439757db23490b9a4b75487d699470cdd7ebcf4 seems to be fine as well ..... (and so all commits before it, incl. those nv50- ones .... ?) There is one more nouveau-related commit just below cut line ....mmm..intrigue!
04:16 AndrewR: sadly replaying this trace (blender.bin) doesn't trigger bug, while real Blender run hang GPU now ....
04:17 imirkin: =/
04:17 imirkin: that's what i kinda figured
04:17 imirkin: also isn't blender 2.79 super-old now?
04:19 AndrewR: imirkin, sure, but I rarelyy use it, mostly for looking into files my friend send me from time to time (he uses 2.79 + 2.80x ... on win)
04:19 imirkin: not a problem, just pointing it out
04:20 imirkin: the one in portage is 2.83
06:23 AndrewR: imirkin, sorry: reverting 28a781323fba87e6e338cfecb0b6fe25a08f61a4 ("nouveau: change fence destruction logic on screen destroy") seems to fix my blender hang ... (note, this machine also has two nv50-class gpus)
15:31 imirkin: AndrewR: gah, ok
15:50 imirkin: AndrewR: ok, we need to quickly revert that
15:50 imirkin: it was for drm-shim's benefit... sigh
15:50 imirkin: i guess this stuff is more subtle
15:50 imirkin: there's some case i didn't consider
15:51 imirkin: AndrewR: that's very weird though ... this gets triggered at, essentially, program exit
15:51 imirkin: although blender might do something weird
15:51 imirkin: and create/destroy a few screens first
15:51 imirkin: hm
15:53 imirkin: skeggsb: does anything bad happen if we destroy a channel while stuff is running on it? will the kernel "handle it", or does userspace need to be careful about it?
16:31 karolherbst: imirkin: if you submit a new pushbuf the kernel returns an error
16:31 imirkin: karolherbst: what about the old ones?
16:31 karolherbst: well, you already submited them
16:31 imirkin: that were already submitted and are in the middle of executing
16:31 karolherbst: well, they won't finish
16:31 imirkin: does destroying the channel screw anything up for them
16:32 imirkin: what if you start GEM_CLOSE'ing stuff
16:32 imirkin: etc
16:32 karolherbst: I was planning on handling all that much nicer
16:32 karolherbst: but we also have a way to "sync" on a submission, but that's mainly for debugging
16:32 imirkin: this has nothing to do with multi-threading
16:32 karolherbst: but we kind of have to handle the case where we know the context is gone
16:32 imirkin: this has to do with screen shutdown
16:33 imirkin: anyways, i sorta assume you don't have a precise answer. that's why i was hoping skeggsb would answer.
16:55 AndrewR: imirkin, if it works on your side or with different gpus - then I can live with local revert - I have 72 patches for patch am anyway :} (cl stuff and some random series I like to test today)
16:57 AndrewR: imirkin, (testing/bisecting for this problem was done without any of my patches, apart from two related to buildability )
17:05 imirkin: AndrewR: if it's a problem for you, it's a problem for others
17:05 imirkin: i don't understand what the problem is, and i don't run this code live.
17:06 karolherbst: AndrewR: what's the thing breaking btw?
17:06 imirkin: blender 2.79
17:06 karolherbst: ahh.. :/
17:06 imirkin: (maybe other versions too, but that's the one he uses)
17:08 AndrewR: karolherbst, not just blender, but attempt at looking at 'system' preferences pane .... where I already set some aa and stuff ..... may be this is important. but apitrace replay of very same actions doesn't hang like real thing here .....
17:09 imirkin: yeah, it's probaby creating/destroying contexts
17:11 karolherbst: probably
17:11 AndrewR: imirkin, but does 2.83 works for you? Or 'not compiled yet' due to surprizingly-heavy dependencies? :}
17:13 imirkin: well, definitely not compiled yet
17:13 imirkin: but due to not wanting it ;)
17:18 AndrewR: imirkin, I was intrigued by how Openshot tries to create animated titles via blender, but as it turned out this process is ...slow .... surprizingly slow. (at my amd fx 4300)
17:43 AndrewR: https://gitlab.freedesktop.org/mesa/mesa/-/issues/4223
17:43 AndrewR: just so it will not lost
20:46 skeggsb: imirkin: the drm submits a fence and waits on it before telling nvkm to tear the channel down
20:47 imirkin: skeggsb: ok, so modulo bugs, userspace can submit a pushbuf, and then immediately tear down the universe, and all should be well, right?
20:47 skeggsb: that's where those "failed to idle channel" messages come from on dead channels
20:47 skeggsb: yes, in theory
20:48 imirkin: AndrewR: to confirm - does blender hang, or does the computer hang?
20:48 skeggsb: in an ideal world we'd just tear the channel down without waiting, but at least on some chipsets, nvkm doesn't do enough to ensure that's safe right now
20:49 imirkin: skeggsb: the change that seems to trigger issues is when tearing down nouveau in screen_destroy
20:49 imirkin: it used to wait for all (user) fences to complete before triggering work / tearing things down
20:49 imirkin: but i changed it to not do that anymore
20:50 imirkin: so now it instantly triggers all the work and destroys the context/etc
20:51 skeggsb: does it still blow up if you don't delete the engine objects and just kill the channel? there's some potential issues there
20:51 imirkin: and i think some (esp older) software does a funny dance with GLX which ends up causing screens to be created/destroyed on start
20:51 imirkin: skeggsb: unfortunately i can't repro the issues, but i could try to write a patch which leaks those
20:52 imirkin: (obv not a permanent solution, but it's a thing to test)
20:52 skeggsb: (that, also, *should* work, but probably doesn't on some chipsets for the same reasons we need a fence before channel destroy)
20:52 skeggsb: hopefully fixed "soon"...
20:52 imirkin: this is on G92
20:52 imirkin: in case it matters
20:52 skeggsb: yeah, very likely that one's broken
20:52 imirkin: so why don't i see the breakage on a G84?
20:53 skeggsb: yeah, those use identical code for everything relevant there, so i'm not sure
20:53 imirkin: hm, i'll try some stuff maybe a bit more directed
20:53 imirkin: e.g. just like do the bringup, submit something, destroy, repeat
20:54 skeggsb: yeah, could just be timing, the g84 perhaps gets through the pushbuffer faster for whatever reason
20:54 imirkin: hrm
20:54 imirkin: i'll try submitting lots of stuff then ;)
20:54 imirkin: big draws
20:54 skeggsb: worth a try, who knows though, could be a bunch of things
20:55 imirkin: yeah. would definitely prefer having a repro
20:55 imirkin: than just 'guess and check'
20:56 skeggsb: in any case, it *should* work fine without causing the world to end (in the case of deleting engine objects that are still in use, channel errors would be expected, but they *should* be harmless to the system, but quite likely not on some chipsets)
20:56 skeggsb: working on it...
20:58 imirkin: ok cool
21:13 AndrewR: imirkin, sorry was distracted by other things. Blender hangs GPU, usually - i can move mouse cursor but not much else. acpi poweroff work
21:14 imirkin: ok, so that's consistent with a hung gpu + failed recovery
21:14 AndrewR: imirkin, _sometimes_ I was able to switch to ttry and see dmesg/kill blender. But not lately :(
21:14 imirkin: yea
21:14 imirkin: check my suggestion in the bug you filed
21:14 imirkin: if you're willing to experiment some more
21:16 AndrewR: imirkin, I assume I should comment this block with hangy (unreverted) mesa ?
21:16 imirkin: yes
21:16 AndrewR: I guess russian support will not answer too fast ( around midnight here) so I'll risk :}
21:17 AndrewR: (was buying tickets online)
21:17 imirkin: skeggsb: btw, assuming that not destroying the engines works ... what's the recourse?
21:17 imirkin: AndrewR: well, no big rush
21:17 imirkin: skeggsb: i guess there's nothing we can really do without a kernel change...
21:18 skeggsb: imirkin: just free() them instead, the kernel will clean them up
21:18 imirkin: aha right
21:18 imirkin: and we'll "leak" them for the duration of the process, but ... wtvr
21:18 skeggsb: ah yeah that's a potential problem
21:18 imirkin: well, only a problem if you create tons and tons of these
21:18 imirkin: not sure what would cause that
21:19 imirkin: maybe can do something clever with the loader
21:20 skeggsb: even if the kernel didn't screw up and cause a hang, you'd still risk channel errors in dmesg, it's userspace's responsibility to make sure stuff isn't being used before it kills them
21:20 imirkin: hrmph ok
21:20 imirkin: annoying
21:20 imirkin: ooh, can i do a nouveau_bo_wait on something?
21:20 imirkin: here's the problem i'm trying to solve:
21:20 imirkin: drm-shim is basically "emulating" some of these ioctl's
21:20 imirkin: obviously doing any sort of fence wait with the shim ain't gonna work
21:21 imirkin: so my solution to that problem was "just delete stuff instead of waiting"
21:21 imirkin: is there a way to wait for things to quiesce via kernel api's?
21:21 imirkin: (since then the shim would just return "yup, it's done" immediately)
21:21 skeggsb: mmmm.... perhaps waiting on the pushbuf bo
21:22 imirkin: hmmmmmm. yeah, that's a good idea. unfortunately not easy to implement since that's not exposed iirc.
21:22 imirkin: but maybe a libdrm bump
21:23 karolherbst: ohhh, that might be indeed a fine addition
21:24 imirkin: AndrewR: i'll have some more patches for you tonight. sorry for the troubles =/
21:25 AndrewR: imirkin, sorry for discovering this that late!
21:26 imirkin: thanks for discovering it at all =]
21:34 AndrewR: it hanged .....
21:34 imirkin: bleh
21:34 imirkin: sorry
21:34 imirkin: well, there could be other problems
21:34 AndrewR: but again, I have reclocking patch in kernel .... so, it partially reclocked GPU (may play role too) will add to bug
21:34 imirkin: i'll try to make something more robust
21:35 imirkin: do you increase or decrease clocks?
21:41 AndrewR: i think g92 only can go up .....
21:42 AndrewR: *my* g92 - thabnks to vbios being different on different cards ...
21:43 Lyude: hey btw -if any folks here are interested are interested https://lore.kernel.org/dri-devel/CAHUNapTB1tt6T931LfBWVWreXGFwd6tTPqH58i7s3WKivCDT4g@mail.gmail.com/ if anyone has ideas for shorter GSoC projects for this year, we're still looking for responses to the emails we sent out about this
21:46 imirkin: AndrewR: ah ok. ben's theory was that slower boards would be more susceptible to this
21:46 imirkin: you G92 should be better in every way than my G84 though
21:47 imirkin: although maybe you're only reclocking some things and not others. dunno
21:47 imirkin: i'm going to run some experiments to see what all i can repro
22:15 imirkin: AndrewR: ftr, that memory reclock doesn't seem to work
22:16 imirkin: note that the memory clock stays at 500mhz, not the desired 850mhz
22:16 AndrewR: imirkin, yeah .... but I have no knowledge to fix this :} :(
22:16 imirkin: i bet RSpliet has been *dying* to fix this...
22:17 AndrewR:hopes this is metaphor
22:17 imirkin: heh
22:17 imirkin: more like needling him
22:17 imirkin: it's something he looked at a while back
22:17 imirkin: like ... 5y ago
22:17 imirkin: heh
22:18 imirkin: this stuff is all beyond old
22:18 imirkin: the heyday of these GPUs was late 2000's
22:18 imirkin: i.e. 2007-2009
22:19 AndrewR: imirkin, yeah, but this consequence: you can buy one for 1000 rubles and not for say 5000 :}
22:19 imirkin: heh, yeah
22:19 imirkin: sounds like about the same as the rates in USD
22:20 imirkin: $10-ish
22:25 AndrewR: imirkin, I recall problem with mem reclock was due to now hard it was/is to capture all those subtle operations between engines, controlled via firmware (memory link training?). Same for instability problem on fermi ... sadly there is no ext. x-scope you can point at gpu :}
22:26 imirkin: it's not so hard on the earlier gpu's
22:27 AndrewR: imirkin, I still can find some way to boot livecd + blob, if there are ideas for testing ....
22:28 imirkin: AndrewR: i'll have some patches tonight
22:28 imirkin: not this second, sorry
22:28 AndrewR: imirkin, ok, I'll probably will stay 'up' for some quite time because I spend day sleeeping
22:29 imirkin: ehm
22:29 imirkin: might be easier if you sleep at night
22:29 imirkin: your call ;)
22:44 RSpliet: imirkin: day sleeping tends to not be ones call funny enough
22:45 imirkin: don't i know it.
22:45 RSpliet: I had no say in the fact that I just slept from 19:00-22:30
22:45 RSpliet: Now what have I been dying to fix?
22:45 RSpliet: oh, DRAM reclocking on G93
22:45 imirkin: reclock on earlier tesla
22:45 RSpliet: ehh
22:45 RSpliet: -1
22:46 RSpliet: Yeah that just doesn't work. Don't think the DRAM reclocking code is even hooked up?
22:47 imirkin: it's not
22:50 RSpliet: Figured. Anyway, it's just like all the other DRAM reclocking code. Not difficult, just tedious.
22:50 RSpliet: Biggest blocker for these <G98 cards is that it needs to be selective about which bits in which registers it touches. If a column of bits in the pstate table contains either all-0 or all-1, we shouldn't touch it.
22:50 RSpliet: Not set it once, or unset it once, just leave it as it is
22:50 RSpliet: Need a surprising amount of code for something that sounds so trivial
22:50 imirkin: boo
22:51 imirkin: can't you just add if's before emitting the memx_* stuff?
22:51 skeggsb: the gk104 code handles that stuff somehow, though i don't remember how exactly
22:51 RSpliet: I was thinking of a solution where the if's are done once, and decide the masks of bits that can be touched by memx_*
22:51 skeggsb: as does the incomplete fermi stuff
22:52 imirkin: sounds complicated
22:52 imirkin: if's are cheap.
22:52 RSpliet: yeah, GK10x handles it, NVAx didn't need it (mainly because we got lucky)
22:54 skeggsb: if (ram->diff.rammap_11_0b_0400) {
22:54 skeggsb: data |= cfg->bios.rammap_11_0b_0400 << 5;
22:54 skeggsb: mask |= 0x00000020;
22:54 skeggsb: }
22:54 skeggsb: like that :P
22:55 imirkin: skeggsb: btw, i assume any fermi plans are as stalled as they have been in the past?
22:55 RSpliet: But a hundred times :-D Yeah
22:55 skeggsb: yes, there's an inconvenient number of knobs and dials to deal with
22:55 skeggsb: imirkin: tbh, i forgot it even existed
22:55 imirkin: hehe
22:55 imirkin: so even more stalled than before ;)
22:57 skeggsb: dealing with that stuff requires a level of concentration and willingness to accept pain that is... rare lol
22:57 imirkin: i don't blame you!
22:57 imirkin: just checking
22:57 skeggsb: keeping all the interactions paged in at once is difficult
22:58 skeggsb: with hw that old, i really don't see why nvidia couldn't just go "here's the RM code as a reference, go for it"
22:58 skeggsb: buuuut wishes be fishes
22:58 RSpliet: I'm thinking the same thing about RM code for current gen HW, but then who am I...
22:59 skeggsb: :P