01:08 HdkR: karolherbst: Way too old for me :P
02:44 imirkin: karolherbst: yeah, i had a fence issue. i emitted the fence from a different unit and things were better. but it was very basically like a fence vs query output race.
02:44 imirkin: so i changed it to emit a fence by the unit which was outputting the query results
02:44 imirkin: basically there are two types of query outputs ... 32- and 64-bit. the 32-bit ones also write a sequence number from command stream. 64-bit ones don't.
02:45 imirkin: we rely on the screen fence for the latter ones, but it wasn't working here
02:45 imirkin: the theory was that this particular xfb-related query was extra-special
02:45 imirkin: and even a debug_printf() in there was enough time for things to "work out"
02:45 imirkin: so ... very timing-sensitive
02:46 imirkin: karolherbst: commit in question: https://cgit.freedesktop.org/mesa/mesa/commit/?id=729020c7e096ddb17b262797153d56fa4bade990
07:21 karolherbst: imirkin: I could fix it with a usleep(10000)....
07:21 karolherbst: so yeah.. it kind of fits
07:21 karolherbst: but this has nothing to do with queries
07:22 karolherbst: soo.. the test ended up waiting on the last fence through eglDestroyContext
07:22 karolherbst: and if I add a little sleep before pushbuf_submit it helps
07:23 karolherbst: so what my issue was, that we emit the fence properly, the pushbuffer submitted to the kernel looked alright, but the last fence got never signalled
07:23 karolherbst: and.. it wasn't really doing anything in the last batch
07:23 karolherbst: just some blit operations
07:24 karolherbst: no error in dmesg
07:24 karolherbst: (the bo addresses and sizes are bogus, because libdrm without patches) https://gist.githubusercontent.com/karolherbst/0621352aac035b858e839298c96de669/raw/c4735722b6ad3ffc0c8027c59e7c226ec6003b3b/gistfile1.txt
07:25 karolherbst: 0xd git signalled
07:25 karolherbst: but 0xe never
07:25 karolherbst: but once mesa timedout on that fence, later ones got signalled again :)
07:25 karolherbst: soo.. maybe this is just the issue where we submit too fast?
07:26 karolherbst: and somehow with my mt patches I end up kicking more often, which... I don't think actually happens.. but maybe
07:26 karolherbst: who knows
07:30 Santurysim: race condition?
07:41 karolherbst: unlikely
07:41 karolherbst: that test runs single threaded
07:41 karolherbst: it might be something on the kernel side though
07:41 karolherbst: maybe I should boot a debug kernel
07:50 montjoie: hello it seems nouveau is broken since one week on linux-next
07:50 montjoie: nouveau 0000:02:00.0: DRM: failed to initialise sync subsystem, -28
07:50 montjoie: already know or worth a report on mailing list ?
07:56 karolherbst: huh?
07:56 karolherbst: that's a weird one
07:56 karolherbst: montjoie: mind showing your dmesg?
07:58 karolherbst: mhhh
07:58 karolherbst: montjoie: but.. are you up for doing a git bisect? a week isn't that long and you might get lucky by bisecting the last two weeks
07:58 karolherbst: I see some nouveau related patches from 8-9 days ago
07:58 montjoie: karolherbst: https://pastebin.com/bDc6ULNe I got it on two different system
07:59 montjoie: I will try to bisect
07:59 karolherbst: cool, thanks
08:12 montjoie: v5.13 seems okay
10:38 karolherbst: mhh.. but that submitting to fast bug should throw an error... mhhh
10:38 karolherbst: strange
11:10 montjoie: karolherbst: bisect lead to d02117f8efaa5fbc37437df1ae955a147a2a424a drm/ttm: remove special handling for non GEM drivers
11:10 montjoie: but I fear a problem, since I dont see a relation
11:10 karolherbst: mhhh
11:11 karolherbst: you might have ran into an issue which doesn't always happen
11:11 montjoie: i bisected only on drivers/gpu, perhaps I need to bisect withotu filter
11:11 karolherbst: yeah maybe
11:11 montjoie: I reverted it to see if it works
11:11 karolherbst: I don't bother with sub dirs as this usually only safes 1-2 steps
11:16 montjoie: seems to fix the issue
11:16 karolherbst: mhhhh
11:16 karolherbst: try booting a few times just to make sure
11:17 montjoie: at least the fail was always here
11:17 karolherbst: yeah sure.. but before bothering Christian with this, make sure that reverting it indeed fixes the issue, so that you won't run in this "ohh, now it's still broken afer the third reboot" situation
11:17 montjoie: I remote boot thoses PC for dev another driver, and nouveau always fail
11:18 karolherbst: sometimes it's just bad luck
11:18 karolherbst: if that would have been a change inside nouveau.. sure, but your error caused by this commit would be a little strange
11:20 montjoie: I reboot both system to confirm
11:25 montjoie: karolherbst: I restarted bisect without path, but in fact between the last good and bad there was no other commit
11:28 karolherbst: okay
11:28 montjoie: BUT on one of them a second issue is present, screen stuck on start of boot
11:30 montjoie: and it happen on v5.13
11:39 montjoie: but this issue is on fb
13:13 karolherbst: imirkin: I think nv30 is quite bonkers :D
13:13 karolherbst: running into asserts by just running glxgears
13:15 karolherbst: assert(imm->Immediate.DataType == TGSI_IMM_FLOAT32);
13:16 karolherbst: mhh IMM[0] UINT32 {0, 0, 0, 0}
13:17 karolherbst: no int ops though
13:19 karolherbst: removing that assert makes everything black and broken :(
13:20 karolherbst: mhh
13:20 karolherbst: guess it's bisecting time
13:26 karolherbst: ehh.. the display just went black actually
13:27 Santurysim: The whole display?
13:27 karolherbst: well.. it turned of
13:27 karolherbst: off
13:28 Santurysim: This is related to power management, isn't it?
13:30 karolherbst: dpms
13:59 karolherbst: okay.. nv30 also runs with my MT fixes
13:59 karolherbst: but uff...
13:59 karolherbst: the driver...
15:12 imirkin_: karolherbst: fences are queries
15:12 imirkin_: karolherbst: tell anholt about the nv30 troubles. i'm gonna bet it's due to his nir->tgsi stuff.
15:18 imirkin_: neat
15:18 imirkin_: PSA: linking nicks shares permissions for the "primary" nick. you can do that with nickserv link / nickserv group.
15:22 karolherbst: imirkin_: yeah.. it's just some immediate thing
15:22 imirkin_: karolherbst: well, it's generating int immediates. that's bad.
15:22 karolherbst: well...
15:22 imirkin_: in any case, immediates are encoded very weirdly
15:22 karolherbst: they contain the float data
15:22 imirkin_: oh
15:22 imirkin_: lol
15:22 imirkin_: ok
15:23 karolherbst: nir just have int immediates :)
15:23 karolherbst: *has
15:23 karolherbst: because it doesn't matter
15:23 imirkin_: yeah. for a no-integers drivers, maybe be nice and generate them as floats in the conversion?
15:23 imirkin_: the nir->tgsi pass knows that it's a no-integers situation
15:23 karolherbst: well.. it's an union..
15:24 karolherbst: I guess the convert could be improved a little
15:24 imirkin_: that's my point
15:24 karolherbst: anyway... I am more annoyed by this nv50 issue
15:24 imirkin_: just have it set the imm type to float for no-integer drivers
15:24 imirkin_: right
15:24 karolherbst: soo.. no clue what's happening
15:24 imirkin_: so fences are emitted by the query engine
15:24 karolherbst: the stuff gets emited
15:24 karolherbst: and submitted
15:25 imirkin_: we use a 32-bit report type (and iirc tell it to do "short" output), which means it will just write the sequence number.
15:25 imirkin_: which is given in the cmdstream
15:25 karolherbst: right
15:25 imirkin_: long output = 128 bit always ... so for a 32-bit query it goes <seq no> <query result> <64-bit timestamp>
15:25 imirkin_: for a 64-bit query it goes like <64-bit query result> <64-bit timestamp>
15:26 karolherbst: ohh okay
15:26 karolherbst: we just ignore the timestamp I guess
15:26 imirkin_: for 128-bit output? except for the timestamp query ;)
15:26 karolherbst: right
15:26 imirkin_: but there are long and short outputs
15:26 imirkin_: long is 128-bit, short is 32-bit iirc
15:26 karolherbst: yeah sure.. and I only see short queries in the cmd stream
15:26 imirkin_: and the short output ... i forget. is it the seq no, or the actual query value? i sort of assume the latter....
15:26 imirkin_: since the former makes no sense
15:26 karolherbst: I am just wondering _why_ some queries are never executed
15:26 imirkin_: but who knows
15:27 imirkin_: they all get executed
15:27 imirkin_: there are additional flags
15:27 imirkin_: which indicate that the query should wait for things
15:27 imirkin_: in terms of read or write ordering
15:27 karolherbst: ohhh
15:27 karolherbst: well.. I pasted the command stream since the last query
15:27 imirkin_: i missed it
15:28 karolherbst: imirkin_: https://gist.githubusercontent.com/karolherbst/0621352aac035b858e839298c96de669/raw/c4735722b6ad3ffc0c8027c59e7c226ec6003b3b/gistfile1.txt
15:28 karolherbst: I see 0xd in the fence bo, but never 0xe
15:28 karolherbst: except...
15:28 karolherbst: I do a usleep before submitting the last batch
15:28 karolherbst: or well.. it helps
15:28 karolherbst: the bo addresses are bogus
15:29 imirkin_: that's very odd.
15:29 karolherbst: unpatched drm
15:29 karolherbst: imirkin_: yep...
15:29 karolherbst: later submits are executed
15:29 karolherbst: so the test just timesout and moves on
15:29 karolherbst: imirkin_: _but_
15:29 karolherbst: before the 0xd query, there are two other queries without anyting in between
15:30 karolherbst: as... our way of submitting fences isn't really optimal
15:30 karolherbst: but..
15:30 imirkin_: so the 0x10 bit, iirc, means that it should wait for all other writes to complete first
15:30 karolherbst: maybe there aren't enough query slots or something dumb like that?
15:30 karolherbst: imirkin_: yeah well.. I wait inside gdb
15:30 karolherbst: and if I wait 10 seconds it still doesn't finish
15:30 imirkin_: i don't have a perfect mental model of this
15:30 karolherbst: and I don't think anything in between takes that much time
15:30 karolherbst: and again..
15:31 karolherbst: if I put a usleep before submit it helps
15:31 imirkin_: this print is from a gem pushbuf submit, right?
15:31 karolherbst: yeah
15:31 imirkin_: i.e. you're definitely sending this to the kernel
15:31 imirkin_: ok
15:31 karolherbst: in userspace though
15:31 imirkin_: sure.
15:31 karolherbst: but the kernel doesn't return any error
15:31 karolherbst: but yeah
15:31 karolherbst: I checked with gdb
15:31 karolherbst: I had bugs in the past where it actually didn't send.. but that was because of an empty pushbfuffer :D
15:32 karolherbst: but again.. why would a usleep change any of this
15:32 imirkin_: what is the [4] and [3] stuff? the subchan?
15:32 karolherbst: yeah
15:33 imirkin_: i think mwk has the clearest picture of how all this stuff works
15:35 karolherbst: I will probably dig a bit deeper into this... I think there are several contexts involed, but single threaded... but so far nothing really stands out. It works without my changes, but that doesn't explain why
15:39 karolherbst: imirkin_: btw.. this bufctx handling of nv30 is annoying.. any reason why that exists?
15:40 imirkin_: skeggsb_ wrote it.
15:40 karolherbst: probably never updated it to use the libdrm thing
15:40 imirkin_: i did fix it up at some point
15:40 imirkin_: since it was actively broken
15:40 imirkin_: whereas now it's just weird :)
15:40 karolherbst: :D
15:40 karolherbst: ohh btw.. I found one weird thing in nv50
15:40 karolherbst: inside vbo_kick_notify there is this "nv50_bufctx_fence(nv50, nv50->bufctx_3d, true)" call
15:40 karolherbst: and ...
15:40 karolherbst: removing that doesn't change anything
15:40 karolherbst: afaik
15:41 karolherbst: any idea why it's in there?
15:41 imirkin_: maybe not in simple tests
15:41 karolherbst: maybe...
15:41 karolherbst: but having it in there makes it... annoying for me
15:41 imirkin_: ok so
15:41 karolherbst: all I can tell is, that it was always there
15:41 karolherbst: and nvc0 doesn't do it
15:41 imirkin_: hmmm you sure?
15:41 karolherbst: yeah
15:41 imirkin_: nvc0 calls some nv50 functions
15:42 imirkin_: anyways
15:42 karolherbst: but not nv50_bufctx_fence inside vbo_kick_notify
15:42 imirkin_: the vbo kick notify is triggered when you're in the middle of a draw
15:42 karolherbst: ehh
15:42 karolherbst: sure
15:42 karolherbst: but nvc0 only does the fencing
15:42 imirkin_: you know, i'd have to look at the code again
15:42 imirkin_: and i can't right now
15:42 imirkin_: i don't remember it quite as perfectly as i'd like
15:43 karolherbst: nouveau_fence_update vs nouveau_fence_update + nv50_bufctx_fence
15:43 imirkin_: but the idea is to avoid doing too much fencing in the middle of drawing
15:43 imirkin_: i think.
15:43 karolherbst: yeah..
15:43 karolherbst: it also breaks things doing the normal thing :D
15:44 karolherbst: so what nv50_bufctx_fence is doing is to call nv50_resource_validate on the stuff inside the bufctx
15:44 karolherbst: and nv50_resource_validate is only calling nouveau_fence_ref
15:44 karolherbst: and updating the res->status thing
15:45 imirkin_: as you've likely discovered
15:45 imirkin_: there's a lot of hack around resource state tracking
15:45 karolherbst: yeah....
15:45 imirkin_: (a) buffers have a ->fence and ->fence_wr
15:45 imirkin_: which links to the screen-fence-du-jour
15:46 imirkin_: (b) textures have the ->status which is occasionally updated, with half-hearted attempts to add synchronizes
15:46 karolherbst: right
15:46 imirkin_: (to avoid read vs write hazards)
15:47 imirkin_: i never quite fully grok'd that
15:47 imirkin_: and likely added problems
15:47 imirkin_: when enabling "fancy" features
15:47 imirkin_: like images/etc
15:49 karolherbst: yeah...
15:49 karolherbst: it's a bit annoying that we have no good way of making sure there aren't any regressions
15:49 karolherbst: I mean.. I can add this bufctx call back, then I just need to change a few things around
15:49 karolherbst: I was just curious if it really has to be there or it's just there because it always was
15:51 imirkin_: i'm honestly unsure.
15:51 imirkin_: in general
15:52 imirkin_: i like to make nv50 look like nvc0 when possible
15:52 imirkin_: esp around these subtle types of issues
15:52 Svanto: Hello hello, was there a major update recently? I updated my linux install recently and now the output to my external monitor is freezing up
15:52 Svanto: Which is the monitor connected to my nouveau card
15:52 imirkin_: Svanto: sounds like there was. what did you update?
15:53 Svanto: I can't figure out wheter its a nouveau thing since sway recently had an update too
15:53 imirkin_: Svanto: lots of reported issues with sway
15:53 imirkin_: it tries to use modifiers and fails
15:53 Santurysim: Hello, Svanto, which is your card?
15:53 imirkin_: you have to tell it not to use modifiers.
15:54 Svanto: I do have the WLR_NO_MODIFIERS = 1 envvar set up, that's the thing
15:54 imirkin_: yeah, that is
15:54 imirkin_: at least that sounds right
15:54 Svanto: Santurysim: hello! Gefore GTX 970
15:55 imirkin_: Santurysim: WLR_DRM_NO_MODIFIERS=1
15:55 imirkin_: er
15:55 imirkin_: Svanto: --^
15:56 Svanto: Yeah, that's it
15:56 imirkin_: well, you should hit up the sway guys ... dunno where their irc chan moved to
15:56 Svanto: I fixed that before when sway 1.6 came out and there wasn't really an issue, this is new
15:57 imirkin_: ok
15:57 Svanto: imirkin_: understood, it's helpful to know it's a sway problem then. Thanks!
15:57 imirkin_: well, we didn't add new logic to explicitly freeze monitors :)
15:58 imirkin_: it's not necessarily a sway problem
15:58 imirkin_: but if you want help, we need more info than "i updated a bunch of stuff and now it no go"
16:18 Svanto: Sorry, seems I crashed
16:19 Svanto: imirkin_: True, sorry. I was wondering that maybe it was an issue that came up so I'd ask here, but now it seems like it's on Sway's field. They recently added a libseat dependency so idk if that has anything to do with it, although I would be surprised.
16:21 karolherbst: Svanto: normally it's easier to just try with an older mesa/kernel/sway and see what fixes it
16:38 Svanto: Understood thank you!
18:06 Svanto: Okay, I'm convinced its a sway issue. When I move a window, like a wideo playing, it works just fine, but as soon as I move the cursor or switch workspaces it the videl freezes (but audio still works)
18:06 Svanto: *video
18:06 Svanto: I can still use my keyboard shortcuts to move the video back to the internal monitor
18:06 Svanto: [So no idea what gives...
18:07 Svanto: Man, what did they do this time
19:53 Svanto: all right, it was a wlroots issue that's scheduled to be fixed in 14.1.
19:54 imirkin_: got a link?
19:56 Svanto: Right! Here imirkin_ https://github.com/swaywm/wlroots/issues/2991
19:57 imirkin_: thanks
19:57 imirkin_: oh
19:57 imirkin_: it's not resolved or scheduled to be fixed.
19:58 imirkin_: i thought it had already been fixed based on what you said
19:58 imirkin_: emersion: do you know what's up with it?
19:58 Svanto: I'm sorry, I'm not a native speaker so my wording is off sometimes.
19:58 imirkin_: Svanto: no worries!
19:59 imirkin_: almost none of us are native speakers
19:59 imirkin_: well - we're all native speakers. just not of english :)(
20:01 imirkin_: emersion: oh ... you're trying to be clever about rendering the cursor or something, and it ends up getting double-pinned?
20:02 imirkin_: emersion: you need to make at least one cursor bo per drm fd.
20:10 imirkin_: emersion: or even better, don't try to be clever with cursors :)
20:11 imirkin_: not worth the aggravation
20:38 ccr: manage the aggro, throw more dots
20:40 emersion: yeah, multi-gpu is busted on wlroots 0.14.0, need to investigate
20:40 emersion: we're supposed to do one cursor BO per DRM FD, yes
20:41 emersion: oh, but that's not even multi-gpu
20:42 emersion: yeah, this is a breakage i don't really understand
20:43 emersion: maybe it's mad that we're doing trigger-happy dmabuf imports/exports
20:53 imirkin_: emersion: it works great on UMA systems :)
20:53 imirkin_: but unfortunately you can't just treat these things like magic
20:53 emersion: lol
20:53 emersion: well
20:53 imirkin_: the memory has to sit somewhere. and for things which are scanned out, they have to be in VRAM
20:53 imirkin_: but for things to be exported, they have to be in GART
20:53 imirkin_: so ... not a great fit.
20:54 emersion: ok. but import/export to same device shouldn't move the BO
20:54 imirkin_: (until p2p dma takes over the world)
20:54 imirkin_: hmmm
20:54 imirkin_: yeah, if it's always intra-device, that's fine
20:54 imirkin_: but based on the error message, that might not be happening
20:54 emersion: the "bo pinned elsewhere" happens when a single GPU is used
20:54 imirkin_: or we see the export and immediately move to gart? dunno
20:55 emersion: (ie, when the machine has only a single GPU)
20:55 emersion: i guess i'll need to read some kernel code and check
20:55 imirkin_: kernel code won't necessarily show the problem
20:55 imirkin_: the kernel code is pretty simple
20:56 imirkin_: when you use a bo for a cursor, we pin that bo to vram
20:56 emersion: oh, so user-space is in charge of moving the BOs around?
20:56 imirkin_: mmmmmm
20:56 imirkin_: not COMPLETELY
20:56 imirkin_: kernel will move the bo's from a -> b
20:56 imirkin_: but on the instruction of userspace
20:56 imirkin_: and in some cases, the kernel will do its own moving / pinning
20:57 imirkin_: like for scanout things
20:57 imirkin_: but then if userspace decides to request that buffer move back to gart, then it won't be happy
20:57 imirkin_: also ... not sure what happens if userspace has it in gart, is rendering to/from it, and simultaneously that buffer is requested to move
20:57 imirkin_: probably nothing good.
20:58 imirkin_: with a bit of luck, we pin things behind fences
21:00 emersion: nouveau_bo_prime_handle_ref doesn't seem too harmful
21:01 imirkin_: https://www.youtube.com/watch?v=8lYBVi3ALWA
21:01 emersion: lol
21:01 emersion: on-point