00:08 HdkR: Was hoping Wayland would solve my Stepmania frame pacing problems. Better, but still not perfect
00:20 bnieuwenhuizen: HdkR: I thought the kernel had quirks for HMDs?
00:22 HdkR: bnieuwenhuizen: That's what I thought as well
00:22 imirkin: it just sets a property
00:22 bnieuwenhuizen: maybe the quicks don't get picked up by your compositor then :)
00:22 imirkin: up to the kms client to do something abou tit
00:24 HdkR: Maybe I'll poke at the Sway sources and see if there is an easy miss for the Index. For now I've just unplugged it
00:24 HdkR: or...wlroots?
00:25 bnieuwenhuizen: fix it in wlroots if possible, fixes more compositors :)
00:36 HdkR: ick, disabled vsync in the game settings and the pacing got better. Wonder if this panel isn't exactly 60hz
00:36 HdkR: 59.98...ugh okay
00:38 imirkin: HdkR: there's a property called 'non_desktop' which i meant to be set to 1 for detected HMD's
00:38 imirkin: there's a list of the known ones in drm_edid.c
00:38 imirkin: ( ... which *is* meant to ... )
00:38 HdkR: I'd guess the Valve Index is in there :P
00:39 imirkin: HdkR: a bunch of them are
00:39 imirkin: but there's not just one it seems
00:39 imirkin: HdkR: https://cgit.freedesktop.org/drm/drm/tree/drivers/gpu/drm/drm_edid.c#n172
00:41 imirkin: you can check your edid with the edid-decode tool, or online with https://people.freedesktop.org/~imirkin/edid-decode/
00:41 HdkR: that's a bunch
01:10 marex: so speaking of neverball again, did anyone ever notice problems with how it does the reflection on the logo surface in main menu ?
01:11 marex: there is something with Face + Stencil , which seem to fail in my case, did it fail for anyone on any other GPU too before ?
01:11 marex: (face+stencil interaction on etnaviv is probably broken too :) )
01:53 xyene: if I have a texture sourced from a dmabuf and wish to realize it immediately, is creating an FBO from it and glCopyTexImage2D to a second texture (and waiting for a fence) the ~most efficient way to do this?
02:02 imirkin: what do you mean by 'realize'?
02:02 imirkin: i.e. why can't you use the dma-buf you got directly
02:04 xyene: the dmabuf exporter (which I don't have much control over) "leased" the dmabuf to my process for 10ms, but I'd like to be able to draw the same texture for longer than 10ms
02:05 xyene: but after 10ms the exporter will reuse the memory pointed to by that dmabuf, making my texture "garbage"
02:31 imirkin: i see.
02:44 marex: xyene: is that one of those video pipelines where buffers are written to specific DRAM addresses and then overwritten shortly after ?
02:45 marex: (by some new frame)
02:49 xyene: marex: yes, basically exactly that
02:51 marex: :(
02:52 marex: xyene: this is some embedded system, isnt it ?
02:53 marex: xyene: I guess you can't just load a list of DMABUFs into that hardware, and then pull them out one after the other and use them as GPU texture, while the ones you processed could be queued back to the hardware ?
02:58 xyene: it's a desktop system with some "interesting" constraints: a kernel module (that I control) presents an mmap-able file under /dev/ and can hand out dmabufs to it; QEMU is instructed to use that file as shmem; a Windows guest uses the IVSHMEM driver to see that file as regular RAM; a Windows daemon uses Windows capture APIs pointed to that shared buffer; and finally a Linux-side app receives dmabuf fds from
02:58 xyene: the kernel module and uses them to draw on a surface
02:59 xyene: if the Linux-side is vsync'd to 60Hz and the guest isn't, it can submit 1000+ frames per second, and the amount of outstanding frames is limited by the size of that shmem buffer
03:03 marex: (wouldnt it be easier to patch qemu itself with maybe some custom driver (?) to handle all this without the kernel driver?)
03:03 marex: er, s@driver@device model@
03:04 kisak: anholt, dcbaker: I think that the Fixes line on https://gitlab.freedesktop.org/mesa/mesa/-/commit/58e43594fc457eaaf1b1e01e48948959a82080bc is wrong. It's pointing to the 20.2 backport instead of 634384e4a09d897e0f045e6e99b787804ef4fe40
03:04 marex: xyene: I might be wrong , but I would expect that if a dmabuf is in use by userspace, kernel should not be able to overwrite its content
03:05 kisak: dcbaker: is that something the maintainer scripts handle?
03:07 dcbaker: kisak: i don't see a fixes tag in there. All the scripts do for fixes is ask if the commit in the fixes line is in my this branch at all. So if the commit isn't in master it won't get picked up
03:08 kisak: line 5 of the commit message
03:08 kisak: (gitlab hides a line making it 4)
03:08 xyene: marex: hmm, not sure I follow regarding the device model bit. regarding dmabuf overwriting, empirically it does seem like the guest can overwrite the memory the dmabuf fd points to while the fd is still open (had to explicitly implement synchronization logic around that)
03:11 kisak: in any case, it's confirmed in #3990 that 20.3 is affected by the issue that commit fixes and safe to assume 21.0 as well
03:15 HdkR: xyene: You don't happen to work on Looking Glasss do you? :D
03:17 xyene: HdkR: haha, yes, actually :P
03:18 HdkR: woo. I haven't used it but every time I look at it I think, "Damn, that's impressive."
03:19 Lightkey: They made such good games.
03:20 HdkR: I'm probably the exact demographic for the project as well actually. Running Linux with a Windows VM for gaming
03:20 xyene: HdkR: it eats voraciously through memory bandwidth, but it's fun when it works :)
03:20 HdkR: haha, I bet
03:22 HdkR: Quad or Eight channel memory + PCIe 4 must be a huge boon to it
03:26 xyene: for sure... it works well enough for 4K@60 using GTX 10-series-era cards with dual-channel RAM / PCIe 3, but my personal usecase is also... running Excel so the guest GPU isn't exactly taxing the bandwidth like a game would :)
03:26 HdkR: hehe
03:27 HdkR: I should throw my 8k panel at it sometime
08:30 pq: imirkin, I suppose you could put depth surfaces into dmabuf, if you have a DRM format/modifier to match.
08:37 pq: xyene, dmabufs may have implicit fences, but in complex use cases they are often not enough. You need explicit fences communicated over the same protocol you use to communicate the dmabufs. Ideally your producer will never re-use a buffer until the consumer signals it free for re-use.
08:38 pq: xyene, as for the produces running unthrottled, when the consumer gets a new frame and it hasn't used the previous frame yet either, it can release the buffer of the previous frame for re-use immediately.
08:38 pq: *producer
08:39 pq: xyene, this way you'd never need to copy in the consumer, but it can just keep a hold of the buffer as long as it needs.
08:41 pq: marex, xyene, fences are how producer and consumer are synchronized. If someone does not honour or forward the fences, nothing much stops it from stomping on the buffer content.
08:42 pq: ideally buffers are allocate and shared once, then re-used many times
13:26 KevinTang: dim: No kernel git checkout found in 'src'.
13:27 KevinTang: I have configured $HOME/.dimrc, but I don't know why this problem still occurs...
15:27 imirkin: pq: the question came up as i was reviewing a nouveau modifier patch, which allows linear modifier on all surfaces. but it's not legal for depth, so seems like we should guard against it. unless it's moot for dma-buf in the first place.
17:16 robclark: mareko: was looking a bit at threaded ctx to see what would be needed on driver side.. one awkward corner case is image/sampler views that are created with a format that doesn't support compression, but with a pipe_resource which was originally created with a format that does support compression.. currently we demote to uncompressed (ie. a blit) when the CSO is created, which would no longer work w/ threaded ctx..
17:16 robclark: but I'm wondering if there is some way the driver could queue up it's own internal work somehow to run in driver thread? That seems like it would be a convenient way to handle this (ie. when CSO is created, it queues up whatever internal work before the CSO has a chance to be bound in driver thread)
17:51 karolherbst: imirkin: well.. we have this issue on tegra already, where you kind of have to enforce full modifier awareness across the full stack, otherwise it just breaks.
18:08 imirkin: karolherbst: right ... my question was specifically about whether depth textures were in-scope for dma-buf
18:09 karolherbst: ohh, right
18:10 robclark: do we have fourcc's defined for z/s?
18:41 imirkin: zmike: any reason you only added scissored clears for color and not depth/stencil?
18:41 zmike: imirkin: huh?
18:41 imirkin: zmike: commit 1c8bcad81a7c
18:41 imirkin: that's you, right?
18:42 imirkin: you only set have_scissor_buffers for color attachments
18:42 zmike: huh
18:42 zmike: dunno
18:43 zmike: maybe lack of test cases
18:43 zmike: or just an oversight
18:43 imirkin: ok. was just checking it wasn't some like "oh no, vulkan is weird" thing
18:43 zmike: that predated my involvement with zink by 2-3 months
18:43 imirkin: o
18:44 imirkin: what was the target? iris?
18:44 zmike: probably?
18:44 imirkin: ah yeah. looks like it -- 328cc00d3980
18:45 imirkin: seems like you made it work for depth on the iris side
18:46 imirkin: Kayden: can you think of any funny issues with iris + scissored depth/stencil clears?
20:33 karolherbst: dcbaker: soo, generated stuff and creating rlib to include those as extern crates does work... just need to figure out how to declare nice names and such :D
20:50 mareko: robclark: we do that by uncompressing in set_sampler_views
20:50 mareko: robclark: create_sampler_view only sets a flag whether decompression should happen, but doesn't touch the context
20:52 robclark: mareko: I guess that would be a fallback.. although for us the create_cso fxns are gen specific but the bind fxns are not.. so handling it in create would be cleaner.. would you be open to me adding something that create_cso could use to enqueue a driver callback?
20:53 mareko: robclark: you could still add a driver-specific callback to the sampler view to invoke it in set_sampler_views
20:53 imirkin: robclark: don't you have to do it at set time anyways? could be any number of things that happen between the create an set
20:54 robclark: demotion to uncompressed is a one-way street
20:54 imirkin: ah ok
20:55 mareko: it's not with shared resources
20:55 robclark: it is basically something that exists for deqp.. not really something that actually happens in the "real world"
20:55 imirkin: like 90% of our drivers...
20:56 bnieuwenhuizen: imirkin: for radeonsi we hit it with some of the UE4 demos at least
20:56 robclark: it's same situation with shared resources.. once you use it once in an "I can't be UBWC" way, it is forever on non-UBWC
20:56 mareko: robclark: I'm not opposed to the idea of enqueueing a driver specific callback per se, the question is whether we can avoid it
20:56 robclark: we could probably avoid it by handling in bind-CSO path.. but generally I assume things are bound more frequently than they are created so handling it at create time would be nice
20:57 robclark:probably should add some perf_debug() logging for that path.. but UBWC is supported widely enough that I think real games/demos shouldn't hit that path
20:59 mareko: wouldn't it be nice if our compression supported arbitrary format reinterpretations
21:00 robclark: indeed
21:00 mareko: and image stores
21:00 robclark: it is the thing that makes copy_image a huge pita
21:00 robclark: oddly, a6xx seems to be quite happy with UBWC + image stores
21:01 mareko: if you benchmarked it, you might be unpleasantly surprised at the performance
21:02 mareko: we have compressed image stores in hw but we don't use them. why? because we know better
21:02 robclark: hmm, it's possibly, IIRC blob didn't use it..
21:05 bnieuwenhuizen: the other thing is render feedback loops
21:05 bnieuwenhuizen: almost loops can sometimes be legal in GL but not in HW :(
21:05 robclark: you mean bouncing btwn compressed/uncompressed?
21:05 mareko: the problem with image stores is that when a simd doesn't overwrite the whole compressed block, the hw has to read the block from memory, decompress, update, recompress, store
21:06 bnieuwenhuizen: stuff like sampling from the sample pixel as you're rendering to
21:06 bnieuwenhuizen: from the same*
21:07 robclark: we support that just to a limited extend for that blend extension.. but it is limited because sampling isn't cache coherent with writing too..
21:25 Kayden: interesting, intel recently gained compressed image stores as well, but I never did benchmark the performance
21:27 robclark: I am kinda wondering.. as long as you aren't thrashing L1 cache, is stores to compressed images really worse than a decompress -> compute -> recompress cycle
21:27 bnieuwenhuizen: well, if you know they may be ways to not have to do that cycle
21:28 bnieuwenhuizen: like decompress more permanently
21:28 bnieuwenhuizen: also outcome of my benchmarks from AMD was that it was also fairly dependent of the pattern of the shader: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6796#note_643299
21:28 robclark: I'm kinda seeing CS mostly used for post-proc effects.. where we'd like to have things re-compressed afterwards for scanout..
21:29 robclark: (and presumably shaders are writing to at least most of a compression block)
21:29 bnieuwenhuizen: in general I think on AMD compression for write once + read once may not be a win if there is a compress/recompress step in between
21:30 bnieuwenhuizen: decompress*
21:30 bnieuwenhuizen: and it may be slower if you hit the cache most of the time in your workload
21:31 bnieuwenhuizen: also the problem with "at least most of a compression block" is that as soon as it is not whole you might need to do the compression roundtrip in HW
21:31 robclark: for us, my understanding is that things are uncompressed in L1
21:32 bnieuwenhuizen: here it is uncompressed in L1 but write-through ... so each one ends up in L2 separately AFAICT
21:32 robclark: but honestly, we have some low hanging fruit around compute shaders.. it hasn't been where most of our time goes in android app workloads.. which may ofc be somewhat different from steam store workloads
21:33 bnieuwenhuizen: (another thing learned from the benchmarks is to try to split 16x16 workgroups into 4x 8x8 instead of 4x 16x4 subgroups which performs much better with compression)
21:33 bnieuwenhuizen: which may need some id mangling in the shader