00:35anholt: naturally, upon touching num_ssbos again, everything goes badly. shelving that branch for the moment, !9070 up for getting more asan going
01:25airlied: wow my asan deqp vk has been running single threaded for 5 days
01:25airlied: (lavapipe built with asan)
01:26airlied: might kill it, really want to use that machine for something else :-P
01:28zmike: how's it doing?
01:37airlied: zmike: well asan kill the process when it has a problem, so the fact it's still running seems good
01:37bnieuwenhuizen: depends on how far you got :)
01:37zmike: hm dunno about that, I left a cts run going for a couple days one time, finally checked and it had deadlocked itself :/
01:37bnieuwenhuizen: though you're applying a level of asan that we can only dream of with a HW driver :P
01:38airlied: it's somewhere in pipeline.sampler.view_type
01:39airlied: not too sure how asan goes inside the llvm jit code, though I finally got perf hooked up so I can which asm instructions in the jit code are just waiting for memory accesses
01:40airlied: can see which
01:42zmike: nice!
01:45HdkR: Huh, I've never looked in to how ASAN interacts with LLVM's JITs. I would assume that you lose ASAN inside the JIT though
01:47airlied: HdkR: yes that would be my assumption too
01:56HdkR: airlied: Looks like at the /very/ least you need to put the `sanitize_address` function attribute on each function
01:56HdkR: Which modifies load and store IR ops
02:10sunshavi: I am compiling mesa-git on an H3 SBC. And I am getting https://termbin.com/1v4m. Any idea how to fix?
02:48airlied: sunshavi: if you don't want clover -Dgallium-opencl=disabled
02:48airlied: otherwise get a clang devel package installe
03:41imirkin: zmike: is there anything missing for ES 3.2 for zink?
03:42zmike: imirkin: !9060
03:43zmike: that's it
03:43imirkin: ah ok. i forgot that advanced blend was in there
03:43zmike: the vk ext is only implemented by nvidia atm so it's not going to be super helpful to most people
03:43zmike: but it still counts!
03:44imirkin: it's a weird ext.
03:46HdkR: The photoshop extension :)
03:47zmike: haha
03:47imirkin: it's esp weird that it was included as a required feature for ES 3.2
03:47imirkin: i guess it's because it was part of AEP, but that's weird too.
03:48HdkR: Makes sense for mobiles because implementing the blend modes is easy enoug
03:48imirkin: yeah, but why is it a good idea
03:48HdkR: That's more about questioning Google requirements at that point :)
03:48imirkin: who wants those ultra-specific blend mondes
03:48imirkin: heh
03:49HdkR: Maybe there was some grand plan of getting Photoshop on ChromeOS
03:49imirkin: and photoshop somehow NEEDED this to be done as a blend mode?
03:51HdkR: It's just a perf thing for them right?
03:51imirkin: sure. but getting that to be required in ES?
03:52HdkR: That's more questionable. But I try not to question Google's choices these days
03:52imirkin: trying to avoid the worry lines?
03:52HdkR: Pretty much
03:54HdkR: But you get fbfetch and hardware blending support for it near everywhere so it's more :shruggie:
04:01HdkR: I think it is only AMD that doesn't have a solution for it?
04:16HdkR: (I guess maybe Vivante and BCM could get screwed with it as well)
04:31anholt: broadcom can fb fetch.
04:32HdkR: Nice
04:32anholt: vc4 only had fbfetch for implementing blending
07:37mareko: anholt: radeonsi uses num_ssbos as the highest used slot + 1, if it's incorrect, the driver will blow up
07:46mareko: anholt: st/mesa packs SSBO bindings, so there should be no holes
07:48mareko: it's safe to assume for GL that there are no unused boles between slots
08:05sailus: danvet: The vsprintf patch on top of rc1 once we have that, right?
08:37danvet: sailus, yeah, since it's for 5.13 anyway let's wait for -rc1 with the topic branch
08:38danvet: assuming that was the question
08:38tzimmermann: sailus, see my response
08:39tzimmermann: danvet, i'm worried about the removal of drm_get_format_name()
08:39tzimmermann: might break one of the tress
08:41elmarco: hi! does anyone know how eglExportDMABUF images are synchronized between processes?
08:41elmarco: it seems this happens somehow lazily / in idle
08:42cwabbott: anholt: i think it's supposed to mean the total number in the linker, and the max binding point + 1 for everything after the linker
08:43cwabbott: at least, it's used that way in iris, freedreno, etc.
08:46elmarco: ah, nevermind, qemu is missing a glFlush when sharing the rendering...
08:47danvet: tzimmermann, yeah I think we probably need two-stage for this
08:47danvet: like merge all the removals right after topic branch lands
08:47danvet: wait 2-3 weeks or so for trees to sync up
08:48danvet: then another round including the removal patch
08:48danvet: otherwise too much conflict potential
08:48danvet: I don't think we have to stage this out over more than one release cycle if the topic branch happens right after -rc1
08:49danvet: plus it's not a tricky conflict, so even if we screw up and have to resolve it in a merge it should be fine
08:51danvet: airlied, not sure you replied, but I'm going to do the kcmp topic branch now
09:11danvet: robertfoss_,
09:11danvet: lol still can't irc
09:12danvet: airlied, linux-next: build warning after merge of the pm tree
09:12danvet: we need to be early I guess :-)
09:53sailus: danvet: Petr and Mauro are both fine merging this through the drm-misc-next so I guess we won't need a topic branch.
10:01danvet: wfm
10:58tzimmermann: sailus, danvet, i just saw v4. i'll take care of merging later today or tomorrow
11:10airlied: danvet: i think linus has power/internet sissues so merge window is delaywd a bit
11:18sailus: tzimmermann: Thanks!
11:34danvet: airlied, well just prepping it really
11:38tzimmermann: sailus, i'll wait with merging until the recent comment has been addressed
12:02sailus: tzimmermann: You mean Andy's comments? I can send v9, or address them later, either works.
12:30tzimmermann: ok
16:49jenatali: daniels: You're probably the right person to review/ack this Windows CI change: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9018
16:49jenatali: It's blocking another MR which wants to bump the Piglit version we use
17:15emersion: do i need to eglFlush after eglCreateSyncKHR?
17:15emersion: or is eglCreateSyncKHR enough?
17:29daniels: emersion: if you're trying to extract a dmabuf fd, iirc you need to create the sync point, flush, then dup
17:30emersion: daniels: hm, no, i'm trying to glReadPixels asynchronously
17:30emersion: so i'm dojng my glReadPixels calls, then create the sync point
17:30emersion: then i want to wait on the sync point, but do i need to glFlush before that?
17:32daniels: oh, you're doing PBOs
17:34emersion: oh, yes
17:34bnieuwenhuizen: probably better to use the GL thing rather than egl?
17:34imirkin: well, presumably the buffer is passed somewhere else which can't?
17:35emersion: but i need a sync fence FD, so that i can poll() it
17:35imirkin: otherwise, just mapping the buffer in GL will wait
17:35bnieuwenhuizen: imirkin: you could use it in combination with a coherent mapping :)
17:35emersion: i don't want my process to block while the DMA transfer is in progress
17:37imirkin: emersion: i think you have to glFlush then
17:37emersion: ok, thanks!
17:37imirkin: since that read pixels could be sitting in some buffer, essentially, in the GL
17:37imirkin: if the GL has no reason to execute it, it might not (or at least not to completion)
17:37emersion: after or before creating the sync point?
17:37emersion: my guess would be after?
17:38imirkin: i don't really know how sync points fit into things
17:38imirkin: i know this won't work with nouveau :)
17:38imirkin: but presumably nouveau also doesn't advertise sync fd's
17:38emersion: since creating a sync point is essentially just a command
17:38emersion: eh :)
17:38bnieuwenhuizen: if that is true then definitely after :)
17:39imirkin: otoh, perhaps creating the sync point is enough to force a flush? i don't know how it's specified.
17:39emersion: > When a fence sync object is created, eglCreateSyncKHR also inserts a fence command into the command stream of the bound client API's
17:40imirkin: ok, so doesn't force a flush then.
17:41imirkin: how do you get a sync fd out of it?
17:42imirkin: i could imagine the sync fd forcing a flush (up to the sync point)
17:42emersion: eglCreateSyncKHR returns a EGLSyncKHR, then eglDupNativeFenceFDANDROID
17:42emersion: ah, good point, but doesn't look like it
17:43imirkin: i guess it just creates the fd at fence creation time, but yeah, i think you have to flush. after creating the fence.
17:44imirkin: er. s/fence/sync point/
17:44bnieuwenhuizen: yeah, my reading as well
17:45imirkin: (that said, an impl might flush anyways, esp to "fix" or prevent buggy applications ... my suspicion is that android isn't necessarily the paragon of spec conformance)
17:47emersion: it worked without the flush, so… :P
18:02lynxeye: EGL fences have EGL_SYNC_FLUSH_COMMANDS_BIT_KHR in their wait primitives to do the flush for you. As you don't use the native waits if you convert to a sync fd, you need to do your own flushing.
18:02lynxeye: creating the fence doesn't trigger a flush
18:06robclark: mareko: can PIPE_MAP_UNSYNCHRONIZED (or really TC_TRANSFER_MAP_THREADED_UNSYNC) ever happen with something that is not PIPE_BUFFER? (And if so, how are you supposed to handle tiling/compression?)
18:10emersion: lynxeye: thanks
18:19robclark: mareko: nevermind, I see that TC_TRANSFER_MAP_THREADED_UNSYNC is limited to PIPE_BUFFER
18:22zmike: robclark: iirc it's only used for doing backing storage replacement
18:22zmike: which is why it's buffer-only
18:23robclark: right, ok.. I was just worried that there might be a case that combined TC_TRANSFER_MAP_THREADED_UNSYNC with detail/decompress.. but if it is just PIPE_BUFFER that won't happen
18:23zmike: yeah, I wish the docs for that were a bit better
18:30zmike: robclark: not sure if it'll be at all useful as another reference, but here's my tc transfer map implementation https://gitlab.freedesktop.org/zmike/mesa/-/blob/zink-wip/src/gallium/drivers/zink/zink_resource.c#L845
18:32robclark: hmm, I think in theory you are supposed to drop the valid_buffer_range stuff (although tbh I'm still wondering about keeping that for non-threaded case)
18:33zmike: you're allowed to use it if it's a synchronous mapping
18:36robclark: ahh
20:49jekstrand: jenatali: You said "assuming it doesn't break our openCL CI". Is that well-integrated into gitlab now?
20:49jenatali: jekstrand: Yep
20:49jekstrand: 👍
20:50jekstrand:assigns marge
21:01imirkin: mareko: you were mentioning earlier that ssbo's are packed. are images also packed?
21:02imirkin: on nv50, ssbo's and images are in the same hardware "array" of things, and i'd rather avoid having some complex remapping thing when i don't have to
21:06robclark: that sounds a bit like adreno (ssbo and image are same thing)..
21:06imirkin: yeah ... do you do any remapping there?
21:06imirkin: or do you just start images after ssbo's / vice-versa
21:06robclark: yeah
21:06imirkin: and assume the best
21:06imirkin: ok
21:06jekstrand: SSBOs and Images are sort-of the same here too.
21:07imirkin: so i have 16 total slots for "stuff", and if i expose 16 of each with a max combined of 16
21:07robclark: would be nice if mesa/st had an option to lower ssbo to image (and maybe even lower to texture for read-only with no coherence requirements)
21:07imirkin: i could end up with a shader that has an image at binding 10 and ssbo at binding 10
21:08imirkin: so something will remap both of those to 0, and then internally i can count up the ssbo's and bump up the images by 1 in the shader compiler and api right?
21:09imirkin: ugh, but i'm still dependent on the shader to know how many total there are. annoying, but not the end of the world.
21:09robclark: yeah, ir3 builds a remapping table when it compiles the shader
21:09robclark: but it's also kinda slow-path for draw-time emit when there are images/ssbo
21:09jekstrand: We've got most of the code in our stack to lower to images to SSBOs.
21:09jekstrand: It assumes Intel's tiling formats but it could be adapted.
21:10jekstrand: We didn't get typed load/store that could be used on all bit sizes up until Gen9 (SKL)
21:10robclark: (for us, it is more "ssbos are images" than "images are ssbos"
21:10jekstrand: So we have to turn R/W images into SSBOs
21:10imirkin: that's ok - nv50 doesn't have typed at all :)
21:10imirkin: making writeonly work will be fun. on the bright side, that means i'll be able to support formatless no problem
21:11jekstrand: We've got typed for writeonly all the way back to Ivy Bridge
21:11imirkin: yeah, it's part of DX11 i think
21:11jekstrand: And I suppose we could lower readonly to texture but we haven't
21:11imirkin: nv50 is a DX10 part
21:12imirkin: not sure if DX10 had some sort of optional compute thing, but obviously they were also in the early stages of CUDA
21:12imirkin: it's not enough to support ARB_compute_shader proper since that requires ARB_images and ARB_ssbo, which in turn are spec'd to work in frag (whereas this does not)
21:13jenatali: imirkin: There was a very limited shader model 4 (DX10) compute, yeah
21:13imirkin: but it's like 90% of the way to ES3.1 requirements i think
21:14jenatali: imirkin: https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-devices-downlevel-compute-shaders
21:14imirkin: anyways, this work should also help opencl efforts on that lines of GPUs
21:14jekstrand40: Yeah. If you wanted to steal the NIR pass from the Intel drivers, you'd be welcome to.
21:14jekstrand40: And we could make a general one if we're ok with making them linear.
21:15imirkin: jekstrand40: well, i actually have that part 50% written (the "read" half, not the write half)
21:15jekstrand40: For OpenCL, we already have a pass to turn readonly images into textures. We've not hooked it up for GL or Vulkan, though.
21:15imirkin: (since we have to do it on fermi/kepler as well)
21:15jekstrand40: Right
21:16jekstrand40: storage image support is one of the parts of our HW I like the least....
21:16imirkin: anyways, at this point i'm not looking at NIR -- too much new stuff for me.
21:16imirkin: getting the hw to work is hard enough
21:16jekstrand40: heh
21:16imirkin: just found out the "zero" register doesn't work when used as an index into global memory
21:16imirkin: so that was fun.
21:17jekstrand: You have a magic zero register?
21:17imirkin: wellll
21:17imirkin: sort of
21:17imirkin: on later gens, there is a dedicated "zero" register
21:18imirkin: on tesla, you just use a register above the "gpr limit"
21:18imirkin: frequently you'd use r63, as there are encoding benefits to using a register below 64
21:18jekstrand: Ah
21:18jekstrand: Neat
21:18imirkin: on fermi+, the last register in the register file is zero forever
21:19imirkin: on nvidia isa, you can't just use an immediate 0 whereever you want
21:19imirkin: whereas you can use a register whereever you want
21:19jekstrand: We can't put immediates wherever but we can put them most places.
21:20imirkin: ah. on nvidia, only the second arg can be a not-register (with VERY rare exceptions)
21:20jekstrand: And most of the places you couldn't put an immediate, you wouldn't want zero anyway.
21:20jekstrand: Same here. It's only in the 2nd
21:21imirkin: [that also goes for uniforms being used directly]
21:21jekstrand: Yeah, we can put uniforms most places.
21:21jekstrand: If they're push constants
21:22jekstrand: Because then they're just a regular register with a scalar stride
21:22imirkin: and on tesla you had both 4- and 8-byte encodings per instruction
21:22imirkin: but the 4-byte encodings can only take "short" registers (less than 64)
21:22imirkin: and lots of other limitations
21:22jekstrand: Oh
21:23jekstrand: We have 8-byte and 16-byte instructions. :D
21:23imirkin: hehe
21:23jekstrand: Our "compacted" instructions are 8-byte.
21:23imirkin: fermi through pascal are 8-byte only
21:23imirkin: and i think volta+ is 16
21:23jekstrand: Gotta put all that regioning and type information somewhere!
21:24imirkin: on kepler+ you had "groups" of opcodes, with scheduling info once per 7 or 3 opcodes (depending on gen)
21:24imirkin: on volta+, the scheduling info is baked into each op
21:24jekstrand: Ah
21:24imirkin: (hence more bytes per op)
21:25jekstrand: Yeah, we've got sched info in ops now too
21:25jekstrand: I'm not sure what they got rid of in order to add it.
21:29jekstrand: On a completely unrelated note, Vulkan is 5 years old today...
21:30imirkin: jekstrand: with 16 bytes, you don't really have to get rid of anything... it's a lot of bytes.
21:31jekstrand: imirkin: You would think so, and yet we still don't get full immediates on 3-src instructions. 😩
21:31imirkin: like you don't get 3 immediates in there?
21:31jekstrand: We don't get one
21:31imirkin: we don't even get 3 registers in there sometimes
21:32imirkin: src3 == dst reg in some cases.
21:32imirkin: definitely doesn't complicate RA at all.
21:32jekstrand: I would love it if we had an encoding that let you put two fp16 immediates in a fma
21:33jekstrand: Of course, knowing our HW, they'd probably manage to design it so both immediates were on the multiply...
21:34imirkin: ahaha
21:36jekstrand: The other thing I'd really like is an integer MUL+ADD or SHL+ADD with two small integer immediates.
21:37imirkin: finally!
21:37imirkin: nvidia has something you don't have
21:37jekstrand: lol
21:37jekstrand: You mean other than fast GPUs?
21:37imirkin: wtvr, without reclocking, the GT1 destroys a GTX 3080 Ti
21:38jekstrand: :-(
21:40imirkin: SHL+ADD on nvidia *requires* an immed for the shift, and the add bit can be an immed as well though
21:40imirkin: (called 'ISCADD' for some reason. don't ask what the C is.)
21:41Sachiel: integer shift clockwise?
21:41jekstrand85: There are so many places where that'd be nice
21:41imirkin: that'd be pretty intense
21:42imirkin: jekstrand85: yeah, it's great for address calcs
21:43jekstrand21: There's also a lot of stuff sort-of like addresses where it'd be nice.
21:45imirkin: indirects can also take immediate offsets directly in the op (up to a size)
21:45jekstrand21: Same here but it's a uselessly low size
21:45jekstrand21:should give up and just IRC on his phone
21:45imirkin: oh. on nvidia it's a good size (like 12-16bit)
21:46jekstrand21: I think we can get to the first 1/4 of the file with it or something like that
21:46jekstrand21: So not useless but it usually doesn't save enough to be worth the bother.
21:46imirkin: i mean indirect on an address
21:46imirkin: not on a register
21:46imirkin: indirect registers is insanity :)
21:47jekstrand21: What are addresses?
21:47jekstrand21: :P
21:47imirkin: like from memory
21:47jekstrand21: Instructions don't touch memory on our HW
21:47imirkin: you have those messages right?
21:48jekstrand21: Yes but they're a lot of set-up to use and have to go off the EU to a shared function unit.
21:48imirkin: and since you're putting the message together, there's no real benefit to having a separate offset
21:48jekstrand21: Pretty much
21:48imirkin: makes sense
21:48jekstrand21: So instead of being able to add x to uniform y, we first have to set up a message and do a send and it's like 4 instructions to get uniform y
21:48jekstrand21: And SENDs can't take any immediates at all
21:49imirkin: right
21:49jekstrand21: So you can stick "easy access to uniforms" on the list of things nvidia has that we don't.
21:49imirkin: yea
21:51bnieuwenhuizen: huh at that point I suspect even AMD is easier wrt uniforms ...
21:53jekstrand: bnieuwenhuizen: Yea, pretty sure you are
21:53jekstrand: Though we don't have the weird scalar vs. vector cache thing. That's just strange.
21:54bnieuwenhuizen: I dunno, generally the hitrate on the scalar cache is way better
21:54bnieuwenhuizen: as we put more stuff in it that is not variant between invocations
21:55bnieuwenhuizen: though maybe on average a good LRU cache beats that ...
21:55imirkin: recent nvidia gens have gained "SGPRs" too
21:55kherbst: turing+
21:55imirkin: we don't really know what to do with them :)
21:55kherbst: I do :p
21:55kherbst: have WIP patches :D
21:56bnieuwenhuizen: AFAIU sometimes compute kernels can actually have a very significant SGPR load
21:56imirkin: kherbst: yeah, but it's just like opportunistic stuff right? i feel like there might be a "good" thing to do with it
21:56kherbst: what do you mean?
21:56imirkin: if only i knew...
21:56jekstrand: Yeah, I really need to finish IBC so we can start doing scalar allocations
21:56bnieuwenhuizen: in nir there is a pass to figure out of stuff is scalar vs. vector and that should be pretty good
21:57imirkin: bnieuwenhuizen: you mean dynamically uniform vs not right?
21:57bnieuwenhuizen: yep
21:57kherbst: yeah, there is a nir pass I was making use of
21:57kherbst: it's quite okay
21:58imirkin: iirc intel already needs to determine that for ssbo indexing or some such
21:58bnieuwenhuizen: and then we do a second pass over that to assign register types with further limitations (e.g. our SGPR stuff can't do float computations)
21:58kherbst: I have to get back to it at some point :)
21:59jenatali: kherbst: (Unrelated) - was your r-b on that CL MR supposed to be for all of it or just the clover bit?
21:59kherbst: uhm.. your call :p
22:00jenatali: I don't think that's how it's supposed to work :P
22:01jekstrand: lol
22:07kherbst: jenatali: I mean, the changes look correct
22:07jenatali: Good enough for me, considering how simple it is to verify the fix