00:35 anholt: naturally, upon touching num_ssbos again, everything goes badly. shelving that branch for the moment, !9070 up for getting more asan going
01:25 airlied: wow my asan deqp vk has been running single threaded for 5 days
01:25 airlied: (lavapipe built with asan)
01:26 airlied: might kill it, really want to use that machine for something else :-P
01:28 zmike: how's it doing?
01:37 airlied: zmike: well asan kill the process when it has a problem, so the fact it's still running seems good
01:37 bnieuwenhuizen: depends on how far you got :)
01:37 zmike: hm dunno about that, I left a cts run going for a couple days one time, finally checked and it had deadlocked itself :/
01:37 bnieuwenhuizen: though you're applying a level of asan that we can only dream of with a HW driver :P
01:38 airlied: it's somewhere in pipeline.sampler.view_type
01:39 airlied: not too sure how asan goes inside the llvm jit code, though I finally got perf hooked up so I can which asm instructions in the jit code are just waiting for memory accesses
01:40 airlied: can see which
01:42 zmike: nice!
01:45 HdkR: Huh, I've never looked in to how ASAN interacts with LLVM's JITs. I would assume that you lose ASAN inside the JIT though
01:47 airlied: HdkR: yes that would be my assumption too
01:56 HdkR: airlied: Looks like at the /very/ least you need to put the `sanitize_address` function attribute on each function
01:56 HdkR: Which modifies load and store IR ops
02:10 sunshavi: I am compiling mesa-git on an H3 SBC. And I am getting https://termbin.com/1v4m. Any idea how to fix?
02:48 airlied: sunshavi: if you don't want clover -Dgallium-opencl=disabled
02:48 airlied: otherwise get a clang devel package installe
03:41 imirkin: zmike: is there anything missing for ES 3.2 for zink?
03:42 zmike: imirkin: !9060
03:43 zmike: that's it
03:43 imirkin: ah ok. i forgot that advanced blend was in there
03:43 zmike: the vk ext is only implemented by nvidia atm so it's not going to be super helpful to most people
03:43 zmike: but it still counts!
03:44 imirkin: it's a weird ext.
03:46 HdkR: The photoshop extension :)
03:47 zmike: haha
03:47 imirkin: it's esp weird that it was included as a required feature for ES 3.2
03:47 imirkin: i guess it's because it was part of AEP, but that's weird too.
03:48 HdkR: Makes sense for mobiles because implementing the blend modes is easy enoug
03:48 imirkin: yeah, but why is it a good idea
03:48 HdkR: That's more about questioning Google requirements at that point :)
03:48 imirkin: who wants those ultra-specific blend mondes
03:48 imirkin: heh
03:49 HdkR: Maybe there was some grand plan of getting Photoshop on ChromeOS
03:49 imirkin: and photoshop somehow NEEDED this to be done as a blend mode?
03:51 HdkR: It's just a perf thing for them right?
03:51 imirkin: sure. but getting that to be required in ES?
03:52 HdkR: That's more questionable. But I try not to question Google's choices these days
03:52 imirkin: trying to avoid the worry lines?
03:52 HdkR: Pretty much
03:54 HdkR: But you get fbfetch and hardware blending support for it near everywhere so it's more :shruggie:
04:01 HdkR: I think it is only AMD that doesn't have a solution for it?
04:16 HdkR: (I guess maybe Vivante and BCM could get screwed with it as well)
04:31 anholt: broadcom can fb fetch.
04:32 HdkR: Nice
04:32 anholt: vc4 only had fbfetch for implementing blending
07:37 mareko: anholt: radeonsi uses num_ssbos as the highest used slot + 1, if it's incorrect, the driver will blow up
07:46 mareko: anholt: st/mesa packs SSBO bindings, so there should be no holes
07:48 mareko: it's safe to assume for GL that there are no unused boles between slots
08:05 sailus: danvet: The vsprintf patch on top of rc1 once we have that, right?
08:37 danvet: sailus, yeah, since it's for 5.13 anyway let's wait for -rc1 with the topic branch
08:38 danvet: assuming that was the question
08:38 tzimmermann: sailus, see my response
08:39 tzimmermann: danvet, i'm worried about the removal of drm_get_format_name()
08:39 tzimmermann: might break one of the tress
08:41 elmarco: hi! does anyone know how eglExportDMABUF images are synchronized between processes?
08:41 elmarco: it seems this happens somehow lazily / in idle
08:42 cwabbott: anholt: i think it's supposed to mean the total number in the linker, and the max binding point + 1 for everything after the linker
08:43 cwabbott: at least, it's used that way in iris, freedreno, etc.
08:46 elmarco: ah, nevermind, qemu is missing a glFlush when sharing the rendering...
08:47 danvet: tzimmermann, yeah I think we probably need two-stage for this
08:47 danvet: like merge all the removals right after topic branch lands
08:47 danvet: wait 2-3 weeks or so for trees to sync up
08:48 danvet: then another round including the removal patch
08:48 danvet: otherwise too much conflict potential
08:48 danvet: I don't think we have to stage this out over more than one release cycle if the topic branch happens right after -rc1
08:49 danvet: plus it's not a tricky conflict, so even if we screw up and have to resolve it in a merge it should be fine
08:51 danvet: airlied, not sure you replied, but I'm going to do the kcmp topic branch now
09:11 danvet: robertfoss_,
09:11 danvet: lol still can't irc
09:12 danvet: airlied, linux-next: build warning after merge of the pm tree
09:12 danvet: we need to be early I guess :-)
09:53 sailus: danvet: Petr and Mauro are both fine merging this through the drm-misc-next so I guess we won't need a topic branch.
10:01 danvet: wfm
10:58 tzimmermann: sailus, danvet, i just saw v4. i'll take care of merging later today or tomorrow
11:10 airlied: danvet: i think linus has power/internet sissues so merge window is delaywd a bit
11:18 sailus: tzimmermann: Thanks!
11:34 danvet: airlied, well just prepping it really
11:38 tzimmermann: sailus, i'll wait with merging until the recent comment has been addressed
12:02 sailus: tzimmermann: You mean Andy's comments? I can send v9, or address them later, either works.
12:30 tzimmermann: ok
16:49 jenatali: daniels: You're probably the right person to review/ack this Windows CI change: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9018
16:49 jenatali: It's blocking another MR which wants to bump the Piglit version we use
17:15 emersion: do i need to eglFlush after eglCreateSyncKHR?
17:15 emersion: or is eglCreateSyncKHR enough?
17:29 daniels: emersion: if you're trying to extract a dmabuf fd, iirc you need to create the sync point, flush, then dup
17:30 emersion: daniels: hm, no, i'm trying to glReadPixels asynchronously
17:30 emersion: so i'm dojng my glReadPixels calls, then create the sync point
17:30 emersion: then i want to wait on the sync point, but do i need to glFlush before that?
17:32 daniels: oh, you're doing PBOs
17:34 emersion: oh, yes
17:34 bnieuwenhuizen: probably better to use the GL thing rather than egl?
17:34 imirkin: well, presumably the buffer is passed somewhere else which can't?
17:35 emersion: but i need a sync fence FD, so that i can poll() it
17:35 imirkin: otherwise, just mapping the buffer in GL will wait
17:35 bnieuwenhuizen: imirkin: you could use it in combination with a coherent mapping :)
17:35 emersion: i don't want my process to block while the DMA transfer is in progress
17:37 imirkin: emersion: i think you have to glFlush then
17:37 emersion: ok, thanks!
17:37 imirkin: since that read pixels could be sitting in some buffer, essentially, in the GL
17:37 imirkin: if the GL has no reason to execute it, it might not (or at least not to completion)
17:37 emersion: after or before creating the sync point?
17:37 emersion: my guess would be after?
17:38 imirkin: i don't really know how sync points fit into things
17:38 imirkin: i know this won't work with nouveau :)
17:38 imirkin: but presumably nouveau also doesn't advertise sync fd's
17:38 emersion: since creating a sync point is essentially just a command
17:38 emersion: eh :)
17:38 bnieuwenhuizen: if that is true then definitely after :)
17:39 imirkin: otoh, perhaps creating the sync point is enough to force a flush? i don't know how it's specified.
17:39 emersion: > When a fence sync object is created, eglCreateSyncKHR also inserts a fence command into the command stream of the bound client API's
17:40 imirkin: ok, so doesn't force a flush then.
17:41 imirkin: how do you get a sync fd out of it?
17:42 imirkin: i could imagine the sync fd forcing a flush (up to the sync point)
17:42 emersion: eglCreateSyncKHR returns a EGLSyncKHR, then eglDupNativeFenceFDANDROID
17:42 emersion: ah, good point, but doesn't look like it
17:43 imirkin: i guess it just creates the fd at fence creation time, but yeah, i think you have to flush. after creating the fence.
17:44 imirkin: er. s/fence/sync point/
17:44 bnieuwenhuizen: yeah, my reading as well
17:45 imirkin: (that said, an impl might flush anyways, esp to "fix" or prevent buggy applications ... my suspicion is that android isn't necessarily the paragon of spec conformance)
17:47 emersion: it worked without the flush, so… :P
18:02 lynxeye: EGL fences have EGL_SYNC_FLUSH_COMMANDS_BIT_KHR in their wait primitives to do the flush for you. As you don't use the native waits if you convert to a sync fd, you need to do your own flushing.
18:02 lynxeye: creating the fence doesn't trigger a flush
18:06 robclark: mareko: can PIPE_MAP_UNSYNCHRONIZED (or really TC_TRANSFER_MAP_THREADED_UNSYNC) ever happen with something that is not PIPE_BUFFER? (And if so, how are you supposed to handle tiling/compression?)
18:10 emersion: lynxeye: thanks
18:19 robclark: mareko: nevermind, I see that TC_TRANSFER_MAP_THREADED_UNSYNC is limited to PIPE_BUFFER
18:22 zmike: robclark: iirc it's only used for doing backing storage replacement
18:22 zmike: which is why it's buffer-only
18:23 robclark: right, ok.. I was just worried that there might be a case that combined TC_TRANSFER_MAP_THREADED_UNSYNC with detail/decompress.. but if it is just PIPE_BUFFER that won't happen
18:23 zmike: yeah, I wish the docs for that were a bit better
18:30 zmike: robclark: not sure if it'll be at all useful as another reference, but here's my tc transfer map implementation https://gitlab.freedesktop.org/zmike/mesa/-/blob/zink-wip/src/gallium/drivers/zink/zink_resource.c#L845
18:32 robclark: hmm, I think in theory you are supposed to drop the valid_buffer_range stuff (although tbh I'm still wondering about keeping that for non-threaded case)
18:33 zmike: you're allowed to use it if it's a synchronous mapping
18:36 robclark: ahh
20:49 jekstrand: jenatali: You said "assuming it doesn't break our openCL CI".  Is that well-integrated into gitlab now?
20:49 jenatali: jekstrand: Yep
20:49 jekstrand: 👍
20:50 jekstrand:assigns marge
21:01 imirkin: mareko: you were mentioning earlier that ssbo's are packed. are images also packed?
21:02 imirkin: on nv50, ssbo's and images are in the same hardware "array" of things, and i'd rather avoid having some complex remapping thing when i don't have to
21:06 robclark: that sounds a bit like adreno (ssbo and image are same thing)..
21:06 imirkin: yeah ... do you do any remapping there?
21:06 imirkin: or do you just start images after ssbo's / vice-versa
21:06 robclark: yeah
21:06 imirkin: and assume the best
21:06 imirkin: ok
21:06 jekstrand: SSBOs and Images are sort-of the same here too.
21:07 imirkin: so i have 16 total slots for "stuff", and if i expose 16 of each with a max combined of 16
21:07 robclark: would be nice if mesa/st had an option to lower ssbo to image (and maybe even lower to texture for read-only with no coherence requirements)
21:07 imirkin: i could end up with a shader that has an image at binding 10 and ssbo at binding 10
21:08 imirkin: so something will remap both of those to 0, and then internally i can count up the ssbo's and bump up the images by 1 in the shader compiler and api right?
21:09 imirkin: ugh, but i'm still dependent on the shader to know how many total there are. annoying, but not the end of the world.
21:09 robclark: yeah, ir3 builds a remapping table when it compiles the shader
21:09 robclark: but it's also kinda slow-path for draw-time emit when there are images/ssbo
21:09 jekstrand: We've got most of the code in our stack to lower to images to SSBOs.
21:09 jekstrand: It assumes Intel's tiling formats but it could be adapted.
21:10 jekstrand: We didn't get typed load/store that could be used on all bit sizes up until Gen9 (SKL)
21:10 robclark: (for us, it is more "ssbos are images" than "images are ssbos"
21:10 jekstrand: So we have to turn R/W images into SSBOs
21:10 imirkin: that's ok - nv50 doesn't have typed at all :)
21:10 imirkin: making writeonly work will be fun. on the bright side, that means i'll be able to support formatless no problem
21:11 jekstrand: We've got typed for writeonly all the way back to Ivy Bridge
21:11 imirkin: yeah, it's part of DX11 i think
21:11 jekstrand: And I suppose we could lower readonly to texture but we haven't
21:11 imirkin: nv50 is a DX10 part
21:12 imirkin: not sure if DX10 had some sort of optional compute thing, but obviously they were also in the early stages of CUDA
21:12 imirkin: it's not enough to support ARB_compute_shader proper since that requires ARB_images and ARB_ssbo, which in turn are spec'd to work in frag (whereas this does not)
21:13 jenatali: imirkin: There was a very limited shader model 4 (DX10) compute, yeah
21:13 imirkin: but it's like 90% of the way to ES3.1 requirements i think
21:14 jenatali: imirkin: https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-devices-downlevel-compute-shaders
21:14 imirkin: anyways, this work should also help opencl efforts on that lines of GPUs
21:14 jekstrand40: Yeah.  If you wanted to steal the NIR pass from the Intel drivers, you'd be welcome to.
21:14 jekstrand40: And we could make a general one if we're ok with making them linear.
21:15 imirkin: jekstrand40: well, i actually have that part 50% written (the "read" half, not the write half)
21:15 jekstrand40: For OpenCL, we already have a pass to turn readonly images into textures.  We've not hooked it up for GL or Vulkan, though.
21:15 imirkin: (since we have to do it on fermi/kepler as well)
21:15 jekstrand40: Right
21:16 jekstrand40: storage image support is one of the parts of our HW I like the least....
21:16 imirkin: anyways, at this point i'm not looking at NIR -- too much new stuff for me.
21:16 imirkin: getting the hw to work is hard enough
21:16 jekstrand40: heh
21:16 imirkin: just found out the "zero" register doesn't work when used as an index into global memory
21:16 imirkin: so that was fun.
21:17 jekstrand: You have a magic zero register?
21:17 imirkin: wellll
21:17 imirkin: sort of
21:17 imirkin: on later gens, there is a dedicated "zero" register
21:18 imirkin: on tesla, you just use a register above the "gpr limit"
21:18 imirkin: frequently you'd use r63, as there are encoding benefits to using a register below 64
21:18 jekstrand: Ah
21:18 jekstrand: Neat
21:18 imirkin: on fermi+, the last register in the register file is zero forever
21:19 imirkin: on nvidia isa, you can't just use an immediate 0 whereever you want
21:19 imirkin: whereas you can use a register whereever you want
21:19 jekstrand: We can't put immediates wherever but we can put them most places.
21:20 imirkin: ah. on nvidia, only the second arg can be a not-register (with VERY rare exceptions)
21:20 jekstrand: And most of the places you couldn't put an immediate, you wouldn't want zero anyway.
21:20 jekstrand: Same here.  It's only in the 2nd
21:21 imirkin: [that also goes for uniforms being used directly]
21:21 jekstrand: Yeah, we can put uniforms most places.
21:21 jekstrand: If they're push constants
21:22 jekstrand: Because then they're just a regular register with a scalar stride
21:22 imirkin: and on tesla you had both 4- and 8-byte encodings per instruction
21:22 imirkin: but the 4-byte encodings can only take "short" registers (less than 64)
21:22 imirkin: and lots of other limitations
21:22 jekstrand: Oh
21:23 jekstrand: We have 8-byte and 16-byte instructions. :D
21:23 imirkin: hehe
21:23 jekstrand: Our "compacted" instructions are 8-byte.
21:23 imirkin: fermi through pascal are 8-byte only
21:23 imirkin: and i think volta+ is 16
21:23 jekstrand: Gotta put all that regioning and type information somewhere!
21:24 imirkin: on kepler+ you had "groups" of opcodes, with scheduling info once per 7 or 3 opcodes (depending on gen)
21:24 imirkin: on volta+, the scheduling info is baked into each op
21:24 jekstrand: Ah
21:24 imirkin: (hence more bytes per op)
21:25 jekstrand: Yeah, we've got sched info in ops now too
21:25 jekstrand: I'm not sure what they got rid of in order to add it.
21:29 jekstrand: On a completely unrelated note, Vulkan is 5 years old today...
21:30 imirkin: jekstrand: with 16 bytes, you don't really have to get rid of anything... it's a lot of bytes.
21:31 jekstrand: imirkin: You would think so, and yet we still don't get full immediates on 3-src instructions. 😩
21:31 imirkin: like you don't get 3 immediates in there?
21:31 jekstrand: We don't get one
21:31 imirkin: we don't even get 3 registers in there sometimes
21:32 imirkin: src3 == dst reg in some cases.
21:32 imirkin: definitely doesn't complicate RA at all.
21:32 jekstrand: I would love it if we had an encoding that let you put two fp16 immediates in a fma
21:33 jekstrand: Of course, knowing our HW, they'd probably manage to design it so both immediates were on the multiply...
21:34 imirkin: ahaha
21:36 jekstrand: The other thing I'd really like is an integer MUL+ADD or SHL+ADD with two small integer immediates.
21:37 imirkin: finally!
21:37 imirkin: nvidia has something you don't have
21:37 jekstrand: lol
21:37 jekstrand: You mean other than fast GPUs?
21:37 imirkin: wtvr, without reclocking, the GT1 destroys a GTX 3080 Ti
21:38 jekstrand: :-(
21:40 imirkin: SHL+ADD on nvidia *requires* an immed for the shift, and the add bit can be an immed as well though
21:40 imirkin: (called 'ISCADD' for some reason. don't ask what the C is.)
21:41 Sachiel: integer shift clockwise?
21:41 jekstrand85: There are so many places where that'd be nice
21:41 imirkin: that'd be pretty intense
21:42 imirkin: jekstrand85: yeah, it's great for address calcs
21:43 jekstrand21: There's also a lot of stuff sort-of like addresses where it'd be nice.
21:45 imirkin: indirects can also take immediate offsets directly in the op (up to a size)
21:45 jekstrand21: Same here but it's a uselessly low size
21:45 jekstrand21:should give up and just IRC on his phone
21:45 imirkin: oh. on nvidia it's a good size (like 12-16bit)
21:46 jekstrand21: I think we can get to the first 1/4 of the file with it or something like that
21:46 jekstrand21: So not useless but it usually doesn't save enough to be worth the bother.
21:46 imirkin: i mean indirect on an address
21:46 imirkin: not on a register
21:46 imirkin: indirect registers is insanity :)
21:47 jekstrand21: What are addresses?
21:47 jekstrand21: :P
21:47 imirkin: like from memory
21:47 jekstrand21: Instructions don't touch memory on our HW
21:47 imirkin: you have those messages right?
21:48 jekstrand21: Yes but they're a lot of set-up to use and have to go off the EU to a shared function unit.
21:48 imirkin: and since you're putting the message together, there's no real benefit to having a separate offset
21:48 jekstrand21: Pretty much
21:48 imirkin: makes sense
21:48 jekstrand21: So instead of being able to add x to uniform y, we first have to set up a message and do a send and it's like 4 instructions to get uniform y
21:48 jekstrand21: And SENDs can't take any immediates at all
21:49 imirkin: right
21:49 jekstrand21: So you can stick "easy access to uniforms" on the list of things nvidia has that we don't.
21:49 imirkin: yea
21:51 bnieuwenhuizen: huh at that point I suspect even AMD is easier wrt uniforms ...
21:53 jekstrand: bnieuwenhuizen: Yea, pretty sure you are
21:53 jekstrand: Though we don't have the weird scalar vs. vector cache thing.  That's just strange.
21:54 bnieuwenhuizen: I dunno, generally the hitrate on the scalar cache is way better
21:54 bnieuwenhuizen: as we put more stuff in it that is not variant between invocations
21:55 bnieuwenhuizen: though maybe on average a good LRU cache beats that ...
21:55 imirkin: recent nvidia gens have gained "SGPRs" too
21:55 kherbst: turing+
21:55 imirkin: we don't really know what to do with them :)
21:55 kherbst: I do :p
21:55 kherbst: have WIP patches :D
21:56 bnieuwenhuizen: AFAIU sometimes compute kernels can actually have a very significant SGPR load
21:56 imirkin: kherbst: yeah, but it's just like opportunistic stuff right? i feel like there might be a "good" thing to do with it
21:56 kherbst: what do you mean?
21:56 imirkin: if only i knew...
21:56 jekstrand: Yeah, I really need to finish IBC so we can start doing scalar allocations
21:56 bnieuwenhuizen: in nir there is a pass to figure out of stuff is scalar vs. vector and that should be pretty good
21:57 imirkin: bnieuwenhuizen: you mean dynamically uniform vs not right?
21:57 bnieuwenhuizen: yep
21:57 kherbst: yeah, there is a nir pass I was making use of
21:57 kherbst: it's quite okay
21:58 imirkin: iirc intel already needs to determine that for ssbo indexing or some such
21:58 bnieuwenhuizen: and then we do a second pass over that to assign register types with further limitations (e.g. our SGPR stuff can't do float computations)
21:58 kherbst: I have to get back to it at some point :)
21:59 jenatali: kherbst: (Unrelated) - was your r-b on that CL MR supposed to be for all of it or just the clover bit?
21:59 kherbst: uhm.. your call :p
22:00 jenatali: I don't think that's how it's supposed to work :P
22:01 jekstrand: lol
22:07 kherbst: jenatali: I mean, the changes look correct
22:07 jenatali: Good enough for me, considering how simple it is to verify the fix