00:00 jenatali: It's even in the translator already: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/7a0767f2a83c49003d66edfbc70c9cc6cd374ab7
00:00 karolherbst: of course it is :p
00:00 karolherbst: but what if khronos refuses to accept such evil
00:01 karolherbst: should they add dozens of spirv opts instead
00:01 karolherbst: what should we care about some hw vendor ISA :p
00:01 airlied: I think it's pretty much a core part of making oneapi kernels
00:01 karolherbst: so what?
00:01 airlied: sometimes you have to let a bit of the crazy in to get the market share
00:01 karolherbst: although they already created like 20 extensions and I doubt they plan to get it into core
00:01 airlied: like I'm sure cuda is full of that stuff
00:01 karolherbst: it isn't?
00:02 karolherbst: nvidia is quite strict about it: no assembly
00:02 airlied: karolherbst: for users
00:02 airlied: for their libraries I expect there is bypasses
00:02 karolherbst: mhhh. right
00:02 karolherbst: I mean.. they allow PTX as inline "assembly"
00:02 airlied: lots of hand tuned kernels for specific devices
00:02 jekstrand: karolherbst: uh....
00:02 karolherbst: but PTX isn't assembly anyway
00:03 airlied: it might be they control PTX enough so they can just add opcodes everywhere easily
00:03 karolherbst: sure
00:03 airlied: but I wonder if they've got libraries that bypass PTX
00:03 karolherbst: but PTX instruction set is quite sane
00:03 karolherbst: and documented
00:03 karolherbst: mhhh
00:03 karolherbst: internally probably
00:03 airlied: and have hand rolled kernels that are the thing they say doesn't exist
00:03 karolherbst: ohh those exist and they even say they do
00:04 karolherbst: they even have function mangling
00:04 karolherbst: but it's still all internal
00:04 karolherbst: and they don't abuse PTX for it
00:04 karolherbst: what intel does with spir-v
00:04 karolherbst: I don't care if there is a native format of writing asembly code
00:04 karolherbst: but why does it have to be inside spir-v?
00:05 karolherbst: spir-v was nice. If you support all the extensions it can run on any hw. with inline assembly you kill this
00:05 karolherbst: there is no way other runtimes want to support other vendors assembly language
00:10 airlied: karolherbst: yeah it's a bit of a nightmare path, they do have native binary support as well
00:10 airlied: so not sure why inline asm is really needed.
00:12 karolherbst: yeah.. not sure either
00:14 karolherbst: airlied: where are the spir-v things dicsussed within khronos btw? gitlab or is there also a mailing list?
00:14 karolherbst: :D
00:15 airlied: karolherbst: gitlab only
00:15 airlied: though I'm not sure if I'm just not subscribe to some spir-v wg
00:15 karolherbst: there is no spir-v wg
00:15 karolherbst: that's why I am asking
00:16 karolherbst: seems like they weren't crazy enough to bring it up internally...
00:16 karolherbst: oh well...
00:16 karolherbst: let's hope it stays that way
00:20 karolherbst: jekstrand: anyway.. any idea how to deal with drivers having crappy alignments because their compiles are just broken? :p
00:21 karolherbst: for fun reasons, codegen only assumes 0x10 alignment for !compute shaders...
00:21 karolherbst: so vectorization is pretty much disabled for compute on indirects
00:22 karolherbst: but for graphics shaders.... if you have an indirect, codegen just assumes 0x10 alignment
00:23 karolherbst: mhh.h.... I have an idea
00:23 karolherbst: I can disable it for nir and just use nirs pass to vectorize loads
00:23 karolherbst: mhhh
00:23 karolherbst: I wanted to do that anyway
01:03 karolherbst: ahhh... why are things so broken :(
01:38 jekstrand: karolherbst: Not sure what to tell you...
06:26 airlied: jstultz: are you will to give an ack for mauro's patch series? I'm not feeling a great amount of respect for his engineering law interpretations
07:28 MrCooper: Lyude: it's an issue with the Git v2 transport protocol, which was enabled by default in Git 2.26 and disabled again as of 2.27
07:44 pq: Venemo, "dispensable"? An incomplete triangle sounds like just two points, not three points with zero area.
07:44 pq: I used 'dict' to look in Moby Thesaurus
07:45 Venemo: pq: it really is about just 1 or 2 vertices, not 3 vertices with 0 area
07:45 Venemo: pq: so I decided to go with 'incomplete'
07:50 pq: ok, that's accurate then
07:54 Venemo: this is the patch, in case you're curious: https://gitlab.freedesktop.org/Venemo/mesa/-/commit/75b8bc4942accb72c6afcdb48911837f13becc4a
08:04 pq: sounds good
08:05 pq: personally I often go to 'dict' command when I need to find a word for something
09:23 karolherbst: jekstrand: but I think it's common for drivers to specify different alignment for types. On nv hardware we only care up to 128 bit alignment everything higher is pointless. And I think AMD hardware can deal with unaligned loads just fine
09:23 karolherbst: to some degree
09:29 daniels: karolherbst: assembly in CL, I hear you say? https://people.collabora.com/~daniels/tmp/cl_viv_vx_ext.h
09:29 karolherbst: hey....
09:29 karolherbst: can people stop doing it? :p
09:29 karolherbst: ohh no
09:29 karolherbst: it's the intel mistake
09:30 karolherbst: daniels: please remove this extension as it's suffered from a falacy :p
09:30 karolherbst: *it
09:30 daniels: oh, I don't think anyone will ever use or care about it too much - it's in fact been almost completely disappeared from the internet. but very useful if you wanted to, say, reverse-engineer the Vivante NN engine in the RK3399Pro and i.MX8 :)
09:31 daniels: or something
09:31 karolherbst: :D
09:31 karolherbst: right...
09:31 karolherbst: but usually vendors should just have their own ISA or langauge specified
09:31 karolherbst: and devs should just use this instead of clc
09:31 karolherbst: less headaches
09:32 karolherbst: I don't even understand why anyone would want to add inline assembly support to spir-v even...
09:32 daniels: well, quite, but that's not in the top ten things that Vivante should be doing differently with their proprietary driver
09:32 karolherbst: like there are 0 benefits for doing so
09:32 daniels: and it definitely doesn't have any impact on SPIR-V, or anything CL anyone will ever directly use
09:33 daniels: the only user I ever found was a proprietary Rockchip TensorFlow backend
09:33 karolherbst: daniels: intel has a spirve extension for inline assembly :p
09:33 karolherbst: *spirv
09:33 daniels: yeah, I saw that - that's threatening since they actually did it 'properly' with spv and also got stuff upstream - no such danger with Vivante
09:33 karolherbst: right...
09:34 karolherbst: I mean.. there are external instructions sets for a reason...
09:34 karolherbst: why inline assembly...
09:34 karolherbst:just doesn't understand why people don't push back on that
09:51 HdkR: an assembly shader binary format is too hard, just allow me to __asm volatile the world please
10:11 emersion: is there a way to check whether a device is render-only?
10:11 emersion: (ie. DRIVER_MODESET not supported)
10:12 emersion: is the only way to drmModeGetResources and check whether the error is ENOTSUP
10:12 emersion: ?
10:13 emersion: it doesn't seem like https://lwn.net/Articles/588016/ has landed
10:15 daniels: emersion: yeah, we just only believe a device is not render-only unless drmModeGetResources succeeds, and it shows at least one crtc + connector
10:16 emersion: ok
10:50 karolherbst: mhhhh
10:50 karolherbst: load_store_vectorize is pretty useless right now for nouveau if the address is in any kind indirect
12:38 kisak: anyone happen to know what distro has "vulkan-icd-loader 1.2.151.r14" (specifically uses <ver>.r<distro_ver>)
12:52 siqueira: mlankhorst , Hi, could you give your ack to this issue? https://gitlab.freedesktop.org/freedesktop/freedesktop/-/issues/287 Melissa is in the process to become a VKMS maintainer, for this reason, she requested write permission to drm-misc for applying VKMS patches.
12:58 mlankhorst: siqueira: seems already acked
12:59 mlankhorst: daniels: ^
13:03 siqueira: I thought we need ack from all maintainers, my mistake. Thanks
13:13 mlankhorst: one suffices :)
13:13 daniels: cool, I'll do that then
13:14 siqueira: thanks!
14:04 danvet: mlankhorst, topic/phy-compliance still exists, maybe time for a dim remove-branch?
14:59 pinchartl: robher: ping
15:00 robher: pinchartl: pong
15:03 pinchartl: robher: I'd like to give a try to converting graph.txt to yaml
15:03 pinchartl: but I'm not sure what the best option is
15:03 pinchartl: should it move to dt-schema ?
15:03 pinchartl: and be handled automatically, or referenced explicitly with $ref to start with ?
15:04 robher: pinchartl: that would be my preference.
15:05 robher: pinchartl: though if we have $ref's then that implies bumping the dt-schema version.
15:06 pinchartl: is it ok to assume that a node named "ports" or "port" always refers to OF graph ?
15:09 robher: pinchartl: I wish, but there's at least ethernet switch bindings that use ports. IIRC, in dtc I had to look for ports+port or port+endpoint.
15:10 pinchartl: can that be done with a YAML schema ?
15:19 sravn: danvet: The KeemBay DRM driver could use a new set of eyes. Now you got the drm_managed patches out maybe you could find some time?
15:20 pinchartl: sravn: Daniel has just left the channel
15:48 sravn: pinchartl: Thanx, will ping him next time I see him around. If you have time then that would be great. A partial review is better than none...
15:54 pinchartl: sravn: I'm short on time at the moment :-(
16:17 jekstrand: karolherbst: Too high alignments shouldn't be a problem. You can always clamp down.
16:18 jekstrand: karolherbst: As far as setting your own alignment goes, not sure what to say
16:19 jekstrand: karolherbst: The problem is that, while it's ok to have a high alignment on a scalar when it's a member of a struct or a stand-alone variable, you don't get to pick if it's part of a vector.
16:19 karolherbst: right...
16:19 karolherbst: I can just fix it by depending on nirs vectorizer
16:19 karolherbst: but that's still not good enough
16:19 karolherbst: :/
16:19 jekstrand: karolherbst: We could add in a concept of aligned scalars as long as we're crystal clear that the scalar component of a vector is not that.
16:20 jekstrand: But even there you run the risk of breaking if you implement variable pointers type extensions.
16:20 jekstrand: Though maybe you're safe if you only care about GL and CL
16:20 jekstrand: The moment you care about Vulkan, you're hosed.
16:20 karolherbst: I mean.. I really don't want to overalign
16:20 karolherbst: _but_
16:20 karolherbst: now if I get indirect loads, the alignment is the size of the member
16:20 jekstrand: I get it. YOu want to use 64 and 128-bit loads when you can
16:20 karolherbst: yes
16:21 jekstrand: It sucks that nvidia's 64 and 128-bit loads require 64 and 128-bit alignment
16:21 karolherbst: well...
16:21 karolherbst: I am sure they have a reason for doing it like that :p
16:21 jekstrand: Oh, I'm sure they do too
16:22 jekstrand: On Intel, our wide loads are u32vec2/3/4 loads that require scalar alignment
16:22 jekstrand: Then we have a scalar 8/16/32-bit load that only requires byte alignment
16:22 karolherbst: I was wondering if I could analyze the indirect somehow and figure out what alignment the given pointer has, but that's also terrible to do
16:22 jekstrand: karolherbst: The alignment code should do that for you
16:22 karolherbst: but.. maybe we could do the following
16:22 jekstrand: Well, it can't analyze "the array index is even"
16:22 karolherbst: for all variables we can assume that depending on the location they have in inherent alignment
16:23 karolherbst: and depending on that, we could derive the alignment of the indirect
16:23 jekstrand: That's exactly what the code I landed yesterday does
16:23 karolherbst: mhhh
16:23 karolherbst: maybe I just have to figure out how to use it inside nouveau then
16:24 jekstrand: Possibly
16:24 jekstrand: It should give you a mul/offset which is the best known alignment
16:24 karolherbst: but all I got was just small alignments, but maybe that's also because of my local tree wasn't properly rebased because of the regression
16:24 karolherbst: ahh.. let's see
16:25 jekstrand: If you find some bugs, I'm happy to help fix them. I've not spent much time looking at the alignments it kicks out.
16:25 jekstrand: If it can chase all the way to the variable and all the array indices are constant, it should give you align_mul=256 align_offset=offset%256
16:26 karolherbst: does it work after lower_io?
16:26 karolherbst: I guess not?
16:26 jekstrand: No
16:26 jenatali: lower_io embeds the alignment from the last deref into the load/store op
16:26 jekstrand: Maybe it's not picking up on constant array indices you need to do some constant-folding before lower_io?
16:33 karolherbst: jekstrand: no, constant folding is fine as far as I can tell
16:33 karolherbst: it does work with directs anyway
16:37 jekstrand: Ok, cool.
16:37 jekstrand: For indirects, no, it doesn't have smarts to detect i * 2
16:38 jekstrand: That'd be kind-of cool
16:38 jekstrand: But I don't know how useful
16:40 karolherbst: well. for us it is :p
16:40 karolherbst: I hit a few shaders needing it for shared
16:40 karolherbst: so you have an indirect on the invocation id
16:40 jekstrand: By "don't know how useful" I mean I don't know how often x[i * 2] comes up in practice
16:40 karolherbst: and the "offset" part is shifted
16:41 karolherbst: uhm..
16:41 karolherbst: base I mean
16:41 karolherbst: or.. offset?
16:41 karolherbst: dunno :D
16:41 karolherbst: anyway, shared seems like it hits this a few times
16:41 jekstrand: So you actually have cases where you're seeing x[i << 2] or something?
16:42 jekstrand:really wants NIR to grow some proper range analysis for things like this.
16:42 karolherbst: I can dum the nir.. wait a moment
16:42 jekstrand: It probably wouldn't be *that* hard to detect left-shift and mul cases.
16:46 jenatali: That'd be pretty cool actually
16:47 jekstrand: I'm just not sure how much of it we should do with on-demand helpers and how much we should do with an actual analysis pass.
16:48 jekstrand: A nir_ssa_scalar_get_possible_set_bits helper wouldn't be that hard to write
16:49 jekstrand: The question we really want to ask for something like array alignments is "how many bits at the bottom can I count on being unset?"
16:49 karolherbst: jekstrand: https://gist.github.com/karolherbst/14e4cf31a074008178e5e667005b3ed6
16:49 karolherbst: there are some store_shared
16:49 jekstrand: x << y, x & y, and x * y all have implications there.
16:49 karolherbst: and I think they could be vectorized
16:50 karolherbst: mhhh...
16:50 karolherbst: maybe not?
16:50 karolherbst: intrinsic store_shared (ssa_87, ssa_95) + intrinsic store_shared (ssa_63, ssa_100)
16:51 karolherbst: I think those can be
16:51 karolherbst: yeah...
16:51 karolherbst: so what the code does is
16:51 karolherbst: base + 0/1/2/3
16:52 karolherbst: on a shared memory array
16:52 karolherbst: soo.. it would be perfectly fine to get those merged as 128b stores
16:52 karolherbst: the source is also a vec4 essentially
16:53 jenatali: LLVM probably would've done that for you if we were allowed to have optimizations enabled
16:53 karolherbst: just swizzled
16:53 jekstrand: karolherbst: Yeah....
16:53 karolherbst: jenatali: that's GLSL :p
16:53 jenatali: Oh, nvm
16:53 jekstrand: karolherbst: But I think you're talking about a lot of smarts here.
16:53 karolherbst: maybe
16:53 anholt: karolherbst: are you using the i/o vectorizer already?\
16:53 jekstrand: It's all doable, probably.
16:53 karolherbst: I am trying to
16:54 karolherbst: just hitting cases where it doesn't work for us
16:54 jekstrand: But it requires seeing through a bunch of shifts and ANDs to get an alignment
16:54 anholt: karolherbst: might want to throw together a unit test and run it by pendingchaos, then
16:54 karolherbst: jekstrand: maybe we could just set values and constant fold the chain? :p
16:54 jekstrand: karolherbst: ?
16:55 karolherbst: I mean.. most indirects actually just have one or two variables in them
16:55 karolherbst: and if we can fake constant fold we might be able to derive the alignment
16:55 karolherbst: but maybe that's also super britle
16:55 jekstrand: I don't see any way that fake constant folding is going to be reliable
16:56 jekstrand: But we may be able to do some analysis
16:56 karolherbst: yeah...
16:56 karolherbst: anyway. for shared we don't even have to know the vars as the indirect always has a constant part
16:56 jenatali: Hm... I can pass the bruteforce test for single-component fma, but fail for 3-component fma... something seems odd there...
16:56 karolherbst: we just need to figure out how the variable bits affect the alignment
16:57 karolherbst: jenatali: normal :p
17:01 jenatali: Oh... the CTS's kernel for 3-component math is crazy
17:02 karolherbst: yep
18:04 jenatali: Hm, this probably isn't my problem but it's bugging me. Taking a function_temp vec3, casting to vec4, storing a vec3+undef, then loading a vec3 out of it... There must be an optimization pass that'd recognize how stupid that is, right?
18:04 jenatali: jekstrand: ^^
18:04 jekstrand: Uh... Nope.
18:05 jekstrand: We don't have a lot of stuff that can see through casts all that well
18:05 jekstrand: Also, that sounds increidbly sketchy
18:05 pepp: jekstrand: have you seen https://gitlab.freedesktop.org/mesa/mesa/-/issues/3487? (crash introduced by MR 6472)
18:05 jenatali: I agree, it's incredibly sketchy
18:06 jekstrand: pepp: No, I haven't. Do you have some GLSL that I can look at?
18:07 jekstrand: pepp: I tried to be super-careful with the blob stuff but I may have flubbed it. I did at least once and it was caught by CI.
18:07 jekstrand: jenatali: That potentially stomps memory after the vec3 which migh hold something useful.
18:07 jekstrand: jenatali: Who's doing that?
18:08 jenatali: jekstrand: I'm still trying to figure out where it's coming from. It should be fine though because the vec3 is treated as a vec4 (since it's not packed)
18:10 jekstrand: jenatali: Assuming there isn't a solatary float after it. :)
18:10 jenatali: jekstrand: There shouldn't be. Lower_vars_to_explicit_types treats a float3 as float4-sized, so the next float should be float4-aligned
18:10 jenatali: Or greater
18:10 jenatali: Hence why I don't *think* it's my problem, but it still definitely shouldn't be there
18:11 jekstrand: Does it? I thought float3 was float4 aligned but float3 sized. I'm probably wrong though
18:11 jenatali: For CL, the size/align func treats vec3 to be sized/aligned as vec4
18:12 karolherbst: jenatali: size should be vec3, no?
18:13 jenatali: karolherbst: Check glsl_type::cl_size()
18:13 jenatali: unsigned vec_elemns = this->vector_elements == 3 ? 4 : this->vector_elements;
18:15 jekstrand: Ok, then :)
18:16 karolherbst: heh..
18:17 karolherbst: that was me :D
18:17 karolherbst: mhhh
18:17 karolherbst: let me check..
18:17 karolherbst: I think that's even correct
18:18 karolherbst: ahh yeah
18:18 karolherbst: "Except for 3-component vectors whose size is defined as 4 * size of each scalar component."
18:20 jenatali: jekstrand: It's in the SPIR-V for the libclc function
18:21 jekstrand: jenatali: Which libclc function?
18:21 jenatali: Currently looking at ilogb3
18:21 jenatali: er, ilogb, for vec3
18:21 jenatali: %x_addr_609 = OpVariable %_ptr_Function_v3float Function
18:21 jenatali: %storetmp_121 = OpBitcast %_ptr_Function_v4float %x_addr_609
18:25 jenatali: jekstrand: Any thoughts on a way to detect that kind of thing?
18:26 jekstrand: Not sure
18:26 karolherbst: jenatali: well..
18:26 karolherbst: what's the issue there?
18:27 karolherbst: you get an undef
18:27 karolherbst: and we could optimize undef stores away
18:27 karolherbst: just remove the undef channel and move on :p
18:28 jekstrand: jenatali: Short of hand-coding something to detect that exact pattern, I'm not sure.
18:28 jenatali: Seems like copy-prop should be kicking in
18:28 jekstrand: Specifically, detecting store((vec4 *)my_vec3, vec4(x, y, z, undef))
18:29 jekstrand: copy-prop can't see through the cast
18:29 karolherbst: jekstrand: wouldn't it be easier to just detect stores of undef values?
18:29 jenatali: Ahh
18:30 jekstrand: karolherbst: Potentially but converting store(ptr, vec4(x, y, z, undef)) to store((vec3 *)ptr, vec3(x, y, z)) unconditionally may lead to worse problems.
18:30 karolherbst: why would it?
18:30 jekstrand: Because we could end up with all sorts of casts sprinkled around for no good reason
18:31 jekstrand: We only want to do that in the case where ptr is, itself, a cast of a vec3
18:31 karolherbst: but I am not saying we should cast anything, at least not on the deref level
18:31 karolherbst: just saying we want to get rid of stores of undef values
18:31 jekstrand: karolherbst: In general, we do but it's a vector store and only one piece of it is undef
18:31 karolherbst: right
18:31 jekstrand: The problem here isn't the extra bit of undef in the store, it's the fact that copy-prop can't see through it.
18:32 karolherbst: mhhh
18:32 jenatali: Right
18:32 jekstrand:is really tired of all of LLVM's pointless casting
18:32 jenatali: Hm... I wonder if I have optimizations enabled and shouldn't somehow...
18:32 jenatali:checks
18:33 jenatali: Nope, there's -O0 there
18:34 jekstrand: It's probably how clang is handling the vector stuff
18:34 jenatali: I don't get where it's coming from either, the implementation should just be:
18:34 jenatali: DECLSPEC RET_TYPE##3 FUNCTION(ARG1_TYPE##3 x) { \
18:34 jenatali: return (RET_TYPE##3)(FUNCTION(x.x), FUNCTION(x.y), FUNCTION(x.z)); \
18:34 jenatali: Yeah, must just be passing a vec3 by value that causes it?
18:34 jekstrand: Probably
18:35 krh: airlied, danvet: do udmabuf patches go through drm pulls?
18:37 jekstrand: jenatali: If this is something that clang generates every time a vec3 is passed by-value, then we should figure out how to sort it out even if it is with something hand-rolled.
18:37 jenatali: jekstrand: Yeah, that seems to be the case
18:38 jekstrand: jenatali: I think we could have a little hand-rolled thing that looks for store((vec4 *)my_vec3, vec3(x, y, z, undef)) and drops the cast and the undef
18:39 jenatali: Maybe even specifically just for function_temp
18:39 jekstrand: Or maybe store((vec4 *)my_vec3, val, wrmask=xyz)
18:40 airlied: krh: pretty sure they do
18:41 jekstrand: jenatali: I kind-of want an opt_algebraic like thing for derefs so we can more easily sort out these cases. I just have no clue how to design such a thing.
18:41 krh: airlied: I sent it to you as well
18:41 jenatali: jekstrand: Yeah, I remember you filed an issue for that :P
18:41 jenatali: * TODO: At some point in the future, we could be clever and understand
18:41 jenatali: * that a float[] and int[] have the same layout and aliasing structure
18:41 jenatali: * but double[] and vec3[] do not and we could potentially be a bit
18:41 jenatali: * smarter here.
18:42 krh: airlied: may be something for the current cycle - just adding missing compat ioctl
19:04 anholt:wishes for a unit testing path for "take this GLSL, make asserts about the NIR that results"
19:04 anholt: (going through mesa/st and everything)
19:11 jekstrand: jenatali: Yeah, I could make a case for a pass which takes all types and stomps them to uint
19:11 jekstrand: jenatali: But that wouldn't fix this case. It's a vec3/4 thing :-/
19:11 jenatali: Yeah
19:12 jenatali: I wonder if a pass that removed known out-of-bounds writes would be a good idea in general
19:12 jekstrand: But that one isn't known out-of-bounds. :-(
19:12 jenatali: Isn't it? Writing to the 4th component of a vec3?
19:12 jekstrand: Not unless we can chase it all the way to the variable
19:12 jekstrand: If we can chase it to the variable, it is
19:12 jenatali: Sure, which we kinda can, there's only one relatively simple cast in the way
19:13 jekstrand: In fact, that might be UB in CL :P
19:13 jenatali: It probably is...
19:18 karolherbst: jenatali: what does out of bounds mean anyway? :p
19:19 karolherbst: I think for C like languages it is pretty much impossible to really tell if anything is OOB except for arrays with a static size I guess
19:19 jenatali: Or non-arrays?
19:19 karolherbst: but even then container_of just is a way to still write beyond
19:20 karolherbst: jenatali: you never know if it's valid use case or not
19:20 jenatali: An out-of-bounds access to a function_temp variable is pretty clear UB
19:20 jenatali: That's stack corruption
19:20 karolherbst: really.. the only case where it's quite obvious is an indirect/direct on an fixed size array
19:20 karolherbst: jenatali: right.. for the entire variable
19:21 karolherbst: but what if you have a pointer to a struct
19:21 karolherbst: _any_ struct
19:21 karolherbst: and you do container_of on that?
19:21 karolherbst: if you can follow the chain back to the top level declaration, sure
19:21 karolherbst: if not? no chance
19:21 jenatali: If it's a direct pointer to the variable, it's invalid
19:21 jenatali: Sure
19:21 jekstrand: Yeah, OOB only works if you have a pointer to an actual variable
19:22 jekstrand: There might be a few more cases you can argue but it gets tricky
19:22 jenatali: Yeah
19:22 jenatali: jekstrand: Got a hand-rolled pass that enables copy prop on the vec3
19:22 jekstrand: jenatali: Cool
19:23 jenatali: But probably we want something that drops writes to known OOB, and replaces reads of known OOB with undefs
19:23 jenatali: Which is what I did just for the hardcoded case of vec3 and vec4
19:24 karolherbst: jenatali: I'd let applications just crash :p
19:25 jenatali: karolherbst: They don't crash, it's just a missed optimization
19:25 jenatali: Ends up spilling to scratch instead of copy prop
19:25 karolherbst: ohhh.. mhhh
19:25 karolherbst: annoying
19:26 jenatali: Agreed
19:26 jenatali: Now I have a shader I can actually kinda read and see why it's failing :P
19:26 karolherbst: :D
19:27 jekstrand: The #1 real reason for optimizations: Being able to read the shader. :P
19:28 jenatali: Seriously though, I've been adding a bunch of early optimizations so that I can actually read the shader before libclc gets inserted... and then a bunch after libclc before explicit_io so I can still read the shader :P
19:28 karolherbst: jekstrand: guess your hw doesn't have stupid opcodes to make the shader less readable :p
19:28 jekstrand: karolherbst: We have register regioning. :P
19:29 karolherbst: well.. fair :D
19:29 karolherbst: our xmad is pretty terrible though
19:29 karolherbst: so.. imul is not the fast way of doing imul :p
19:29 karolherbst: but using 3 xmads is
19:30 jekstrand: We can't do 32x32 imul
19:30 jekstrand: We can only do 32x16 so it's two imul and an add, all with fancy strides. :P
19:30 karolherbst: right
19:30 karolherbst: xmad is more or less like that
19:30 jekstrand:thinks he's winning :)
19:30 karolherbst: it's just a 16x16+32 op
19:30 jekstrand: 32x16->32, actualy
19:32 karolherbst: heh.. that's nice :p
19:33 karolherbst: our XMAD is really annoying though, has several operation modes which are all.. strange
19:33 karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/gallium/drivers/nouveau/codegen/nv50_ir.h#L287
19:33 karolherbst: still no idea what CSFU means
19:34 karolherbst:wonders what jekstrand would think about adding all those to nir :p
19:35 jekstrand:isn't opposed to back-end-specific opcodes
19:37 karolherbst: right.. but most of them make more or less sense, no?
19:37 karolherbst: although.. I guess adding iadd3 is something I should focus on
19:37 karolherbst: we have hw support for quite some time
19:38 italove: umm, speaking of that, I'm working on fusing ball_eq + cmp (lt, ge) into midgard opcodes such as ball_lt, and someone suggest that maybe these should be added in NIR
19:39 jekstrand: There's a balance here. It could be that what you really need for midgard is an optimization pass to fuse
19:40 jekstrand: Not everything needs to be done in NIR
19:40 anholt: or just backend instruction selection that recognizes it.
19:40 italove: sure, that was what I was doing actually
19:41 anholt: jekstrand: got a few minutes to talk load_ubo range annotation?
19:43 jekstrand: anholt: Sure
19:44 anholt: so, we need some sort of range information on load_ubo. at worst, size of the whole ubo. ideally, the minimum range the deref might access.
19:44 karolherbst: jekstrand: anyway.. what should I do about my stupid bug until I figured out how do let the load/store do what I need ..
19:45 anholt: I was trying to go at this by taking the GL uniform buffer variable's Offset and Size, stuff them into a nir_address_format_32index_offset_range's extra 2 components compared to 32bit_index_offset, then pick those bits back out in lower_explicit_io and stick it on the load_ubo
19:45 karolherbst: hitting this "assert(*alignment == explicit_type_scalar_byte_size(this))" assert
19:45 karolherbst: *alignment is 16, the other side 4
19:45 anholt: but this feels like maybe explicit_io should be able to calculate the range for me?
19:46 anholt: (such that I don't need to something similar for spirv)
19:46 anholt: but if I have explicit_io do this, how does it figure out the base offset of the ubo from the address format?
19:46 anholt: (WIP https://gitlab.freedesktop.org/anholt/mesa/-/commits/nir-ubo-ranges-real-ubos)
19:46 jekstrand: anholt: In the case where we have the whole deref chain and assume no OOB array access, it should be possible to compute bounds by walking the deref
19:46 jekstrand: anholt: Of course, if it ever hits a non-deref along the path, it has to throw up its hands and walk away
19:51 jekstrand: karolherbst: If you always assume 16B alignment, your compiler is broken. :P
19:51 karolherbst: jekstrand: I won't say it's not :p
19:52 jekstrand: karolherbst: I mean, we could do something that would allow for higher alignments for variables and structure members
19:52 jekstrand: But the moment you have a free-floating pointer, you're toast
19:52 karolherbst: I mean.. I wouldn't care as much if that doesn't affect users, but it probably will :/
19:53 karolherbst: jekstrand: alternative solution would be to disable codegens memory vectorizer when coming from nir
19:53 jekstrand: karolherbst: That sounds like a kind-of terrible but maybe necessary option
19:53 karolherbst: right
19:53 karolherbst: and hence the idea to use nirs
19:53 jekstrand: karolherbst: Or make it align_mul/offset-aware.
19:53 karolherbst: but that doesn't give me the result I need
19:54 anholt: jekstrand: looks like my deref chain at load_ubo creation time is something like "deref_array(deref_cast(load_const))"
19:54 karolherbst: jekstrand: well... right... but I wasn't planning on adding such stuff to codegen
19:54 jekstrand: karolherbst: Right
19:54 karolherbst: the vectorizer code isn't.... the best one to understand and read
19:54 karolherbst: codegens I mean
19:54 jekstrand: anholt: Is this Vulkan or GL?
19:54 anholt: GL
19:55 jekstrand: anholt: Where does that load_const come from? I don't remember what the lowering path for GL looks like. :-/
19:55 karolherbst: jekstrand: maybe I just throw up my hands and just accept it's not perfect and see that the regressions isn't that bad
19:56 anholt: jekstrand: gl_nir_lower_buffers is picking the UBO variable's offset within the ubo out of the linked program info, then you make a 32bit_index_offset with the ubo block plus that
19:56 jekstrand: anholt: Right
19:59 jekstrand: anholt: Do you care about a double-ended range or just a max offset?
20:00 anholt: I could live with a max offset, probably. (though I would need that 32bit_index_offset's offset included)
20:02 jekstrand: Assuming that the offset portion of the vec2 going into your cast is a constant (it should always be for you, I think), it shouldn't be too hard to construct an offset range
20:03 jekstrand: So I think you can chase the deref chain and then, when you hit a non-deref, look to see if it's a load_const or a vec2 where the offset side is a load_const
20:04 jekstrand: That should be able to always get you a range for GL
20:05 anholt: ok. this seems plausible
20:07 jekstrand: anholt: We just need to make sure there's some way to encode "I don't know" in the range
20:07 jekstrand: range=0 comes to mind
20:07 jekstrand: or range=UINT32_MAX
20:08 anholt: range=~0 in the existing MR that fixes up the default uniform block for freedreno
20:08 anholt: (we were losing load_uniform's range information and making bad choices as a result)
20:39 jenatali: karolherbst: My vec3 bug was that I have a small bug in lower_compute_system_values which prevents adding in work group offsets if global offsets aren't also being used :D
20:40 karolherbst: uff :D
20:45 anholt: jekstrand: thanks for setting me on the right path. I've been intimidated by understanding nir derefs, and this ended up being not bad.
20:45 jekstrand: anholt: \o/
21:48 karolherbst:wished he had more time for nir stuff...
22:18 jekstrand: karolherbst: So say we all
23:04 Rush: hello. What is the best place to report a feature request for llvmpipe? It's missing an extension EXT_blend_minmax which will force me to create a CPU costly software workaround.
23:04 Rush: https://developer.mozilla.org/en-US/docs/Web/API/EXT_blend_minmax
23:09 Rush: https://gitlab.freedesktop.org/mesa/mesa/-/issues/3489 - I've created here. Hope it's the right place
23:09 dcbaker[m]: Yup, that's the right place
23:11 anholt: Rush: airlied has been actively working on llvmpipe, so there's definitely hope for that getting added :)
23:11 karolherbst: Rush: I think for llvmpipe it probably doesn't matter all that much if you do it in the application or if it happens inside mesa :p, but sure.. we could probably still support those extension
23:18 Rush: karolherbst: theoretically yeah but in practicse is not that easy. I'm running drawing commands in a WebGL content on the server. So I either redo the entire drawing pipeline, or blend each sprite manually with the buffer. Either way is inefficient.
23:19 karolherbst: ahh.. fair