02:56mareko: anholt_: no, emulating transform feedback would be very special at NIR level, because we would need shared memory, workgroup scan operations, and global ordered atomic iadd and isub in VS and TES, because we can only do the atomic once per workgroup
02:56mareko: actually the atomic isub should be unordered (normal atomic)
02:59mareko: GS is even more complicated, because we don't do it at the GS (multiple primitives per thread) level, but we pass all transform feedback outputs via shared memory and then store 1 vertex per thread like VS
03:02robclark: mareko: fwiw, the adreno gens that need to lower xfb to store_global are also things that don't support gs/tess.. which I guess makes things a bit easier..
03:07imirkin: robclark: just a3xx, right?
03:08imirkin: there are some fails in there i might look at
03:17robclark: imirkin: I think we still use stg for a4xx, although we don't need to (it does have for-realsies hw xfb).. coincidentally that was also where hw gained gs/tess (although I guess enabling that isn't a priority, we probably are closer to getting that sorta stuff working on a5xx, and there is probably a lot more low hanging fruit when it comes to backporting a6xx gles31+ fixes to a5xx)
03:18imirkin: robclark: ah ok. well, a3xx is what i have unpacked and powered on :)
03:19robclark: imirkin: you should get a snapdragon laptop.. they are fun ;-)
03:20imirkin: robclark: first i have to get a dump truck to get rid of all the clutter
03:20robclark: (and at least for the lenovo c630 one, we are only a handfull of patches on top of upstream)
03:20imirkin: i do sorta need a new laptop ... i barely ever use them though
03:20imirkin: esp since they stopped making ones with useful screens
03:20robclark: heh, fair.. tho the laptop does take up a bit less room, since it doesn't need a monitor/keyboard
03:21imirkin: now you just have movie-watching screens. o well.
03:22robclark: the movie aspect ratio just means two columns of code side-by-side.. which is useful
03:23imirkin: robclark: -ENOHEIGHT
03:23imirkin: i use rotated 24" monitors, so that fits 2 columns of code on each one. pretty nice.
03:23imirkin: that's the only redeeming quality of wide-screen monitors -- they can be rotated :)
03:41HdkR: Careful, the c630 is now dead on Lenovo's website, they recommend an Intel based c640 as a replacement :P
03:44robclark: I still recommend c630.. at least until we get 8cx stuff running ;-)
03:45HdkR: Please more 8cx :)
04:56austriancoder: anholt_: nice.. but what is tracie?
07:34mareko: does X use modifiers?
07:35mareko: or does it depend on toolkits?
07:44MrCooper: mareko: glamor & DRI3 support modifiers, but yeah I think it's up to clients to actually make use of them
07:45MrCooper: zmike_: FWIW, with direct CPU access to GPU accessible memory it's easy to fall off a performance cliff
08:08mareko: MrCooper: will it ever possible to drop the non-modifier allocation path?
08:09MrCooper: I doubt it, why?
08:28mareko: MrCooper: so that we can get the correct scanout flag everywhere
08:29danvet: mareko, so it's all wired up, but not enabled by default
08:29danvet: at least not in -modesetting
08:29MrCooper: mareko: I suspect you'll need to treat buffers as not scanout capable unless an appropriate modifier is used
08:30danvet: mareko, I don't think the correct scanout flag happens without modifiers
08:30emersion: that's why they want to drop the non-modifier allocation path i assume?
08:31MrCooper: warm up your time machine then ;)
08:32emersion: maybe once modifiers are supported the non-modifier path can be changed to always assume USE_SCANOUT
08:32emersion: and then you can tell downstream to use modifiers to get betetr perf
08:35emersion: downstream isn't in a bad state tbh - wayland compositors and hw-accelerated clients running via mesa support it
08:49mareko: emersion: I wanna change the default to non-scanout
08:58emersion: mareko: this will break stuff. doing the opposite (default to scanout) won't.
09:34bbrezillon: pendingchaos: I'm looking at https://www.khronos.org/registry/spir-v/specs/unified1/SPIRV.html#Memory_Semantics_-id- and was wondering if the Volatile flag really applied to all ops
09:34bbrezillon: what does it mean for a barrier for instance?
09:44pendingchaos: I don't know what it would mean
09:44pendingchaos: I think it was probably supposed to be invalid
09:47bbrezillon: ok, so I can assume the access property is only needed for atomic ops
09:51pendingchaos: it looks like spirv-val errors on it: https://github.com/KhronosGroup/SPIRV-Tools/blob/master/source/val/validate_memory_semantics.cpp#L162-L166
09:56bbrezillon: pendingchaos: ok, thanks for the pointer, I'll refer to spvtools validation next time :)
12:37Zeising: I have questions about xmlpool, who should I ask?
13:23zmike: MrCooper: ?
13:49MrCooper: zmike: re your discussion with mareko about using blits
13:50zmike: MrCooper: uhhh sorry, can you rephrase/elaborate then?
13:50zmike: I'm still learning :)
13:52MrCooper: maybe I'm misunderstanding, I assumed the alternative to blits was direct CPU mappings
13:53zmike: the discussion was re: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5338 which is more or less a wrapper in zink for calling vulkan functions
13:53zmike: e.g., vkCmdCopyBufferToImage and vkCmdCopyImageToBuffer
13:55MrCooper: never mind then :)
13:55zmike: this is still good info!
14:29baryluk: Hi, is anybody else getting __gnu_cxx::recursive_init_error from LLVM 9 / LLVM 10 (llvm/include/llvm/ADT/Hashing.h:328) when running from radeonsi's si_init_compiler ?
14:36rhyskidd: mslusarz: can i get you to look at wip rebase of nouveau's downstream valgrind-mmt fork against the latest valgrind 3.16 release?: https://github.com/Echelon9/valgrind-mmt/tree/mmt-3.16-RC1
14:36rhyskidd: major new features in (upstream) valgrind 3.16 include support for AArch 64 v8.1 and improvements on false positive rates with highly optimised code. Lots of small bugfixes around too
14:43mslusarz: rhyskidd: 2 questions: 1) does it build without warnings? 2) did you have to resolve any conflict?
14:43rhyskidd: 1) clean build; and 2) no major rebase conflicts, only in Makefiles
14:44rhyskidd: binary seemed to pass my basic testing, but tbh i haven't used mmt in anger much at all
14:44mslusarz: ok, then there's nothing for me to look at
14:45mslusarz: I haven't used mmt for quite some time...
14:45karolherbst:still has mmt stuff to fix *sigh*
14:45karolherbst: but I will complain if anything broke :p
14:46karolherbst: once I am using it again
14:48rhyskidd: :p thanks mslusarz
15:46jenatali: jekstrand/karolherbst: I'm planning on adding align_mul/align_offset to load/store deref intrinsics, so that we can track alignment coming from SPIR-V. For load/store deref that's not coming from SPIR-V, I was thinking to just leave alignment 0 to indicate "natural" alignment (which means different things for GL/CL). That currently breaks validation in nir_intrinsic[_set]_align. Objections to relaxing that validation to allow 0 for untyped load/store
15:47karolherbst: jenatali: I don't think that's a good idea
15:47karolherbst: think about vectorizing loads
15:47karolherbst: like merging 4 32 bit loads to 1 128 bit one
15:47karolherbst: and it would be helpful to know if the first member is actually aligned to 128 bit and not just 32
15:48karolherbst: and if that's not comming from spir-v the driver can eg say that the first element in private storage is at least 128 bit aligned
15:48karolherbst: or something
15:48jekstrand: jenatali: I've wondered about how to best do that
15:48karolherbst: or for shared mem even
15:48jekstrand: jenatali: One option would be to provide some sort of alignment on the derefs instead
15:49jenatali: Hm, SPIR-V can annotate alignment on pointers too
15:50karolherbst: yeah.. but stuff like pointers to global mem are usually in the responsibility of the driver to decide their initial alignment and stuff :/
15:50karolherbst: the more you think about it the worse it gets actually
15:50karolherbst: nv hardware can't do unaligned memory loads/writes
15:50jenatali: Yeah, DXIL can't do unaligned loads/stores either
15:51karolherbst: and being able to load a vec4 in one go would be cool.. but if the pointers alignment is 4 we can't really do that
15:51karolherbst: it's... complicated
15:51jenatali: Sure, I get that
15:52jenatali: I guess all I'm trying to do right now is add information when we have it, not try to synthesize information for the cases where we don't
15:52karolherbst: I am wondering how a 0 can help here
15:52karolherbst: maybe 0 could mean "driver controlled" or something?
15:52jenatali: Yeah, or just unknown
15:53jenatali: Essentially right now, lower_explicit_io assumes that loads are being done at natural alignment for the type being loaded, since the loads/stores don't have any alignment info to start with
15:54karolherbst: yeah.. which with packed is not the case
15:54jenatali: My thinking was to keep that behavior when the load/store has 0 alignment, otherwise propagate the original alignment through
15:54karolherbst: I decided to ignore this issue for now.. but yeah... you kind have have to split the value for the driver in the end :/
15:56karolherbst: and we do have some hardware having relaxed rules here :/
15:56karolherbst: I think AMD can do unaligned loads of 32 bit data or so
15:56karolherbst: like 128 bits in one go, but only aligned to 32
15:56karolherbst: if I remember correctly
15:57jenatali: DXIL requires all loads to be 32-bit aligned, and for UBO loads we do 128bit loads exclusively, while everything else we're currently doing 32bit loads/stores exclusively
15:57karolherbst: I see
15:57karolherbst: I think we want to have a seperate pass cleaning it up
15:57karolherbst: so.. we would end up with a load_global of 32 bit data, but 2 alignment
15:57jenatali: Er, I guess that's not entirely true, shared/temp memory can be explicitly typed to anything, but in order to support pointers into that space, we're treating them as int32 arrays
15:58karolherbst: and a pass can split it up
15:58karolherbst: and apply given alignment rules for loads
15:58jenatali: Yeah, I have a pass I wrote for all load/store deref for our DXIL backend which splits under-aligned loads/stores
15:58karolherbst: okay, cool
15:58jenatali: Works well, except there's a couple places that are choking on the load/store having 0 align
15:59jenatali: E.g. the vectorizer
15:59karolherbst: ahh yeah...
15:59karolherbst: I have "some" patches to fix it, but they are hacky
16:00karolherbst: jenatali: mind checking if the nir part of this patch helps? https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/dee91751aede20ccc9fcaaa409795c064a41e54b
16:00karolherbst: it is incompatible to vulkan.. but if that solves all your issues getting a proper patch done mght make sense
16:01jenatali: I'm not sure how it would?
16:01karolherbst: maybe there are more places which generate 0 alignment :/
16:01karolherbst: but that was one of them at least
16:01karolherbst: there was another...
16:02karolherbst: jenatali: do you have this patch locally? https://gitlab.freedesktop.org/mesa/mesa/-/commit/7afc9632a6d03ed8d23fbab08b564da594b9cfd6
16:02jenatali: I haven't looked at any of the other sources of nir besides SPIR-V, but if any of them generate plain untyped load/store deref, they wouldn't be setting alignment
16:03jenatali: And in my patch, I have 0 alignment for all places where vtn generates its own custom loads/stores (e.g. function parameters): https://gitlab.freedesktop.org/kusma/mesa/-/merge_requests/116/diffs?commit_id=14e7788e9765ce25567cbfe3bf8110b063c8c97c
16:03karolherbst: I am strictly talking about spirv here though
16:03jenatali: Yes, we have that patch I'm pretty sure
16:03karolherbst: ohh, you are even forcing an alignment of 0.. right mhh
16:03karolherbst: that's dangerous especially with function args
16:04karolherbst: well.. the entrypoint is special here
16:04jenatali: I mean we inline everything afterwards so they all disappear
16:04jekstrand: Having 0 mean "I don't know" seems reasonable. Then the code which creates strides/alignments on types can say do "if (align_mul == 0) nir_intrinsic_set_align(natural); else assert(properly aligned);"
16:04jenatali: Yeah that's what I was thinking too
16:04karolherbst: jekstrand: we can't rely on having a proper alignment though
16:04Zeising: eric_engestrom: Ping? :) Sorry for nagging you all the time when I have issues, but does https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5403 look at all sensible, or am I missing something?
16:05jekstrand: As far as hardware goes, Intel can do unaligned loads of up to 4 bytes and a 4-byte-aligned loads of up to a vec4.
16:05jekstrand: karolherbst: What do you mean?
16:05karolherbst: jekstrand: packed structs
16:05jekstrand: karolherbst: Right...
16:05jenatali: karolherbst: Those are annotated from SPIR-V correctly so they'll have the right alignment info
16:05jekstrand: karolherbst: But do packed structs allow you to have, say, a uint32_t that's unaligned?
16:05karolherbst: jekstrand: sure
16:05karolherbst: C packed ;)
16:06karolherbst: just like in C
16:06jekstrand: karolherbst: Fun....
16:06karolherbst: that's for sharing the same packed struct between C and CLC
16:06jekstrand: karolherbst: So the pass is going to have to grow some real smarts
16:06karolherbst: so same rules apply
16:06jenatali: FYI I was getting unaligned loads/stores *before* dealing with packed structs
16:06karolherbst: jekstrand: or we split it in a seperate pass where you can inject driver rules
16:07karolherbst: jenatali: uff.. not good
16:07jenatali: char* out; char* in; for (int i = 0; i < 2; ++i) out[i] = in[i]
16:07jenatali: That generated a int16_t load/store
16:07jenatali: 1-byte aligned though
16:07karolherbst: or something?
16:07jenatali: I mean the clang frontend did it
16:07karolherbst: ehh... :/
16:07karolherbst: yeah.. sounds like some opt doing magic
16:07jenatali: The SPIR-V already had the 1-byte aligned 2-byte load/store
16:08jenatali: Took me forever to figure out that was the reason for the CTS failures :)
16:08jenatali: And then I learned about packed structs...
16:08jekstrand: That sounds like something our hardware would actually like though
16:08karolherbst: the fun never ends with CL :p
16:08jekstrand: We can happily do an unaligned 16-bit load/store
16:09karolherbst: lucky you :p
16:09jenatali: We can't, which is why we need the alignment info coming from SPIR-V, so we can split it
16:09karolherbst: but yeah.. that's why the custom rules thing
16:09jekstrand: In any case, It's starting to sound to me like we want to pass through the alignment from SPIR-V and then have a pass which tries to increase the alignment if it can.
16:09karolherbst: and no rule could mean aligned by siye
16:09karolherbst: jekstrand: yeah
16:09jekstrand: And having align_mul/offset on load/store_deref seems like the way to do that.
16:10jekstrand: We may also want an alignment on cast derefs
16:10jenatali: Hm, what for?
16:10jekstrand: To give said NIR pass something it can work with if we get a pointer alignment in SPIR-V.
16:10karolherbst: jekstrand: we already have, no?
16:11jekstrand: karolherbst: No, we don't last I checked
16:11karolherbst: I fixed it...
16:11karolherbst: jekstrand: https://gitlab.freedesktop.org/mesa/mesa/-/commit/7afc9632a6d03ed8d23fbab08b564da594b9cfd6
16:11karolherbst: or did you mean soemthing else?
16:11karolherbst: ohh wait
16:11karolherbst: that's ptr_stride...
16:11jekstrand: Yeah, we need ptr_align
16:11jekstrand: And I don't think we need to care about offset there
16:12jenatali: I guess I'm not following, what would ptr_align do for a cast that couldn't just be inferred from the loads/stores?
16:12jenatali: Well, assuming the code that generated the SPIR-V was able to chase the alignment correctly
16:12jekstrand: It's more for the cases where we get an alignment in and are able to do a better job of chasing.
16:12jekstrand: For instance, we may inline something which allows us to chase even better.
16:13jenatali: Ah hm
16:13karolherbst: yeah.. there are spir-v things for that
16:13karolherbst: there is an Alignment Decoration
16:13jekstrand: Also, for Vulkan, I'd like to be able to throw an alignment of 64B on all UBO base pointers and have NIR figure out alignments from there.
16:14jenatali: For right now I'm going to start with align/offset on load/store, try to patch up existing places to be able to deal with 0, and just focus on correctness first
16:14jekstrand: That seems reasonable.
16:14jekstrand: We can do the deref stuff and the inference pass as a second step.
16:14karolherbst: yeah, sounds good
16:14jenatali: The optimizations that can be done sound worthy but are a bit out of my expertise :)
16:14jenatali: Thanks guys
16:14jekstrand: jenatali: Sure. I'm just trying to make sure we're building the right thing.
16:14jenatali: jekstrand: Agreed, which is why I'm asking :)
16:15jenatali: Though I guess in this case I already built it :P I just caused some test failures as a result
16:36zmike: jekstrand: heya, I'm trying to debug a dead code/variable elimination in nir related to piglit's shaders@glsl-array-bounds-10 in zink
16:38zmike: the fs is using a uniform as an array index, and the uniform is getting removed
16:38zmike: from the nir output it looks like the instructions related to the uniform are getting removed
16:40zmike: I've checked against other drivers and it seems to be okay in those cases; any tips on where I might start checking for the source of the problem?
16:40jekstrand: from your description, I"m not sure
16:40zmike: what more info can I provide?
16:40jekstrand: What do you mean by "getting removed"? Is it being replaced with an undef? A constant?
16:41zmike: I think replaced with undef
16:42jekstrand: I'd start with NIR_PRINT=1 and see where it gets removed.
16:42zmike: right, I've done that as well
16:42zmike: it's gone after one of nir_lower_pack, nir_copy_prop, nir_opt_remove_phis, nir_opt_dce
16:42zmike: though I'm suspecting nir_lower_pack
16:43zmike: in that the print associated with that pass is the last place I see instructions related to the uniform
16:46anholt_: austriancoder: CI for "does my driver keep rendering these apitrace/renderdoc/etc. traces correctly"
16:56jekstrand: zmike: I'm not sure. I would look for where the array index becomes an undef rather than looking for where the uniform gets removed.
16:56zmike: jekstrand: alright, that's a good tip
16:57jekstrand: It's possible they happen at the same time. But if they don't, looking at where the load_uniform gets deleted may be a red herring.
16:59jekstrand: 90% of "where did my instruction go?" is from DCE. However, that's rarely the reason why it stopped being used.
17:00austriancoder: anholt_: ahh.. thats nice
17:00zmike: this is good info.
17:44austriancoder: robclark: how big was the win when switching to simple-mtx? and what did you used to benchmark it?
17:49robclark: I don't think it was massive.. but gfxbench driver-overhead.. and there is a piglit drawoverhead test too
17:57austriancoder: robclark: okay.. will move it down in my to-do list. drawoverheads needs gl3 if I am not wrong - so nothing for me
17:59anholt_: austriancoder: there's a -compat flag
18:00anholt_: definitely recommend trying it out
18:01austriancoder: anholt_: aha.. okay will give it a try
18:13zmike: jekstrand: okay, I think I figured it out with your hint: in this test, there's an array with no defined values, and a uniform is used as an index; nir_opt_if propagates the evaluation of index < 0 for array indexing (which is undef since there's no initial value), then nir_opt_dead_cf removes the whole thing and the uniform use is replaced by an undef
18:13zmike: then nir_remove_dead_variables removes the uniform since it's no longer in a block
18:14jekstrand: zmike: That sounds like a thing that could happen
18:16zmike: so uhh...how do I go about figuring out how to translate that knowledge into not removing the uniform?
18:24jekstrand: zmike: Good question!
18:24zmike: reading the spec, it seems to me that the comment and behavior in can_remove_uniform() might not be accurate for this case?
18:25zmike: specifically the spec part (which isn't referenced in the comment) which says the uniform is considered active if it can be determined that the uniform is accessed or if it can't be determined conclusively
18:25jekstrand: zmike: So what is the actual problem?
18:25jekstrand: I'm still not seeing where NIR is doing anything wrong here
18:25zmike: the problem is that the test fails because shader_runner is trying to access that uniform
18:26zmike: but it gets removed
18:26zmike: accessing it doesn't change any behavior for the shader, but the test fails because the access fails
18:26jekstrand: The test fails because it tries to upload a value to a uniform that's gone?
18:27jekstrand: How does the test not fail everywhere else?
18:27anholt_: zmike: generally this means the test needs to be fixed such that the test does actually use the uniform
18:27zmike: well I ran it on iris and I think you just avoid hitting this case because it gets taken care of later in brw?
18:27jekstrand: zmike: Are you sure you're on the latest piglit? It's possible that tarceri fixed some tests when he made that change.
18:27zmike: I...am not on the latest piglit
18:28jekstrand: I'd try that first
18:28zmike: there don't appear to have been any changes to this particular test in the past 6 years
18:28jekstrand: Maybe there have been changes to shader_runner?
18:29zmike: I don't see anything that would change this behavior?
18:29zmike: I'm rebuilding now
18:29zmike: I was a couple months behind
18:29anholt_: zmike: I'd recommend moving the uniform setup to "uniform int idx = -2" inside the shader instead of a uniform command outside
18:30zmike: anholt_: hm but isn't the existing usage technically valid?
18:30zmike: or do we not care so much
18:30anholt_: then even if the array access is optimized out, you still get what you wanted (test runs to completion, alpha channel with the undef value is ignored by the probe)
18:31zmike: alrighty, easy enough
18:31anholt_: zmike: the compiler eliminating that uniform is valid. shader runner asserts that all uniforms you try to set didn't get eliminated, because we want to make sure you don't typo a uniform name, and asking to set an eliminated uniform is a silly, rare thing to do.
18:31zmike: anholt_: thanks for the info!
18:31anholt_: so, in this case a uniform initializer is a convenient way to dodge shader_runner's checks
18:33zmike: anholt_: while you're checking the tests there, the array-bounds-12 test has the same issue
18:33zmike: but it seems like it's intentional for that case
18:33sravn: pcercuei: I will look at your dbi series next week. Just a heads up that I have seen it but do not have time atm
18:34anholt_: zmike: looks like 12 wants the same treatment, it's just the vs version of that test?
18:34zmike: right I didn't notice the vertex shader line :)
18:34zmike: anholt_: thanks again!
18:35anholt_: glad to help!
18:35anholt_: (more fun than debugging /init being ENOENT in what feels like an innocent CI change)
18:36zmike: I'll keep your CI woes in mind next time I have shader test questions
19:08anholt_: vulkan CI enabled, finally!
19:15airlied: anholt_: how do you get it to complete in a reasonable time? cut down the number of tests?
19:16robclark: yeah, I think it is run one out of every N..
19:23airlied: yaya finally cts_runner runs to completetion without dying, now to make it pass
19:30austriancoder: anholt_: is anyone working on https://gitlab.freedesktop.org/mesa/mesa/-/issues/2655 ?
19:53anholt_: austriancoder: not really, I did it for the nfsroot mode but not the fastboot+initramfs mode
19:58pcercuei: sravn: alright, no rush
20:01austriancoder: anholt_: okay.. let's see if I get it working
20:03anholt_: did the nginx idea make sense?
20:14austriancoder: it makes sense... I thought about it and did not found any other easy way
20:26jenatali: airlied, jekstrand, karolherbst: Thoughts about trying to src/dest alignment info on a memcpy_deref intrinsic? That'd bring the number of consts up to 6 (with accesses) for that intrinsic, where there's currently a limit of 4 for nir
20:31anholt_: austriancoder: another thing I've been thinking about for the initramfs mode: could we precompute most of initramfs in the container and then just append the particular job's contents to it at test time?
20:32jekstrand: jenatali: Is this intrinsic in upstream? I don't see it.
20:32airlied: jenatali: I'd bump the consts :-P
20:32jenatali: jekstrand: Not yet
20:32jekstrand: What are the others?
20:32jenatali: Src/Dest access
20:32jekstrand: src/dest access, I'd assume
20:32jekstrand: that's 2, what are the other 2?
20:33anholt_: austriancoder: cpio can be concatenated, and xz can be concatenated, this sounds pretty good.
20:33jenatali: Yeah, it's for OpCopyMemorySized
20:33jenatali: Src/dest align mul/offset
20:33jekstrand: Oh, right. You need align on both ends.
20:33jekstrand: Yeah, bump it to 6
20:33jekstrand: It'll bloat things a bit but it's not the end of the world.
20:34jenatali: Mmkay, wanted to make sure that number wasn't an important choice
20:34jenatali: Or some other trick for "this intrinsic is special and needs just a bit more data"
20:34jekstrand: If we become *that* concerned about it, we can always make the alignment so that it takes one slot and we pack it as two 16-bit halves.
20:34jekstrand: But I don't think we should be that worried about NIR bloat.
20:34jenatali: Cool, sounds good
20:35jenatali: I'll tackle that later, our memcpy's super unoptimized right now (byte-by-byte where each byte is a read-modify-write in a larger dword :D), would be nice to get alignment info and improve it
20:38jekstrand: Yeah. When the Google team was working on clspv (for OpenCL -> Vulkan SPIR-V translation) they had to do some very exciting things to get 8 and 16-bit stores to work correctly. Some of them involved atomic cmp/swap loops. :-(
20:38jekstrand: We added 8 and 16-bit types to the API for buffers as a result.
20:38jenatali: Oof... we're currently doing atomic and/or
20:39imirkin: jekstrand: some hardware has to play those tricks too =/
20:40jekstrand: jenatali: Does that actually work? If you have a genuine data race on that byte, won't that result in you maybe getting neither of the two values?
20:40jenatali: jekstrand: Hm. Didn't think about that. Not sure if data races are defined to have to produce one of the two values though
20:41jekstrand: jenatali: I'm not sure. It certainly seems nicer to applications to give them that but I don't know how OpenCL defines it if they define it at all.
20:42jenatali: Found it
20:42jenatali: jekstrand: "Any such data race results in undefined behavior."
20:47tango_: otoh, opencl pre 1.2 doesn't necessary support sub-32-bit data access
20:47tango_: unless the cl_khr_byte_addressable_store extension is enabled
20:48jenatali: tango_: That's required for 1.1 it looks like
20:48tango_: jenatali: hm 1.2 iirc
20:48tango_: was it 1.1?
20:48tango_: wow I'm old
20:48tango_: I'm starting to forget thing 8-P
20:48jenatali: We're shooting for 1.2 anyway (well... hopefully 3.0 now that that's a thing :D)
20:49tango_: well 3.0 should be easier than 2.0 actually
20:49jekstrand: jenatali: And/or it is, then!
20:49tango_: at least if you have 1.2
20:49airlied: it's a thing that is 1.2 :-P
20:49jenatali: Yep, should be basically free after adding some caps
20:49tango_: a lot of 3.0 changes were actually rollbacks from 2.0 to make it more “implementable”
20:50jekstrand: Where by "a lot of" you mean "we rolled back almost all of 2.0"
20:50tango_: well, there were also a lot of other changes
20:50jenatali: Is there anything left of 2.0 that's still required?
20:51tango_: jenatali: I think the memory model
20:51jenatali: What do you mean by that?
20:51tango_: (which is good because the 1.x was ... suboptimal)
20:51tango_: jenatali: the memory ordering and consistency rules
20:52tango_: the way it was defined in 1.x, one didn't even have the guarantee that atomics worked the way one would expect
20:54tango_: (still can't blame the standard for bugs such as this though ;-) https://gitlab.freedesktop.org/mesa/mesa/-/issues/2703 )
20:55jenatali: Why does the platform advertise 6GB allocations? O_o
20:58tango_: jenatali: 8GB card. not sure why 6GB is the max
20:58tango_: for what it's worth, rocm gives me a max alloc size of 6.8GiB
20:59tango_: mesa 6871947673 (6.4GiB)
20:59jenatali: Hm, that seems quite large. Unfortunately I don't think we're going to be able to go above 4GB, though realistically we might end up capped at 2GB
20:59tango_: (boy how much do I hate the fact that free software opencl implementation can't coexist when they share llvm ...)
20:59jenatali: We're using 32bit buffer index + offset
21:00tango_: jenatali: doesn't that prevent using the entire memory of the device?
21:00jenatali: In a single allocation, sure
21:01tango_: but if the offset is 32-bit too then the max alloc size should definitely be clamped at 4GB
21:01jenatali: Yeah, right now we're reporting 1GiB (minimum required by the spec if you have more than 4GiB of available mem)
21:06tango_: btw is the choice of 32-bit for performance or are there other considerations?
21:07jenatali: We're mapping to the DXIL intermediate language which doesn't have pointers
21:08jenatali: I guess we could do 64-bit values with like 16 bits of buffer index and 48 bits of offset, but that's significantly more complicated to pack/unpack and would result in a lot of 64-bit math. Using 32bit splits ends up doing most math at 32 bits
21:16tango_: jenatali: would it be possible to “choose” the encoding based on usage?
21:17tango_: as in: if the user only does < 4GiB allocations, use 32+32, otherwise use 16+48?
21:17tango_: hm duplicating codepaths this way would be a PITN for maintenance though
21:18jenatali: Not easily...
23:27anholt_: are there any hw drivers that have a significantly reduced featureset on gles compared to desktop?
23:27anholt_: the question is for firefox: if they're going to pick only gl or only gles on linux, which should it be?
23:28anholt_: (I'm arguing gles, because of cases like pi4/freedreno/i think panfrost where we have much less exposed for desktop due to some desktop limits)
23:34airlied: anholt_: are they picking gles2/3/3.1/3.2? :-P
23:35airlied: like gles3.1 + exts covers quite a lot of desktop
23:35anholt_: airlied: they pick desktop gl and glx, period, currently. trying to convince them to autodetect and pick something sensible, but they don't want two paths for context setup.
23:36karolherbst: tango_, jenatali: nvidia limits the allocation size to 1/4 of VRAM btw
23:36airlied: anholt_: don't they run on windows/osx?
23:37anholt_: yeah, but windows/osx/android have separate context setup paths
23:38airlied: but I think for desktop hw you get compute shaders on ivybridge where GL doesn't
23:38anholt_: airlied: you get compute on ivy in gles, you're saying?
23:39airlied: jekstrand: being the only driver to overallocate buffers, brings out some bugs :-P
23:39airlied: anholt_: yes
23:39airlied: but unless they need some desktop functionality it's probably not a bad idea to use GLES