IRC Logs of #dri-devel on irc.freenode.net for 2023-12-12

01:33 airlied: agd5f: will do it now
05:02 mareko: ACO seems to work very well with radeonsi
07:14 airlied: agd5f: backmerged pushed out
07:51 mupuf: mareko: great to hear! Any more you want to share?
07:53 mupuf: As in, do you mean it is close to being functionally and performance equivalent? Or is it equal, or better? What apps did you use for your testing?
12:23 luc: hi, all, recently I do some experiment on aarch64 platform. i replace memcpy[1] with __memcpy_aarch64_simd[2] in _mesa_store_compressed_texsubimage. it turns out that the latter is almost 1x slower than the former. If I understand correctly. what _mesa_store_compressed_texsubimage() does is copying data from ram to vram. I dont know why simd does worse under the circumstance
12:24 luc: [1]https://github.com/bminor/glibc/blob/master/sysdeps/aarch64/memcpy.S
12:25 luc: [2]https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memcpy-advsimd.S
12:25 karolherbst: luc: memcpy is already implemented efficiently via simd instructions
12:25 karolherbst: glibc chooses what is the fastest given the hw and input
12:26 karolherbst: also compilers might replace memcpy by something better as well
12:29 luc: compared to ARM-software version, I noticed that glibc just doesn't use SIMD/FP registers，I wonder how they (simd/fp registers) make a difference.
12:29 karolherbst: yeah.. but I'd trust them to know what they are doing and apparently they seem to do
12:30 karolherbst: but it might be best to check with gdb what actually happens on that memcpy
12:30 karolherbst: compilers are free to skip going through libc on any memcpy call, so it might just be that the compiler does something even smarter
12:32 karolherbst: and by using something besides memcpy you take that freedom away from compilers
12:42 luc: I've checked that with gdb. sure that it is __memcpy_generic in [1] above that is chosen. so i guess what is slow are those instructions such as load/store q0.. 7
12:47 karolherbst: luc: out of curiousity, did you try the sve version?
12:50 luc: karolherbst: not yet, because my cpu is armv8-a, according to ARM reference, sve is introduced since armv8.2-a
12:51 karolherbst: could check in /proc/cpuinfo but yeah.. it's kinda hard to find out when sve was actually introduced
12:54 luc: in fact, __ARM_FEATURE_SVE not defined by my compiler
12:59 luc: karolherbst: /proc/cpuinfo shows `Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid`
12:59 karolherbst: I see
12:59 karolherbst: yeah.. then no idea why the generic version is faster, unless there is a good reason for this
13:08 luc: karolherbst: thanks a lot
13:12 cwabbott: https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memcpy-advsimd.S is using the same instructions as __memcpy_generic, because with aarch64 you can assume that you have ASIMD instructions (which is what both use to load/store)
13:12 cwabbott: the inner loop even looks very similar
13:12 cwabbott: no idea why one would be slower
14:05 karolherbst: jenatali: yeah, so I didn't hit any regressions with https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26641
14:06 jenatali: karolherbst: Great, let's land it
14:07 jenatali: karolherbst: How long does a full CL CTS run take on rusticl?
14:07 karolherbst: depends on the kind of the CL CTS run
14:08 jenatali: Kind?
14:08 karolherbst: or did you mean the run with the official runner and everything?
14:08 karolherbst: but also FULL vs EMBEDDED matters
14:08 jenatali: Yeah, a full run
14:09 MrCooper: karolherbst: heads up, I'm bisecting a rusticl regression since Friday with radeonsi
14:09 karolherbst: my last run on zink on radv was ~3 hours apparently
14:09 karolherbst: MrCooper: which one?
14:09 karolherbst: I wanted to dig into one today myself ...
14:10 karolherbst: jenatali: zink only exposes EMBEDDED though, and on iris which does FULL it was like 10 hours?
14:10 karolherbst: but that was like a year ago
14:11 MrCooper: piglit program@execute@builtin@builtin-float-remquo-1.0.generated & program@execute@builtin@builtin-float-sincos-1.0.generated broke
14:11 jenatali: Huh ok. When I did it a few years ago I was seeing closer to like 72. Not sure if I was just slow execution or if it was slow compilation
14:11 karolherbst: yeah.. maybe rusticl being heavily threaded helps
14:11 karolherbst: though
14:11 karolherbst: the CTS built in release mode helps a lot
14:12 karolherbst: but yeah... I interface with a `pipe_context` only from a special worker thread, which allows some kind of parallelism
14:12 karolherbst: (I should compile programs in parallel though...)
14:13 karolherbst: but I have a script which runs like evertyhing in an hour parallized
14:13 karolherbst: or under 10 minutes with wimpy and some annoying and irrelevant tests disabled
14:14 karolherbst: MrCooper: ahh
14:14 karolherbst: MrCooper: on my end I have nextafter, remainder and remquo failing sometimes
14:15 karolherbst: but also something with half vstore/vload
14:15 karolherbst: I'll look into the vstore/vload stuff first then
14:16 MrCooper: ~30 tests fail here ever since I started testing, these were passing until today though
14:16 karolherbst: jenatali: maybe something serializes on conversion/math_brute_force on your end? Those tests are already threaded themselves and run on multiple CL queues
14:16 karolherbst: and conversions is like 60% of the runtime
14:16 karolherbst: at least for me
14:16 jenatali: Yeah they just take forever
14:17 jenatali: I haven't tried to do a full run recently and I'm working on perf currently so hopefully it'll be faster when I'm done
14:17 karolherbst: Test Conversions passed in 28495.6525979s on iris
14:17 karolherbst: roughly 8 hours full profile
14:18 jenatali: I think my last fails actually disappeared since I last looked too (hooray shared / external libraries) so I might be able to actually submit for CL3.0 certification
14:18 karolherbst: nice
14:19 karolherbst: jenatali: full or embedded profile?
14:19 karolherbst: I guess full as you don't have the image restriction issue with d3d
14:19 jenatali: Full
14:19 karolherbst: stats from my iris CTS run: https://gist.githubusercontent.com/karolherbst/6373866091ab497f4683edfa3902a2e4/raw/90d6962f0ed09f3a81a709d409f794a51db60f3d/gistfile1.txt
14:19 jenatali: What issue?
14:19 karolherbst: like.. GL doesn't split samplers and textures
14:20 karolherbst: so most drivers only support 32 read only images
14:20 jenatali: Oh right
14:20 karolherbst: and radeonsi wasn't interested unless anything actually needs more, as it's otherwise just pointless overhead :D
14:20 jenatali: Yeah the one main benefit of using an external runtime+driver
14:20 jenatali: Right
14:22 karolherbst: MrCooper: I see
14:23 karolherbst: MrCooper: one concerning issue is that I _sometimes_ hit this assert: test_bruteforce: ../src/gallium/auxiliary/util/u_inlines.h:83: pipe_reference_described: Assertion `count != 1' failed.
14:23 karolherbst: kinda need to figure out what that's all about
14:30 MrCooper: karolherbst: bisect landed on https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26307
14:30 karolherbst: mhh still doing mine
14:30 karolherbst: or rather.. git bisect run is still running
14:31 karolherbst: which of those commits though?
14:31 karolherbst: however
14:31 karolherbst: I think something is up with denormal flushing or rounding behavior...
14:32 MrCooper: 17e01a9a9b743d89066ba0a42c841e9b7e7d0528 "radeonsi: merge context_reg_saved_mask and other_reg_saved_mask into a BITSET"
14:32 karolherbst: mhhh
14:32 karolherbst: that's going to be fun to debug
14:37 MrCooper: will you file an issue about this?
14:37 karolherbst: I'd probably just debug it and send a patch...
14:38 MrCooper: even better :) thanks
14:40 karolherbst: my bisect ended at dbbf566588cedc72062f3d3640a0cf1bebd40af9 aco,ac/llvm,radeonsi: lower f2f16 to f2f16_rtz in nir :')
14:41 karolherbst: I think I have to set the execution mode or something
15:11 karolherbst: MrCooper: ohh.. yours was fp16 stuff?
15:11 karolherbst: ehh wait
15:11 karolherbst: it's a vec16
15:11 karolherbst: nvm
16:09 emersion: jani: enums can't be forward-declared :/
16:09 jani: emersion: welcome to gcc, they can!
16:10 emersion: jani, hm, but how does the compiler decide on the size?
16:10 emersion: is there a flag to make all enums int or something?
16:10 jani: emersion: of course, you shouldn't ever put that in an abi, but within a kernel build it'll always be the same
16:19 CounterPillow: why did they add that particular footgun to the compiler?
16:20 emersion: p e r f o r m a n ce
16:21 emersion: i hope C23 enums can be forward-declared if you specify the underlying type
16:21 jani: I think that's what modern C++ has
16:24 bl4ckb0ne: CounterPillow: saving a few bits
16:24 bl4ckb0ne: back 78 those were expensive
16:24 HdkR: forward declaring enums is valid since C++11, just need to be explicit about the sizing :)
16:25 karolherbst: bl4ckb0ne: still could have made it explicit
16:25 karolherbst: though that's easy to say 45 laters
16:25 karolherbst: *45 years
16:26 bl4ckb0ne: indeed
16:26 karolherbst: though back then C was also more like assembly on steroids if anything
16:26 bl4ckb0ne: better late than never eh
16:26 HdkR: Oh cool, C23 gains the same `enum Enum : int` explicit sizing as C++11
16:26 karolherbst: yeah..
16:27 karolherbst: C23 is another huge release (finally)
16:27 CounterPillow: Now we just have to wait 20 years to be able to use it in the kernel :)
16:27 karolherbst: and then in 2035 we'll get a mesa MR defaulting to it
16:27 bl4ckb0ne: do you think msvc will have full support by then?
16:27 CounterPillow: Partial but only if you pretend it's C++
16:29 karolherbst: by 2035 MS will have ditched the NT kernel and uses linux anyway (and long ditched MSVC for something llvm based)
16:31 jani: emersion: actually it's not always the same, but the caller needs to have the complete type before it can make the call
16:31 jani: emersion: so you can't call functions with the wrong size
16:32 emersion: i see
16:32 karolherbst: sounds like an oversight, they should have allowed it for extra cursedness
16:32 jani: :)
16:33 karolherbst: MrCooper: .... can you confirm that your issue goes away with MESA_SHADER_CACHE_DISABLE=1?
16:34 karolherbst: what a pain issue
16:34 karolherbst: so it fails for me on the first run compiling things
16:34 karolherbst: and then passes
16:35 karolherbst: well.. apparently that's not even true
16:35 karolherbst: pain
16:35 karolherbst: anyway...
16:35 karolherbst: I think the cache is broken
16:37 jenatali: bl4ckb0ne: I think MSVC will probably have full support for C23 within a year or two if I had to guess
16:37 bl4ckb0ne: thats good to know
16:37 jenatali: Just a guess FWIW, I don't have insider knowledge on their timelines for that stuff
16:37 jenatali: But they finally added C11 threads (https://devblogs.microsoft.com/cppblog/c11-threads-in-visual-studio-2022-version-17-8-preview-2/) so it seems like they actually care a bit about C now
17:32 MrCooper: karolherbst: nope, fails even with MESA_SHADER_CACHE_DISABLE=1
17:34 karolherbst: mhhh
17:34 karolherbst: I wonder if my issue is the same, but it's quite random
17:34 karolherbst: and I need to run ~7 times with the cache disabled to either hit it or not
17:49 cmarcelo: jenatali: best part of that MSVC news for me: struct whatever w = {}; will work for it now.
17:49 karolherbst: pain.. I always bisect towards nonsense commits :(
17:50 jenatali: cmarcelo: Hm? Is that a thing being added in C23?
17:50 MrCooper: karolherbst: it's been consistent for me so far, I've only done low double-digit number of tests though
17:51 cmarcelo: jenatali: yes. you can use = {} instead of = {0} to zero initialize structs.. that is helpful in some edge cases too (nested structs etc). it was already supported in clang/gcc as compiler extensions for a while.
17:51 jenatali: Oh cool
17:53 karolherbst: MrCooper: mhh.. maybe I'm debugging a different bug then
17:53 MrCooper: seems likely
17:59 karolherbst: let's see how many attempts it will take to find the culprit :')
18:00 cmarcelo: jenatali: and from my understanding it also will zero the padding bits (!)
18:00 jenatali: :O
18:00 karolherbst: it's not already guarnateed?
18:00 karolherbst: or will {} != { 0 } then?
18:02 cmarcelo: I don't think is guaranteed :-( my understanding is that will be different. trying to parse out the spec proposals.
18:10 bwidawsk: so there were a few patches which landed for 23.3 (started with 9ec9849c85e8202cb) that leandrohrb56 authored and that emersion and daniels reviewed which essentially stop me from using VKMS as an EGL renderer. I'm wondering what the right path would be for me to run my test suite now
18:10 bwidawsk: at least I think this is the case...
18:12 emersion: bwidawsk: why do they stop you from doing that?
18:12 bwidawsk: I think the main one is I lose dmabuf import apparently
18:12 emersion: sounds like a bug
18:13 cmarcelo: karolherbst: AFAICT "= {0}" didn't guaranteed to also zero the padding. empty initializer "= {}" guarantees that.
18:13 karolherbst: cursed
18:14 karolherbst: the same for the compiler extensions?
18:14 emersion: eh, really?
18:14 bwidawsk: emersion:
18:14 bwidawsk: ```
18:14 bwidawsk: 2023-12-12T17:35:10.212251Z DEBUG main: smithay::backend::egl::display: Supported EGL client extensions: ["EGL_EXT_client_extensions", "EGL_EXT_device_base", "EGL_EXT_device_enumeration", "EGL_EXT_device_query", "EGL_EXT_platform_base", "EGL_KHR_client_get_all_proc_addresses", "EGL_KHR_debug", "EGL_EXT_platform_device", "EGL_EXT_explicit_device", "EGL_EXT_platform_wayland", "EGL_KHR_platform_wayland", "EGL_EXT_platform_x11",
18:14 bwidawsk: "EGL_KHR_platform_x11", "EGL_EXT_platform_xcb", "EGL_MESA_platform_gbm", "EGL_KHR_platform_gbm", "EGL_MESA_platform_surfaceless"]
18:14 bwidawsk: ```
18:15 bwidawsk: sorry, wrong one
18:15 bwidawsk: I meant this
18:15 bwidawsk: ```
18:15 bwidawsk: 2023-12-12T17:35:10.248792Z INFO main: smithay::backend::egl::display: Supported EGL display extensions: ["EGL_EXT_create_context_robustness", "EGL_KHR_cl_event2", "EGL_KHR_config_attribs", "EGL_KHR_context_flush_control", "EGL_KHR_create_context", "EGL_KHR_create_context_no_error", "EGL_KHR_fence_sync", "EGL_KHR_get_all_proc_addresses", "EGL_KHR_gl_colorspace", "EGL_KHR_gl_renderbuffer_image", "EGL_KHR_gl_texture_2D_image",
18:15 bwidawsk: "EGL_KHR_gl_texture_3D_image", "EGL_KHR_gl_texture_cubemap_image", "EGL_KHR_image_base", "EGL_KHR_no_config_context", "EGL_KHR_reusable_sync", "EGL_KHR_surfaceless_context", "EGL_EXT_pixel_format_float", "EGL_KHR_wait_sync", "EGL_MESA_configless_context", "EGL_MESA_drm_image", "EGL_MESA_query_driver", ""]
18:15 bwidawsk: ```
18:19 bwidawsk: oh hang on a sec
18:19 bwidawsk: maybe it's my fault, let me check something else
18:33 bwidawsk: daniels, emersion, leandrohrb56: It was my mistake. It was falling back to gles renderer instead of using pixman as it was supposed to be.
18:34 cmarcelo: karolherbst: the GCC extension seems to do that (zero padding), although it not really documented. also looks like in practice gcc/clang already treat "={0}" == "={}". will keep an eye open to see what MSVC will do here.
18:35 karolherbst: yeah, it's also often faster to just initialize it all in one go, because vector instructions
18:39 vsyrjala: iirc c23 mandates ={} to make sense. ie. padding is also zeroed
18:42 cmarcelo: vsyrjala: yes
18:44 vsyrjala: oh that was exactly what is being disuccsed :)
18:44 vsyrjala:didn't look far back
18:45 vsyrjala: if only constexpr for functions had been included as well :(
18:48 jenatali: cmarcelo: Feel free to +1 https://developercommunity.visualstudio.com/t/add-an-experimental-c23-mode-stdclatest-and-implem/1657588
18:48 jenatali: I just did :)
18:53 cmarcelo: jenatali: voted
19:46 mareko: mupuf: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10285
19:48 karolherbst: mareko: while you are here, are you aware of any recent regression inside radeonsi in regards to the shader compiler _sometimes_ producing different/wrong code? Should be 2-3 weeks old change, but I'm still haveing troubles figuring out what's actually going on here. Just wondeirng if you know something
19:49 mupuf: mareko: thanks!
19:49 mareko: karolherbst: if it's the bitset thing, try to use CLEAR instead of SET
19:49 mareko: in si_compute.c
19:50 mareko: for the saved registers
19:50 karolherbst: not quite sure yet.. I need to run a test ~15 times to properly detect the regression, so my git bisect runs are kinda... unreliable so far
19:50 karolherbst: yeah.. MrCooper bisected to that I think
19:50 karolherbst: 17e01a9a9b743d89066ba0a42c841e9b7e7d0528 specifically
19:50 karolherbst: I might end up at the same commit
19:50 mareko: BITSET_SET_RANGE is 100% wrong, it should be CLEAR
19:51 karolherbst: okay, thanks :) will try that out then
19:53 mareko: the previous code used a bitmask and it set 0
19:54 karolherbst: that one inside si_launch_grid?
19:54 mareko: there should be only one in that file
19:54 karolherbst: okay, must be that one then
19:55 mareko: it's rather obvious from the bad commit
19:55 karolherbst: yeah, now that I found the spot it indeed looks wrong
19:57 karolherbst: will need to run the test in a loop for a while to be sure it fixes it :)
19:58 soreau: could it cause gpu hangs or app crashes?
19:58 karolherbst: I've seen such happening but not sure if it was caused by that
19:58 mareko: only with rusticl, clover, or CDNA
19:59 karolherbst: mareko: yeah.. so it looks better, do you want to submit an MR or should I?
19:59 karolherbst: I'll do more testing to make sure it's better
20:00 mareko: feel free to do it
20:00 karolherbst: okay, once I run more tests I'll open one then
20:22 ChaosPrincess: is there any documentation on how tessellation control shaders are compiled? even a very simple one that only sets the levels and passes through one variable (tes_color[gl_InvocationID] = tcs_color[gl_InvocationID]) turns into a huge pile of bcsels and control flow.
21:28 airlied: ChaosPrincess: for what gpu?
21:31 ChaosPrincess: asahi, but that is me dumping nir quite early, right at the beginning of agx_compile_variant
21:36 airlied: not sure where they lower tess to compute and do actual tessellation
21:37 airlied: might need alyssa to appear
21:37 ChaosPrincess: they don't. i am looking at input nir that is being passed from opengl compiler to driver-specific code
21:38 airlied: NIR_DEBUG=print_tcs might be a good place to look
21:39 jenatali: ChaosPrincess: IIRC it comes out of the GLSL frontend that way
21:41 ChaosPrincess: print_tcs says the offending pass is gl_nir_lower_buffers
21:44 jenatali: Ah, looks like it's probably nir_lower_indirect_derefs which just isn't wrapped in NIR_PASS_V so it doesn't print