IRC Logs of #dri-devel on irc.freenode.net for 2023-09-19

00:50 karolherbst: dcbaker: seems to be flaky..
00:51 karolherbst: ehh wait
00:51 karolherbst: there are also fails in the test I've written.. uhh
00:52 karolherbst: huh.. something is weird
00:54 karolherbst: heh.. that generated wrapper doesn't include the origina header file... maybe something changed with an updated bindgen
00:54 karolherbst: I'll look into that tomorrow
01:02 karolherbst: and ubuntu "rolling" doesn't have a new enough bindgen...
01:02 karolherbst: *sigh*
01:34 dcbaker: karolherbst: soon we’ll be able to build bindgen in the build process, lol
01:36 HdkR: Oh wow, how long would that take?
01:37 HdkR: That sounds like my idea of forcing people to rebuild a clang fork :D
01:38 soreau: HdkR: upstream the patches, then you wont have to ;)
01:38 HdkR: :)
03:03 kurufu: Since anv exports yuv formats a single plane, is that essentially set in stone now? Asking as vulkan video makes that format more popular the single plane format seems to make lots of things harder (if only because the world expects 3 planes).
03:09 Sachiel: what do you mean? The number of planes depends on the specific format, no?
03:48 kurufu: https://github.com/haasn/libplacebo/blob/c7e9e95cd5144cefc52bfba43ca2d9138f7b3eec/src/vulkan/formats.c#L480 for example
03:49 kurufu: Despite the format being x planes, drivers dont neccessarily need to export a representation with multiple planes.
03:58 Sachiel: An image’s memory planecount (as returned by drmFormatModifierPlaneCount) is distinct from its format planecount (in the sense of multi-planar Y′CBCR formats). In VkImageAspectFlags, each VK_IMAGE_ASPECT_MEMORY_PLANE_i_BIT_EXT represents a memory plane and each VK_IMAGE_ASPECT_PLANE_i_BIT a format plane.
03:58 Sachiel: https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#VkDrmFormatModifierPropertiesEXT
04:42 Company: fun fact: Disabling optimizations when generating spirv with glslang/glslc can result in significantly faster shader code
04:43 Company: almost as fast as zink now
06:19 daniels: soreau: could you please file an MR?
08:18 soreau: daniels: Now that I thought about it, it needs zmike's MR to be merged first since it's against the egl branch of !24700
08:44 soreau: daniels: also I'd like some feedback on fixing the crash http://pastie.org/p/6tqXQok1L7hJhIMv7mCQa9/raw with resizing weston-simple-egl regardless of vblank_mode/swap_interval http://ix.io/4GKs
09:17 soreau: in that trace, wsi_wl_surface is NULL because it's using the old swapchain after it has been replaced with a new swapchain but before it was pruned
09:18 soreau: attempts to simply prune before it was used were unsuccessful
09:32 karolherbst: HdkR: building bindgen takes like a minute
09:54 karolherbst: dcbaker: okay.. so the arch container simply had bindgen 0.64 and there the static inline stuff is mostly broken..
10:58 wideopen: The solver itself has all the documentation, it's incremental solving through the server, there is more room to extensions, so if this is a code point like variable in clause i.e constraint based thingy, it tests it, and takes on a assume path then continues and does it again (the test), there is very little that needs a change. It's all done cause bitwuzla has softfloat core too, so mmiotrace through the solver should be also very
10:58 wideopen: easy, you just add all the integers to be polled, and you maximize them and annotate a name to them. What squad psykose talks about their anal tranny stalkers, as all the last ones alike have been reported missing teeth and likely if they have anything other to bully or violate more, they will report a missing bully altogether, so i care none about your squad, and idiots or brain dead he was in the moment he was born, that has nothing
10:58 wideopen: to do with me , only thing that has to do with me, is that bully stalks my businesses along with many others, and since my hand bone was broken 2012 years ago, there is a set of people who will crucify such, i.e beat up , lock up and if more problems are met, then already wheelchair etc. At overseas legal powers such as Khmer army handled that, but i can offer some wrestlers to help them. In the meantime this commune is a filthy trash,
10:58 wideopen: i am better off anyway not to deal with you. That i am bio-chiped is another thing i am aware about, but this battle i win, and i say upfront i can elect something to tattoo on your butts. And my choice is "Decade long Markov Chain abusive stalker anal house parrot"
10:58 wideopen: good bye
11:00 daniels: soreau: the patch looks reasonable to me, but it would need zmike to comment on it since the flush_queue is zink-specific
11:01 soreau: daniels: unfortunately, zmike said he can't reproduce the resize crash
11:01 soreau: but no telling what sort of dragons he has installed ;)
11:02 soreau: or maybe it's only on radv? idk which chip he tried
11:05 daniels: right, but he's the one who knows whether flushing the flush queue at that point is theoretically correct or not
11:05 daniels: I didn't even know zink had its own internal queue of stuff
12:26 Wallbraker: Does https://docs.mesa3d.org/envvars.html#envvar-MESA_VK_TRACE do anything? Is it a newish variable?
12:34 pixelcluster: Wallbraker: it does something on 23.2 and newer
12:35 pixelcluster: the different trace modes have had different environment variables to enable before that, but docs are generated from latest main
12:35 Wallbraker: Ah yes, I'm on 23.1.7, that clears that up. Really those things should have version tags so you know for which version they apply to.
13:04 zmike: soreau: I slept on it and I think https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25288 is what you need
13:08 soreau: zmike: Thanks, I'll try it. Were you ever able to reproduce the resize crash? (on radv?)
13:08 zmike: eventually
13:09 soreau: super good
13:15 DavidHeidelberg[m]: running GPT-2 on iris (TGL): ~ 112ms per query; llvmpipe (i7-1185G7 @ 3.00GHz): ~150ms (but CPU heats up quickly, so power consumption would be very different)
13:16 DavidHeidelberg[m]: outcome: OpenCL may be not best, but still better than CPU on integrated GPU with shared memory for LLVM
13:16 DavidHeidelberg[m]: *LLM damn
13:20 soreau: zmike: yes it works
13:21 soreau: zmike: did you get a chance to glance at the swap interval patch?
13:25 karolherbst: DavidHeidelberg[m]: wondering how good it runs on Intel's stack
13:25 karolherbst: but yeah..
13:25 karolherbst: on Intel's iGPUs it's mostly about power efficiency
13:26 karolherbst: and.. there might be some overhead randomly :D
13:26 karolherbst: always hard to tell with tests running quickly
13:31 zmike: soreau: no, but I'm also probably not the ultimate reviewer for such a thing
13:31 DavidHeidelberg[m]: karolherbst: RUSTICL_ENABLE=iris GPU=1 PYTHONPATH="." JIT=1 python examples/gpt2.py --model_size=gpt2 --prompt "Hello." --count 46 --temperature 0 --timing
13:33 karolherbst: heh...
13:33 karolherbst: DavidHeidelberg[m]: what repo do I need to clone?
13:33 DavidHeidelberg[m]: https://github.com/tinygrad/tinygrad/
13:33 karolherbst: ahh, it's tinygrad
13:33 soreau: zmike: ok
13:38 karolherbst: DavidHeidelberg[m]: looks like intel is a bit faster :D but I'm also using a debug build
13:39 karolherbst: What is Rusticl?: Rusticl is a library for building and manipulating Rusticl objects. It is a library for building and manipulating Rusticl objects. It is a library for building and manipulating Rusticl objects. It is a
13:39 karolherbst: oh well..
13:39 karolherbst: but that question breaks with rusticl
13:39 karolherbst: "ValueError: probabilities do not sum to 1" :D
13:39 karolherbst: guess I might want to fix that
13:40 karolherbst: works with llvmpipe...
13:40 DavidHeidelberg[m]: Interesting :D
13:41 DavidHeidelberg[m]: Usually so far rusticl w/ iris worked best for me
13:42 karolherbst: I break it way too easily
13:42 karolherbst: but yeah...
13:42 karolherbst: I'm sure it's a 0.9999 vs 1.0 thing
13:42 karolherbst: and some precision is slightly off
13:42 karolherbst: let's see...
13:43 karolherbst: yeah.. works with count 45 :D
13:44 DavidHeidelberg[m]: Iris doing trolololo
13:44 DavidHeidelberg[m]: Which gen?
13:44 karolherbst: but yeah.. intels stack is significantly faster.. I guess there is something I should optimize then
13:44 karolherbst: 9.5
13:45 karolherbst: "Rusticl is an EULA agreement-based and open source project which aims to provide a framework for building web apps based on Rust. This means that the project is open source."
13:45 daniels: well, those are certainly all words
13:46 karolherbst: it's fascinating how sure it is and how much garbage it is saying
13:46 karolherbst: *sure of itself"
13:46 tnt: mmm, I just checked pocl/rusticl iris/rusticl llvmpipe/intel compute and yeah, rusticl iris is quite a bit slower than the 3 others for some reason.
13:47 karolherbst: probably something debug build
13:47 DavidHeidelberg[m]: it's GPT-2, with llama (but it takes 13G of VRAM) it's "sometimes" reasonable
13:47 karolherbst: clinfo is also 3x quicker with release builds
13:47 DavidHeidelberg[m]: yeah, I'm using Debian nightly builds (so release)
13:48 karolherbst: ahh
13:48 tnt: I should be on a release build too
13:49 karolherbst: anyway.. I'm sure there are optional CL extensions or something
13:49 karolherbst: or the runtime overhead is just too high
13:49 karolherbst: I think I broke it: https://gist.githubusercontent.com/karolherbst/c5f1ccf96268327801f3bafa88a8a770/raw/7022887aee7013cd0b77c32637fa8801572c3159/gistfile1.txt
13:50 karolherbst: but glad to know that it also produces same nonsense with Intel
13:50 karolherbst: so the math seems to be alright
14:03 karolherbst: nice.. seems to work with zink as well... kinda
14:03 karolherbst: I kinda have to find a solution for the gpu getting reset too quickly
14:04 karolherbst: and I get tons of "MESA: error: zink: couldn't allocate memory: heap=4 size=20102"
14:04 karolherbst: oh well...
14:24 karolherbst: okay.. yeah it's mostly GPU side things ...
14:25 karolherbst: and some memcpies, but whatever
14:25 karolherbst: 83% is just the python runtime
14:27 karolherbst: GPU is 90% busy, so there is that
14:33 karolherbst: and it's always creating a context with profiling enabled.. oh boi.. anyway, I suspect we can be more optimized on the kernel side somewhere
15:31 karolherbst: mhhh.. maybe I have some synchronization bug somewhere... zink is also hitting something all the time
15:54 karolherbst: I wonder if we want to do something about shaders like this: https://gist.githubusercontent.com/karolherbst/55db26f796f9b5bde6ea213d0c255346/raw/1b281ff277964c51c0f0f07765a05255068df00b/gistfile1.txt
16:24 karolherbst: oof... ralloc_asprintf accounts for 33% of my launch kernel overhead...
16:32 karolherbst: "SIMD16 skipped because workgroup size 1 already fits in SIMD8" getting tons of those
16:33 karolherbst: or "SIMD32 skipped because workgroup size 16 already fits in SIMD16"
16:34 karolherbst: Kayden: ^^ seems like this is a significant CPU overhead for launching compute jobs
16:34 karolherbst: like.. the biggest part of it
16:35 karolherbst: might make sense t skip those ralloc_asprintf calls
16:35 karolherbst: inside brw_simd_should_compile that is
16:42 jrpan: Hi, I'm new to graphics rendering and drivers and please forgive me for asking dumb questions. I've been playing with the vulkan intel driver. I see that when anv_cmd_buffer_bind_descriptor_set is called, the descriptor is saved to the cmd_buffer at cmd_buffer->state.gfx.base->descriptors[set_index].
16:42 jrpan: If there are multiple draws in a command buffer, and for each draw, new descriptors are binded, how are the desciptors being distinguished between drawcalls that are within the same command buffer (when the buffer is submitted to GPU)?
16:43 jrpan: Becuase it's the same command buffer and the later binded descriptor would just overwrite previous binded descriptor. Or my understanding is just wrong?
16:45 jrpan: I just want to reference all descriptors that are being used in a command buffer at queuesubmit. But so far I can only get the last descriptor used.
16:45 pendingchaos: mostly likely vkCmdDraw/etc read from state.gfx.base->descriptors to create commands that actually bind the descriptors
16:47 jrpan: So at vkCmdDraw, the state.gfx.base->descriptors is being "saved" to the vkcmdDraw?
16:47 Sachiel: anv_CmdBindPipeline in anv_cmd_buffer.c takes care of that
16:51 mareko: tarceri_: I've noticed that gl_program::sh::UniformBlocks is populated before UBO linking for GLSL, but not SPIR-V. If I add UBOs before UBO linking, do I also need to update gl_program::sh::UniformBlocks?
16:57 Kayden: karolherbst: that isn't handled by shader cache?
17:04 karolherbst: Kayden: nope, that's all at the SIMD selection level
17:06 karolherbst: Kayden: iris_launch_grid -> iris_upload_compute_state -> iris_upload_compute_walker -> brw_cs_get_dispatch_info -> brw_simd_select_for_workgroup_size -> brw_simd_should_compile -> ralloc_asprintf
17:08 karolherbst: guess that's only hit if the workgroup size is variable...
17:53 karolherbst: Kayden: btw, any ideas what could be improved here? https://gist.githubusercontent.com/karolherbst/bb1ba3fa4c680dd04639357d18782959/raw/9012fd6756b6332c7ae2c938a34e5cfbad82fb17/gistfile1.txt
17:53 karolherbst: it kinda feels all a bit suboptimal
17:53 karolherbst: but I can't really put my finger on it yet
17:54 karolherbst: ehh
17:55 karolherbst: nvm, maybe I should check what kernels are actually used often
17:55 karolherbst: but I'm wondering if there are any magic intel isntructions we might want to wire up to speed up matrix multiplications
17:57 karolherbst: I kinda need GPU perf profiling tools...
19:01 anholt: hakzsam: I've got some wip ci stuff that I'd like to have the 1.3.6 cts for. are you still working on polishing that today?
19:44 anholt: hakzsam: I guess we're waiting for !25284 for cts uprev?
20:07 wideopen: https://github.com/JonathanSalwan/Triton/issues/284 https://github.com/jaredsofteng/gini/blob/master/doc/crisp/crisp.pdf those are the docs that show how to run those things, and no i did not test, but the crisp part is in fact encoding the models with very high compression, which is technically the maximum of 10bit elias fano too coincidentally , probably something like this they use, only DMA and gpu backends can be written, it's not
20:07 wideopen: a lot of work. You need to read through a paragraph of 2.10 models, where it is described, not that varint encoding, they went so far, that they encode wire transfers too maximum ways, where they send the compressed model. I mean the code is round about already there, French people wrote the thing, triton concrete and bitwuzla cnf with gini backend. I am at my last year with computers, i do not care what you do, if you bother me again
20:07 wideopen: on my territory people will shoot you. You are trash.
21:47 karolherbst: okay
21:48 karolherbst: I have a fun idea for an optimization of compute kernels
21:48 karolherbst: https://gist.githubusercontent.com/karolherbst/d9cb39f00329014550eacca62536544a/raw/78bdc9bc7f9080d0345e99de765727ddd4882eb6/gistfile1.txt
21:48 karolherbst: so all threads execture the lower loop, which is doing the exact same thing in all threads
21:48 karolherbst: and stores the same result at the same location in all threads
21:48 karolherbst: I wonder if we could just vectorize or optimize that code in a way, that it's not doing something silly like this
22:05 karolherbst: mhhh
22:05 karolherbst: this entire loop could be an atomic... actually.. the entire shared memory array could be one atomic
22:06 karolherbst: I wonder how terrible it would be to match this pattern...
22:15 airlied: uggh yeah I wonder if tinygrad itself could do better in it's generator code there
22:15 karolherbst: the pain point is that it is shared code with the CPU stuff...
22:16 karolherbst: somehow
22:18 karolherbst: but I do wonder if we actually just want to match that pattern, because it's actually not _that_ complicated
22:18 karolherbst: the loop is uniform, which we should be able to proof
22:18 karolherbst: and if every thread stores it at the same location, we should do something smarter
22:19 karolherbst: yeah
22:19 karolherbst: `con 32 %54 = fmul! %62, %53 (0.000781)`
22:19 karolherbst: so yeah.. nir detects it as uniform
22:20 karolherbst: the biggest question is simply to what should we optimize this
22:22 karolherbst: maybe subgroup ops and just cut that loop by the subgroup size?
22:23 karolherbst: but yeah.. if LLVM is able to optimize this to something not stupid, no wonder Intel is faster