02:55luc: cwabbott: I noticed that glibc starts using simd/fp registers since https://sourceware.org/git/?p=glibc.git;a=commit;h=e6f3fe362f1aab78b1448d69ecdbd9e3872636d3. but my test is on the older glibc version (2.31) which does NOT contain those instructions like `ldr q0, [src]`. so I guess __memcpy_aarch64_simd() might not faster than normal LD/ST if destination of memcpy is vram
03:09HdkR: ldq q0 is asimd
03:09HdkR: Oh, does not contain those
08:13MrCooper: karolherbst: BTW, any reason rusticl on radeonsi couldn't be tested in CI?
08:47airlied: MrCooper: getting CL cts into a useful form, though we could just pick a couple of main tests
08:48MrCooper: I was thinking of piglit
08:48MrCooper: that should have caught this regression and a few before at least
09:20pq: jani, what if you forward-declare an enum, use it in struct definition, and then include the definition of the enum which results in a different size? Or maybe you just copy the struct without ever having the enum definition?
09:23emersion: pq: it appears the compiler remembers the enum has unknown size, and will require a size definition before you can do these things
09:24jani: pq: it's an incomplete type similar to a struct/union forward declaration, and can't use it before you know the size
09:24jani: https://gcc.gnu.org/onlinedocs/gcc/Incomplete-Enums.html
09:26pq: cool, thanks!
09:31jani: I know it's a bit hacky, but if you can use it to avoid pulling in some headers everywhere, I'll use it
09:32emersion: it's non-standard, so i won't use it
09:33jani: fair, though the kernel is explicitly non-standard
09:34jani: I'd also avoid it outside of the kernel
09:34emersion: yeah
11:25tomba: Looking at the atomic helpers, the current sequence when enabling the video pipeline is: crtc enable, bridge pre_enable, encoder enable, bridge enable. Crtc's enable happening before the bridge's pre_enable strikes me a bit odd, especially as bridge's pre_enable documentation says "The display pipe (i.e. clocks and timing signals) feeding this bridge will not yet be running when this callback is called". Anyone have insight on why the sequence has
11:25tomba: evolved to be such? Does the DRM framework expect that there's always an encoder which will somehow gate the signal from CRTC, until the encoder enable is called?
11:47karolherbst: MrCooper: not really
11:47karolherbst: piglit is better than nothing I guess, but I also kinda want the CL CTS to be tested, it's just a pain to do
11:48mupuf: karolherbst: what's different about cl vs gl/gles/vk?
11:49karolherbst: mupuf: it's a bunch of binaries with no consistent way for fetching subtests, see my own runner dealing with that nonsense: https://gitlab.freedesktop.org/karolherbst/opencl_cts_runner/-/blob/master/clctsrunner.py
11:49mupuf: thanks!
11:49karolherbst: `def create(cls, id, file):` specifically
11:50karolherbst: so I have a bunch of different regex to parse the help thing.. and one tests has a special option for it
11:50karolherbst: it's all messy
11:50mupuf: wow
11:51karolherbst: I think it would be easier to fix the CTS to be more consistent here instead 🙃 probably
11:51karolherbst: yeah.. the image ones are really crazy as they have flags you can pass in
11:51karolherbst: like the image format + order and such as well
11:51karolherbst: it's nice for testing, but a pain for creating such a list
11:52mupuf: https://www.khronos.org/conformance/adopters/conformant-products/opencl <-- it is suprirsing to still see submissions here
11:52karolherbst: _maybe_ it would make sense to write a binary/lib translating from deqp style naming to CL CTS
11:52karolherbst: mupuf: why surprising though? :D
11:52mupuf: I thought noone cared
11:52karolherbst: Intel is pretty big in CL
11:53karolherbst: and arm as well
11:53karolherbst: nah
11:53karolherbst: that was like 5-10 years ago
11:53karolherbst: today they care more
11:53karolherbst: the CL WG is pretty active even
11:53karolherbst: currently people work on making it more vulkan like by adding command buffers and stuff
11:53mupuf: so many comformance results, so few CL apps
11:53karolherbst: cl_khr_command_buffer
11:54mupuf: I see, good to hear
11:54karolherbst: mupuf: davinci resolve is one CL app :D
11:54karolherbst: it's a pro video editing tool
11:54mupuf: right
11:54karolherbst: but yeah.. it's more used in professional apps than like linux desktop ones
11:54karolherbst: photoshop also uses CL for some stuff?
11:54mupuf: and I guess some AI stuff may be using CL
11:55karolherbst: otherwise in the foss world you have darktables being able to use CL
11:55karolherbst: ahh yeah
11:55karolherbst: openvino uses CL on intel
11:55karolherbst: it's a framework doing ONNX base AI/ML stuff
11:55karolherbst: and can be used to some degree with tensorflow/pytorch/etc...
11:56karolherbst: mupuf: the sad part is simply that besides CUDA we only have CL as a cross vendor API which doesn't suck
11:56karolherbst: though that _might_ change with the CXL stuff
11:56mupuf: SYCL was also supposed to change that
11:56karolherbst: uhm..
11:56karolherbst: ha
11:56karolherbst: no
11:57karolherbst: SYCL is C++ only for startest and SyCL is a _compile time_ API, there is no runtime specified
11:57karolherbst: sooooo
11:57mupuf: :o
11:57karolherbst: if your toolchain only supports AMD GPUs, your app only supports AMD GPUs
11:57karolherbst: luckily the runtime intel worked on layers on top of CL
11:57karolherbst: which brings us back to CL anyway
11:57karolherbst: *toolchain
11:58karolherbst: so yeah....
11:58karolherbst: mupuf: that's kinda what's going on atm: https://www.linuxfoundation.org/press/announcing-unified-acceleration-foundation-uxl
11:58karolherbst: but that just formed
11:59karolherbst: and it's unknown what comes out of it
11:59mupuf: Ah, UXL, not CXL
11:59mupuf: that was confusing ;)
11:59karolherbst: ahh yeah.. my bad
11:59mupuf: np
12:00mupuf: too many three letter acronyms
12:00mupuf: Thanks Karol, keep up the good work!
12:00karolherbst: :) thanks
12:01karolherbst: but yeah...
12:01karolherbst: I kinda wnat to test the CL CTS in CI
12:03karolherbst: actually I kinda like my deqp adapter idea....
12:05mupuf: that would be an easy path forward, yeah
12:06karolherbst: just also slow if it simply would `execv` the binaries...
12:06mupuf: deqp is extremelly slow at test enumeration
12:06mupuf: so don't worry
12:06karolherbst: I wonder if one could `dlopen` them and just call into their `main` function instead...
12:07mupuf: how many tests are there, and how long is a typical runtime?
12:07karolherbst: all tests in non wimpy mode are like between 3 and 70 hours apparently
12:07karolherbst: 70 as that's what jenatali needed for a full CL CTS run
12:08karolherbst: wimpy reduces amount of iteration in arithmetic tests
12:08karolherbst: if I pass `wimpy` and `quick` into my runner it's like 10 minutes parallized
12:08karolherbst: but it skips a bunch of corner case tests
12:08karolherbst: subtests without taking image formats/order into account is roughly 2500
12:08mupuf: I see
12:09karolherbst: but you could split those up if you wanted to
12:09karolherbst: probably around 5000 then
12:09mupuf: sure, but let's just say that it is a little silly for it to take that long
12:09mupuf: just like igt used to take a month to run
12:09karolherbst: it's testing a lot of stuff
12:09karolherbst: like
12:09karolherbst: arithmetic precision
12:10karolherbst: and it iterates over a bunch of random values just to make sure the runtime is okay there
12:10mupuf: right, but knowing what to test is an important thing too
12:10karolherbst: yeah
12:10mupuf: I'm sure the wimpy mode is a good start anyway
12:10karolherbst: yeah
12:10karolherbst: it's good enough
12:11karolherbst: it doesn't catch all subnormal/nan related corner case, but whatever
12:11mupuf: 10 minutes means it could run on one runner
12:11karolherbst: that's 10 minutes on a 20 core machine
12:11mupuf: is it cpu-limited?
12:11karolherbst: yes
12:11karolherbst: compiling a lot of C code
12:11karolherbst: well some tests
12:11karolherbst: some tests are GPU limited
12:11mupuf: we have 16 cores runners
12:11karolherbst: I could do a 4 core run and see how that changes things
12:12mupuf: (5950X, for navi21/31)
12:12mupuf: 4 cores run? It would only make sense for lavapipe
12:12mupuf: For bare metal runners, our slowest runners are the steam decks
12:13karolherbst: yeah... I just want to see how slong things become if you limit on the CPU side aggressively
12:13karolherbst: *slow
12:13karolherbst: but it's kinda wild.. some test utilize the GPU at 100%, others at like 0.1%
12:14karolherbst: and most isn't even runtime, it's just validation on the CTS side
12:14karolherbst: as you know... it calculates the same thing also on the CPU for checking the result
12:21karolherbst: at least on intel CPU util idles around 8% while running the CTS :')
12:21karolherbst: (with 4 cores)
12:22karolherbst: now it's 0% :')
12:23mupuf: ha ha, you are latency bound :D Everything just keeps waiting on each other
12:24karolherbst: mhhh
12:24karolherbst: doubt
12:24karolherbst: my cores are still at 400% in userspace
12:24karolherbst: ehh
12:24karolherbst: each one at 100%
12:25karolherbst: I mean.. the CL CTS for 90% of its time runs the same code on the CPU and checks if the result is correct, so if it wouldn't be CPU bound it would be kinda sad
12:25pq: I suspect a simple misread idle vs. usage % :-)
12:25karolherbst: but it kinda depends on the tests, some are more GPU bound, for non optimal code reasons
12:26karolherbst: pq: no seriously.. some of the tests are just doing 100% math
12:27karolherbst: if you validate multiple 10 thousands results from e.g. `sin` that's what you get
12:27karolherbst: (and all the other builtins)
12:27pq: karolherbst, I think mupuf read "usage" when you said "idle".
12:27karolherbst: ahh I see
12:28mupuf: pq: indeed
12:28karolherbst: ehh wait
12:28karolherbst: I wrote CPU
12:28karolherbst: I meant GPU
12:28karolherbst: duh
12:28pq: lol
12:28mupuf: makes more sense :p
12:28karolherbst: my fault 🙃
12:28karolherbst: mhh but yeah.. limiting to 4 cores kinda doubles the CL CTS runtime on my intel GPU
12:29karolherbst: could be worse
12:30karolherbst: but we also have a wimpy factor option in some tests, which just adjusts how many iterations are done
12:31karolherbst: anyway.. maybe I'll play around with the deqp idea and see how bad it would be
12:33mupuf: karolherbst: sounds like a good idea. deqp support is found in a lot of tools, so if you can keep compatibility to it, it would be easiest
12:33karolherbst: yeah..
12:34mupuf: maybe it could even land in clcts, but that's not a requirement
12:35karolherbst: I wonder how hard it would be to convert all those testing to deqp actually, maybe I should bring it up at the CL WG next year as well.. but that kinda requires everybody else wanting it also :D
12:35karolherbst: but migrating the entire code base is probably quite a bit of work
12:39tomeu: haven't been following, but if somebody is going to make any big changes to the CTSes, it would be great if caching of golden results was taken into account
12:39tomeu: once I implemented that in my test suite, I started finding concurrency bugs in the kernel driver...
12:42mupuf: tomeu: I guess for images, it can make sense... but for precision/arithmetics tests, I doubt caching would improve performance :s
12:43tomeu: well, anything that computes something expensive in the CPU that is used to compare it with the driver's output
12:44tomeu: guess that is the case if it's CPU bound
12:54mupuf: it doesn't have to be expensive, it can be that there are just too many tests. Imagine a test that check that the gpu can increment a variable... for every format and for every acceptable value in the format
12:54mupuf: that seems to be what clcts is doing in some cases... not much to cache there
12:55mupuf: but... maybe the issue is that the tests are stupid :D
12:55mupuf: and instead a lot of operations could be tested at the same time, and only decomposed if the final result doesn't match expectations
12:56mupuf: but that requires serious work
12:56mupuf: glad to you are taking cpu time seriously in the design for the test suite!
12:59jenatali: Honestly the CL CTS does at least test things the right way, multithreaded and batching work together. It just does an insane amount of work
13:09mupuf: jenatali: good to hear too!
13:47Venemo: Lynne: ping, I am trying to reproduce the multiplanar issue, can you give me a hand, please? it seems your command produces a video file with 1 frame, and I am unsure how best to view it
13:48Venemo: Lynne: gnome's video app only shows a black screen, and vlc only shows the frame for a very brief time
13:48Venemo: Lynne: however, the output seems to be broken with or without enabling the transfer queue, so I am convinced the issue is not in the transfer queue implementation
14:06Lynne: Venemo: sure
14:07Lynne: try "ffmpeg -init_hw_device vulkan -i test.png -vf format=yuv422p10le,hwupload,hwdownload,format=yuv422p10le,format=rgba -c:v png -y test_output.png"
14:08Venemo: Lynne: looks good
14:08Lynne: compared to the input image?
14:09Venemo: they look the same to me
14:10Lynne: let me test again, maybe it fixed itself
14:10Lynne: which patches do I apply aside from the multiplane one?
14:10Venemo: I tested this branch here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25594
14:11Venemo: the other branch does not expose the transfer queue yet
14:11karolherbst: mupuf: yeah so on radeonsi I kinda end at 22 minutes
14:14Lynne: segfault in radv_GetPhysicalDeviceQueueFamilyProperties2?
14:15Lynne: I thought we fixed that bug
14:17Lynne: Venemo: ah, the sparse binding queue was added, so the count's off again, https://0x0.st/H3RO.diff fixes it
14:18Lynne: I am seeing corruption here on 10bits: https://0x0.st/H3RV.png
14:19Venemo: okay, give me a moment
14:19mupuf: karolherbst: on what host cpu?
14:19karolherbst: intel i7-12700
14:19mupuf: ack, with 4 threads, right?
14:19karolherbst: yes
14:19karolherbst: and it has the cooling to run at max clock all the time :)
14:20karolherbst: I _could_ check how long it takes on my steamdeck, but then I have to figure out how to do that first
14:21mupuf: yeah, not important
14:22mupuf: I think expecting about 45-60 would be somewhat accurate
14:26jenatali: FL4SHK: you're not authed with NickServ so IRC folks aren't seeing your messages. This is the right place though
14:27pq: emersion, I'd like to wash my hands off that heap naming discussion now, there is nothing useful I could say.
14:28emersion: ahah
14:28emersion: well thanks for your replies, i think they've been useful
14:29pq: glad you think so, I just feel butting in somewhere that's not my business :-p
14:30FL4SHK[m]: hello
14:30pq: FL4SHK[m], hello, we see you now.
14:30FL4SHK[m]: cool
14:31FL4SHK[m]: I was wondering, how difficult would it be to develop either a GCC version of LLVMPipe for a custom GPU, (given that I know how to write a GCC backend, which I do) or to develop a new Mesa driver in general?
14:32FL4SHK[m]: I am a hardware/software developer and I plan on creating a new FPGA-based workstation.
14:32FL4SHK[m]: I would be happy with something like 500 MHz for the system (doable with my Xilinx ZCU104)
14:33FL4SHK[m]: This would be a passion project but I'd be happy to open source everything
14:35FL4SHK[m]: I typically do that for anything public I make anyway
14:36FL4SHK[m]: at least for stuff made outside of work :)
14:41karolherbst: FL4SHK[m]: gcc doesn't support those kind of use cases
14:42karolherbst: or uhm..
14:42karolherbst: mhh
14:42karolherbst: maybe with libgccjit.so it actually does...
14:42karolherbst: though not sure what input it epxects
14:43FL4SHK[m]: I see
14:44karolherbst: but it looks very very C centric
14:44karolherbst: I think it would be a cool research project in terms of "how good is the gcc jit"
14:44karolherbst: but not sure we'd have any plans of merging it unless it has strong benefits over llvmpipe
14:44FL4SHK[m]: I see
14:45FL4SHK[m]: perhaps I could write an LLVM backend for the GPU then
14:45karolherbst: at least having a more stable API/ABI would be a benefit
14:45FL4SHK[m]: that would be nice yes
14:45karolherbst: but in general we kinda prefer having the GPU's backend compiler all inside mesa
14:46karolherbst: as doing GPU backends in gcc and llvm kinda... well.. have their disadvantages and we move away from that
14:46FL4SHK[m]: okay, right
14:47FL4SHK[m]: well, I was actually hoping to make my GPU have really long vectors instead of making a conventional GPU
14:47karolherbst: please don't
14:47FL4SHK[m]: otherwise it'd be like the CPU core
14:47karolherbst: or rather
14:47karolherbst: not inside the ISA
14:47FL4SHK[m]: hm?
14:48karolherbst: vector ISA have too many drawbacks and everybody moves away from them
14:48FL4SHK[m]: why?
14:48karolherbst: (if they haven't already)
14:48karolherbst: GPU ISA are mostly scalar
14:48FL4SHK[m]: I see
14:48karolherbst: makes it easier to optimize code
14:48karolherbst: so every SIMD lane is just a scalar program from the ISA point of view
14:48FL4SHK[m]: I could make a regular GPU then
14:48karolherbst: and the hardware just runs e.g. 32 threads in a SIMD group
14:48karolherbst: with each executing the same instruction
14:49karolherbst: makes it easier to parallize data as you won't run into the issue of "what if you can only use 3 of your 32 SIMD lanes, because of program code"
14:49karolherbst: vectorization within a thread is destined to fail
14:50karolherbst: so you get the most perf if you don't rely on it
14:50karolherbst: however.. some GPUs have e.g. vec2 operations for fp16 or 128 bit wide memory load/stores
14:51FL4SHK[m]: I could do that
14:51FL4SHK[m]: the ISA I was going to go with is designed to reduce memory traffic
14:51karolherbst: but in the end feel free to experiment :D
14:52karolherbst: yeah.. so some ISAs have "scalar" or "uniform" registers which are special cases where one instruction inside the SIMD group has the same result across all lanes
14:52karolherbst: so optimizations like that exist
14:53karolherbst: but that's alu <-> register traffic stuff
15:06Venemo: Lynne: I still don't see it
15:06Venemo: Lynne: what GPU do you use?
15:07Lynne: 7900XTX, give me a sec to try on a 6900XT
15:13Venemo: Lynne: I am also trying on 7900XTX
15:13Venemo: sorry but the issue just doesn't happen here
15:13Venemo: are you sure we are doing the same thing?
15:15Lynne: you're using the sample image I'm using, right: https://gitlab.freedesktop.org/mesa/mesa/uploads/2ffa09962eb83f2e1f7de2d919b549ec/test.png
15:15Venemo: exactly
15:16Lynne: you're running "ffmpeg -init_hw_device vulkan -i test.png -vf format=yuv422p,hwupload,hwdownload,format=yuv422p,format=rgba -c:v png -y test_out.png"
15:16Lynne: with export RADV_PERFTEST=transfer_queue
15:16Venemo: I tried both with and without, both commands produce the same output
15:17Lynne: no green stripe across the image?
15:18Venemo: Lynne: this is how it looks here: https://i.imgur.com/B08PvuU.jpg
15:18Venemo: I have ffmpeg-6.0.1-3.fc39.x86_64 in case that matters
15:19Lynne: it does, that's before the large vulkan patchset was merged
15:19Venemo: let me try again
15:19Lynne: could you update to 6.1?
15:19Venemo: ehh
15:20Venemo: there doesn't seem to be a fedora build for ffmpeg 6.1 yet, what is the easiest way for me to get it?
15:21Lynne: git clone, ./configure && make build?
15:21Venemo: ehh, okay
15:22Lynne: on the bright side, it's less time to compile than a minimal mesa and doesn't depend on llvm
15:23Venemo: Lynne: what package do I need for this one? nasm/yasm not found or too old. Use --disable-x86asm for a crippled build.
15:23kwizart: Venemo, 6.1 is in rpmfusion for f40 (you could use a container or chroot ?)
15:24Lynne: Venemo: nasm
15:26Venemo: kwizart: I would prefer not to
15:26Venemo: Lynne: got it, it's building now
15:28Venemo: Lynne: awesome, I got the green stripe now
15:29Lynne: it should disappear if you remove the transfer_queue perftest
15:30Venemo: it does indeed
15:35Lynne: it also does happen on 6900XT, but the stripe is twice as large as 7900
15:35Lynne: *long
15:36Venemo: whatever the issue is, it's probably the same problem
15:36Venemo: on both GPUs
15:41tomeu: gfxstrand: what would you think of adding some of these instructions to NIR? https://www.tensorflow.org/mlir/tfl_ops
15:41tomeu: it doesn't need to be this dialect, can be something different, but of equivalent functionality and level of abstraction
15:45gfxstrand: tomeu: A bunch of those we already have.
15:46gfxstrand: tomeu: The first question that comes to mind is how big is a tensor?
15:46gfxstrand: NIR vectors are currently limited to the SPIR-V limit of 16
15:46gfxstrand: With some limitations on exactly what sizes are allowed but those can probably be lifted.
15:46karolherbst: ~~question is for how long still~~
15:46tomeu: they can be megabytes big, but I'm not sure a tensor can be mapped as a vector
15:47gfxstrand: tomeu: Oh... Okay, that changes my mental model.
15:48gfxstrand: So what is a tensor then? Is it an opaque object that's backed by memory somewhere?
15:51tomeu: yep, with attributes such as dimensions (4 is common) and data type
15:52tomeu: guess I should investigate a bit more what others are doing for generating machine code from MLIR, I just got really frustrated by having to reinvent NIR in my NPU driver
15:52tomeu: there is a cute graph at https://www.tensorflow.org/mlir/overview
15:53tomeu: there is a mlir-to-spirv translator out there, but I'm not sure what is the level of abstraction of the output
15:54tomeu: ie. if convolution operations have been lowered to CL spirv or are still there
15:58Venemo: Lynne: interesting. it would seem that there are 3 buffer->image and image->buffer copies and the 3rd copy seems to miss a part of the image
16:00gfxstrand: tomeu: So, my gut feeling is that if you do it all with intrinsics, find yourself a suffix you're happy with and you can make as many as you'd like.
16:02tomeu: ok, I will play with it after holidays and comment back
16:02gfxstrand: It's unclear to me how tensor ops would fit into NIR long-term.
16:03gfxstrand: If it's a good match, we may want to add a new op type for tensor ops and make them more first-class.
16:03gfxstrand: My biggest fear is that tensor NIR will end up looking so different from regular NIR that we might as well have different IRs.
16:04gfxstrand: But I haven't thought about it hard enough for that fear to be an opinion. It's more of a "Hey, there's a mountain over there and I've heard rumors of dragons so, uh... watch out!"
16:08Lynne: Venemo: luma plane looks fine, so it's the chroma planes
16:16tomeu: yeah, I also see the tensor type as the main difficulty here
16:20gfxstrand: If it remains an opaque thing, that's easy enough.
17:19Venemo: Lynne: only the 3rd thingy seems wrong
17:19Venemo: but I don't yet see why
17:43Lynne: in case you're wondering why you couldn't replicate on 6.0: it doesn't use multiplane images (ever)
17:45FL4SHK[m]: is LLVMPipe not going to be supported in the future?
17:46koike: o/ I'm trying to run dim setup, but it fails to update rerere cache and ask me to run git branch --set-upstream-to=<remote>/<branch> rerere-cache , I already removed rerere-cache branch and worktree to see if it would fix but no luck, I'm new to dim tool so I was wondering if anyone could give me pointers here
17:49koike: (never mind, looks like the branch didn't really get removed, it seems it worked now, sorry for the noise and thanks for :rubberduck: xD )
17:57jenatali: FL4SHK: LLVMPipe runs on the CPU, not a GPU
17:58jenatali: There exist drivers that use LLVM for generating GPU code. I think it's just radeonsi at this point. But it has nothing to do with LLVMPipe
17:58FL4SHK[m]: Okay
17:58Venemo: Lynne: how is the buffer uploaded to the GPU?
17:58kisak: Intel has a couple fingers into llvm (OpenCL?)
17:59FL4SHK[m]: How difficult would it be to develop a Mesa driver for a new GPU then?
17:59FL4SHK[m]: I'm assuming it'd be hard...
18:00jenatali: LLVM is used in frontends like rusticl, yeah. I dunno how much it's used for GPU backends, especially from Mesa
18:00jenatali: FL4SHK: It depends
18:00gfxstrand: Baseline is "hard". It only goes up from there depending on hardware and what APIs you want to support.
18:00gfxstrand: I mean, multiple highschoolers have successfully written Mesa drivers, so...
18:01FL4SHK[m]: Hm
18:01gfxstrand: But also I've been head down on NVK for 1.5 years and we're just now starting to play games and I'm one of the best there is.
18:01idr: Totally average, every day high school students...
18:01gfxstrand: From Normal High
18:01FL4SHK[m]: I see
18:01FL4SHK[m]: I'll keep that in mind
18:02gfxstrand: What are you wanting to make a driver for?
18:02FL4SHK[m]: a custom FPGA-based GPU
18:02FL4SHK[m]: I have part of the instruction set written up
18:02FL4SHK[m]: my goal is to have a 500 MHz workstation
18:03FL4SHK[m]: which should be possible with the hardware I've got
18:03FL4SHK[m]: A Xilinx ZCU104
18:03FL4SHK[m]: I know it's a lot of work
18:08Lynne: Venemo: memory map image on RAM to a vkbuffer, then vkbuffer->vkimage copy
18:08Lynne: same but in reverse for downloads
18:09Venemo: Lynne: is it possible that there is a sync bug in there somehow?
18:09Lynne: validation passes
18:10Lynne: we do a barrier before each copy too
18:10Venemo: I'm not sure if that is relevant here. by the same logic I could say radv passes the cts
18:10Lynne: (it doesn't?)
18:10Venemo: it does
18:11Venemo: or what do you mean?
18:11Lynne: nothing
18:11Lynne: disabling host-mapping and falling back to a RAM->vkbuffer + vkbuffer->vkimage copy doen't help
18:11Venemo: it is very curious that only some middle part of the image is missing and the rest is correct
18:12Lynne: it's always the same part too, everywhere, so it's not a sync issue, I think
18:12Lynne: it seems like it could be alignment related somehow, though not sure
18:13Venemo: alignment of what?
18:15heat: gfxstrand, was NVK considerably harder cuz nvidia? or do you reckon it didn't matter much?
18:15heat: well, doesn't, it's still ongoing work ofc :)
18:15Venemo: Lynne: what is very peculiar here is that it fails the same way even if I force the code to copy the image line-by-line
18:16gfxstrand: heat: It's hard because we're going straight to "can play D3D11 and D3D12 games"
18:17gfxstrand: If you just want enough of OpenGL ES 2 to get a desktop up and going it's significantly easier.
18:17FL4SHK[m]: I'd like to be able to run some lower end emulators
18:18FL4SHK[m]: eventually
18:18gfxstrand: NVIDIA hardware is quite nice, actually. That's not at all the problem.
18:19karolherbst: the tldr on nvidia is, that the hardware is designed for driver developers
18:20soreau: is there a gl(es) driver for nvk-capable hw too, or you mean running $compositor on zink?
18:20Venemo: soreau: there is nouveau like always has been
18:20gfxstrand: There's a GL driver but Zink+NVK is already starting to outpace it
18:20soreau: I see
18:21soreau: well, don't forget there's a forest in the trees, somewhere..
18:22Lynne: Venemo: <some> alignment, after all, if the image has nicer dimensions it looks fine
18:22heat: gfxstrand, i thought it was a PITA to get docs on nvidia hw though? or did you folks solve that situation already?
18:22Lynne: I chose that image because it's all odd-sized
18:22gfxstrand: We have headers now
18:23gfxstrand: Which is a big step up
18:23gfxstrand: For the ISA, some folks have access to some docs and we have the PTX docs public which are often helpful.
18:23gfxstrand: But developing any GPU driver involves a certain amount of R/E anyway
18:35Venemo: Lynne: does it behave better with even-sized images?
18:44FL4SHK[m]: if the GPU is open source surely there's no R/E involved other than reading the code
18:45karolherbst: *doubt*
18:45FL4SHK[m]: Doubt for what?
18:45karolherbst: GPUs are quite complex
18:45idr: FL4SHK[m]: GPUs are complex. Some of the RE is, "What happens if I do these things together in a way nobody really thought about?"
18:45karolherbst: and reverse engineering isn't limited to binary blobs
18:46FL4SHK[m]: ah
18:46FL4SHK[m]: gotcha
18:46karolherbst: but yeah.. you might have to figure out how your open source GPU behaves doing certain things as it might not be obvious from the code
18:46karolherbst: maybe "debugging" would be the better term here? dunno :)
18:46FL4SHK[m]: right
18:47heat: i've gone through the intel GPU docs a fair bit... safe to say they don't tell you all the things you need to know
18:47FL4SHK[m]: In my case I will be developing both the GPU and the driver
18:47karolherbst: and usually: hw trumps the spec/code in any argument
18:47heat: and between the thousands of pages of docs and the i915 kernel driver... yeah i gave up on that pretty quickly :)))
18:49gfxstrand: Well, reverse engineering is just debugging something you don't have the capacity to change.
18:49airlied: also a gpu is a lot more than just compute execution units
18:49gfxstrand: So you're just replacing debugging something you can't change with debugging something you can.
18:50karolherbst: :D fair
18:50airlied: those are the fun pieces, but for a useful graphics gpu, you'd probably want texture units at least, and maybe hw blending
18:54vsyrjala: iirc some pirate once said: "hardware docs are more of what you'd call guidelines than an actual description of how the hardware really works"
18:55gfxstrand: hehe
18:55gfxstrand: Pretty much
18:55dj-death: when it's not outright lies
18:56karolherbst: hard to tell if anybody actually lies there,b
18:56gfxstrand: That's the best part of not having docs. There's nothing to lie to me!
18:56FL4SHK[m]: <airlied> "those are the fun pieces, but..." <- not sure there's much more I'll be including in mine
18:56karolherbst: but the code to design the hw doesn't have to match it out of various reasons, including bugs
18:56dj-death: yeah
18:56dj-death: gfxstrand: nice little feeling not to have to trust anybody ;)
18:57karolherbst: at least nvidia just leaves out the parts they don't want to share :D
18:58vsyrjala: i feel docs are a bit of a double edged sword sometimes. inexperienced developers tend to blindly trust what they read there, and then nothing works correctly
19:13daniels: just like the average CV, they're aspirational but unreliable
19:13karolherbst: or they end up debugging their own code for days just until somebody tells them the docs are wrong or come to that conclusion on their own or in the worst case give up
20:14Lynne: Venemo: yup
20:17Lynne: it's only the width that matters, the height can be od
20:17Lynne: *d
20:20Lynne: correction: it happens on 1024x1024 images too
20:21Lynne: corruption seems to depend on dimensions but so far seems to be random
20:22Lynne: the corruption on 1024x1024 does look like an incorrect stride issue for the chroma
20:49karolherbst: okay.. the u_trace stuff has a use-after-free :')
20:49karolherbst: apparently util_queue_finish doesn't properly clean up the thread started in util_queue_init
20:50karolherbst: mhh maybe it's more of a u_queue issue then
21:08Venemo: Lynne: so, if you take the same image, but add 1 pixel to each side, it will work?
21:12Venemo: Lynne: really weird. can you give me an image to reproduce with at 1024x1024?
21:31Lynne: better yet, I can teach you to make your own: "ffmpeg -i test.png -vf crop=w=1359:h=1791 -y test_cropped.png"
21:32Lynne: you can use any larger image as an input and crop to whatever size you need, the ffmpeg command to upload+download always does a format conversion to yuv422
21:34Venemo: Lynne: by any chance, have you tried if this works on amdvlk?
21:37Venemo: it probably will
21:37Lynne: err, no, I haven't, it's a pain to use since it hamfistedly overrides the default driver
21:37Venemo: if it's any consolation, I've found 3 other bugs while chasing this, but neither of them have anything to do with your use case sadly
21:53Lynne: bad news: it works just fine on amdvlk
21:54Venemo: it's not bad news, it means whatever the problem is, it's solvable
21:57Venemo: Lynne: can you help me understand how the 3 planes are combined into a single image that I can see in the file?
21:57Lynne: err, turns out it works because amdvlk doesn't support vulkan's multiplane 422
21:58Venemo: heh
21:59Lynne: yeah, so after downloading, there's a conversion step to turn it into rgb to compress it with png and output it
22:00Lynne: you can get the exact data out directly via .nut, but you'll just have to launch vlc with a cli arg to start in paused mode rather than play
22:01Lynne: internally, the vkformat 1000156004 is used for the images, with the vkimage being non-disjoint, regularly allocated, optimally-tiled
22:33Venemo: Lynne: I think I found the solution... who the hell would have thought that the size of a plane is not the same as the image extent... the issue seems to be gone if I use vk_format_get_plane_width/height
22:33Venemo: give me a few minutes to update the branch
22:40jenatali: Venemo: That's what subsampling does, though
22:42Venemo: I learn something new and interesting every day in this job
23:54Venemo: Lynne: I've updated the MR now, can you pls check if the issue is gone?
23:59Lynne: Venemo: yup, fully works now!