00:00 karolherbst: some interesting data on nir vs tgsi backend caching in nouveau and it somehow shows that either we mess up or nir_serialize with strip = true isn't doing the best job: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4264#a74530e411f7be6b3f755b03dcf655bfc31b1c46
01:05 airlied: jekstrand: why does the simple_bo path not set exec no reloc?
08:07 pq: emersion, "just tell me the cause of einval" - that's the kernel DRM flight recorder, I dunno what state it is in nowadays.
08:08 emersion: pq: this? https://patchwork.kernel.org/patch/11184403/
08:10 emersion: good to see we aren't the only ones with this issue at least
08:11 pq: emersion, yeah, that effort by Sean. I forget which one was the latest incarnation, but it has been a while since I recall hearing from it.
08:11 emersion: seanpaul: do you know what's the status of this? ^
08:13 pq: ISTR something about getting the dump from tracefs instead of dmesg
08:13 pq: also that it would still need privileges more than just DRM master
08:13 emersion: hrm
08:14 pq: maybe there should be a master-only DRM ioctl to dump it out...
11:16 Vanfanel: emersion: I am still trying to set a cursor to the cursor plane, can't get this working... I am getting "[drm:drm_atomic_plane_check [drm]] [PLANE:42:plane-2] invalid pixel format XR24 little-endian (0x34325258), modifier 0x0".
11:16 Vanfanel: I am creating the cursor GBM bo with "gbm_bo_create(viddata->gbm_dev, usable_cursor_w, usable_cursor_h, GBM_FORMAT_ARGB8888, GBM_BO_USE_CURSOR | GBM_BO_USE_WRITE)", so it IS an ARGB888 cursor, not XR24
11:18 Vanfanel: But strangely enough, when I remove "GBM_BO_USE_CURSOR" from the gbm_bo_create() flags, the atomic commit on the same cusror plane DOES succeed
11:18 Vanfanel: I have found this: https://github.com/swaywm/wlroots/commit/7bc43413edc2db04bbfba395f9606957ff2fa21c
11:19 Vanfanel: So, what's going on here with CURSOR planes? there's something very strange (possibly something I don't know)
11:21 Vanfanel: removing "GBM_BO_USE_CURSOR" let's the later atomic commit that modifies the CRTC_ID and the FB_ID of the CURSOR PLANE succeed, as I said, but of course no cursor appears on screen even if the atomic commit succeded
11:21 mlankhorst: have you tried with drm debugging?
11:22 Vanfanel: mlankhorst: yes, that's where I got "[drm:drm_atomic_plane_check [drm]] [PLANE:42:plane-2] invalid pixel format XR24 little-endian" from.
11:22 Vanfanel: mlankhorst: but that fail does NOT appear when I remove the GBM_BO_USE_CURSOR from the BO creation
11:26 Vanfanel: in other words, the GBM BO seems to be a working one ONLY if I remove the GBM_BO_CURSOR flag on it's creation
11:26 mlankhorst: seems like gbm does something then
11:26 Vanfanel: Nnote that I have to get the fb_id of the GBM BO with gbm_bo_get_user_data(bo)
11:27 Vanfanel: but that's normal
11:28 Vanfanel: mlankhorst: have you seen this? -> https://github.com/swaywm/wlroots/commit/7bc43413edc2db04bbfba395f9606957ff2fa21c
11:28 Vanfanel: It says something similar
11:29 emersion: Vanfanel: you add the GBM BO as a DRM FB via drmModeAddFB2 right?
11:29 emersion: or drmModeAddFB
11:29 emersion: maybe the format is wrong there
11:29 emersion: or drmModeAddFB2WithModifiers
11:37 Vanfanel: emersion: No, I simply 1) create the GBM BO for the cursor with gbm_bo_create(ARGB8888). 2) Write the cursor data to it via gbm_bo_write(). 3) Get the fb_id for it via gbm_bo_get_user_data(). 4) Try to set the CRTC_ID and FB_ID pros of the CURSOR PLANE, using the GBM BO corresponding fb_id.
11:37 pq: Vanfanel, userdata is not FB ID
11:38 pq: userdata is whatever pointer you decide to store with the gbm_bo so that you can look up your custom data later
11:39 pq: Vanfanel, the only way to get an FB ID is that you or something somewhere calls AddFB, AddFB2, or AddFB2WithModifiers.
11:42 pq: Vanfanel, gbm_bo_get_user_data() returns whatever you set with gbm_bo_set_user_data(). If you didn't set anything, it returns... who knows. NULL or uninitialized.
11:43 Vanfanel: pq: I am seeing I was filling userdata with fb_id for other buffers, but not for this one. I was obviating something fundamental :)
11:43 pq: heh
11:44 Vanfanel: pq: and indeed I was using AddFB on those buffers
11:44 Vanfanel: pq: Thanks A LOT
11:44 Vanfanel: I have been looking for an answer for hours now.. since yesterday
12:10 karolherbst: bbrezillon: fround_even doesn't cut it?
12:12 bbrezillon: karolherbst: I guess I tried it and didn't do what was expected, but I don't remember tbh
12:13 karolherbst: mhh
12:13 karolherbst: I could test that on nvidia
12:13 karolherbst: but we can do it in one op
12:14 karolherbst: our conversion op support rounding towards integers
12:14 karolherbst: and it doesn't seem like your implementation has any magic to it
12:14 bbrezillon: nope
12:15 bbrezillon: and the CTS was complaining IIRC
12:15 karolherbst: I can take a look if I know which test it was
12:21 bbrezillon: karolherbst: I think it was bruteforce
12:21 karolherbst: k.. let me see
12:26 bbrezillon: karolherbst: according to https://manpages.debian.org/jessie/opencl-1.2-man-doc/round.3clc.en.html, we don't want rtne
12:27 bbrezillon: we need 'round-away-from-zero'
12:37 karolherbst: let's see
12:40 danvet: airlied, Re: [linux.git drm/ttm]: NULL pointer dereference upon driver probe <- probably good if you weigh in on this
12:41 danvet: sravn, pinchartl, any good ideas for people we could volunteer a bit more for bridge stuff?
12:42 danvet: you guys are doing great work, but I think we can use more ...
12:59 karolherbst: bbrezillon: mhh.. yeah.. seems like I hit "ERROR: round: -16777216.000000 ulp error at 0x1p-1 (0x3f000000): *0x1p+0 vs. 0x0p+0"
13:00 karolherbst: let me see what nvidia does
13:01 karolherbst: ups...
13:01 karolherbst: ptx doesn't support round anyway :D .. argh
13:02 karolherbst: oh wow.. they just go beserck on that one
13:03 karolherbst: cvt.rna.tf32.f32 b1, f; // convert fp32 to tf32 format mhhh
13:05 karolherbst: but that's some fancy new stuff
13:15 karolherbst: bbrezillon: well.. your implementation also doesn't cut it :/
13:15 karolherbst: ERROR: round: inf ulp error at 0x1.fffffep-2 (0x3effffff): *0x0p+0 vs. 0x1p+0
13:17 bbrezillon: hm, I'm pretty sure all round tests were passing
13:17 karolherbst: could be some dx specific thing
13:41 bbrezillon: karolherbst: hm, I guess that's caused by the rounding on the fadd(+0.5)
13:41 karolherbst: probably
13:46 karolherbst: bbrezillon: I guess you might need to throw in an inf check as well.. although I'd argue that inf + anything is till inf.. mhh
14:30 karolherbst: bbrezillon: ohh wait.. inf has no mantissa so it should be handled arleady.. strange
14:31 imirkin: karolherbst: inf + nan = nan. inf - inf = nan.
14:32 karolherbst: right...
14:32 imirkin: re "inf + anything is still inf"
14:37 bbrezillon: karolherbst: but it's not inf that's failing here, is it?
14:37 karolherbst: ohh right.. it's 0x3effffff
14:37 karolherbst: which should be 0
14:38 karolherbst: bbrezillon: mhhh.. 0x3effffff + 0.5 == 1
14:39 karolherbst: now I wonder why you don't hit this
14:39 karolherbst: mhh.. wait
14:40 bbrezillon: karolherbst: can you try with https://gist.github.com/bbrezillon/9283df26059db636a12a8f7668aeeafd ?
14:40 bbrezillon: and yes, that's weird
14:41 karolherbst: bbrezillon: ohhh. not, it's not I think
14:42 bbrezillon: I mean, that's weird that it didn't fail during our tests
14:42 karolherbst: 0.999999970197 == 0x3f7fffff, but 0.999999970198 == 0x3f800000
14:42 karolherbst: 0x3effffff == 0.4999999701976776123046875
14:42 karolherbst: so I guess it could vary depending on hw or whatever
14:43 bbrezillon: yep, there's a rounding imprecision when we add 0.5
14:43 imirkin: floating point is well-defined
14:43 imirkin: there is no "depending on hw"
14:43 karolherbst: well....
14:43 bbrezillon: but I'd expect it to always round the same way
14:43 imirkin: with specific values, there's a single correct answer
14:44 karolherbst: should be, yes
14:44 bbrezillon: what imirkin says
14:44 bbrezillon: :)
14:44 jekstrand: airlied: Probably no good reason
14:44 imirkin: not should be. is.
14:44 karolherbst: yeah, then I guess we do something incorrectly
14:44 jekstrand: airlied: Feel free to submit a MR to fix that. :)
14:44 bbrezillon: karolherbst: could be dxil too
14:44 karolherbst: right
14:44 imirkin: unless you're claiming a serious flaw in the hw that you're the first to discover.
14:45 karolherbst: bbrezillon: your new version seems to be better
14:45 karolherbst: at least it runs longer now
14:45 jenatali: IIRC floating point math should be RTNE in DXIL and CL
14:45 karolherbst: imirkin: what are we defaulting to for compute shaders?
14:46 karolherbst: is there a default even?
14:46 imirkin: for what operation?
14:46 karolherbst: uhm.. all floating point?
14:46 imirkin: lol
14:46 imirkin: well the rounding logic is different depending on the operation
14:46 karolherbst: right
14:46 imirkin: so ... which operation.
14:46 jenatali: Add? :)
14:46 bbrezillon: yes, add
14:47 karolherbst: yeah, in that case it was the add I guess
14:47 imirkin: i'd have to check. i *assume* rtz
14:47 imirkin: (round-to-zero)
14:47 imirkin: er, towards
14:47 bbrezillon: then dxil was correct :P
14:47 imirkin: karolherbst: envydis prints the rounding mode
14:48 imirkin: although it's questionable whether it gets emitted correctly post-kepler
14:48 karolherbst: imirkin: at least from the input we only set it explicitly for conversions
14:48 imirkin: yeah, coz there's no way to specify what you want, so we always set it to one thing
14:48 imirkin: (prior to whatever you guys are talking about)
14:49 agd5f_: hanetzer, dmesg output and xorg log (if using X)
14:49 karolherbst: mhhh
14:49 karolherbst: bruteforce add fails :D
14:49 karolherbst: maybe our rounding mode is indeed busted
14:49 imirkin: i take that back --
14:49 imirkin: i think we use "round nearest"
14:50 karolherbst: yeah... sounds about right
14:50 imirkin: karolherbst: have a look at the RoundMode type
14:50 imirkin: (in nv50_ir.h)
14:50 imirkin: should also be a properly of all instructions i think
14:50 karolherbst: yeah
14:50 imirkin: there's ROUND_N, _M, _Z, and _P
14:50 karolherbst: mhhh
14:51 imirkin: i believe all those options are settable on floating point ALU ops like add and mul
14:51 karolherbst: they are indeed
14:51 imirkin: but there's never any reason to set them to anything in particular for GLSL, so we go with the flow
14:51 imirkin: iirc blob uses ROUND_N, so we do too.
14:52 imirkin: (even post-ARB_shader_precision, floating point behavior in GLSL isn't particularly well-defined)
14:54 karolherbst: mhh ERROR: add: -8388607.000000 ulp error at {-0x0p+0, -0x1.000002p-126}: *-0x1.000002p-126 vs. -0x1p-125 (0x81000000) at index: 3566
14:55 karolherbst: that's a simple add test
14:55 bbrezillon: karolherbst: should I submit the new version?
14:56 bbrezillon: regardless of those rounding issues on nvidia
14:57 karolherbst: I am wondering what's the better version actually
14:57 imirkin: i'm sure holy wars have been fought over the best rounding mode...
14:57 jenatali: FWIW: https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#rounding-modes-1
14:57 imirkin: that and whether one should eat a hard-boiled egg from the big or little end...
14:58 jenatali: > Round to nearest even is currently the only rounding mode required67 by the OpenCL specification for single precision and double precision operations and is therefore the default rounding mode.
14:59 alyssa:trying to understand Mesa formats, again
15:00 Sachiel: they tend to be a flat surface, usually with four legs to keep them upright at a decent height
15:02 karolherbst: imirkin: ehh.. we don't emit the rounding mode for fadd
15:02 imirkin: like i said, on maxwell+, it might not be done properly.
15:06 karolherbst: mhh, but by default we emit NE anyway
15:10 karolherbst: bbrezillon: anyway.. use the ones you think results in the better binaries, I can still try to figure out why it doesn't work here
15:12 karolherbst: imirkin: mhhh.... mhhh.... so.. for whatever reason, in a debug build it seems to pass...
15:12 imirkin: lol
15:12 karolherbst: or I messed something up :D
15:13 karolherbst: anyway.. maybe adding the code to emit RND does have an impact even though by default nearest is 0?
15:13 karolherbst: I'll dig into it...
15:14 imirkin: check the nvdisasm stuff :)
15:33 bl4ckb0ne: could Mali 400 MP2 run gles3 or its limited by the hardware?
15:34 alyssa: bl4ckb0ne: hw limitations
15:34 imirkin: the attached CPU can run GLES3 via llvmpipe though :)
15:34 bl4ckb0ne: so no gles3 on the pinephone then
15:34 bl4ckb0ne: thanks alyssa
15:34 alyssa: In theory particular extensions could be emulated but gles3 as a whole, no
15:35 alyssa: mali t600 is very much "what happens if we made mali 400 gles3+opencl capable"? ...and it's a radically different arch as a result
15:35 imirkin: note that unlike the later desktop GL series, in general ES3 is not the same as ES2 + exts. same for ES3.1 vs ES3.
15:35 bl4ckb0ne: iirc the vao extension works
15:35 imirkin: i.e. no amount of (currently written) exts on ES2 will get you ES3 functionality
15:36 alyssa: ^ also that
15:36 imirkin: individual features, sure. but not the whole thing.
15:36 emersion: good to know
15:36 bl4ckb0ne: is the difference that big between es3 and es2?
15:36 imirkin: no
15:36 imirkin: there's just no exts that cover that functionality ;)
15:37 imirkin: much like there are no exts that cover desktop GL 3.1 -> GL 3.2
15:37 alyssa: bl4ckb0ne: ES2.0 spec is 200 pages, ES3.2 spec is 600 pages ;)
15:37 bl4ckb0ne: oh boy
15:37 bl4ckb0ne: didnt expect that
15:37 imirkin: ES2 is DX9. ES3 is DX10.
15:37 imirkin: (but without some of the bits)
15:37 imirkin: (like no geometry shader... probably the biggest bit.)
15:38 bl4ckb0ne: geometry shader is limited by the HW?
15:38 alyssa: actually, no Mali has hardware support for geometry shaders
15:38 alyssa: mali t600+ uses a compute shader to emulate it
15:38 bl4ckb0ne: could it work with an extension?
15:38 alyssa: but you really can't do that on mali 400
15:39 alyssa: (and no, s/w transform feedback for the GP is Not Recommended(TM) )
15:39 karolherbst: alyssa: that sounds... annoying?
15:39 alyssa: karolherbst: checkbox compliance
15:39 karolherbst: yay....
15:39 karolherbst: nobody sane uses geometry shaders anyway :p
15:40 alyssa: karolherbst: indirect draws are compute shaders to patch the cmdstream ;P
15:40 karolherbst: ufffff
15:41 karolherbst: that sounds even more annoying than what we have to deal with
15:41 karolherbst: but I guess it's kind of sameish
15:41 alyssa: gs/tess likewise
15:41 karolherbst: just we have a macro language for that...
15:41 alyssa: yo, I heard you like NIR, ...
15:46 karolherbst: ehhhh
15:46 karolherbst: yeah soo.. no changes, debug build: it works
16:17 sravn: danvet: cd drivers/gpu/drm/bridge; git log --since=1year * | grep ^Author | sort - anyone that pops up more than a few times should be candidates
16:20 sravn: danvet: Maybe try to ask Douglas Anderson <dianders@chromium.org>, Jerome Brunet <jbrunet@baylibre.com>, Torsten Duwe <duwe@suse.de> or Icenowy Zheng <icenowy@aosc.io>
16:21 imirkin: karolherbst: same shader binary?
16:21 imirkin: bl4ckb0ne: i meant that DX10 includes geometry shaders while ES3 doesn't. otherwise most DX10 features are required for ES3.
16:22 sravn: danvet: Or the same query with focus on who reviewed bridge patches the last year?
16:23 sravn: There seems to be an influcx of bridge patches, or maybe it is just me who looses track. I hope we will see a decrease in number of new bridge patches but that may be optimistic
16:24 sravn: s/new bridge patches/new bridges/
16:37 danvet: sravn, yeah maybe just dropping a few on the floor until more people start peer-reviewing their stuff might be needed
16:37 danvet: since it does seem to be fairly active
16:57 karolherbst: imirkin: not sure yet.. I think that march/mtune settings mess it up
16:57 karolherbst: but mhhh
16:59 imirkin: oh that makes sense
16:59 imirkin: const folding
17:00 imirkin: is not "precise" wrt that
17:05 Vanfanel: On drmModeSetCursor2(), there were the hot_x and hot_y parameters. SInce I am now using the ATOMIC interface, what CURSOR PLANE properties would hot_x and hot_y map to? I am so far using hot_x for CRTC_X and hot_y for CRTC_Y. Is that right?
17:26 karolherbst: imirkin: actually.. it is -Ofast
17:27 imirkin: karolherbst: right, but as a result of const folding
17:27 imirkin: i.e. the const folding works differently under diff optimization levels
17:27 karolherbst: actually.. no
17:27 karolherbst: the shaders are the same
17:27 imirkin: imm -> const?
17:27 imirkin: and the const tables are different?
17:28 karolherbst: we have no const tables at the moment
17:28 imirkin: o
17:28 karolherbst: the kernel is also super trivial, so I would be surprised if any of that matters
17:28 karolherbst: it's probably something dumb
17:28 karolherbst: like a compiler option flipping shit fo the entire application, not just mesa
17:29 imirkin: ah perhaps
17:31 karolherbst: anyway, will figure out which one and see if we can just "fix" it inside meson
17:31 karolherbst: imirkin: I bet it's unsafe-math-optimizations
17:31 imirkin: these types of floating "options" are super-sensitive to compiler opts
17:32 imirkin: like x87 does one thing, sse does another, etc
17:32 imirkin: so depending on the instructions selected, you'll get diff things
17:32 imirkin: (diff versions of sse/avx have diff things available)
17:32 karolherbst: "When used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimizations."
17:32 imirkin: yeah, that's not something you want when you want super-precise results :)
17:32 karolherbst: yep
17:33 karolherbst: maybe we should just force -fno-unsafe-math-optimizations.
17:33 imirkin: -fsafen-up :)
17:33 karolherbst: "-fno-signed-zeros" heh.. yeah
17:33 karolherbst: I am sure we never want that :D
17:34 imirkin: signed zeros are annoying.
17:35 HdkR: Reminder that reciprocal with SSE still allows a huge swing in its approximation and ends up with quite different results between Intel and AMD :P
17:35 karolherbst: HdkR: it's about add being broken :p
17:37 imirkin: right, so there are some things which are not well-defined by IEEE. rcp is one of them.
17:37 imirkin: but add and mul are well-defined.
17:38 HdkR: People compiling with unsafe math operations? Sounds terrible
17:38 karolherbst: HdkR: speed
17:39 HdkR: Still sounds terrible :P
17:39 dcbaker[m]: It don't matter how fast you get the wrong answer
17:39 karolherbst: moar speed
17:39 HdkR: Time to compile all of gentoo with unsafe optimizations
17:39 zmike: jekstrand: ping, I think I found a bug in nir constant folding and am not sure how best to report it (I have piglit test case, spirv shader, nir output, ...)
17:39 karolherbst: HdkR: don't ask
17:41 karolherbst: zmike: shader_test files are always nice
17:41 zmike: uhhh well it's not a shader test specifically
17:42 karolherbst: dcbaker[m]: do you think we want to add -fno-unsafe-math-optimizations?
17:43 karolherbst: but yeah.. without unsafe-math-optimizations it behaves as expected
17:44 dcbaker[m]: Hmmmm. Might make sense for at least the bits where we know it breaks things
17:45 karolherbst: dcbaker[m]: I don't think it actually breaks stuff directly, just that it messes with the FPU state also breaking applications
17:46 karolherbst:really wants a -ffpu-state-dont-touchy options
17:47 karolherbst: this entire FPU stuff is totally bonkers in x86 anyway
17:47 karolherbst: ehh.. maybe it was something else actually? uhhh
17:48 karolherbst: maybe we should just disallow -Ofast..
18:04 imirkin: HdkR: it's the wrong way, but faster!
18:05 HdkR: :D
18:05 HdkR: Need to get all the perf back from all the Intel mitigations?
18:11 karolherbst: :D
18:12 karolherbst:doesn't understand why intel keeps shipping a broken vector processor alongside their GPU
18:14 karolherbst: bbrezillon: mhhhh.... so Ofast wasn't the culprit of your round implementation failing :/
18:17 sravn: danvet: well, I cleared my bridge queue now I think. One applied, one wiht review stuff to fix, one waiting for binding review (have asked for help on #devicetree)
18:18 imirkin: karolherbst: that's ok, the cpu guys can't figure out why they keep shipping a broken gpu alongside their perfect vector processor :)
18:20 danvet: sravn, one tricked I sometimes used when you have two patch series
18:20 karolherbst: :D
18:20 danvet: is to ask each to review the other
18:20 imirkin: danvet: works best when both series are from the same person :)
18:20 danvet: might end up with just mutual rubber-stamp, but often you get a new volunteer
18:20 Vanfanel: What's the best way to remove an FB from a plane? I mean, what should I set the plane FB_ID prop to? 0? NULL?
18:20 danvet: maybe just *often enough
18:20 danvet: imirkin, lol
18:21 danvet: Vanfanel, 0
18:21 danvet: all properties are u64
18:21 danvet: so NULL not really a thing
18:21 danvet: if this is for a userspace atomic ioctl call
18:22 karolherbst:never understood why using SSE/AVX for memcpy is such a great idea, couldn't they just add an.. you know, "memcpy" instruction doing the right thing?
18:22 imirkin: karolherbst: like nvidia did? :)
18:23 HdkR: You mean like Ice Lake's `fast rep movb`? :P
18:23 imirkin: HdkR: is fast another prefix?
18:23 karolherbst: HdkR: I bet it's slower than using AVX512, because you still need to have a reason to sell AVX512 CPUs :p
18:23 imirkin: replacement for turbo button?
18:23 HdkR: karolherbst: It's a single instruction that can saturate memory
18:23 karolherbst: imirkin: well.. I guess GPU had that DMA shit for ages :p
18:24 karolherbst: but yeah.. the instruction is just for global -> shared
18:24 karolherbst: afaik
18:24 imirkin: give it time.
18:24 HdkR: imirkin: Nah, it just means that any `rep movb` will saturate memory if the cpuid flag is set
18:24 Vanfanel: danvet: Thanks! Yes, it's for an userspace atomic commit :)
18:24 karolherbst: HdkR: ohh, so it's a magic thing you can flip off to make code slower again?
18:24 karolherbst: or faster?
18:24 karolherbst: "slow rep movb" :p
18:25 HdkR: In kernel space I guess, userspace can't modify CPUID :P
18:25 karolherbst: doesn't see why the kernel would ever want to disable that :p
18:25 karolherbst: at which point it was pointless to add
18:25 karolherbst: or it has to be disabled when a process does AVX
18:25 karolherbst: dunno
18:26 HdkR: nah, it is just an indicator to userspace that it is available
18:26 karolherbst: ohhh
18:26 karolherbst: I see
18:26 karolherbst: so we have now those 1000 loc memcpy implementations with a simple if (fast rep movb) rep movb; else thing?
18:26 HdkR: Should let glibc set up its memcpy dispatcher to jump to that handler
18:26 karolherbst: :p
18:26 HdkR: :)
18:26 HdkR: yep yep
18:27 karolherbst: HdkR: saw that on twitter a while ago: https://twitter.com/jfbastien/status/1288232681432440834
18:28 karolherbst: sad truth
18:28 HdkR: yea
18:28 karolherbst: it's even sader if you know the reasons why
18:29 karolherbst: but not everybody has the icc, so they need to write assembly like glibc does :D
18:29 HdkR: I hope that AMD gains the same feature and in a couple years we will just be running a single instruction for copying memory :/
18:29 karolherbst: finally
18:29 karolherbst: and it takes 20 years until we can remove the "legacy code", but then we have AVXX16K requiring even more code
18:30 HdkR: :D
18:31 karolherbst: mhhh
18:31 karolherbst: I think I didn't though it through
18:31 karolherbst: it won't be AVXX16K, but AVX8Kx8K because abusing tensor instructions will be the future
18:32 karolherbst: ahh.. it's called AMX
18:32 karolherbst: the fuck...
18:32 karolherbst: they are actually doing it
18:32 HdkR: :)
18:33 karolherbst: I am sure we will see an AMX based memcpy soonish
19:20 zmike: Kayden: ping re:!5338
19:20 jekstrand: zmike: That's entirely possible. File a gitlab issue. Tests are nice.
19:21 zmike: jekstrand: sure, but what info specifically would be useful? is "run this piglit test with zink from this branch" okay, or do you need dumps?
19:23 jekstrand: zmike: A piglit that fails without Zink would be nice.
19:23 jekstrand: zmike: As would a patch to fix constant folding. :-P
19:24 jekstrand: zmike: But Zink isn't too hard to run. I think my mesa-run script still has a flag for it.
19:24 zmike: I can just provide you a cmdline to run the test?
19:25 jekstrand: sure
19:25 zmike: I'm not sure how feasible it is to make a standalone test that would fail on any driver since the spirv is...girthy
19:25 zmike: cool
19:25 jekstrand: Yeah... SPIR-V.....
19:26 zmike: it's also spirv that's come from glsl, during which time I've rewritten all the buffer reads to single int reads with the understanding that they'll be flattened back into full-length reads by nir optimization
19:26 Kayden: zmike: err, so you want to interleave/deinterleave them...but you also want to support packed buffers...?
19:26 zmike: Kayden: right
19:26 zmike: I can blit them but not read them
19:51 zmike: jekstrand: alright, I included all possible info I think
19:52 jekstrand: zmike: Cool.
19:57 jekstrand: zmike: Replied. I think NIR is fine.
19:58 zmike: hm
19:59 zmike: jekstrand: no, that's actually how I do reads
20:00 zmike: but for writes I use the component mask
20:00 zmike: and then, starting at the array indexed vec4, I check the mask for the vec4 member
20:01 jekstrand: zmike: Maybe you just don't want both levels of indexing?
20:02 zmike: jekstrand: I've thought about flattening it before, but that's sort of a big project since we use this layout for all ssbo/ubo reads/writes, and redoing the layout is going to mean probably fighting regressions the whole way
20:02 zmike: hoping to finish through gl 4.6 before I do stuff like that
20:03 zmike: hm
20:03 zmike: will try your change though, it does seem reasonable
20:03 zmike: thanks for checking
20:05 zmike: (having spent the morning battling packed format zs samplers, it's taking me some time to get back into shader mode)
20:09 jekstrand: zmike: Yeah... Packed formats....
20:10 airlied: packed is easier than z32s8 :-P
20:10 airlied: so who wants to ack vallium :-P
20:10 zmike: oh me
20:10 zmike: definitely me
20:13 airlied: I expect the only really reviewable bits are the nir passes, unless someone wants to really review the guts of it val_execute.c
20:14 zmike: is there a MR up for this?
20:15 airlied: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6082/
20:18 airlied:tagged a few ppl
20:28 jekstrand: karolherbst: I'm looking at an OpenCL 2.0 shader and seeing someone declare a CrossWorkgroup variable. Am I going crazy or did the OpenCL group when they allowed such a thing?
20:29 jenatali: jekstrand: CrossWorkgroup is just global, isn't it?
20:29 karolherbst: yeah.. I just wanted to say
20:29 karolherbst: cross is global
20:29 karolherbst: workgroup is shared
20:29 jekstrand: Sure, but how can you declare such a thing *in* a shader?
20:29 karolherbst: why not?
20:29 jekstrand: Where's the memory supposed to come from?
20:29 jekstrand: How are you supposed to access it?
20:29 jekstrand: It's not a pointer, it's an OpVariable
20:29 karolherbst: (int global*) 0x12345678
20:29 karolherbst: :p
20:29 karolherbst: ahh
20:29 karolherbst: variable
20:30 karolherbst: mhhh
20:30 karolherbst: mind sharing the spir-v?
20:30 jekstrand: I'm not sure I can...
20:30 jenatali: jekstrand: Is it a global variable?
20:30 karolherbst: jekstrand: maybe just the declarations of the variable?
20:30 karolherbst: you can leave out the code :D
20:31 jenatali: > For OpenCL C 2.0, or with the __opencl_c_program_scope_global_variables feature macro, variables defined at program scope and static variables inside a function can also be declared in the global address space.
20:31 karolherbst: ufff
20:31 karolherbst: 2.0 strikes again
20:32 jekstrand: What does it mean though?
20:32 jekstrand: If I know what it means, I can implement it.
20:32 jenatali: I've thought about how I'd implement that - basically have the runtime allocate a buffer for the memory, and pass it as a hidden pointer
20:32 karolherbst: jenatali: and what about the size?
20:32 jekstrand: Yeah, I guess....
20:32 jenatali: It means that it's memory that's implicitly bound to the program essentially
20:32 jekstrand: And is just accessible across everything?
20:32 jekstrand: Seems a but nuts but ok.
20:32 jenatali: Yup
20:32 jekstrand: That's implementable.
20:33 karolherbst: but how big should the buffer be?
20:33 jenatali: And then you look at C++ and you can have constructors and destructors ;)
20:33 jenatali: karolherbst: Big enough for all program-scope variables
20:33 jekstrand: Please no
20:33 karolherbst: :D
20:33 karolherbst: jekstrand: OpenCL C++ is a thing
20:33 airlied: does the definition at least have to be constant
20:33 jekstrand: karolherbst: I know. It makes me sad.
20:33 karolherbst: airlied: global int *p;
20:33 jekstrand: karolherbst: No, "global int p"
20:33 karolherbst: but.. I guess it contains one int
20:34 karolherbst: jekstrand: no pointer?
20:34 karolherbst: mhhh
20:34 karolherbst: the spec sows with a pointer
20:34 karolherbst: "global int *p;"
20:34 jenatali: karolherbst: There's an example of "global uchar buf[512]; // OK."
20:34 jenatali: Scroll down a bit in the https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#global-or-global section
20:35 karolherbst: ahhh
20:35 HdkR: Sometimes I feel like the people behind OpenCL 2.0 went mad with power
20:35 karolherbst: "global int * global ptr; " yeah.. well
20:35 karolherbst: _but_ is it sizeof(*ptr) or how big is it? :p
20:35 airlied: it does have a big SVM caveat
20:36 jenatali: karolherbst: That's sizeof(ptr)
20:36 karolherbst: jenatali: it's a pointer inside global memory pointing to global memory
20:36 jenatali: Yes, I know
20:36 karolherbst: the question is, does it point to an array or a single element
20:36 jenatali: You can have one kernel write a value there, and another one read it
20:36 jenatali: Does it matter? It's just a pointer?
20:36 karolherbst: right
20:36 jekstrand: Hrm... I'm not seeing where this global is in the CL.
20:36 karolherbst: and where do you specify how big it is?
20:36 airlied: ughh my tangent took me to cudaMemcpyToSymbol
20:37 jekstrand: But there are headers included so.....
20:37 karolherbst: _or_
20:37 karolherbst: is there no buffer and a thread has to set it?
20:37 jenatali: karolherbst: The buffer is sized based on how many global variables you declare, and what their types are
20:37 karolherbst: jenatali: int*
20:37 jenatali: So if you declare a single global int, you need a 4 byte buffer
20:37 karolherbst: not int
20:37 karolherbst: int*
20:37 karolherbst: so, how big?
20:37 jenatali: Ok, fine, and int* is pointer-sized, so the implicit buffer needs to be able to hold a pointer
20:38 karolherbst: how many ints do I need?
20:38 karolherbst: okay, so the pointer is undefined until a thread sets it, right?
20:38 jenatali: You don't allocate the data that the int* points to. It starts out pointing to0
20:38 Kayden: yeah, the amount of "let's add things, and damn the consequences" was pretty frustrating..
20:38 jenatali: > Program scope and static variables in the global address space are zero initialized by default. A constant expression may be given as an initializer.
20:38 karolherbst: okay
20:39 karolherbst: so it's really just a buffer having all the variables, nothing more
20:39 jenatali: Yep
20:39 karolherbst: that makes a bit of sense then
20:39 karolherbst: zero iniitialized is annoying though
20:40 jenatali: Yeah it's just an implicit buffer object, associated with the program
20:40 karolherbst: mhhh
20:40 karolherbst: stuff like that is still annoying
20:40 jenatali: karolherbst: If you accept SPIR-V generated from OpenCL C++, then you have to deal with constructor-initialized, so before the first kernel execution of a given program, you have to run a different implicit kernel to initialize all the memory :D
20:40 karolherbst: right...
20:41 karolherbst: that goes into the kernel wrapper :p
20:41 jenatali: Nope, only should run once
20:41 karolherbst: right, the kernel wrapper :p
20:41 karolherbst: ohh wait
20:41 jenatali: The kernel wrapper runs every time you execute the kernel
20:41 karolherbst: first global execution?
20:41 jenatali: Yep
20:41 karolherbst: wat....
20:41 karolherbst: ......
20:41 jenatali: Yeah
20:41 karolherbst: ......
20:42 karolherbst: okay... I think it's the time to relalize that CL 2.0 was just a big trolling attempt and we should skip it :p
20:42 karolherbst: the heck comes up with that?
20:42 karolherbst: that makes no sense whatsoever
20:43 karolherbst: how...
20:44 karolherbst: why?...
20:44 karolherbst: ufff
20:44 jenatali: I can't actually find it in SPIR-V though, so I'm not sure how it's supposed to be implemented
20:46 karolherbst: jenatali: Initializer and Finalizer
20:46 jenatali: Ah, thanks, that's what I was missing
20:46 karolherbst: coll that the env spec doesn't define it either
20:46 karolherbst: *cool
20:48 karolherbst: jenatali: do you know where the runtime behaviour is defined?
20:48 jenatali: Which behavior?
20:49 karolherbst: when it needs to be run
20:49 jekstrand: karolherbst: Actually, I can share the OpenCL C:
20:49 jekstrand: __global bool debug_error = false;
20:49 karolherbst: I guess it's not that painful to implement it.. but still
20:49 jekstrand: It's not even used..... :-(
20:49 karolherbst: jekstrand: right... but yeah
20:49 jekstrand:replaces it with a #define
20:49 jenatali: > In this case, OpenCL C/C++ compiler shall generate program initialization kernels that perform C initialization or C++ construction. These kernels must be executed by OpenCL runtime on a device before any kernel from the same program can be executed on the same device. The ND-range for any program initialization kernel is (1,1,1). When multiple programs are linked together, the order of execution of program initialization kernels that belong to
20:49 jenatali: different programs is undefined.
20:49 karolherbst: jekstrand: so... it's essentially like uniforms
20:49 karolherbst: just for ssbos
20:49 karolherbst: so you have one implicit buffer having all those global variables
20:50 karolherbst: just like uniforms are like an implicit ubo
20:50 jenatali: Yep
20:50 Kayden: zmike: I'm really confused what you're trying to do in !5338. You're packing two separate buffers at map time into a single buffer. So, you still have split buffers. Yet, your gitlab comments make it sound like you're having *both* a packed copy and a split copy? They both live on? How do they stay in sync???
20:50 karolherbst: jenatali: ehhhh
20:50 karolherbst: linking even
20:50 karolherbst: at least I am happy there are CTS tests testing that :p
20:50 jenatali: Yeah, if you do the linking in SPIR-V, you'll just get multiple initializer entrypoints
20:51 karolherbst: right
20:51 karolherbst: so we just need to compile towards initializers and finalizers
20:51 jenatali: Yeah
20:51 karolherbst: but when are finalizers ran? when the cl_program objects gets destryed?
20:51 jenatali: > Program clean up may result in the execution of one or more program clean up kernels by the OpenCL runtime. This is due to the presence of non-trivial C++ destructors for program scope variables. The ND-range for executing any program clean up kernel is (1,1,1). The order of execution of clean up kernels from different programs (that are linked together) is undefined.
20:51 jenatali: So yeah, cl_program cleanup
20:51 karolherbst: ....
20:52 karolherbst: I still only care about 1.2 at this point :D
20:52 karolherbst: what a mess
20:52 karolherbst: no wonder CUDA is used
20:52 jenatali: Yeah, same - though now that 3.0 makes all the *hard* 2.x stuff optional, we could implement some of the convenience stuff like this
20:52 karolherbst: yeah
20:53 zmike: Kayden: this is just for handling buffer map/unmap, so e.g., when mapping a buffer for read it returns the expected result
20:53 karolherbst: I would totally go for the minimum and add features which are used/make sense
20:53 jenatali: Yep, exactly
20:53 zmike: there's no "copy"
20:53 jenatali: I'm still proud that I implemented multiple devices on a single CL context, even across disparate GPU vendors :D
20:53 zmike: and the only split is the mapped buffer data while it's mapped
20:54 Kayden: zmike: so the data in the gpu resource is stored as....packed, or split?
20:54 zmike: packed
20:54 jekstrand: Now I'm seeing " const uint64_t c_one = 1ul;
20:54 Kayden: zmike: then what is map doing
20:54 jekstrand: This OpenCL code is just awesome
20:54 Kayden: zmike: it's packing two things
20:54 karolherbst: jekstrand: fun...
20:54 zmike: Kayden: the map is deinterleaving it on the fly
20:54 jekstrand: karolherbst: Note that's "const" not "constant" :)
20:54 Kayden: but the point of map is to return a single packed thing...
20:55 Kayden: which ... you... have
20:55 Kayden: I thought
20:55 jenatali: jekstrand: Ugh that's the worst
20:55 Kayden: the other one takes split things, and creates packed data on map, and re-splits on unmap
20:55 Kayden: this seems to be taking the same packed thing twice, and re-packing it on map (?) and ...
20:56 zmike: Kayden: uhh tbh this is so old that it's out of context a bit, but from what I recall, the issue here is that vulkan can't do readbacks for zs buffers
20:56 zmike: it can only copy z or s
20:57 karolherbst: jekstrand: I don't even now what's the default address space for variables :D
20:57 jenatali: karolherbst: If that was in a function, it'd be private. If it's not in a function, then it's global
20:57 zmike: so yes, I have a packed buffer, and yes, I'm returning a packed buffer, but it's not possible to do any other way unless I use a compute shader or somesuch
20:57 karolherbst: global?
20:57 karolherbst: ahh
20:57 Kayden: oh, so it's trying to return only 1 of depth or stencil?
20:58 Kayden: well
20:58 zmike: it's more or less a workaround for vulkan not supporting copying image data to/from buffers if it's a combined zs buffer
20:58 Kayden: it seems like the zink calls then would be reading depth only from the ZS buffer, and then also stencil only from the ZS buffer, and something would need to re-pack them
20:58 zmike: right, that's what this is doing
20:58 zmike: (it's sort of coming back to me now)
20:59 zmike: anytime something wants to do a readback on a zs buffer, it has to go through here because vulkan only handles z or s for image data copying but not both like this
20:59 zmike: so it has to read each aspect and then manually pack it
21:00 zmike: but at any other time things are 100% awesome and we just wanna keep jamming that packed buffer into the gpu without any changes
21:00 Kayden: ok
21:01 Kayden: but, you're just...mutating the pipe resource's format to S8 and back
21:01 zmike: yup
21:01 Kayden: but it's not a packed S8 buffer
21:01 zmike: in the driver it handles the rest
21:01 Kayden: it's X24S8
21:01 Kayden: with strides and offsets that aren't S8
21:01 zmike: right, but internally we know the format
21:01 Kayden: right
21:01 Kayden: so this is a generic helper that relies on zink internals
21:01 zmike: uhhhhh
21:02 Kayden: because you're calling transfer_map on a pipe resource with the wrong format
21:02 jekstrand: karolherbst, jenatali: When the OpenCL C compiler sees "const" does it give us a NonWriteable decoration or anything useful like that?
21:02 Kayden: but in the subclass of pipe_resource you have the right format...
21:02 zmike: right, the idea is that when you're using this, you know to watch for that happening and then grab the stencil aspect
21:02 jenatali: jekstrand: I don't think so
21:02 zmike: Kayden: as in the comment
21:02 Kayden: yeah, I mean, that's doable, I guess
21:03 jekstrand: jenatali: That's annoying...
21:03 Kayden: I do wonder if it'd make sense to have a transfer_map_with_aspect or something that only selects Z or S...
21:03 zmike: err but then the driver still has to manually pack it after?
21:04 jenatali: jekstrand: Not sure actually, I do see that clang outputs "constant" in the LLVM IR, even though it puts it in address space 1 (global)
21:04 jenatali: Not sure what happens when transforming that to SPIR-V though
21:04 jekstrand: jenatali: Right.
21:05 jekstrand: Am I missing patches for handling __constant variables?
21:05 karolherbst: jekstrand: well, you know how much "const" means in C?
21:05 jekstrand: I'm seeing "__constant const uint shuffleA[8] = {1, 0, 3, 2, 5, 4, 7, 6};
21:05 Kayden: zmike: the thing is, if you give a driver a pipe resource that says it's S8, and tell it to map it....that's supposed to be packed bytes. not bytes within 32-bytes
21:05 jekstrand: And that's failing on not creating the variable
21:06 Kayden: I guess normally if we wanted to reinterpret and read packed bytes, we'd viewclass as R8_UINT, not S8_UINT, but.
21:06 jekstrand: Hrm... Maybe that's the same problem?
21:06 zmike: Kayden: right, but the driver knows that it's not really a S8 (just like it knows that it's not going to be the depth-only format before that)
21:06 jenatali: Hm, I don't think we really had to do much to vtn to support __constant, I think all of our work was mapping it to DXIL, so clover should mostly handle it?
21:06 zmike: so the idea is that it's passing first the depth format and then the stencil
21:06 karolherbst: jekstrand: yeah.. I don't think we are ready to consume __constant memory correctly at the moment
21:07 zmike: I can change it to R8_UINT if you want, but imo S8_UINT is a little more explicit that this is a stencil format
21:07 Kayden: I guess it has ptrans->stride set correctly for the larger format
21:07 karolherbst: jenatali: is the spilling to a buffer worked out already?
21:07 jekstrand: karolherbst: Do we want a separate storage class for it?
21:07 Kayden: maybe that makes it okay
21:07 karolherbst: jekstrand: it's the same as __constant kernel args
21:07 Kayden: also
21:07 jenatali: karolherbst: Not sure what you're asking?
21:07 jekstrand: karolherbst: I'm wondering if we want nir_var_mem_constant
21:07 Kayden: pipe_resources can be shared across multiple contexts
21:08 karolherbst: jekstrand: maybe
21:08 Kayden: so mutating the pipe_resource's format is not threadsafe
21:08 jekstrand: Which can map to nir_shader::constant_data
21:08 karolherbst: anyway, no matter where the constant is comming from, it all is just __constant
21:08 karolherbst: jekstrand: doesn't work
21:08 zmike: ergh
21:08 jekstrand: Right, 'cause kernel args
21:08 karolherbst: yes...
21:08 karolherbst: and being able to take addresses and stuff
21:09 jekstrand: We can make taking the addresses of nir_shader::constant_data stuff work.
21:09 karolherbst: I think it does make sense to put them into constant_data for directly accesses values
21:09 jekstrand: That doesn't concern me
21:09 karolherbst: but once we get indirects we have to spill into a buffer
21:09 jenatali: karolherbst: We've got a lowering pass that attempts to do that FYI
21:09 karolherbst: jenatali: mhhhh
21:09 karolherbst: ...
21:09 jekstrand: karolherbst: nir_shader::constant_data *is* a buffer. Well, it turns into one, at least.
21:09 karolherbst: jekstrand: mhhh...
21:10 karolherbst: could we treat it exactly like a ubo and be done with it? :p
21:10 karolherbst: but yeah.. maybe we need mem_constant
21:10 jekstrand: karolherbst: In fact, with !6244, I do a neat thing and patc the address of that buffer into the shader binary post-compile.
21:10 zmike: Kayden: fwiw this is what the usage implementation looks like in zink https://gitlab.freedesktop.org/zmike/mesa/-/commit/77ed43462ad6d36bae0aac40ceb6ceb320b542d4
21:10 karolherbst: jekstrand: mhhh...
21:10 jenatali: We've been able to get by with using memo_ubo FWIW
21:11 zmike: the only real change related to u_transfer_helper is is the part at zink_resource.c:400
21:11 karolherbst: jekstrand: I was thinking we could just tread it like uniforms in galliums
21:11 jekstrand: It's pretty slick. No more magic UBO. :D
21:11 karolherbst: and const buffer 0 are the in shader constants
21:11 jekstrand: karolherbst: That could work, maybe.
21:11 jekstrand: Then we don't really want var_mem_ubo, we want var_uniform
21:11 jekstrand: But that's ok
21:11 karolherbst: you know what kills it?
21:11 jekstrand: ?
21:12 karolherbst: CL 2.0 allows you to bind SVM pointers to __constant kernel args :p
21:12 jekstrand: Fine with me. :)
21:12 karolherbst: yeah....
21:12 jekstrand: 64-bit address or bust!
21:12 karolherbst: well..
21:12 karolherbst: we could add a mem_constant and treat it like global memory and have normal global loads and shit
21:12 karolherbst: just with different caching or so
21:12 karolherbst: ....
21:12 karolherbst: no idea how much it actually matters
21:13 jekstrand: You can't cast between __constant and generic, right?
21:13 karolherbst: right
21:13 jekstrand: Ok
21:13 jekstrand: Then I'm starting to think nir_var_uniform and a lowering pass.
21:13 jekstrand: UBO doesn't make sense because that assumes all values come in from the outside.
21:13 karolherbst: well.... right
21:13 jekstrand: But nir_var_uniform allows initializers on entries, allows the driver to re-arrante at-will, etc.
21:14 karolherbst: ahhh
21:14 karolherbst: and then we can lower it to an ubo once the need arrises
21:14 jekstrand: It sounds like a much closer match to me
21:14 karolherbst: right now clovers ABI is stupid
21:14 HdkR: What happens if you union a constant and generic?
21:14 jekstrand: Yup, just like gallium does
21:14 karolherbst: HdkR: undefined? :p
21:14 HdkR: :D
21:14 karolherbst: you can cast, but you can't convert
21:14 karolherbst: but casting in C also gives you garbage if you are not careful :p
21:15 karolherbst: you can also do (int global*)0x12345678 and hope it does something sane :p
21:15 zmike: Kayden: I guess the simplest change to fix the threadsafety issue would be to add a member to struct pipe_transfer for the transfer format
21:15 karolherbst: jekstrand: the issue is that clover has a 8/24 bit split of the ubo index + offset
21:15 karolherbst: no idea if it makes sense to keep it
21:15 karolherbst: but that is how clover passes the buffer refernses in
21:15 jenatali: karolherbst: Sounds like it's time to use 64bit pointers ;)
21:15 karolherbst: so you have always 2 level indirections
21:15 karolherbst: which sounds stupid
21:16 karolherbst: jenatali: just count them...
21:16 karolherbst: there is like 0 benefit on how clover does it right now
21:16 karolherbst: and two level indirections suck anyway
21:16 jekstrand: karolherbst: 8/24 is reasonable for constants, I think.
21:16 karolherbst: the point is, that it's indirects
21:16 jekstrand: It just sucks because it's hard for NIR to pack/unpack
21:16 karolherbst: allows less optimizations
21:16 karolherbst: we can embed direct const buffer into instructions
21:17 karolherbst: like add $r0 $r1 c4[0x104]
21:17 karolherbst: but not if they are indirect
21:17 Kayden: zmike: Huh. I'm kind of surprised that Vulkan doesn't allow Z24S8 to be in a view compatibility class with R32_UINT
21:17 Kayden: I guess that's to allow it to actually be separate
21:17 jekstrand: karolherbst: Yeah, it doesn't
21:17 jekstrand: Kayden, rather
21:17 karolherbst: some instructions allow offset indirects, but those are rather rare
21:17 karolherbst: anyway
21:17 jenatali: jekstrand: That's one of the reasons we switched to 64bit pointers, so we could use 32bit pack/unpack ops
21:17 karolherbst: I'd like to just count them, just like we do in glsl
21:18 zmike: Kayden: I'm far from an expert, but I couldn't find any other way to do it besides handling them separately
21:18 Kayden: zmike: I might just add an aspect member to pipe_transfer
21:18 zmike: aspect would work for me as well
21:18 jekstrand: karolherbst: If the problem is NIR failing at pack/unpack, that might be fixable.
21:18 karolherbst: jekstrand: no, it's a clover issue :p
21:18 Kayden: think it's probably ok other than that
21:19 zmike: cool
21:19 zmike: Kayden: I can just add this as another patch on the MR?
21:19 Kayden: still not sure what u_transfer_helper is buying you without using the create/destroy hooks
21:19 karolherbst: clover expects the shader to take two indirects for all __constant accesses
21:19 karolherbst: as it passes the indirects into the kernel via the kernel input
21:19 Kayden: it seems very odd to me to be using half the mechanism
21:19 zmike: I prefere to think of it as an expedited version of the mechanism
21:20 Kayden: and, get/set_stencil hooks that set...the separate stencil hook to...itself
21:20 zmike: is there a gallium enum for aspect somewhere?
21:20 karolherbst: I think the idea was to make sub allocations and save space
21:20 karolherbst: but.. uff
21:20 karolherbst: that also kills performance a lot
21:20 Kayden: zmike: actually I'd just add PIPE_TRANSFER_DEPTH_ONLY and PIPE_TRANSFER_STENCIL_ONLY flags
21:20 jekstrand: karolherbst: :-/
21:20 zmike: Kayden: ah, okay
21:20 Kayden: zmike: then zink could assert that it has one of those set when seeing a packed depth/stencil format
21:21 Kayden: since it can't implement both at once
21:21 zmike: right
21:21 zmike: and I'll remove the format swizzling and just use the flags
21:21 karolherbst: although I am not sure if the offset is indirect or if clover just passes the size in... anyway
21:21 karolherbst: the index indirect is the annoying one
21:21 karolherbst: jekstrand: but hey.. you know we can only bind 8 ubos for compute shaders?
21:22 jekstrand: yeah....?
21:22 karolherbst: and we use 1 for kernel input and 1 for the driver buffer?
21:22 karolherbst: and now imagine what we have to do if the shader/kernel has 8 ubos
21:22 karolherbst: or rather we indirectly access ubo 9 like c[$r5][0x100]
21:22 jenatali: karolherbst: Switch them all to SSBOs? :)
21:22 karolherbst: jenatali: bingo
21:22 karolherbst: so...
21:23 karolherbst: having the index indirect means we have to lower to global memory
21:23 karolherbst: _all_ of them
21:23 karolherbst: except kernel input and driver buffer
21:23 karolherbst: which make __constant pointless :)
21:23 karolherbst: so yeah, the clover __constant ABI sucks a lot
21:23 jekstrand: karolherbst: That's pretty terrible
21:23 karolherbst: maybe AMD is fine with it
21:23 jenatali: Yeah, we ran into driver bugs with non-uniform accesses to constant buffers (UBOs) so we switched our __constant handling to SSBO/UAV anyway
21:23 karolherbst: we are not
21:24 karolherbst: jenatali: hah.. yeah
21:24 karolherbst: non uniform access can be slower than global memory even
21:24 karolherbst: on nv hardware
21:24 jekstrand: karolherbst: What's the status of the latest unstructured patches?
21:24 karolherbst: jekstrand: I reworked the vtn bits
21:24 jekstrand: karolherbst: Ready for me to read again?
21:24 karolherbst: fixed a few bugs
21:24 karolherbst: jekstrand: if you review the vtn changes, yes
21:24 HdkR: <3 when non-uniform uniform access ends up being slower than global
21:24 karolherbst: the nir pass? dunno, I think we might have to work on that together or something
21:25 jekstrand: karolherbst: I may actually review the NIR structurizer too. :)
21:25 karolherbst: :)
21:25 karolherbst: I think I would take care of some of the review, can't promise I'd fix all of the issues
21:25 karolherbst: but there are enough others motivated enough to fix that maybe :p
21:40 zmike: Kayden: I revised the MR to add those flags and pass them without touching resource->format
21:45 Kayden: zmike: in unmap ptrans->usage will have PIPE_TRANSFER_DEPTH_ONLY and PIPE_TRANSFER_STENCIL_ONLY set at the same time now
21:46 zmike: Kayden: we do?
21:47 Kayden: ah, no, you're resetting it
21:47 jekstrand: karolherbst: I hate switch statements. Have I evern mentioned that? :-P
21:47 karolherbst: I think you did
21:47 karolherbst: and I might have as well :p
21:48 karolherbst: I'd like to share the code between structured and unstructured, but I diverged quite a lot
21:48 Kayden: zmike: don't you just want to be setting those on trans->trans->usage and trans->trans2->usage?
21:48 zmike: uhhhh
21:48 zmike: my brain...
21:48 zmike: yes
21:48 jekstrand: karolherbst: Yeah, as would i
21:48 Kayden: or rather, I think they will be
21:48 jekstrand: karolherbst: Apart from the hash set/table issue I commented on, the new thing looks great.
21:48 karolherbst: :) cool
21:49 jekstrand: karolherbst: I think we could simplify it even more if we were clever with the worklist
21:49 karolherbst: yeah...
21:49 karolherbst: maybe
21:49 Kayden: because your trans2 = transfer_map(... | STENCIL_ONLY) will set it on trans2->usage already
21:49 zmike: Kayden: I think...yeah, you're right
21:49 jekstrand: karolherbst: How are you testing it?
21:49 zmike: I was being overzealous
21:49 karolherbst: jekstrand: some tests and test_basic if and test_basic loop
21:49 jenatali: jekstrand: We've tested it with CLOn12 as well
21:49 karolherbst: and relying on jenatali and co to catch issues :p
21:49 jekstrand: karolherbst: K... I may need to dig up my iris clover branch.
21:50 jenatali: Not super thoroughly though
21:50 airlied: bring back shamrock
21:50 jekstrand: airlied: Do you have a more recent version somewhere?
21:50 jekstrand: My branch is oooooollllld
21:50 jekstrand: karolherbst: do you have a clover branch I should rebase on?
21:50 karolherbst: not anymore
21:50 karolherbst: uhm...
21:50 jekstrand: :-/
21:50 karolherbst: one thing maybe
21:51 karolherbst: jekstrand: you get 32 bit addresses for allocated buffers, no?
21:51 jekstrand: karolherbst: I think I used 32-bit, yeah.
21:51 karolherbst: jekstrand: the thing is, upstream sutff just works :p
21:51 karolherbst: but for 64 bit pointers you might need https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6064
21:51 jekstrand: karolherbst: Oh?
21:51 airlied: https://gitlab.freedesktop.org/airlied/mesa/-/commits/shamrock-constant-buffers/ might be the last time I rebasd it
21:51 jekstrand: karolherbst: I didn't realized we'd merged all of it.
21:51 airlied: still old
21:51 karolherbst: jekstrand: I have the structurizer + 64 bit pointers and even SVM stuff just works now :p
21:52 jekstrand: airlied: Newer than mine. :)
21:52 karolherbst: jekstrand: maybe some spirv shit is annoyihg sometimes
21:52 jekstrand: karolherbst: Cool. I'll re-up shamrock so I can have a slightly more sane environment for hacking on this.
21:52 airlied: you can have my lvl0 on anv, though I don't have opencl on lvl0 on anv :-P
21:52 karolherbst: jekstrand: branch I tested the structurizer with: https://gitlab.freedesktop.org/karolherbst/mesa/-/commits/clover_tmp
21:52 karolherbst: "add validation" and the spirv change might be helpful
21:53 jekstrand: karolherbst: Wow, that's really close to master
21:53 karolherbst: yeah
21:53 karolherbst: I disable HMM because of ... kernel bugs :p
21:53 jekstrand: karolherbst: Ok, I think I'll rework shamrock on top of your structurizer branch then
21:53 jekstrand: From the MR
21:53 karolherbst: cool
21:55 karolherbst: bbrezillon: mind pushing your updated round implementation? I think then we could just merge it :p
21:55 karolherbst: happy to accept anything that works on d3d and nouveau
21:55 jekstrand: This isn't gonna rebase well....
21:55 zmike: Kayden: third time's the charm?
21:55 karolherbst: jekstrand: don't bother... the only thing you probably need is the gallium api bits, and some nir stuff?
21:56 karolherbst: or was there more?
21:56 karolherbst: I assume most nir patches can just go
21:56 jekstrand: karolherbst: There were some iris patches to make inputs work and stuff
21:56 karolherbst: right..
21:56 jekstrand: And it's > 1 year old
21:56 jekstrand: So yeah
21:56 karolherbst: jekstrand: or just use llvmpipe :p
21:56 jekstrand: I want to test stuff in our compiler stack too
21:56 karolherbst: ahh
21:57 karolherbst: but fun how at first I always thought it might stay a little hobby project of me pmoreau and maybe some others to implement CL :D
21:57 airlied: just merge the shamrock patches under a debug envvar
21:58 jenatali: karolherbst: I'm glad it didn't :)
21:58 jekstrand: airlied: I'll probably try to land at least some of them.
21:58 karolherbst: jenatali: wondering what would have happened if I wouldn't have pushed for nir in clover that strong :p
21:58 jekstrand: airlied: As long as I can keep Larabel (and therefore management) from picking up on it, I'm good.
21:59 Sachiel: time for a new account
21:59 karolherbst: jekstrand: ehh, just say that's the one CL stack they know about, just different name. They probablt wouldn't notice it's a new one besides the other 10 :p
21:59 jekstrand:has been tempted to have airlied re-author and push them.
21:59 jekstrand: BUt I don't think we need to go that far
21:59 airlied: jekstrand: doesn't call the envvar INTEL_DEBUG=opencl
21:59 airlied: jekstrand: I should do that at some point :-P
21:59 karolherbst: jekstrand: just create a new account on gitlab and calling it senior_jason and nobody will know it was you :p
22:00 Sachiel: definitelynotjason@jlkestrand.net
22:01 Kayden: zmike: so how does writeback work?
22:01 zmike: Kayden: hm?
22:02 airlied: jekstrand: but I can submit them and have karolherbst ack them :-P
22:02 karolherbst: sure :p
22:02 karolherbst: I ack everything
22:02 Kayden: hm, I guess it get split back up and uploaded twice
22:02 Kayden: zmike: a bit confused why you need to munge the format to depth-only in flush_region()
22:02 Kayden: zmike: it doesn't look like it does anything
22:02 zmike: umm
22:02 Kayden: zmike: originally I was wondering if you needed a flush_region for depth and another for stencil
22:03 Kayden: but I think it's ok with 1
22:03 zmike: Kayden: yea that's...actually probably a good point?
22:03 Kayden: the flush splits the packed thing back out into separate, and then zink's transfer unmap would upload each, I guess
22:03 Kayden: re-packing it again
22:03 zmike: right
22:03 jekstrand: airlied: Let me see what I think is upstreamable after I get done rebaseing, shall we?
22:04 Kayden: but yeah, flush_region always does both, so ... could drop the format changes there I think
22:04 zmike: alrighty
22:04 zmike: dropped
22:05 karolherbst: jekstrand: I think once we talked about it on a publicly logged channel, the chances of getting found out might be not 0 anyway :p
22:05 Kayden: probably want to re-test that in case I'm completely off-base
22:05 zmike: yea I'm trying to find a case that hits it
22:06 Kayden: jekstrand: we actually had customers asking about CL in mesa using gallium and not wanting to use NEO
22:06 karolherbst: :D
22:06 karolherbst: more CL stacks!
22:06 karolherbst: at some point intel alone will have more CL stacks than everybody else combined :P
22:06 Kayden: they emailed Mark and he emailed me and I told him "uhh, that doesn't really exist"
22:07 jekstrand: karolherbst: It's been a while... Do I want -Dgallium-opencl=icd?
22:07 karolherbst: jekstrand: yeah
22:07 Kayden: (beyond "some tests run" at least)
22:07 karolherbst:doesn't bother with standalone and would like to remove it
22:07 karolherbst: jekstrand: you probably have ocl-icd installed anyway
22:07 airlied: I think we are getting close to being able to CI CL at least
22:07 airlied: would be good to get some coverage in the Intel CI if it was possible
22:08 jekstrand: karolherbst: Looking through the patches again, some of them seem pretty trivial.
22:08 karolherbst: airlied: I guess we should get your libclc stuff in as well
22:08 jekstrand: karolherbst: I'll make an MR out of what I think we can land
22:08 karolherbst: jekstrand: cool!
22:09 jekstrand: There's no reason why we can't land global resources, for instance.
22:09 karolherbst: yeah
22:09 karolherbst: for us there is no differences between CL and GL compute anyway :p
22:09 jekstrand: It's a tiny number of lines of code
22:09 karolherbst: *are no
22:09 jekstrand: There's not much difference for us these days between CL and Vulkan compute
22:09 jekstrand: Vulkan's added so many features
22:10 karolherbst: right...
22:10 karolherbst: and our hw is "simple" enough so it doesn't matter :p
22:10 jenatali: karolherbst, airlied: FYI, I've got an MR for an alternate integration with libclc. Needs to be rebased on top of some of jekstrand's changes though...
22:10 jenatali: I guess, alternate's not right. Extended :)
22:11 karolherbst: :)
22:16 zmike: Kayden: confirmed that's fine
22:30 jekstrand: karolherbst: Are OpenCL kernels always local_size_variable?
22:30 karolherbst: yes
22:31 jekstrand: I seem to recall there maybe being an intel extension to for local size.
22:31 karolherbst: kernels can provide hints, but the API can always do whatever
22:31 jekstrand: But maybe not
22:31 jekstrand: Oh, ok.
22:31 jenatali: karolherbst: No, there's a required size annotation in addition to the hint
22:31 karolherbst: normally you set local_size to NULL and the runtime chooses the perfect size for you :p
22:31 karolherbst: jenatali: ohh, really?
22:31 karolherbst: since when?
22:31 jenatali: Forever? Not sure
22:32 karolherbst: never heard of it though
22:32 jenatali: https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#optional-attribute-qualifiers
22:32 jenatali: reqd_work_group_size
22:32 karolherbst: ohh, didn't know
22:33 karolherbst: so maybe it's indeed not _always_ variable :O
22:33 karolherbst: jenatali: but the documentation of clEnqueueNDRangeKernel doesn't specify anything about it?
22:33 karolherbst: weird
22:33 jenatali: It should
22:33 karolherbst: ohh
22:33 karolherbst: I think I missed it
22:33 karolherbst: "CL_​INVALID_​WORK_​GROUP_​SIZE if the program was compiled with cl-uniform-work-group-size and the number of work-items specified by global_work_size is not evenly divisible by size of work-group given by local_work_size or by the required work-group size specified in the kernel source."
22:34 karolherbst: or is it something else?
22:34 jenatali: Let me find it, one sec
22:34 jenatali: > The work-group size to be used for kernel can also be specified in the program source or intermediate language. In this case the size of work group specified by local_work_size must match the value specified in the program source.
22:34 karolherbst: jekstrand: anyway.. by deafult most kernels are variable. Never saw those fixed ones
22:34 jenatali: (from the CL3.0 API spec)
22:35 karolherbst: it's also in 2.2
22:35 jenatali: CL_​INVALID_​WORK_​GROUP_​SIZE if local_work_size is specified and does not match the required work-group size for kernel in the program source.
22:35 karolherbst: ahh, okay
22:35 karolherbst: I guess I should fix this in clover
22:35 karolherbst:always thought the API can always overwrite
22:36 jenatali: Nah, only for work_group_size_hint, which isn't the required hint
22:37 karolherbst: right...
22:37 karolherbst: it's actually part of CL 1.0 :O
22:37 jenatali: Yeah, I thought it might've been
22:37 jekstrand: cmarcelo: How do variable workgroup sizes work in iris? Does it get stuffed in cbuf0 somehow?
22:37 jekstrand: cmarcelo: Or is it push constants?
22:38 karolherbst: jenatali: the CTS even has tests for it :D
22:38 jenatali: Yup
22:43 jekstrand: Yeah, looks like it all goes in cbuf0
22:47 jekstrand: Which is fine but annoying
22:48 karolherbst: jenatali: api kernel_required_group_size :)
22:49 jenatali: karolherbst: Yep, I'm aware
22:49 jekstrand: karolherbst: How are you handling inputs in clover? cbuf1?
22:49 karolherbst: yeah.. I think I'll fix it tomorrow then :D
22:49 karolherbst: jekstrand: cbuf0
22:49 karolherbst: but clover has it's special API
22:49 karolherbst: jekstrand: pipe_grid_info.input or so
22:49 jekstrand: karolherbst: It just stuffs them in there already?
22:49 jekstrand: karolherbst: Or do you have to have nouveau code for it?
22:49 jekstrand: karolherbst: It'd be really nice if it was automatic. :)
22:50 karolherbst: yeah....
22:50 karolherbst: clover has a lot of ... own APIs which are potinless
22:50 jekstrand: Is it automatic or no?
22:50 karolherbst: it is not
22:50 jekstrand: Ok
22:51 jekstrand: That sounds like a good improvement we could make
22:51 jekstrand: But, for now, I'll just stuff it in cbuf0 after the system values
22:51 jekstrand: It shouldn't be hard
22:51 karolherbst: jekstrand: cso->req_input_mem and info->input
22:51 karolherbst: and then you just upload it yourself..
22:51 karolherbst: it's annoying really
22:52 jekstrand: karolherbst: Oh, so there's already a size field I can just grab?
22:52 karolherbst: on the cso, yes
22:52 jekstrand: Cool. I think that helps. Not sure, but I'm looking
22:52 karolherbst: pipe_compute_state..req_input_mem
22:52 jekstrand: karolherbst: I have the uncompiled_shader. Does that have it?
22:52 karolherbst: yes
22:53 karolherbst: when clover calls create_compute_state it just provides the size already
22:53 jekstrand: Ok, cool. I just need to figure out how to fish it out then.
22:53 karolherbst: check nvc0_cp_state_create :p
22:53 karolherbst: it's really just saving that value until later
22:54 karolherbst: nve4_compute_upload_input uses it when the grid gets launched
22:54 karolherbst: really..
22:54 karolherbst: we should just use cb0...
22:54 jekstrand: We really should
22:55 jekstrand: It means pulls on Intel instead of push but meh
22:55 jekstrand: We aren't getting pulls in iris
22:55 karolherbst: mhh
22:55 karolherbst: how are you handling uniforms then?
22:55 jekstrand: Aren't getting push in iris, rather.
22:56 jekstrand: And if we really cared, we could do some sort of opimtization on cbuf0 in general.
22:56 karolherbst: the thing is, the kernel input can be smaller than constant buffers.. so maybe it does make sense to treat it differently?
22:56 karolherbst: mhh
22:56 karolherbst: yeah...
22:56 karolherbst: but sizes are different
22:56 karolherbst: and then you could just call it cbuf0 :)
22:56 jekstrand: ?
22:56 karolherbst: but then it messes with nir again
22:56 karolherbst: jekstrand: I assumed that the push constant space isn't as big as ubos
22:57 karolherbst: maybe we should treat uniforms/kernel inputs as a different thing alltogether then...
22:58 karolherbst: mhhh
22:58 jekstrand: karolherbst: That's for us to worry about. :P
22:58 Lyude: hey airlied, how do you usually merge nouveau into drm? mainly wondering if it's a part of drm-tip or not before you pull
22:58 karolherbst: well.. gallium might need to check for the size and stuff
22:58 karolherbst: dunno if there is a special "uniform" size thing
22:58 karolherbst: or if all uniforms put together are always as big as an ubo
22:59 karolherbst: jekstrand: anyway, as it's right now, you can just use push constants I guess
22:59 karolherbst: and advertise 1k as the max size
23:00 karolherbst: jekstrand: PIPE_COMPUTE_CAP_MAX_INPUT_SIZE is the CAP for the input size
23:07 jekstrand: Now time to remember how to invoke iris clover
23:08 jekstrand: Well... clinfo says I have a driver.
23:08 jekstrand: But no devices.
23:10 jekstrand: karolherbst: Do I have to add iris to the clover deps anywhere?
23:10 jekstrand: karolherbst: I had a patch to do that but it conflicted so bad I had to throw it out
23:12 airlied: jekstrand: do you get to iris_create_screen?
23:12 jekstrand:tries
23:12 airlied: or iris_screen_create rather
23:12 jekstrand: airlied: Nope
23:13 jekstrand: I know I've had this problem before....
23:13 Kayden: zmike: are you updating the MR to drop those bits, then?
23:13 zmike: Kayden: uhhh did I not push?
23:13 Kayden: doesn't look like it
23:13 zmike: oh
23:14 zmike: well pushing here says it's up to date
23:14 airlied: jekstrand: yeah me too and I alwys debug it and forget it :-P
23:15 zmike: Kayden: assuming I didn't miss something, the last thing we talked about was keeping the resource format in unmap for flush, and that's removed in the MR now
23:15 airlied: jekstrand: I usually hit strace to see if it's loading things
23:15 airlied: from where I expect
23:15 Kayden: zmike: oh, I meant dropping the flush_region() format parameter
23:15 zmike: ohh
23:15 zmike: I misunderstood
23:15 Kayden: like I think the first 5 hunks are unnecessary
23:15 zmike: right
23:16 zmike: this is from an older version of the patch actually
23:16 zmike: I forgot I added that...
23:16 jekstrand: airlied: It's looking for stuff in lib64/gallium-pipe
23:17 jekstrand: Ah, need to setup pipe-loader stuff
23:18 jekstrand: maybe?
23:18 airlied: sounds like it
23:19 airlied: though I thought that was automagic
23:19 Kayden: I don't think it's entirely auto-magic
23:19 Kayden: I think you need a couple lines to ask it to try
23:19 jekstrand: There's a C file which looks like it could be automagic but maybe isn't?
23:19 jekstrand: No, it isn't because it's checked into git
23:20 zmike: Kayden: okay, removed
23:20 airlied: jekstrand: pipe_loader_create_screen also another place to dig
23:23 Kayden: zmike: thanks, looks good now
23:23 zmike: Kayden: thanks for the reviews, will add that note in
23:26 jekstrand: Oh, I think I built it static before which is why this wasn't a problem
23:29 airlied: might be an envvar
23:30 jekstrand: Ok, got that. Now just need to figure out how to make driconfig not borked
23:33 jekstrand: Ok, it's not crashing now.
23:33 jekstrand: Also not finding iris. :-(
23:35 Kayden: zmike: thanks! sorry about all the fuss, hopefully the result is a bit nicer in the end
23:37 jekstrand: karolherbst, airlied: Where am I supposed to find llvmspirvlib?
23:37 jenatali: jekstrand: https://github.com/KhronosGroup/SPIRV-LLVM-Translator ?
23:37 jekstrand: jenatali: Right....
23:38 airlied: ah yes that old chestnut
23:38 airlied: if ou have a system llvm just checkot the correct branch
23:38 jekstrand: airlied: Is that ever going to get packaged?
23:39 airlied: jekstrand: once we get mesa cl needing it :-P
23:39 airlied: it's like chicken-egg at the moment
23:39 airlied: don't want to bother tstellard before I really have to
23:39 jekstrand: airlied: Fair enough
23:43 jekstrand: This is fun....
23:43 jekstrand: /usr/bin/ld: /home/jason/.local/lib64/libLLVMSPIRVLib.a(SPIRVWriter.cpp.o): relocation R_X86_64_32 against symbol `__pthread_key_create@@GLIBC_2.2.5' can not be used when making a shared object; recompile with -fPIC
23:45 airlied: jekstrand: stick osme -fPIC in there
23:45 jekstrand: Yeah
23:45 jekstrand: Doing so now
23:45 jekstrand:hates having to remember all this garbage
23:45 jekstrand: libLLVMSPIRVLib seems like a really good candidate for a meson wrap
23:46 airlied: it also seems like something we should just vendor into mesa somedays :-P
23:49 jekstrand: Woo! I have a driver!
23:49 jekstrand: Now to watch it crash
23:52 karolherbst: ehhh "OpExecutionMode %10 LocalSize 64 14 1" mhh
23:52 karolherbst: that's how it's done in vulkan, isn't it?
23:53 jenatali: karolherbst: Yeah, LocalSize is either Vulkan size or CL required size
23:53 jenatali: LocalSizeHint is the hint variant
23:54 karolherbst: ehh.. to no surprise, it's not implemented in clover... at all
23:54 karolherbst: not even for llvm
23:55 jenatali: Btw, you might want the first 2 patches from https://gitlab.freedesktop.org/kusma/mesa/-/merge_requests/34 if you're looking at those attributes
23:56 karolherbst: jenatali: don't care about the hint for now :p