03:30 mareko: anholt: yes, we'll need fp16 interpolation for load_interpolated_input
03:52 mareko: anholt: our barycentric coordinates have always 32 bits
04:19 jekstrand:has an writeimage passing. \o/
04:21 jenatali: jekstrand: I thought you peaced out :P
04:22 jekstrand: jenatali: I warred back in?
04:22 jekstrand:doesn't know
04:22 jenatali: Congrats either way :P
04:22 jekstrand: thanks
04:23 jekstrand: I've also come to the conclusion that clover image support needs to be rewritten ground-up
04:23 jekstrand: Maybe not quite that bad
04:23 jekstrand: But it needs some work
04:24 jekstrand: It assumes images "just work" which isn't quite true
04:24 jekstrand: It also doesn't talk gallium the way 3D drivers do *at all*
04:24 jenatali: :/
04:26 airlied: jekstrand: seem correct, it's really r600 specific code
04:28 jekstrand: airlied: Yeah, the history on it is pretty obvious. As in, someone added images for r600 clover and then someone else added them for GL and the two basically didn't talk. Then gallium moved on and evolved and the clover image stuff just sat there.
04:28 jekstrand: So not only is it different, it's also self-inconsistent with the rest of gallium by now.
04:29 jekstrand: For instance, clover doesn't set pipe_surface::writeable on things that it clearly writes.
04:29 jenatali: Wait... Clover images predate GL?
04:30 jekstrand: predate the gallium implementation of GL_image_load_store, yes
04:30 jenatali: Huh
04:30 jekstrand: At least that's the way imirkin made it sound the other day
04:31 jekstrand: airlied: You may have to mail karolherbst an r600 :P
04:40 jekstrand: jenatali: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6578
04:40 jekstrand: jenatali: Made an MR out of the three probably-ok patches.
04:40 jekstrand: The rest is a mess
04:40 jekstrand: jenatali: But, if you're interested in common NIR code for image -> tex lowering, I figure we might want to land it as part of CLOn12 rather than waiting for clover to do something in common code.
04:41 jekstrand: jenatali: No pressure though. I'm not trying to rock your boat.
04:41 jenatali: jekstrand: I'm not sure it's super important. DXIL requires images to be typed, meaning we have to do more than just lowering the image ops to tex ops, we have to replace the variables with typed versions of them based on which types of reads/writes are done
04:42 jenatali: So, we could share *part* of a common pass, but right now it's all one pass for us
04:42 jekstrand: jenatali: You require typed for write-only UAVs?
04:42 jenatali: jekstrand: Yep
04:42 jekstrand: Interesting
04:42 jekstrand: Desktop never did
04:42 jekstrand: Desktop GL, that is
04:42 jenatali: We support "RAW" UAVs for buffers, but if you want automatic type conversions (e.g. float -> unorm) then the type needs to be declared in the shader
04:43 jenatali: At least, the type of float vs int vs uint
04:43 jekstrand: That's the same as GLES. In desktop GL, you only need a format if you're going to read from it.
04:43 jenatali:shrugs
04:43 jekstrand: So CL works pretty well. You do reads as a texture and writes as storage/UAV
04:43 jekstrand: It only gets tricky on read/write images
04:44 jenatali: Yeah, our read/write is the same as write-only, except we have API-level caps for which formats are supported that way
04:44 jekstrand: ah
04:44 jenatali: Lots of this predates me in this space though, most of it was just carried from D3D11 -> 12
04:44 jekstrand: Yeah, our HW is an unfortunate mess when it comes to UAV reads and formats.
04:44 jekstrand: (And probably the reason for all those caps. *sigh*)
04:53 jenatali: Sounds like https://reviews.llvm.org/D85911 is just about ready to land, so hopefully I can take the WIP tag off libclc soon (without having to split it)
04:54 jekstrand: jenatali: Cool!
04:54 jekstrand: jenatali: The __builtin one is still in limbo?
04:54 jenatali: jekstrand: Sounds like they're requesting a rename of the function but otherwise seems fine
04:55 jekstrand: jenatali: What are they suggesting it be named?
04:55 jenatali: jekstrand: Something like __clc_runtime_has_fma32
04:55 jekstrand: That seems reasonable
04:56 jenatali: Yep, agreed
04:57 jenatali: But since we don't have any patches in the libclc series that care about it, I don't think it's critical to land that before that series
04:57 jekstrand: Agreed
04:59 jenatali: So I still just need to rewrite how we do conversions (and finish getting the rest of my reworks into our downstream fork... *cough* images/constants/alignments *cough*) and then I think we'll be in good shape to actually be able to plop the rest upstream
05:00 jenatali: But for now, bedtime
05:11 jekstrand: jenatali: There's now a patch in that MR you do care about. It's a fix fo access qualifiers. Turns out I gave you some very bad review feedback. :-( Sorry...
05:12 jekstrand: karolherbst: Pass 78 Fails 11 Crashes 13 Timeouts 4 with images enabled. :D
05:13 mareko: jekstrand: any plan to have a textual representation of NIR for internal shaders? (GLSL would be OK too, but the current NIR API is ugly for this)
05:14 jekstrand: mareko: No plans ATM
05:14 jekstrand: mareko: Why do you want text?
05:15 mareko: jekstrand: we use TGSI text in radeonsi at the moment
05:15 jekstrand: So far, everyone who's doing NIR codegen from inside the driver uses nir_builder.
05:15 jekstrand: I've not looked at your shaders but, IMO, it's way better than any GLSL codegen I've ever seen.
05:16 mareko: jekstrand: even pure GLSL?
05:16 jekstrand: If you can write just a pure GLSL string, that's probably better, sure.
05:16 jekstrand: But the moment you start generating GLSL with a pile of printfs......
05:17 mareko: it would be a constant string, maybe formatted with %s etc.
05:18 jekstrand: It's been suggested before. Usually in the context of testing.
05:18 jekstrand: In fact, there may even be something in-tree but I don't know how well it works. I've never used it.
05:19 jekstrand: "nir_builder is too painful to use for a few built-in shaders" isn't something I hear often.
05:21 mareko: glsl_float64_funcs_to_nir is close to what I might need
05:21 jekstrand: I would be a little worried about making a text representation super-important to the driver unless we have some very good unit tests as part of `ninja test` because it's not going to get caught by compile testing if someone ever introduces a bug.
05:21 jekstrand: mareko: If you're already a GL driver, spinning up the GLSL compiler isn't terrible, I don't think.
05:21 jekstrand: It may be a little painful from a gallium back-end though. Fishing out the GL context can be a pain.
05:22 jekstrand:peaces out again
05:22 airlied: just use tgsi and tgsi->nir :-P
05:23 mareko: yes we use that right now (TGSI text -> TGSI -> NIR)
05:23 jekstrand:glares at airlied
05:25 mareko: I'm thinking of a standalone tool that does GLSL text -> serialized NIR at build time
05:26 airlied: mareko: what's the worst shaders you have? the query ones?
05:26 mareko: yes
05:26 airlied: mareko: radv has probably equivs for most of those in nir already
05:27 mareko: radeonsi doesn't have descriptor tables in NIR though
05:28 mareko: I'll just leave the TGSI shaders as-is, long live TGSI!
05:28 airlied: mareko: oh yeah but it would be a one time effort to just port the radv ones
06:15 daniels: jekstrand: nice work!
06:15 daniels: airlied: definitely send karolherbst some r600, might make gerddie feel less lonely :P
06:17 airlied: daniels: I should ebay gerddie some fp64 hw :-P
06:17 daniels: fp64 :(
06:18 DrNick: GLSL text -> C nir_builder source
07:31 pq: Lyude, well, if the backlight switch is defined as "compliant HDR" vs. "battery-saving non-compliant HDR", maybe that will be simple enough? Otherwise sounds like it could be non-trivial to define what it actually means and how it's used.
07:40 pq: Lyude, or, if you want to keep the HDR "compliant", then userspace will need to know the changed max HDR luminance.
09:14 MrCooper: DPA: something like https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4861 might work for etnaviv as well
09:28 DPA: MrCooper: Thanks, I'll take a look at it.
10:16 austriancoder: MrCooper: etnaviv is special as it is a renderonly gpu and I do not think that will help here
10:17 MrCooper: not sure how that would matter?
10:18 MrCooper: Bottom line is always that the driver needs to keep track of the original DRM FD passed in, and make sure GEM handles it returns are valid for that
10:23 austriancoder: MrCooper: that is all fine on etnaviv. The thing is that we have two fds (gpu and kms) and we need to use the correct kms fd to allocate a dumb buffer and import it. In the problematic case (two kms devices) we always use the first used kms fd for allocation and not the correct one. We create one pipe_screen instance where we store the kms fd. A hashmap is used where the gpu fd is the key (see etnavivs winsys).
10:26 MrCooper: if it was all fine, we wouldn't be having this conversation :)
10:27 MrCooper: the problem as that the screen de-duplication only takes the device into account, not the file description referenced by the DRM FD
10:28 MrCooper: if the DRM FD passed in for creating the second screen references a different description, GEM handles generated by the de-duplicated screen aren't valid for it
10:29 MrCooper: that is what that iris MR fixed
10:30 MrCooper: (I just got disconnected briefly, so might have missed something you wrote, or something I wrote might have been dropped)
10:34 emersion: (no, you haven't missed anything)
10:34 MrCooper: thanks
10:37 austriancoder: MrCooper: so that's why we are using the DRM fd for the screen hasmap. The created screen has a reference to the kms fd that gets used to alloc a dumb buffer. If we have one kms and one etnaviv device everything is fine
10:39 DPA: I think we all mean the same thing here, but there are many ways to fix it:
10:39 DPA: 1) Don't de-dublicate screens at all
10:39 DPA: 2) de-dublicate screens not on a per gpu, but on gpu+kms device basis.
10:39 DPA: 3) differentiate between a common per gpu screen and kms pipe_screens
10:39 DPA: 4) Keep track of the kms device for a handle in other ways
10:39 austriancoder: if there is a second kms device we want to use in parallel we have the same gpu fd but two different kms fds.
10:39 karolherbst: DPA: multiple screens can mean multiple contexts on a GPU which some GPUs don't have many of
10:40 austriancoder: DPA: i think this issue should be relevant for all renderonly gpus
10:40 MrCooper: austriancoder: the hash table is keyed on the gpu_fd, which is presumably not the original FD passed in for creating the screen
10:41 austriancoder: MrCooper: could be yes..
10:42 austriancoder:sits on a park bank
10:42 DPA: MrCooper: That shouldn't matter, util_hash_table_create_fd_keys compares the inodes: https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/gallium/auxiliary/util/u_hash_table.c#L83
10:44 DPA: Or are the gem handles only valid for exactly the same filedescriptor?
10:44 MrCooper: it matters because it doesn't ensure the original DRM FDs passed in for creating the screens reference the same file description (and thus the same GEM handle namespace)
10:47 pq: austriancoder, to clarify, if someone does open("/dev/dri/card0") twice to get two fds, they are different file descriptions. GEM handle made for one is not valid for the other.
10:47 MrCooper: GEM handles are valid for a file description, which can be referenced by any number of file descriptors; something like os_same_file_description is needed to check if two FDs reference the same description
10:50 pq: So I don't think comparing inodes is correct either. It confirms the underlying device is the same, but not the file description.
10:51 pq: file description != file descriptor
10:56 MrCooper: right, exactly
10:59 pq: I don't know how one could check if two fds refer to the same file description. The inode check only checks for the same file.
10:59 danvet: export gem handle on fd1, import on fd2
10:59 danvet: if you get back the same gem handle it's either a) bad luck b) same drmfd :-)
10:59 danvet: not aware of anything else
11:03 danvet: so if someone else has duped your fd, then it's pretty much lost
11:03 MrCooper: pq: see Mesa's os_same_file_description
11:03 danvet: also since gem handle aren't refcounted, so the other guy could close stuff and you crash
11:04 MrCooper: there's an ioctl for that
11:04 danvet: ah yeah just looked
11:04 MrCooper: actually a syscall, kcmp
11:05 pq: MrCooper, ooh, SYS_kcmp, thanks
11:06 pq: sounds new :-)
11:07 danvet: oh from 2012
11:07 danvet: for checkpoint restore stuff
11:08 DPA: Is os_same_file_description good enough for dedublication, or should I only do stuff on the first file description, and import/export handles as needed?
11:08 danvet: this still freaks me out since if the other side just does GEM_CLOSE it might all break badly
11:09 danvet: this works only (maybe?) for display-only kms drivers I think
11:09 danvet: even there maybe not
11:13 danvet: MrCooper, or is that somehow excluded?
11:13 danvet: that intel mesa pull very much talks about multiple exports of the same stuff
11:14 danvet: or is this just internally within mesa and the gbm thing is still refcounted?
11:34 DPA: How effective is this method of dedublication, doesen't every program that makes a glx or egl context get a different file description for the gpu and a different instance of the mesa libs?
11:34 DPA: So it would only be per process? How many gpu contexts does etnaviv have? And how many programs use the GPU for stuff like to accelerate gui rendering?
11:36 karolherbst: DPA: just from a normal desktop I have like 15 cients on the intel GPU
11:36 karolherbst: and normally with qt5 and gtk3 you get a couple of those
11:36 pq: DPA, I suspect there might be some EGL specification behind this, like saying eglGetDisplay() with the same arguments always returns the same EGLDISplay instead of a new one.
11:37 pq: ...which is what https://www.khronos.org/registry/EGL/sdk/docs/man/html/eglGetPlatformDisplay.xhtml indeed says
11:37 karolherbst: DPA: also.. removing deduplication is just out of the question, there are GPUs with like 16 contexts top out there afaik
11:39 karolherbst: and you don't even have to search for SoCs being that limited
11:39 karolherbst: (older) desktop gpus have similiar constraints
11:43 pq: karolherbst, I don't think that's the use case. Every process opens the render node itself, they can't share anything and there is nothing to de-dup.
11:44 karolherbst: pq: it's not about being able to share, just that GPU contexts are quite limited
11:44 karolherbst: at least on some hw
11:44 pq: karolherbst, yes, and de-dup does not help with that.
11:44 karolherbst: and you don't want your actual games to fail creating OpenGL contexts just because your desktop already uses them all
11:44 karolherbst: why not?
11:44 pq: the kernel driver has to pretend more sw contexts that there are hw contexts
11:45 pq: because there is nothing to de-dup between different processes
11:45 pq: in userspace
11:45 karolherbst: right, you can't share a context between processes, true
11:47 karolherbst: but if an application uses OpenGL and vaapi (or ends up with multiple gallium screens in different ways) we really do want just one hw context
11:47 karolherbst: and I don't see why the kernel has to fake multiple sw contexts just for that
11:47 pq: that makes sense
11:47 karolherbst: as there is literally no benefit of doing so either
11:48 pq: the kernel has to fake multiple sw contexts because people very easily run more than 16 processes that use the GPU
11:48 karolherbst: but then you run into a security nightmare I am convinced most will mess up
11:48 karolherbst: you really don't want to share a hw context with multiple applications except you are willing to make it secure
11:49 karolherbst: and this will require a lot of time to get right
11:49 pq: that's already done and implemented, e.g. in Nouveau ages ago, right?
11:49 karolherbst: no?
11:49 karolherbst: maybe we do that on very very old GPUs
11:49 karolherbst: but on newer cones we have one hw context per application
11:49 pq: my memory from 10 years ago says yes
11:49 karolherbst: maybe it was the case pre nv50
11:50 karolherbst: we don't do that on nv50 and newer, which I am quite sure of
11:50 pq: maybe you have room for "enough" hw contexts nowadays?
11:50 karolherbst: yeah
11:50 karolherbst: not many, but enough
11:50 karolherbst: I mean..
11:50 karolherbst: it's all in VRAM int he end anyway
11:51 karolherbst: we even have context switching firmware doing all the magic
11:51 pq: I do remember the lamentation from driver developers when toolkit developers were planning to start using the GPU by default for everything
11:51 karolherbst: right
11:51 karolherbst: I don't think it's an issue on GL3+ hw though
11:52 karolherbst: maybe we have unlimited contexts these days.... not quite sure about how that works internally
11:53 pq: that doesn't really compare to what you said before about some things havgin 16 or less hw contexts max :-)
11:53 karolherbst: I meant older or more limited hw though
11:53 karolherbst: modern GPUs are fine, they have enough contexts, but making it painful for hw being more limited is not the way to go ;)
11:54 pq: the de-dup code we are talking about is in Mesa, so userspace, and it's very easy to have many processes needing a ctx each, which the de-dup cannot help with.
11:54 karolherbst: I mean.. if _every_ driver is willing to go into this messy area, fine
11:54 karolherbst: I just suspect it will be a sec nightmare
11:54 karolherbst: right
11:55 karolherbst: well...
11:55 pq: well.. you can also just fail :-)
11:55 karolherbst: as I said, if an application ends up with multiple screens, we kind of want to share a hw context for drivers not emulating more
11:56 pq: right
11:56 karolherbst: yeah.. I think it matters how much we can actually do here...
11:56 karolherbst: at least it owuld be nice to have more data on what GPUs can actually do
11:56 karolherbst: and I suspect int he future we could end up with 50 contexts at the same time...
11:57 karolherbst: like browser thinking: let's do per tab gpu rendering and allocate a new process for each
11:57 pq: yup, wheee
11:57 karolherbst: which I am sure they don't do because of hw limitations
11:57 karolherbst: but
11:57 karolherbst: that's a reasonable thing from a security perspective
11:57 karolherbst: would make browser way simplier and less complex in that area
11:57 pq: and even your text editor uses the GPU
11:58 karolherbst: I think we might need an API for that
11:58 karolherbst: active/total hw/sw contexts or something?
11:58 karolherbst: but yeah...
12:00 karolherbst: I guess desktop/toolkits could be smarter and only use hw OpenGL for the compositor if they/d knew the hw can only do 8 contexts or something
12:00 lynxeye: karolherbst: AFAIK none of the embedded GPUs have any HW contexts, so it's all down to the kernel driver pretending there are more contexts
12:00 lynxeye: not counting Tegra as embedded ;)
12:00 karolherbst: heh
12:00 karolherbst: that bad?
12:00 karolherbst: that sounds horrifying
12:00 lynxeye: Just assume you don't carry any state between submits
12:00 pq: also sounds like unlimited contexts :-D
12:00 karolherbst: I mean.. that's fine for embedded use cases, but not for a desktop
12:01 karolherbst: lol...
12:01 karolherbst: :D
12:01 pq: karolherbst, why would you say it's fine for embedded?
12:01 karolherbst: because you usually don't open op dozen of untrusted applications
12:01 pq: a phone is emdebbed, right? But it's far from a closed system.
12:01 karolherbst: *up
12:02 lynxeye: karolherbst: Not assuming state across submits also makes device reset robustness a lot easier
12:02 karolherbst: I don't consider a phone embedded really
12:02 karolherbst: it's more of a small desktop
12:02 karolherbst: I mean embedded as in physically embedded
12:02 karolherbst: like controller hardware for machines or stuff like this
12:03 karolherbst: not consumer hw
12:03 pq: I think a phone is as physically embedded as anything can be :-D
12:03 karolherbst: it's not embedded _into_ something else :p
12:03 karolherbst: I can also say controller hardware instead if that's better
12:04 pq: I think the term you are looking for is "a closed system".
12:04 karolherbst: well.. you could argue that iphones/ipads are closed systems :p
12:05 karolherbst: but in the end it boils down whether you run multiple applications or just a single one doing stuff
12:05 karolherbst: anyway...
12:05 karolherbst: lynxeye: the question is though: is there hw state?
12:06 karolherbst: and if so, how do you protect against an attacker just submiting stuff?
12:07 karolherbst: like how are you isolating processes seeing each other data inside the kernel?
12:09 lynxeye: karolherbst: Sure, HW keeps the last state, but from the individual submit it's unpredictable
12:09 karolherbst: doesn't matter for an attacker
12:09 karolherbst: they just do it multiple times then
12:09 lynxeye: userspace doesn't directly submit in the HW runqueue, but goes through a kernel arbitration
12:09 karolherbst: right
12:09 lynxeye: and at least on etnaviv we have different page tables for eachSW context and switch between them in the kernel driver
12:10 karolherbst: the question was more: does the kernel filter stuff out, rewrite addresses or whatever so it prevents cross process data leaks?
12:10 karolherbst: ahh
12:10 karolherbst: okay, so you hand each process its own page table
12:10 lynxeye: yep, on old GPU we filter the commandstream on new ones we just rely on the per SW context page tables
12:11 karolherbst: how many page tables can you create? I guess multiple ones as those are just pointers I figure
12:11 danvet: karolherbst, yeah I think nouveau would be the only one with a limit on gl context
12:12 danvet: if it works like you say
12:12 lynxeye: depends on GPU implementation, but basically 64K or unlimited
12:12 danvet: e.g. amdgpu even swaps out pagetables and stuff
12:12 karolherbst: lynxeye: okay, that sounds like enough
12:12 karolherbst: danvet: yeah... I think I always forget that we are the only ones with real hw contexts...
12:12 danvet: so as long as you don't run out of system memory, you can allocate
12:12 danvet: karolherbst, we have those too on i915 now
12:12 danvet: but they're just stuff in vram
12:13 karolherbst: but I am willing to accept that "multiple VMs" are good enough as a real hw context replacement
12:13 danvet: so there's iirc a hw limit on 20bit for ctx id
12:13 danvet: but we can even virtualize that
12:13 karolherbst: oh, nice
12:13 danvet: ofc you'll probably run out of random other things first :-)
12:14 karolherbst: lynxeye: are you rejecting forbiddgen stuff or accepting allowed commends?
12:14 lynxeye: karolherbst: At least until you get to HW preemption I don't really see the benefit of real HW contexts
12:14 karolherbst: rejecting kind of comes with the problem you have to know about all of the commands :)
12:14 danvet: karolherbst, I'd assume there's some way to save/restore an idle ctx (kinda needed for suspend/resume)
12:14 karolherbst: lynxeye: isolation
12:14 danvet: so pushing past whatever hw limit you have shouldn't be too hard, as long as the hw limit isn't so low to cause thrashing
12:14 karolherbst: danvet: yeah, we have that in firmware
12:15 karolherbst: and then VRAM to system memory we do in the kernel
12:15 lynxeye: karolherbst: what exactly are you isolating against if only one context can run on the HW at any given time?
12:15 karolherbst: but modern nvidia GPUs even allow us to power down the entire GPU except VRAM
12:15 karolherbst: and let VRAM self refresh
12:15 karolherbst: we don't make use of it yet
12:15 karolherbst: but nvidia does afaik
12:16 karolherbst: lynxeye: data leaks mainly?
12:16 karolherbst: not everything lives in VRAM
12:17 karolherbst: but we can preempt
12:17 karolherbst: soo...
12:17 karolherbst: compute even allows instruction level preemption these days
12:17 karolherbst: graphics still only does stage level preemption
12:18 lynxeye: yep, I get that HW contexts make sense once you have multiple runqueues and/or preemption, but that's nothing I worry about for the enxt few years in embedded ;)
12:18 karolherbst: right...
12:18 karolherbst: well, tegra has all of that now :p
12:18 karolherbst: well.. maybe not now, I think you need orin for mtuliple runqueues?
12:19 karolherbst: although I think volta also had it already
12:19 karolherbst: maybe older gens
12:19 lynxeye: yea and NVidia is charging so much money for the Tegra industrial stuff that you can just go with other more powerful HW...
12:20 karolherbst: hw is not everything
12:21 karolherbst: nvidia is a software company also doing great hardware :p
12:21 lynxeye: if you like proprietary software, sure
12:22 karolherbst: just saying how it is
12:22 karolherbst: if the tooling is shit, nobody likes working with your hardware
12:23 karolherbst: and in the open source world we have literally 0 tooling (you could argue that apitrace _might_ fall into this, but I'd ignore it)
12:23 karolherbst: or frameretrace
12:24 karolherbst: but that's more driver deveoper tools than what application developers can actually rely on using
12:24 karolherbst: and that's also a big issue for OpenCL that there is literally only crappy tooling out there
12:29 DPA: So, what does this all mean for me, and how much de-dublication should I do?
12:44 MrCooper: agd5f_: FWIW, a GitLab issue can be marked as duplicate of an issue in a different project
12:44 MrCooper: (so no need to move it first)
13:59 agd5f_: MrCooper, ah, cool
14:35 jekstrand: karolherbst: We really need to sort out pipeline compilation.... Images require sampler views to be created
14:35 jekstrand: s/pipeline/kernel
14:36 MrCooper: DPA: FWIW, another example you could look at is src/gallium/winsys/amdgpu/drm/amdgpu_winsys.c, with amdgpu_winsys (1 instance per device) and amdgpu_screen_winsys (1 instance per DRM file description)
14:40 jekstrand: karolherbst: Wait... We have a device when we do clover::nir::spirv_to_nir. Why isn't that enough to allocate memory?
14:41 jekstrand: karolherbst: Maybe I'm confused between clover device vs. context?
14:42 karolherbst: jekstrand: when we compile kernels (to nir) we don't have a context
14:42 karolherbst: queues create gallium contexts
14:42 karolherbst: clover contexts create gallium screens
14:43 karolherbst: so we can allocate memory when we have our kernel
14:43 karolherbst: jekstrand: maybe take a look at my constant buffer MR, I reworked it a bit so it sucks less.. maybe that gives you an idea?
14:45 jekstrand: karolherbst: I'm somewhat unconvinced by that approach
14:46 jekstrand: In particular, I'm not sure what it does to lifetimes of things if the kernel gets destroyed after the context
14:46 karolherbst: internal objects are usually refcounted
14:46 jekstrand: But do buffers hold a reference to the context?
14:46 karolherbst: indirectly yes
14:47 karolherbst: well.. not to the context, but to the device
14:50 karolherbst: maybe we could store the buffers somewhere else.. but I think all places just suck differently
14:50 jekstrand: Yeah
14:50 karolherbst: storing it with kernels is the more natural thing
14:51 karolherbst: you compile programs against devices already anyway
14:51 karolherbst: so I doubt it's common to have kernels around but not the devices anoymore
14:56 jekstrand: karolherbst: Looking at the OpenCL API, CreateProgramWithSource takes a context and createKernel, presumably, inherits that context from the program.
14:56 jekstrand: karolherbst: So why does clover drop it on the floor?
14:56 jekstrand: This seems like poor architecture to me and not a real problem.l
14:57 karolherbst: ohh... mhhh
14:57 jekstrand: I think we can probably just fix clover
14:57 karolherbst: yeah..
14:57 karolherbst: I kind of assumed the buffer are per device, but they are per context...
14:57 karolherbst: yeah.. I'll fix that :d
14:58 karolherbst: root_buffers have a per device resource list anyway
14:59 jekstrand: karolherbst: So a context contains a list of devices?
14:59 karolherbst: yes
15:00 karolherbst: but a buffer as well. It stores pipe_resources per device
15:00 jekstrand: Ok, so I guess it maybe makes sense for kernels to be per-device
15:00 jekstrand: hrm...
15:00 karolherbst: no, it does not :p
15:00 karolherbst: you can use the same kernel on multiple devices
15:00 jekstrand: right
15:00 karolherbst: anyway.. I can store a root_buffer on a kernel
15:00 jekstrand: From the API pov, a kernel is per-context
15:00 karolherbst: and that can be used for all devices anyway
15:01 karolherbst: this map I used is totally useless anyway
15:02 jekstrand: That sounds reasonable
15:04 Venemo: here is a silly question. in the context of GS, do we have an alternative for the term 'degenerate'? I just glanced at the "respectful coding" guidelines that jekstrand posted to the mailing list a while ago and realized that this term might be offensive to some people
15:06 jekstrand: karolherbst: For images, we need to be able to create samplers and attach them to kernels as well.
15:06 karolherbst: jekstrand: soo.. first rework done :) I think that already looks much better
15:06 jekstrand: Venemo: trivial?
15:06 karolherbst: I look if I can move the buffer creation up a bit
15:07 Venemo: jekstrand: I've never heard "trivial triangle" or "trivial vertex", does that get the point accross?
15:07 karolherbst: and do it at kernel creation time
15:07 pendingchaos: Venemo: zero-area?
15:07 Venemo: pendingchaos: sounds good
15:07 jekstrand: Venemo: I had no idea that was on the list. I always think about that term in the mathematical sense and have to google for it to get the social sense. :-/
15:07 pendingchaos: I've never heard it used, but it's short and seems self-explanatory
15:07 Venemo: though not a good description for vertices, since those are always zero area
15:08 jekstrand: Venemo: What exactly are you trying to describe?
15:08 jekstrand: Venemo: The case where it's just a pass-through?
15:08 jekstrand: Or the case where it throws everything away?
15:08 jekstrand: The case where it contracts everything to a point?
15:09 Venemo: jekstrand: I'm adding a feature to NIR's GS intrinsics lowering which will filter out degenerates, meaning it filters out primitives and vertices which are not gonna result in actual output
15:09 Venemo: I'm thinking maybe a good term might be 'incomplete', but I also like pendingchaos 's zero-area
15:09 pendingchaos: zero-area might not be a good word, I might be misremembering what it means for a primitive to be "degenerate"
15:10 karolherbst: ehhh....
15:10 Venemo: eg. if the GS creates triangle strips, but only outputs 2 vertices in a strip
15:10 jekstrand: Venemo: incomplete?
15:10 jekstrand: filter_incomplete_prims?
15:10 karolherbst: jekstrand: I am stupid.. we actually have to manage a map, just not a device -> buffer, but module ->buffer one
15:10 Venemo: jekstrand: yeah that's what I had in mind too.
15:11 jekstrand: karolherbst: Wha?
15:11 karolherbst: because the buffer can be different between llvm vs nir
15:11 karolherbst: or doesn't exist at all with llvm
15:11 jekstrand: karolherbst: Yeah.....
15:11 jekstrand: karolherbst: Also, if we start doing per-device optimizations, that could change it as well
15:11 karolherbst: yeah...
15:12 karolherbst: so maybe my approach wasn't incorrect afterall...
15:12 jekstrand: karolherbst: I'm not sure
15:12 karolherbst: soo.. there are several issues
15:12 jekstrand: karolherbst: I feel like an API-level kernel object should contain multiple device-level kernel objects
15:12 karolherbst: at cl_program creation time you compile something into multiple nirs
15:12 jekstrand: And each of those can contain some pipe_resources
15:12 karolherbst: per entry_point
15:13 jekstrand: karolherbst: Yes, programs can have multiple kernels
15:13 jekstrand: Or, rather, kernels are created from programs
15:13 Venemo: jekstrand: the google page lists "crazy, insane, cripple" as bad examples, and I think "degenerate" kind of comes close to these. of course I'm not a native speaker so these things are not easy to judge
15:14 karolherbst: jekstrand: the thing is, a cl_kernel is just a kernel.. it doesn't know about devices or anything
15:14 karolherbst: I mean.. we could pull from a cl_program for which devices it's valid
15:14 jekstrand: Venemo: I could see people not liking that and, honestly, "incomplete primitive" is more descriptive in this case, IMO
15:14 karolherbst: and do more stuff when it's created
15:15 Venemo: jekstrand: okay. that sounds good to me (that, until "primitve" becomes offensive, too :P)
15:15 karolherbst: jekstrand: kern.program.devices :p
15:15 jekstrand: karolherbst: cl_kernel inherits a program so it knows about a list of devices
15:16 jekstrand: But, also, I think we want some sort of a per-device hook
15:16 jekstrand: Which is what we have today
15:16 jekstrand: We have the actual device in spirv_to_nir
15:16 jekstrand: Because link_program is called per-device
15:16 jekstrand: So maybe we have roughly what we need?
15:16 jekstrand: And we need to just create pipe_resources instead of buffers?
15:17 jekstrand: Or, more-to-the-point, we need to break out some of the internal buffer-create API to allow for single-device buffers.
15:17 karolherbst: yeah...
15:17 karolherbst: sounds like a bigger rework as binding arguments only works on buffers at the moment
15:17 jekstrand: It could be a buffer with only one device
15:18 karolherbst: yeah
15:18 jekstrand: And then we assert on bind that it matches
15:18 karolherbst: which is what I am doing already
15:18 jekstrand: hrm...
15:18 karolherbst: I still want to move the buffer creationg
15:18 karolherbst: so it is already there after the kernel was created
15:18 jekstrand: Yeah
15:18 karolherbst: or make it lazy as it's easier to set it up in the constructor then on bind
15:19 karolherbst: *than
15:19 jekstrand: I'd rather have a good way to create stuff at kernel compile
15:19 karolherbst: yeah...
15:19 glennk: Venemo, "zero area triangle" perhaps?
15:19 karolherbst: I think it's not difficult.. let's see
15:20 Venemo: glennk: maybe that can work too, though I would lean towards 'incomplete'
15:21 glennk: in GL speak incomplete typically means the provoking vertex hasn't been emitted yet, but the current vertices are part of what will be a valid triangle
15:22 glennk: if memory serves that is :-)
15:23 Venemo: well, this is true here, too
15:23 Venemo: it could be a valid triangle if one more vertex is emitted
15:24 jekstrand: Venemo: If the case you care about is the one in which there aren't enough vertices, I think incomplete is the right term
15:24 Venemo: agreed
15:25 jekstrand: karolherbst: After my fixes, does the deref align stuff work now?
15:25 karolherbst: yes
15:31 jekstrand: karolherbst: \o/
15:32 jekstrand: anholt: Freedreno CI keeps failing on me: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5909#note_616131
15:33 karolherbst: jekstrand: slowly the code also gets a bit cleaner :D
15:34 jekstrand: karolherbst: :)
15:36 anholt: jekstrand: the a630_vk fail there is fdo infrastructure, haven't seen that happen other than maybe during gitlab upgrades
15:36 anholt: the a530 fail I can't get the logs from?
15:37 jekstrand: anholt: I'm attempting to re-run now
15:37 jekstrand: anholt: Other stuff seems to have gone through Marge fine this morning
15:37 karolherbst: jekstrand: pushed
15:38 anholt: even the current a530 job, I'm getting intermittent 404s on and it keeps resetting the log visualizer to loading from the top and I don't know if I've seen the end
15:38 jekstrand: karolherbst: I still don't get why we need an unordered map
15:39 anholt: if you can see logs, if it's got bootloader dump late in the run (after deqp starts), then https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6529 should basically make it go away
15:39 karolherbst: because we need a per device buffer and we need to select when binding the arguments
15:39 jekstrand: karolherbst: Right...
15:39 karolherbst: I mean.. I could use a root_resource directly, but then I'd need to rework the binding code as well
15:41 jekstrand: karolherbst: This is quite annoying....
15:41 karolherbst: maybe curro has some ideas :)
15:41 karolherbst: but at least now the overhead when running kernel is quite small
15:42 karolherbst: mhh.. but I guess with a resource it would be even smaller as we don't have to do the upload anymore
15:44 jekstrand: root_resource seems like the thing we want, I think.
15:44 jekstrand:looks at samplers
15:48 jekstrand: Wow... It looks like clover re-creates every sampler object every dispatch
15:48 jekstrand: I guess that's probably ok. They're generally pretty light-weight
15:49 karolherbst: it's not the only thing clover does on each dispatch :p
15:49 karolherbst: I think at some point it makes sense to check stuff out with a profiler and see where the big issues are
15:49 karolherbst: but we are not there yet anyway
15:49 jekstrand: No, we aren't
15:50 jekstrand: Now that I'm looking through more stuff, I don't think I'm nearly as offended by the map
15:50 karolherbst: :D
15:50 jekstrand: Everything else seems to use them
15:50 karolherbst: the STL containers are not the best ones.. but at least unordered_map isn't that terrible to use
15:51 karolherbst: better than std::map eg :p
15:52 jekstrand: why is it better than std::map?
15:52 karolherbst: std::map is a binary tree, unordered_map is a hash table
15:52 jekstrand: Why is a hash table better?
15:52 karolherbst: because lookups are constant time?
15:52 jekstrand: Especially when you typically have one device
15:52 karolherbst: binary tree is log(n)
15:53 jekstrand: Where n < number of PCIe ports
15:53 karolherbst: mhhhhh
15:53 karolherbst: mhhhh
15:53 karolherbst: good poing
15:53 karolherbst: *point
15:53 jekstrand: Maybe +1 for the Intel GPU
15:53 karolherbst: I guess for small numbers a map could be faster
15:53 jekstrand: I suspect for the typical case of 1 thing, a map probably is faster
15:54 karolherbst: mhh
15:55 karolherbst: seems like people benchmarked it a bit and only when the map is empty a std::map is faster
15:55 jekstrand: Personally, I'd use std::map just to be consistent
15:56 jekstrand: karolherbst: What's left to resolve on alignments?
15:56 karolherbst: yeah... I guess that's fine as it really doesn't amtter
15:57 lrusak_: any ideas why this would be happening? mesa bug or meson bug?
15:57 lrusak_: 08:17:33 FileNotFoundError: [Errno 2] No such file or directory: '/home/jenkins/workspace/LINUX-arm-GBM/tools/depends/xbmc-depends/arm-linux-gnueabihf-debug/lib/pkgconfig/egl.pc'
15:58 lrusak_: happens why trying to install mesa
15:58 lrusak_: dcbaker[m], any ideas? ^^
15:59 karolherbst: jekstrand: I don't think there is anything to resolve. I am just not very familiar with most of the deref code :)
16:01 dcbaker[m]: lrusak_: I think a meson bug. Is it a clean build?
16:01 jenatali: jekstrand: I didn't get a chance to try it out yet, but I'm fine to land it as-is. I have high confidence I'll be able to rework our compiler on top of it
16:01 lrusak_: yes it's a clean build with meson 0.55.1
16:04 dcbaker[m]: Yeah, that looks like a meson bug for sure
16:08 lrusak_: dcbaker[m], should I file a bug report?
16:09 dcbaker[m]: Yeah. Go ahead and mention me in the bug report.
16:20 lrusak_: dcbaker[m], https://github.com/mesonbuild/meson/issues/7691
16:23 dcbaker[m]: Perfect, thanks
17:13 mslusarz: kusmabite, daniels: Windows Piglit jobs are timing out, even the merge of build-windows job did that...
17:16 daniels: mslusarz: if that's happening just now, I suspect it's because the runner is being restarted for an update
17:17 mslusarz: it does that for at least 2-3 hours
17:21 daniels: hmm
17:24 daniels: mslusarz: fixed now, thanks
17:39 mslusarz: daniels: it's failing in a different way now: "denied: requested access to the resource is denied\nNot logged in to registry.freedesktop.org"
17:50 jekstrand: karolherbst, jenatali: I guess I'll take that as a pair of ACKs
17:53 jekstrand: karolherbst: Can I have your RB on the first two? That fix clover?
17:57 karolherbst: jekstrand: the deref MR?
17:57 jekstrand: karolherbst: yeah
17:58 karolherbst: rb
17:58 jekstrand: Thanks!
17:58 karolherbst: I just noticed that I know call nir_lower_mem_constant_vars twice :)
17:58 karolherbst: *now
17:58 jekstrand: karolherbst: Did you ever write that deref-based constant-folding patch?
17:58 karolherbst: nope
17:58 jekstrand: karolherbst: Uh... I'll fix that first. :)
17:58 jekstrand: Then I'll write us a constant-folding patch.
17:59 karolherbst: jekstrand: no.. it's part of my indirect constant MR :p
17:59 karolherbst: ehh.. I thought?
17:59 jekstrand: karolherbst: karolherbst Right. Yeah, it is. now.
17:59 jekstrand: It's part of the top patch
17:59 jekstrand: Oh, well. That's fixable. :)
18:00 karolherbst: ohh... right, none of us is merged, so it looks fine on my branch :D right
18:00 karolherbst: I think we'll merge your MR first anyway
18:00 karolherbst: so I can fix it up later
18:02 jekstrand: Just assigned Marge
18:02 karolherbst: cool
18:02 karolherbst: I guess with that I can probably just run all CTS tests...
18:02 karolherbst: mostly
18:02 jekstrand: :D
18:02 karolherbst: (and fix codegen)
18:03 karolherbst: unaligned loads will be painful
18:03 karolherbst: and I need to properly support int8/int16 as well
18:03 karolherbst: right now codegen just assumes all indirect are 0x10 aligned
18:04 karolherbst: so it vectorizes
18:04 karolherbst: I think I really just want to use the nir io vectorizer and disable annoying codegen passes
18:04 karolherbst: codegen has no concept of pointer alignments anyway
18:08 karolherbst: jekstrand: btw.. kind of next on my list is figuring out how to deal with opcodes wanting more precision :/
18:09 jekstrand: karolherbst: Yeah
18:10 jekstrand: karolherbst: That sounds reasonable
18:10 jekstrand: jenatali: What happened to the conversion intrinsic stuff?
18:10 karolherbst: so.. in the long term, we kind of need to add modifiers based on opcode
18:10 jekstrand: We need something
18:10 jenatali: jekstrand: I still need to write it
18:10 karolherbst: jekstrand: just keeping https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#compiler-options in mind
18:10 karolherbst: "Math Intrinsics Options"
18:10 karolherbst: I think it makes more or less sense if we find something which doesn't make it harder to implement that
18:10 karolherbst: as I suspect some applications do care
18:11 karolherbst: uhm.. and "Optimization Options"
18:11 jekstrand: karolherbst: Yeah
18:11 karolherbst: "cl-mad-enable" is a fun
18:11 karolherbst: one, but also very straightforward
18:12 karolherbst: "cl-opt-disable".. ehhh
18:12 karolherbst: more interesting are the no-signed-zeros and the further ones
18:12 jekstrand: cl_opt_disable isn't going to happen, at least not completely
18:12 karolherbst: with exact we have this all or nothing approach
18:12 jekstrand: We can potentially shut off a bunch of stuff though
18:12 karolherbst: jekstrand: just disable opt_arithmatic :p
18:12 karolherbst: maybe const folding as well
18:12 karolherbst: but yeah..
18:12 karolherbst: I'd just keep most stuff on
18:13 jekstrand: Yeah, exact is very all-or-nothing which is also what makes it workable.
18:13 karolherbst: thing is.. we never know what drivers actually require
18:13 karolherbst: nouveau _requires_ constant folding :)
18:13 jekstrand: It has very clearly defined semantics: "Don't do anything with this that changes the bits"
18:13 karolherbst: yeah..
18:13 jekstrand: Anything less than that and it gets sticky fast
18:13 karolherbst: but what happens if you enable finite-math-only, but you still can't fold fadd+fmul into fmad :)
18:14 jekstrand: I do think we need to follow LLVM's lead and split fmad and ffma
18:14 karolherbst: mhhh...
18:14 karolherbst: although I think it has its benefits
18:14 karolherbst: and would lead to better codegen overall
18:14 jekstrand: What has its benefits?
18:14 karolherbst: splitting fmad and ffma
18:15 jekstrand: I think that one's more-or-less necessary
18:15 jekstrand: ffma has very specific semantics that not all GPUs can handle.
18:15 karolherbst: ohh
18:15 karolherbst: I've read "I don't think" :D
18:15 jekstrand: And very specific precision requirements
18:15 jekstrand: Oh, ok then. :)
18:16 jekstrand: Unfortunately, GL's semantics for fma are weird
18:16 jekstrand: You can split it if you want
18:16 jekstrand: Unless it's marked precise
18:16 jenatali: Same as D3D's
18:16 jekstrand: At least, that's how we interpret it
18:16 karolherbst: yeah.. I had the discussion with cwabbott a few days ago and the first concerns was that it might end up worse, but I think we can be sneaky about it and eg cwabbott came up with a late algebraic opt doing ~fmad -> ffma or ~ffma -> fmad for hw not supporting the other
18:16 karolherbst: which.. should just resolve all issues
18:17 karolherbst: so we just need to make the opts more precise and split it up a little
18:17 jekstrand: Yeah
18:17 karolherbst: and in the end you have better codegen as you can merge more stuff
18:17 karolherbst: or know what you can merge and what not
18:17 jekstrand: Right now, basically ffma.exact is LLVM's ffma and ffma is LLVM's fmad
18:17 jekstrand: And I guess that sort-of works
18:17 karolherbst: I think I'd say that ffma is always fused, and fmad is never fused
18:18 jekstrand: But with CL making a big distinction, I'm not sure.
18:18 karolherbst: but we allow optimizing into the other one
18:18 karolherbst: for inexact ops
18:18 jekstrand: Yeah
18:18 karolherbst: so a driver can say what it supports, and we are fine
18:18 jekstrand: Semantically, ffma should be fused and fmad should be mul+add
18:18 karolherbst: yes
18:19 karolherbst: mhh.. actually ~ffma -> fmad makes no sense
18:19 karolherbst: as if the hw doesn't support ffma natively, for gl it doesn't matter anyway
18:19 karolherbst: mhhh
18:19 karolherbst: so we would have to do ffma -> fmad if !options.has_ffma
18:20 karolherbst: no matter if it's precise or not
18:20 karolherbst: uhm.. exact
18:20 jekstrand: Yeah, something like that
18:20 karolherbst: and CL is just special and will emit this fma function
18:20 karolherbst: and we handle that on a vtn level
18:20 jenatali: Or lower ffma into a software fma implementation
18:20 jekstrand: ^^
18:20 jekstrand: That
18:20 karolherbst: not for graphics :p
18:21 jekstrand: For CL, I think we want to leave it as ffma, let the optimizer run, and then lower it to the CLC function
18:21 karolherbst: so if we end up from glsl ir _or_ vtn does emit ffma, we just assume it's correct to convert it to fmad later
18:21 karolherbst: mhhhhh
18:21 jekstrand: karolherbst: No, for graphics we emit fmad up-front
18:21 karolherbst: jekstrand: what about glsl fma?
18:21 karolherbst: we have to keep it :)
18:21 jenatali: jekstrand: That means tweaking how we handle libclc to be able to do the mangling outside of vtn
18:21 jekstrand: karolherbst: Honestly, we could probably make GLSL fma just fmad
18:21 karolherbst: please don't :p
18:21 karolherbst: for hw supporting both it matters
18:21 jekstrand: jenatali: It's one mangled function name. We can hard-code it.
18:22 karolherbst: and I know some games which emit ffma and precise modifiers
18:22 jenatali: jekstrand: ... good point :)
18:22 karolherbst: and do fmul+fadd
18:22 jenatali: __Zfmaff I'm pretty sure
18:22 jekstrand: karolherbst: Yeah, we probably want to look at options->has_ffma and emit based on that
18:22 jenatali: er, fff
18:22 karolherbst: fair
18:22 karolherbst: mhhh
18:22 karolherbst: yeah.. that could work then
18:22 karolherbst: so ffma in nir is really ffma
18:22 karolherbst: but vtn and glsl ir _could_ emit fmad instead if they think it's safe
18:23 jekstrand: karolherbst: For sin, cos, sqrt, and friends, I'm kind-of inclined to say we just use the NIR ops and write a fix-up pass which adds some extra math around it ot fix precision.
18:23 jekstrand: And have some sort of threshold for "how much fixing do you need?"
18:23 karolherbst: how am I glad I deal with a driver where some hw has fmad, and the other has ffma, but none has both :p
18:23 karolherbst: jekstrand: we can't
18:23 jekstrand: karolherbst: Why not?
18:23 karolherbst: there is native_* vs the real one
18:23 jekstrand: Oh, right....
18:23 karolherbst: ...
18:23 jekstrand: bother
18:23 jenatali: jekstrand: For sin/cos you want to use libclc. For sqrt/div I agree we should use a fix-up pass
18:24 karolherbst: yes
18:24 karolherbst: jenatali: native just forces us to flag instructions
18:24 karolherbst: more or less
18:24 karolherbst: we really need to know inside nir which opcodes need more precision and which do not
18:24 karolherbst: maybe exact is enough for that.. but what if you have three fsqrt variants?
18:25 jekstrand: So we probably want new opcodes for sqrt/rsq
18:25 karolherbst: glsl fsqrt, cl native_fsqrt and cl fsqrt all having different requiernments
18:25 jenatali: So you want vtn to emit sin for both native_ and non-native_ sin, and then if the hardware doesn't have a super-precise sin, lower the non-native one to libclc?
18:25 karolherbst: ...
18:25 jekstrand: Fortunately, they don't show up in nir_opt_algebraic too much.
18:25 karolherbst: although native is really just "whatever"
18:25 karolherbst: jekstrand: more or less
18:25 karolherbst: ...
18:25 karolherbst: jenatali: ^^
18:25 jekstrand: karolherbst: I'm mildly inclined to say that the back-end can look at stage == KERNEL
18:25 jekstrand: For dealing with GL vs. CL rules
18:25 karolherbst: fair
18:25 jenatali: karolherbst: I highly doubt any hardware is really going to have a sin that meets CL's precision requirements
18:26 karolherbst: and we mark everything as exact except native_*?
18:26 karolherbst: how we deal with stuff like "cl-finite-math-only" then?
18:26 jekstrand: karolherbst: Depends on ffastmath
18:26 jekstrand: karolherbst: For finite-math-only we can ignore it for now
18:26 karolherbst: right
18:26 jekstrand: Unless we really find stuff that badly cares
18:26 jekstrand: It's just a hint
18:27 karolherbst: well.. it would get rid of a few nan checks
18:27 karolherbst: I'd assume it really matters
18:27 jekstrand: Besides, with clover not bothering to optimize derefs, a little math isn't going to hurt anyone. :-P
18:27 karolherbst: right..
18:27 karolherbst: for now it doesn't matter one bit
18:27 karolherbst: I just prefer to potentially implement something working for the corner cases as well if we find a solution already for that
18:27 karolherbst: otherwise we just rework later
18:27 jekstrand: And I think it'll be far easier to reason about it once we have some concrete examples of optimizations we want to do.
18:27 karolherbst: I just want to at least keep it in mind
18:28 karolherbst: fair
18:28 karolherbst: okay.. so for now, CL is just always exact, except for native_* opcodes and we just lower all exact ones to clc functions
18:29 karolherbst: and !fsqrt can be lowered to some scaling and fsqrt (dropping exact)
18:29 jenatali: So... you want to move libclc mangling outside of vtn?
18:29 karolherbst: same for fdiv for some hw
18:29 karolherbst: mhhh..
18:29 karolherbst: I think yes
18:30 karolherbst: doing optimizations of opcodes directly could help
18:30 karolherbst: optimizing the inlined functions is painful
18:30 jekstrand: Wait, what?
18:30 jekstrand: Why are we moving mangling?
18:30 karolherbst: I am sure there are some nice optimizations one can do with trigonometric functions
18:31 jekstrand: karolherbst: We have othing interesting along those lines today. :P
18:31 karolherbst: jekstrand: I mean.. we can keep it for now, but potentially
18:31 jekstrand: jenatali: The big advantage to doing CLC late is things like CSE
18:31 karolherbst: jekstrand: yeah.. because games are not crazy enough to use those heavily
18:31 jekstrand: jenatali: If those built-in functions involve control-flow, CSE can be difficult to impossible.
18:31 karolherbst: compute workloads especially scientific ones on the other hand....
18:32 jekstrand: jenatali: Give this a read: 656ace3dd85b2eb8c565383763a00d059519df4c
18:32 karolherbst: oh wow
18:32 jekstrand: :)
18:33 jenatali: jekstrand: Where is that?
18:33 jekstrand: jenatali: It's a mesa sha from a change I made in our compiler
18:33 jenatali: Ah
18:33 karolherbst: I bet with libclc the difference can be multiple magnitudes bigger :p
18:35 jekstrand: Quite possibly
18:36 jekstrand: Not that I want jenatali to rewrite mangling *again* :(
18:36 karolherbst: also just because the nirs will remain much slower and most passes will have much less to do
18:36 karolherbst: uhm
18:36 karolherbst: smaller, not slower
18:36 jenatali: It probably makes sense to lower some stuff directly to libclc directly out of vtn, but others it probably makes sense to keep it as an opcode and lower it later I suppose
18:36 jenatali: Hopefully the set of things needing mangling at that later point would be scoped enough that mangling could be simpler
18:36 karolherbst: jenatali: I'd just reuse whatever opcodes we have today I think... but yeah... I think we want to be able to convert some nir opcodes to libclc functions
18:37 karolherbst: but yeah.. we don't have to do it now anyway
18:37 karolherbst: right now what matters is to get the precision right
18:37 karolherbst: and split up ffma
18:38 karolherbst: pslitting up ffma will take a month anyway :p (until ever driver was fixed and stuff)
18:40 karolherbst: anyway.. I can deal with it :p and I will try a solution for the precision thing
18:44 jenatali: jekstrand: If you were curious, this is the reworking I had to do for images: https://gitlab.freedesktop.org/kusma/mesa/-/merge_requests/298
18:53 jekstrand: jenatali: Doesn't look too bad
18:53 jenatali: Yeah, not terrible
18:54 jenatali: I'm hoping alignments also won't be too bad, but dealing with constant mem first
18:54 karolherbst: heh.. splitting ffma is kind of fun
18:55 jekstrand: jenatali: The good news is that I'm just about done churning the world out from under you. :)
18:55 karolherbst: :D
18:55 jenatali: jekstrand: Until karolherbst reworks libclc mangling ;)
18:56 jekstrand: jenatali: Well, yeah...
19:03 karolherbst: is there hw not having ffma and fmad?
19:03 jekstrand: yes
19:04 karolherbst: I think it does not make sense to have lwoer_ffma and lower_fmad .. but I think having one for both should be safe?
19:04 karolherbst: or...
19:04 karolherbst: we just have use_ffma and use_fmad and if neither is set we lower?
19:05 karolherbst: ohh wait
19:05 karolherbst: yeah.. doesn't matter
19:06 karolherbst: ohh.. it's fuse_ffma...
19:06 karolherbst: not use
19:11 jekstrand: karolherbst: Ok, I think I've got a pass. What's a good test?
19:17 jenatali: jekstrand: A pass for what?
19:19 jekstrand: constant-folding load_deref of nir_var_mem_constant
19:20 jenatali: Ah
19:22 airlied: karolherbst: renderdoc is open and considered to be generally better than vendor tooling, just need a cl profiler :-p
19:33 karolherbst: jekstrand: basic constant and constant_source
19:34 jekstrand: karolherbst: I've got constant_source folding nicely.
19:34 karolherbst: nice
19:34 karolherbst: constant_source was the one with the in kernel direct load?
19:34 jekstrand: yup
19:34 karolherbst: ahh, cool
19:34 jenatali: I think I'm missing why this pass is needed...
19:35 karolherbst: optimizations
19:35 karolherbst: mainly to reduce the size of the constant buffer
19:35 jenatali: Ah
19:35 karolherbst: but also that direct loads are direct and potentially end up as load_const instead
19:35 jekstrand: And reduce the number of memory loads are just for silly shader constants
19:36 karolherbst: mhhh
19:36 karolherbst: btw.. we need such a pass for nouveau also for GL
19:37 jekstrand: karolherbst: ?
19:37 karolherbst: guess what happens with indirects on in shader constant arrays?
19:37 karolherbst: they get spilled :p
19:37 jekstrand: karolherbst: nir_opt_large_constants
19:37 karolherbst: but does it do it for all indirectws?
19:37 jekstrand: yeah
19:38 karolherbst: ahh, cool
19:38 jekstrand: It's basically exactly what you want for that case
19:38 karolherbst: cool
19:38 karolherbst: right now we don't bind a constant table for shaders, but that's something I'd wanted to look into
19:38 karolherbst: especially as this also allows for better optimized code for constants overall
19:39 jekstrand: jenatali, karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6588
19:39 karolherbst: nice
19:40 jekstrand: Seems to pretty neatly do what what we want
19:41 karolherbst: what about handling casts?
19:41 jekstrand: karolherbst: It will handle simple casts
19:41 jekstrand: karolherbst: More complex casts.... That'll take a lot more work.
19:41 karolherbst: okay
19:42 karolherbst: I was more thinking about int bit size casts
19:42 jekstrand: Even that's tricky
19:42 jekstrand: I think we can make it work but we need to do something similar to load_constant
19:43 jekstrand: Let it spill into the nir_shader::constant_data and then look for load_global_constant(iadd(constant_base_ptr, offset))
19:43 karolherbst: mhhh
19:43 jekstrand: I've also thought about writing a pass that handles load_deref(cast(x)) for bitcast type stuff
19:43 karolherbst: I was wondering if we could just continue following the chain as if you casts the ptr and have constant indicies you still end up at some value sooner or later
19:43 jekstrand: If the cast is the first deref that the load touches, it might not be too bad
19:44 jekstrand: karolherbst: The problem is that nir_constant doesn't have a physical memory layout
19:44 jekstrand: It's a logical thing
19:44 karolherbst: mhhh
19:44 jekstrand: There are some casts we can probably trivially get rid of. Like float->int
19:44 jekstrand: if they have the same bit size
19:45 jekstrand: But that should be done in opt_deref
19:45 karolherbst: fair
19:45 jekstrand: And I've thought about writing a pass that tries to turn load_deref(cast(x)) into alu(load_deref(x))
19:45 karolherbst: jenatali: does CL demand a strict ordering of in kernel constants?
19:46 karolherbst: like you declare a global array with constants and run OOB
19:46 karolherbst: what happens?
19:46 jenatali: UB I'm pretty sure
19:46 karolherbst: okay
19:48 karolherbst: mhhhhh
19:49 karolherbst: jekstrand: soo... I know it's probably UB and all, but with such an opt or with our constant handling in general we might turn a OOB UBO read, which would normally not crash the kernel into a global memory read which could end up trapping... but because it's UB we can just not care
19:49 karolherbst: but I could see issues like this comming up
19:49 karolherbst: I still wouldn't care unless there is a good reason
19:49 karolherbst: just a random thought
19:50 jekstrand: karolherbst: In our graphics shaders, we bounds-check it.
19:50 karolherbst: all global/constant memory accesses?
19:50 jekstrand: karolherbst, jenatali, bbrezillon: Between the three of us, we've landed 205 patches this month!
19:50 karolherbst: only? :p
19:50 jenatali: Four of us?
19:50 karolherbst: ohh wait
19:51 karolherbst: it's september already
19:51 jekstrand: Of course, alyssa has landed 226 by herself. :P
19:51 karolherbst: so 205 in three days?
19:51 karolherbst: :D
19:51 Sachiel: this month being september? that's a lot for 3 days
19:51 jekstrand: karolherbst: No 205 since the start of August
19:51 karolherbst: ahh...
19:51 karolherbst: although you landed like 40 this week, no? :p
19:53 karolherbst: ehh.. at some point we want to have shader-db for cl kernels :p
19:53 jekstrand: karolherbst: Yeah, most of mine have landed this week.
19:53 jekstrand: Or a large chunk of them
20:00 karolherbst: can I rely on the ordering in opt_algebraic?
20:07 jekstrand: karolherbst: Yes
20:07 jekstrand: karolherbst: It runs top-to-bottom
20:07 karolherbst: okay, cool
20:39 karolherbst: nice, 14 labels...
20:40 karolherbst: still needs more work though
20:43 karolherbst: jekstrand: zink foils my plan
20:45 jekstrand: karolherbst: ?
20:46 karolherbst: so zink is one driver which doesn't know
20:46 jekstrand: doesn't know what?
20:46 karolherbst: about hw ffma/fmad support
20:46 jekstrand: Well, it is in the same boat as D3D12
20:47 karolherbst: so it doesn't want fadd+fmul to get merged, but it also doesn't want ffma to get lowered
20:47 karolherbst: but having three boolean siwtches also sounds annoying
20:48 karolherbst: maybe we need an enum fma_mode { has_ffma, has_fmad, has_both, has_neither, has_neither_but_leave_ffma }...
20:48 karolherbst: at least those are the 5 variations we have, no?
20:49 jekstrand: Uh...
20:49 jekstrand:is starting to like .exact :P
20:49 karolherbst: :D
20:50 karolherbst: but somehow an enum indeed feels more natural
20:50 jekstrand: enum seems really complicated
20:50 karolherbst: any better idea?
20:51 karolherbst: mhhh
20:51 karolherbst: I mean I could also add back "lower_ffma"...
20:52 jekstrand: So zink is the weird "has neither but leave ffma?"
20:52 karolherbst: yes
20:52 karolherbst: well.. right now nouveau is the same :p
20:52 karolherbst: but I'd like to advertise to nir what the hw supports
20:52 jenatali: I don't understand that option - why would you use that?
20:53 karolherbst: jenatali: does d3d12 know what the hw actually supports?
20:53 jenatali: karolherbst: No
20:53 karolherbst: so what do you do with "fma"?
20:53 karolherbst: lower it to fadd+fmul or leave it as fma? :p
20:53 jenatali: Lower to mad I think
20:53 karolherbst: why?
20:53 karolherbst: the hw can have fma
20:54 karolherbst: I bet you do have a fma function in hlsl, no?
20:54 jenatali: Sure, but DXIL doesn't have a fma instruction
20:54 jenatali: Nah, just mad
20:54 karolherbst: mhhhhh
20:54 karolherbst: interesting
20:54 karolherbst: I mean strictly speaking glsl doesn't have fma either
20:54 karolherbst: it has an fma with a strong preference being ffma, but fmad is fine as well
20:55 jenatali: https://github.com/microsoft/DirectXShaderCompiler/blob/master/docs/DXIL.rst#fmad
20:55 jenatali: There is an fma, but only for doubles
20:55 jekstrand: karolherbst: I'm not sure it's that strong of a preference
20:55 karolherbst: well.. at least we treat it as one
20:56 karolherbst: "fma performs, where possible, a fused multiply-add operation, returning a * b + c."
20:56 karolherbst: sounds like a preference to me, maybe not a strong one
20:56 karolherbst: but a preference
20:57 karolherbst: for precise it is fused
20:57 karolherbst: otherwise whatever
20:58 airlied: there is also various radeon hw where fma runs slow because market differentiation
20:58 karolherbst: right...
20:58 airlied: not sure though if that is just terasacle or if the newer ones do it
20:58 karolherbst: yeah...
20:58 karolherbst: but I solved this by doing fmad first, then ffma
20:59 karolherbst: so if radeon reports has_fmad and has_ffma it gets fmad first
21:02 Ntemis: hi
21:03 Ntemis: we are facing a regression on mesa 20.1.3 and up for rpi4
21:03 Ntemis: we get no picture from hdmi
21:03 Ntemis: is this known?
21:04 Ntemis: 20.1.3 workd fine
21:04 Ntemis: *works
21:04 Ntemis: updating beyond is no go
21:04 Ntemis: any takers?
21:04 Ntemis: to enlight me
21:05 airlied: Ntemis: like 20.1.4 fails?
21:05 Ntemis: yes
21:06 airlied: Ntemis: should be bisectable then
21:06 Ntemis: for sure
21:06 airlied: there is only 20-30 commits in there, and most of them don't apply to rpi4 at all
21:07 Ntemis: i was wondering if you knew anything thus my coming here
21:07 airlied: not sure who the rpi experts are anymore
21:07 airlied: seems strange that it would affect hdmi at all
21:07 airlied: since mesa knows nothing about that
21:08 airlied: dri2: do not conflate unbind and bindContext() failure
21:08 airlied: egl/dri2: try to bind old context if bindContext failed
21:08 airlied: are the only two commits I could even remotely blame
21:09 Ntemis: we have to ship batocera with an older mesa version so all other devices get affected by this
21:09 Ntemis: do you have rpi4?
21:10 airlied: nope, probably best to file an issue in gitlab
21:10 airlied: or bisect and revert the bad commit
21:11 Ntemis: tbh we were waiting for an update to bring the fix but it didnt happen yet
21:11 Ntemis: ofc we havent tested 1.7 yet
21:12 Ntemis: but i doubt that one can fix it too
21:12 ccr: ...
21:13 airlied: I haven't heard anyone else complain, are you confident isn't the kernel?
21:15 Ntemis: let me try to update that too
21:16 Ntemis: and i can get back and report
21:16 Ntemis: thank you for your time
21:25 Lyude: daniels: any idea if the issue with git fetches being very large on fdo ever got fixed? it seems like I'm still running into the problem on amdgpu-next, but I don't -think- it's happening on the other drm repos?
21:41 Lyude: huh, and now i'm only getting ~20KB/s
22:10 karolherbst: oh wow.. doesn't seem like I broke anyhing
22:16 karolherbst: jenatali: do I read the OpenCL mad right that it allows to be fused _and_ unfused?
22:17 jenatali: karolherbst: Where'd you see that?
22:18 karolherbst: ohh wait.. I think that's only valid for the embedded profile
22:18 karolherbst: yeah...
22:18 karolherbst: I didn't see the embedded
22:18 karolherbst: "The user is cautioned that for some usages, e.g. mad(a, b, -a*b), the definition of mad() is loose enough in the embedded profile that almost any result is allowed from mad() for some values of a and b." :D
22:18 jenatali: karolherbst: No, I think you're right
22:18 jenatali: https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#relative-error-as-ulps
22:19 karolherbst: okay
22:19 karolherbst: soo mad is essentially: whatever is the fastest
22:19 jenatali: Yeah, I think so
22:19 jenatali: "On some hardware the mad instruction may provide better performance than expanded computation of a * b + c"
22:19 karolherbst: which makes it inherently a inexact operation
22:19 karolherbst: _but_
22:20 karolherbst: a * b + c _are_ exact, because you have to allow it through a compiler option to get fused
22:20 jekstrand: But can it be higher precision?
22:20 jekstrand: It doesn't really say
22:20 jekstrand: I guess higher than expected counts as "reduced accuracy"?
22:20 karolherbst: "Implemented either as a correctly rounded fma or as a multiply followed by an add both of which are correctly rounded"
22:20 jenatali: Right
22:20 jenatali: It has to be exactly one of those two
22:21 karolherbst: fun..
22:21 karolherbst: so mad is either, fma is fused and a * b + c is unfused except -cl-mad-enable was set
22:21 jenatali: Seems right
22:22 jekstrand: Ugh
22:22 karolherbst: the wording of the spec though...
22:22 karolherbst: -cl-mad-enable
22:22 karolherbst: Allow a * b + c to be replaced by a mad instruction. The mad instruction may compute a * b + c with reduced accuracy in the embedded profile. See the SPIR-V OpenCL environment specification for accuracy details. On some hardware the mad instruction may provide better performance than the expanded computation.
22:22 jenatali: In the embedded profile
22:22 karolherbst: ohh mad is either.. right
22:22 karolherbst: and embedded can be even worse?
22:22 jenatali: Yep
22:22 karolherbst: okay
22:23 jenatali: "
22:23 jenatali: Any value allowed (infinite ulp)"
22:23 karolherbst: lol...
22:23 karolherbst: you know what results in fast code?
22:23 karolherbst: by constant folding it to 0 :p
22:23 jenatali: Yep
22:24 karolherbst:is thinking about how far an implementation could get with that
22:24 karolherbst: I mean..
22:25 jekstrand: Let's not....
22:25 karolherbst: I know they did this for a*b - a*b
22:25 jekstrand: That's worse than the GL spec for quad interpolation
22:25 karolherbst: and fmad(a, b, -fmul(a, b)) could give you non 0 results
22:26 karolherbst: ehh. the glsl spec also has its dirty tricks :p
22:27 karolherbst: imirkin or I wrote a fun fix for something the VK-GL CTS didn't even consider to happen
22:28 karolherbst: https://github.com/KhronosGroup/VK-GL-CTS/issues/51
22:29 karolherbst: so.. we ended up with the hw giving us 1 _and_ 0
22:29 karolherbst: and the CTS only either tested against 1 or 0
22:29 karolherbst: but not both
22:30 karolherbst: or 0 and a?
22:30 karolherbst: something like that
22:42 karolherbst: jekstrand: btw, I have an idea on how to implement those compiler options: we just have a list of opt_algebraic expressions removing nan checks or doing whatever optimization :)
22:43 karolherbst: and those can be just called from inside clover or somewhere after all the clc lowering happened
22:44 karolherbst: or simply being part of opt_algebric depending on special options..
22:44 karolherbst: I think that's probably the most straightforward way
22:49 jekstrand: karolherbst: At the moment, nir_compiler_options seems like the best plan
22:49 jekstrand: Might be the first thing we ever put in that struct that's actually a compiler option. :p
22:49 karolherbst: lol...
22:54 karolherbst: jekstrand: mhh.. should we emit fmad or fadd(fmul(a, b), c) for ffma if the target doesn't have any ffma?
22:55 karolherbst: or should we keep it as an inexact ffma and let opt_algebraic lower it later
22:56 jekstrand: karolherbst: I think I'm inclined to say we should always emit fmad for GLSL and Vulkan SPIR-V
22:56 karolherbst: mhhh
22:56 jekstrand: Possibly with .exact
22:56 karolherbst: but that would hurt nouveau
22:57 jekstrand: Hrm...
22:57 jekstrand: bah
22:57 karolherbst: and other drivers having devices with only ffma but no fmad
22:57 jekstrand: You're right
22:57 jekstrand: It's almost like we want a 3rd opcode which is "ffam_if_you_have_it
22:57 karolherbst: jekstrand: that's what I have for late opts: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6591/diffs#203b86d7d0bd6811cdfd089faac4ce0552ec4151_2096_2120
22:58 karolherbst: which kind of solves this nicely
22:58 jekstrand: The other option is that we could emit ffma and say that 3D drivers are allowed to lower it to mad or mul+add
22:58 karolherbst: if you don't have fma it doesn't matter if it's exact or not
22:59 karolherbst: if you have neither it gets lowered earlier to fmul+fadd
22:59 karolherbst: (ignoring CL for now and then it's all valid), just wondering how we have to change things for CL then
23:01 karolherbst: so we could just prevent glsl or vtn to emit ffma (except for kernels) and use fmad instead so it should be fine for all cases
23:01 karolherbst: well.. in case the hw doesn't have ffma
23:01 karolherbst: so with CL we still get the ffma out of vtn, but then we can lower it to the sw emulation
23:01 karolherbst: and for graphics it's just fmad
23:02 jekstrand: I think we probably do want ffma when the HW supports it
23:02 karolherbst: or a * b + c and let opt_algebraic to merge it into fmad
23:02 karolherbst: sure
23:02 jekstrand: karolherbst: On Intel, we split all !.exact ffma into mul+add in opt_algebraic
23:02 karolherbst: fmad only if the hw has no fma
23:03 jekstrand: And then re-merge at the end
23:03 karolherbst: jekstrand: yeah.. for nouveau I did !fuse_fma but also !lower_ffma
23:03 karolherbst: which gives me what we need
23:03 jekstrand: karolherbst: We do lower_ffma and then use the ffma fusion pass later.
23:03 jekstrand: Way better shader-db numbers
23:03 karolherbst: heh.. interesting
23:04 karolherbst: shouldn't matter... but hey..
23:04 karolherbst: probably order of optimization messing something up
23:04 karolherbst: or missing optimizations for fma
23:05 jekstrand: It's just that it's way easier to optimize mul+add than fmad
23:05 karolherbst: right...
23:05 karolherbst: mhhh
23:05 jekstrand: Also, there are cases where the compiler can do a better job with mad-fusion than if it trusts the fusion that the app does
23:06 karolherbst: sure..
23:06 karolherbst: I've added a todo addressing this
23:06 karolherbst: slowly I start to think my 1 month estimation wasn't pessimistic :p
23:07 karolherbst: mhhh..
23:08 karolherbst: I wonder how many opts one might have to add to make it not matter anymore
23:08 karolherbst: sadly I don't have access to your super huge shader-db :p
23:09 karolherbst: ohh.. I already see a few opts one could add
23:09 karolherbst: I think I will check with nouveau first and see what regressions I get and see if I can sort it out
23:24 karolherbst: jekstrand: I think one of your patches regressed opengl...
23:24 karolherbst: getting a crash with a shader inside glsl_type::get_explicit_type_for_size_align
23:24 karolherbst: ... I'll bisect
23:26 karolherbst: what a terrible shader...
23:35 jenatali: karolherbst: Do you still have a link to the fdiv scaling patch you shared a while ago?
23:35 karolherbst: jenatali: I linked it on the MR
23:36 karolherbst: but I think I removed it from my tree
23:36 jenatali: Ah, great
23:36 karolherbst: the scaling is quite trivial though
23:39 karolherbst: jenatali: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/7071473a923 seems to be valid still :)
23:40 karolherbst: ohh that's without fdiv.. mhh
23:40 karolherbst: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/4b249c0e6ba454c48bf6fc844aa39788a0d76fe1
23:40 karolherbst: ehh..
23:41 karolherbst: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/c21d90832f6dae5ee67caa3c589dc9af56e0624f
23:41 karolherbst: that one :)
23:41 jenatali: Thanks :)
23:41 karolherbst: why.. exit code 255 from './bisect.sh' is < 0 or >= 128
23:44 karolherbst: jekstrand: ehh.. I think you'll hate me for what I've done in nouveau :D
23:46 karolherbst: sooo.. what I do to calculate alignment of function temp memory is to do this: https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/gallium/drivers/nouveau/codegen/nv50_ir_from_nir.cpp#L63
23:46 karolherbst: of course that trips of glsl_type::get_explicit_type_for_size_align because now the alignment is way bigger than the natural alignment of the type itself
23:49 karolherbst: airlied: why..........
23:50 karolherbst: jekstrand, jenatali: up until now I hoped OpenCL is a world without inline assmebly... but guess what: https://github.com/ROCm-Developer-Tools/LLVM-AMDGPU-Assembler-Extra/blob/master/examples/gfx8/s_memrealtime_inline.cl
23:50 karolherbst: .....
23:51 jenatali: ......
23:51 karolherbst: ....
23:51 karolherbst: and people even ask for it
23:51 jenatali: Clearly that's not portable
23:51 karolherbst: I don't think that's what the devs have in mind when writing that anyway
23:51 karolherbst: I mean....
23:52 karolherbst: ufff
23:52 airlied: lvl0 is also got some inline stuff
23:52 karolherbst: I kind of liked the earlier me being oblivious of the fact that people want this stuff in CL
23:52 airlied: since that's how you make libraries work at all
23:52 karolherbst: they should write spirv
23:53 karolherbst: or just write the entire thing in assembly :p
23:53 karolherbst: and the runtime has a nice IR extension
23:53 karolherbst: ...
23:53 karolherbst: *sigh*
23:53 karolherbst: I mean.. how would that even work with spirv?
23:54 airlied: you get the asm txt
23:54 karolherbst: wait.. spirv supports this evil even?
23:57 airlied_: karolherbst: lost my irc host
23:57 airlied_: https://github.com/intel/llvm/pull/1290/files
23:57 airlied_: in cse you didn't find it :-P
23:57 karolherbst: ehh....
23:58 karolherbst:runs away
23:58 jenatali: .....
23:58 karolherbst: good luck getting that accepted in spir-v
23:59 airlied: it's a vendor ext, it'll likely get accepted without anyone noticing
23:59 karolherbst: well, now we noticed
23:59 karolherbst: can we just ack it?
23:59 klys: what's directfb-core-drmkms, I don't have that in debian, still wondering about getting sdl to work with virgl
23:59 karolherbst: I am sure we will get like 10 people together
23:59 karolherbst: ehh
23:59 karolherbst: *nack