00:37chrisf: anholt, interesting.
00:38chrisf: anholt, can you ping my work email about it and i'll see who i can light a fire under :P
00:41chrisf: nvm, ive got things in motion now
12:27udovdh: somethign happened: https://pastebin.com/rthnVYFN
12:28udovdh: looks like some page fault, but is it a bug or a an indicator of an issue in the video stream that was played?
12:28udovdh: stuiff recovered and kept playing after the glitch
12:29udovdh: happened some more after a little while
12:29udovdh: did not nsee mmhub0 mentioned before
12:30udovdh: kernel 5.7.9 on Gigabyte X570 Aorus Pro, BIOS F20, Fedora 31 with kernel.org, git mesa, etc.
12:34udovdh: different stuff happened after that: https://pastebin.com/6U7XZjsA
12:34udovdh: real quality code... :-/
12:34udovdh: uptime of 12 days and gnome-shell is killed for OOM reasons?
12:36HdkR: Oom? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6148 Might help
12:40udovdh: HdkR, how can I be sure that this is the fix for what just happened?
12:40HdkR: Can't be
12:41udovdh: Ah, so wait and do a git pull when the commit is merged?
12:41udovdh: I do not see this issue regularly
12:45udovdh: was playing HD video off a bluray iso using vlc
12:45udovdh: fairly high bitrate: 35 GB for 1h38m
12:59udovdh: due to multiple audio tracks but also hq video
14:28karolherbst: divergency analysis
14:28karolherbst: I need per component info for vec2/3/4
14:29karolherbst: well.. I can also hack around it :p
15:36jekstrand: karolherbst: I'm starting to think we want a nir_var_mem_global_const
15:36jekstrand: karolherbst: And a load_global_const
15:37karolherbst: jekstrand: probably
15:37jekstrand: karolherbst: Mostly because load_global_const would be CSE-able and we could ignore barriers for it.
15:37jekstrand: The more I play around with global memory stuff, the more it looks like a good idea.
15:37karolherbst: yeah.. I am still not sure how to deal with constant* memory as it's just painful
15:38jekstrand: I'm also wondering if we don't want some sort of a load_global_const which takes a base pointer and an offset.
15:38karolherbst: jekstrand: the big issue is, for the kernel there is no different between input memory and taking the address of _tmp values
15:38jekstrand: karolherbst: Yeah, that's tricky
15:38jekstrand: karolherbst: CL does have __constant though doesn't it?
15:39jekstrand: And inputs aren't __constant?
15:39karolherbst: but so are in kernel constants
15:39karolherbst: there is no difference between those
15:39jekstrand: you answered "yes" to a negative. Which way does it go?
15:39karolherbst: well, inputs are only constant if declared as such
15:40karolherbst: of course you can't write to the input buffer
15:41karolherbst: you also don't have direct access to it anyway
15:41karolherbst: the way stuff works is that there is an implicit wrapper around kernel functions anyway
15:41karolherbst: and prameters are just parameters
15:41karolherbst: so you can take the address of the parameter and write to it if you want to
15:41karolherbst: but if you have a __constant * to something
15:42karolherbst: it can point to memory you pass as kernel args _or_ in kernel constant memory
15:46jenatali: What we're doing (if it helps) is running a pass on inline constants to see if we can promote them shader_temp, which maps to an inline constant buffer inside DXIL, otherwise we reflect them out to the runtime and bind them as hidden inputs
15:47jenatali: That way you can do all kinds of crazy take-address-of-local-const stuff, convert it to an int, store it in a buffer, load it later through a different deref chain, cast it back, and it just works
15:48karolherbst: we were mainly wondering on how to deal with actual constant memory though and maybe having a new file could help
15:48karolherbst: just makes handling of inline constants more annoying
15:48jenatali: I'm not seeing what global_const buys you that uniform/ubo doesn't?
15:49karolherbst: jekstrand: so.. here is the big issue: constants can be global loads with a constant flag and ubos
15:49karolherbst: soo.. the thing is, we only have 8 constant buffers in total with compute shaders on nv
15:50karolherbst: 1 used for inputs/uniforms, 1 for driver const buffer
15:50karolherbst: so we only have 6, but GL mandages more of course
15:50karolherbst: so what we do is to spill the remaining ubos to global memory loads
15:50karolherbst: and I bet that marking global memory loads as containing constant data probably has a benefit
15:51karolherbst: so yeah.. load_global_const would make sense in this case
15:51karolherbst: does it make sense for the general case? no, as CL constant memory _are_ ubos
15:51karolherbst: they have the same size limitation
15:51karolherbst: they behave the same
15:51karolherbst: and you could even bind them like in GL
15:51karolherbst: by having implicit indexing on the args or go full indirect
15:52karolherbst: the bigger issue for us just would be.. what happens if you have some conditinoal code selecting either inline constants _or_ spilled ubos?
15:52karolherbst: but in this case as the inline constants are in a scratch buffer already they could accessed by global addressing as well
15:53karolherbst: those details are just super annoying to deal with
15:55jekstrand: I'm less concerned with spilling out to UBOs and more concerned with how the compiler treats them with respect to things like CSE and barriers.
15:55jenatali: Yup - hence why we promote any inline constants that have a deref chain to them which doesn't end in a load (e.g. a select) into ubos as well
15:56jekstrand: I think, if we have a different mode which is unaffected by barriers, NIR's copy-prop stuff should be good enough to take care of most of our CSE issues.
15:56jekstrand: Though I'm not 100% sure on that.
15:57jekstrand: The other thing is that, in the case where you have a constant offset from some effectively fixed address, we can do much better if we can use a block load instruction.
15:58jekstrand: Essentially, it loads 32B or 64B of data spread across 8 or 16 SIMD lanes. We then pick up the individual bits with something that looks like subgroup ops. (Not real subgroup ops but that's what they sort-of look like.)
15:58jekstrand: The result is very wide loads and much lower register pressure.
15:58jekstrand: For Intel, I couldn't care less about being able to shunt them off to actual UBOs. I just want my big wide loads.
15:58jekstrand: That's the #1 benefit we get from UBOs
15:59jekstrand: For Nvidia, I realize you really want to use your magic UBO hardware and that's fine.
15:59jekstrand: They're sort-of different problems.
15:59jekstrand: But the "help NIR reason about them" problem is one that I think will only be solved with nir_var_mem_global
16:00jekstrand: The others will take more creativity, I think.
16:00karolherbst: jekstrand: but would there be a problem if we just treat them as ubos? using mem_global was always more of a workaround
16:01jekstrand: karolherbst: Not strictly speaking, no. However, with generic pointers, you may not always know.
16:01karolherbst: jekstrand: can't cast constant to generic
16:01jekstrand: Ok, that's helpful
16:01karolherbst: and what was the reason I added the workaround
16:02karolherbst: you can back constant* with SVM memory
16:02karolherbst: CL 2.0 strikes again
16:02jenatali: You could just not do CL2.0 :P
16:02jekstrand: For us, as I said, the only benefits we get from UBOs are a) no 64-bit arithmetic and b) big wide loads if it's a constant offset.
16:03karolherbst: constant buffers in CL are super small
16:03jekstrand: So if we can figure out how to get our big wide loads with a 64-bit pointer, 64-bit pointers all the way.
16:03karolherbst: so doing 32bit math is feasible
16:03karolherbst: I think I'd even back up SVM memory with a real ubo when launching the kernel and just hide it entirely
16:03jekstrand: I also have other reasons to want nir_var_mem_global_const for 64-bit pointers which I can't discuss in detail. :-)
16:03karolherbst: I think that's what nvidia is doing these days as well
16:04karolherbst: I'd like to have it for spilled ubos :p
16:04jekstrand: Yeah, I think there are plenty of reasons why it's useful.
16:04karolherbst: but that's more of a "read this value, but it's constant"
16:04karolherbst: we have LDG.CONSTANT variants
16:05jekstrand: Spilled UBOs, Anything where it's just more convenient to pass a 64-bit pointer into the shader, etc.
16:05karolherbst: and I bet it gets cached more aggressivly or so
16:05jekstrand: That could be.
16:05karolherbst: jekstrand: that's what we do with GL :)
16:05karolherbst: well.. we load it from the driver constbuf
16:05karolherbst: but yeah
16:06karolherbst: still.. we don't do LDG.CONSTANT with those yet
16:06jekstrand: See, you could get a perf boost from this! :P
16:06karolherbst: yeah, that's my hope
16:07karolherbst: but I still need to write this lowering pass for ubos as well :)
16:07karolherbst: it's a bit tricky with indirects
16:07jekstrand: I'm also not sure how I would lower load_global_const to wide loads
16:08jekstrand: It requires some alignment restirctions and things.
16:08jekstrand: I guess I could add a nir_intrinsic_assert_aligned_uniform(x, align)
16:08jekstrand: Which declares that its value is both subgroup-uniform and aligned to the given alignment
16:09jekstrand: And otherwise it's a total pass-through op.
16:09jekstrand: Then detect load_global(iadd(aligned_uniform(x, align), const_offset))
16:09karolherbst: I think we just have to make better use of the alignment info we hagve
16:09karolherbst: and set it better
16:09jekstrand: That would also be an option
16:10jekstrand: Check that align_mul >= 32 && align_offset % 32 == const_offset % 32
16:10karolherbst: so if you load a vec4 of an ubo, the first entry is 0x10 aligneed, the second mul 0x10 + 0x4, etc...
16:10jekstrand: That probably workls
16:10jenatali: One of bbrezillon's pending MRs to upstream has patches to propagate alignment from SPIR-V through explicit IO
16:10jekstrand: Combine it with uniformity analysis
16:10karolherbst: well.. aren't things in ubos normally vec4 aligned?
16:10karolherbst: well.. not for CL at least..
16:11karolherbst: but I think we could be more clever
16:11jekstrand: karolherbst: In order to get my fancy wide pulls, I need vec8 alignment and the address to be subgroup-uniform.
16:11jekstrand: That's tricky to get just from the alignment parameters
16:11karolherbst: but the first member of a UBO is aligned to the UBOs size
16:11jekstrand: Well, to some driver-specified UBO alignment
16:12karolherbst: I bet you could have offsets
19:28austriancoder: jekstrand: I just looked again at your xdc2019 presentation .. what happend to ibc?
19:30jekstrand: austriancoder: Kayden is working on it now.
19:31jekstrand: austriancoder: I had to step away from it back in October/November or so to help out with some of the performance analysis and image compression stuff on Gen12
19:31jekstrand: austriancoder: So it sat there for probably 4 months with no one doing anything
19:31jekstrand: But now Kayden's working on it and starting to make real progress again.
19:31jekstrand: He's just about got spilling working
19:31jekstrand: And I think he hooked up tessellation shaders.
19:35jekstrand: I think it's really good to have someone else work on it for a while; that way we have some chance of breaking any design ruts I may have gotten us into.
19:35jekstrand: I'm really looking forward to it, though. I keep seeing so many cases where real scalars are going to be useful.
19:38austriancoder: great to hear that there is still some movement
19:41jekstrand: Yeah, there was chatter a few weeks ago about trying to land it upstream. I really wanted to get spilling implemented before we did that and someone other than myself familiar with the code-base who could say, "Yes, I like this a lot. We should switch to it."
19:42jekstrand: I wouldn't be surprised if we don't drop it in behind an environment variable some time soonish.
19:42jekstrand: But no promises there.
21:35austriancoder: can I pass custom options to an nir_algebraic.AlgebraicPass() based custom py file?