00:05 karolherbst: anyway, the runtime has implement some interesting perf features we might want to pick up
00:05 karolherbst: eg zcull, but zcull is really trivial in the end
00:11 karolherbst: imirkin: mhh, I ran it through shader-db and it didn't really had any impact on the compile time
00:11 karolherbst: I will still check though what's the max depth it still causes changes, etc...
00:19 imirkin: hm - if they worked out zcull, we should take a look
00:19 karolherbst: it's really simple
00:19 karolherbst: we already talked about it
00:20 karolherbst: it's just a buffer tight to the render target if there is a depth buffer on it as well
00:20 karolherbst: and save/restore stuff when that render target gets switched
00:20 imirkin: yeah
00:20 karolherbst: there are two modes though: depth and stencil
00:20 imirkin: we all know that
00:20 imirkin: the question is how is it sized, etc
00:20 karolherbst: wait..
00:20 karolherbst: imirkin: https://github.com/devkitPro/deko3d/blob/master/source/maxwell/gpu_3d_zcull.cpp#L7
00:21 imirkin: cool
00:21 karolherbst: I guess there is a bit more to it
00:21 imirkin: should look at what the align stuff is, but seems easy
00:21 karolherbst: but it's not much
00:21 karolherbst: yeah
00:21 skeggsb: nvgpu is probably useful for some of those values it uses to calculate that stuff
00:21 imirkin: pretty sure that's not accurate for all gens though
00:22 imirkin: i remember seeing totally weird sizes
00:22 skeggsb: stuff like zcullInfo.pixel_squares_by_aliquots is calculated/exposed by the RM
00:22 imirkin: but that could have been on e.g. fermi
00:22 karolherbst: skeggsb: doesn't ctxsw has to have support for it as well?
00:22 skeggsb: sure does
00:22 karolherbst: I assume ours lacks this?
00:22 skeggsb: not from gm200 onwards :P
00:23 karolherbst: right ;)
00:23 karolherbst: I don't mind implementing it fro gm200 first and then move down
00:23 karolherbst: probably easier as the firmware stuff is done
00:23 imirkin: one less moving part
00:23 karolherbst: there is more though.. wait
00:24 karolherbst: ohh right
00:24 karolherbst: tiled rendering
00:24 karolherbst: + there seems to be a primitive cache on it
00:24 karolherbst: so based on the saved primitives a tile can be replayed from the cache
00:25 karolherbst: they also talked about surface compression being disabled in mesa due to lack of decompression support for rgba32
00:27 fincs: Hi, I was pointed to here in relation to a recent nvidia-related project I'm responsible for
00:27 karolherbst: :p
00:27 linkmauve: Hey fincs, long time no see. :)
00:27 fincs: I'm willing to clarify anything; by no means did I intend to sideline nouveau or cause any harm
00:27 imirkin: karolherbst: yeah, dunno, haven't looked at any of the compression stuff
00:28 imirkin: fincs: guessing that project is devkitpro based on your bnc's domain?
00:28 fincs: Yeah, it's a project I started within devkitPro
00:28 linkmauve: fincs, I’ve been thinking about writing a GL extension to load your .dksh in Nouveau.
00:29 karolherbst: linkmauve: dksh are the offline compiled shaders, right?
00:29 fincs: Heh, not sure if that makes much sense for something that's heavily Maxwell 2nd gen / Tegra X1 specific
00:29 fincs: And yes DKSH is the container format for precompiled shaders
00:29 karolherbst: well
00:29 karolherbst: you need an extension to define new shader binary formats
00:29 linkmauve: karolherbst, yes.
00:29 karolherbst: but the API is already there
00:29 fincs: And there's the thing that deko3d ABI is really different from nouveau ABI
00:29 fincs: So you can't just expect that code to work
00:30 linkmauve: fincs, I don’t have any other Nvidia GPU than a Switch. :p
00:30 karolherbst: imirkin: btw, is there a reason the driver constbuf is not at index 0?
00:30 karolherbst: might make sense to move it there
00:30 karolherbst: so it's easier to support the addional ones from maxwell
00:31 karolherbst: probably not worth it though :/
00:31 fincs: (Fwiw deko3d constbuf0 is the driver constbuf, constbuf1 contains compiler constants, and constbuf2 and up are user ubos)
00:31 karolherbst: dunno if anything even uses 14 ubos
00:31 karolherbst: yeah.. the compiler constants is something I actually implemented in nouveau as well
00:31 karolherbst: for not loaded constants
00:31 karolherbst: it's just not in the tree
00:32 karolherbst: I put that in the driver constbuf though
00:32 karolherbst: there is enough space
00:32 karolherbst: https://github.com/karolherbst/mesa/commit/4deecb06aacde3bb1eedf4fbf3f8c9cd2a81782d
00:32 karolherbst: the results were impressive
00:33 fincs: Er, we aren't talking about the same kind of compiler constants
00:33 fincs: I mean the "default uniform block"
00:33 karolherbst: ahhh
00:33 karolherbst: aka ubo0 :p
00:33 fincs: Which in my case only contains constants mesa's glsl runtime decided to stash
00:33 fincs: Yes
00:33 fincs: Which for me is ubo1
00:33 karolherbst: well
00:33 fincs: Because I shifted everything
00:33 karolherbst: in gallium it's all ubos
00:33 karolherbst: so.. uff
00:33 karolherbst: or at least I think
00:34 karolherbst: dunno, don't care
00:34 karolherbst: at least in TGSI there is no difference
00:34 fincs: I don't think you need to care about that bit tbh
00:34 karolherbst: btw, my dual issue pass "avg single issue in shared programs : 0.900005 -> 0.865488 (-0.03)" mhhh oh well
00:34 karolherbst: a little bit
00:35 fincs: Anyway... there are some juicy bits in deko3d I was able to figure out thanks to RE
00:36 fincs: I'm happy to answer questions related to them
00:36 karolherbst: I can imagine that reing an API like nvn is less painful than to re nvidias driver through GL
00:36 karolherbst: :p
00:36 fincs: Exactly
00:36 fincs: Hence why I did it
00:37 fincs: Actually I'm not sure if this kind of RE is allowed to be used in nouveau...
00:37 karolherbst: I think right now for us mainly the zcull and compression bits are really interesting, meaning everything which reduces shader invocations (especially fps) or reduces memory load :p
00:37 imirkin: fincs: i think a rough list of useful bits you discovered that you figure the "nouveau project" doesn't know about would be interesting
00:37 karolherbst: fincs: you can write the doc and then we can use it :p
00:38 fincs: Yeah
00:38 fincs: Well
00:38 fincs: I tend to express myself better in code
00:38 karolherbst: which I think is still fine, but you wouldn't be allowed to share the code
00:38 karolherbst: if you think that your re effort isn't up to the standard
00:39 fincs: Hmm, what standard?
00:39 karolherbst: well, let me put it this way: somebody who looked at disassembled code isn't allowed to write code based on it and share it. I have no idea what you did, but that's kind of the rule
00:39 fincs: Well
00:39 karolherbst: that person has to write documentation and then others have to implement it
00:40 fincs: That's kind of exactly what we tend to do all the time in the homebrew scene
00:40 karolherbst: that's the only way from disassembling onwards
00:41 karolherbst: I don't know how that would look like from an open source perspective
00:41 karolherbst: but getting donations or any kind of money based on such work (skipping the documentation step) can be a huge risk
00:41 fincs: Like, people have literally reimplemented parts of the Switch's operating system
00:41 karolherbst: which is fine, as long as they have this documentaiton step in the middle ;)
00:41 fincs: Nintendo in particular seems to be mostly interested in stopping piracy/warez
00:42 karolherbst: yeah sure
00:42 karolherbst: just saying
00:42 imirkin: basically the standard is a documentation firewall between the person looking at the disassembled code and the person implementing it anew
00:42 fincs: They don't seem to mind us doing what we do
00:42 imirkin: so the person looking at code is only allowed to document, etc
00:43 karolherbst: fincs: well there are other companies which sue faster, so it's always depends on the stuff
00:43 imirkin: the RE that we tend to do is a lot more about tracing, at which point i think it's a lot more OK
00:43 imirkin: although i've barely looked at anything in ages
00:43 karolherbst: well, nvidia doesn't mind us and we know that :p
00:43 karolherbst: so it's fine
00:43 karolherbst: but they also know we trace stuff
00:43 karolherbst: I think
00:43 karolherbst: ... dunno
00:44 fincs: I have not looked at any code outside the Switch fwiw
00:44 karolherbst: well, it's more of an issue you would have anyway
00:44 RSpliet: There's a name for this: https://en.wikipedia.org/wiki/Clean_room_design
00:44 fincs: Yes I know ^
00:45 karolherbst: the wine project is super strict
00:46 karolherbst: if there is a chance you ever saw MS code you are essentially out
00:46 fincs: Yikes
00:46 karolherbst: like internship at MS -> out for life :p
00:47 fincs: I guess I'm out then for writing nouveau code
00:47 karolherbst: btw, my dual issueing pass has an even smaller impact on kepler, but the base line is also way lower
00:47 karolherbst: avg single issue in shared programs : 0.715022 -> 0.692852 (-0.02)
00:47 karolherbst: 30% dual issueing in avg
00:47 RSpliet: Understandable, MS (and nintendo) are known to sue and sue hard
00:47 karolherbst: means 60% of instructions are issued in pairs ;)
00:48 fincs: Ninty got smart in their legal strategy though
00:48 fincs: They are focusing on piracy groups, and warez resellers, etc
00:48 fincs: Not so much homebrew
00:48 karolherbst: uff I still dumb the wrong data
00:48 fincs: Hmm about the dual issue thing
00:49 fincs: What improvements did you measure for maxwell?
00:49 karolherbst: so less than 30% of instructions are issued alone
00:49 karolherbst: mhh pascal
00:49 fincs: I don't really know anything about Pascal and up
00:49 karolherbst: ahh yeah. I removed my file :D
00:49 karolherbst: but yeah
00:49 fincs: Is Pascal basically Maxwell gen 3?
00:49 karolherbst: it was 80% single issued down to 70% roughly
00:49 karolherbst: essnetially
00:50 karolherbst: SM60 in cuda speak
00:50 fincs: Pfft
00:50 karolherbst: it's the same as SM52
00:50 karolherbst: mostly
00:50 karolherbst: for all we care at least
00:50 fincs: Any glaring incompatibilities with Maxwell?
00:50 karolherbst: unimportant stuff
00:50 karolherbst: I think one sys value is different
00:50 fincs: I saw remnants of Pascal compute code in nvn
00:50 fincs: Not sure why that's there
00:50 RSpliet: some speed improvements for lower precision? Or is that later...
00:51 karolherbst: dunno
00:51 karolherbst: but that hardly changes anything in the ISA...
00:51 karolherbst: maybe some latencies are different
00:51 karolherbst: who knows
00:51 RSpliet: Sched codes perhaps.
00:51 karolherbst: doubtful
00:51 karolherbst: we use the same
00:51 karolherbst: but yeah..
00:52 karolherbst: maybe we could use lower ones with pascal
00:52 karolherbst: dunno
00:52 karolherbst: don't really care even
00:52 karolherbst: wouldn't make much of a difference probably
00:52 karolherbst: ahh yeah
00:52 karolherbst: there is some f16 stuff added to SM60
00:53 karolherbst: uff iadd3
00:53 karolherbst: I should clean up my todo list
00:53 karolherbst: seriously
00:53 RSpliet: 16-bit (FP16) floating-point operations (colloquially "half precision") can be executed at twice the rate of 32-bit floating-point operations
00:53 RSpliet: Yeah that stuff
00:53 karolherbst: looks like SIMD within a reg
00:53 fincs: TX1 has FP16 stuff btw
00:54 karolherbst: fincs: in the ISA
00:54 fincs: And according to some people who have worked on Switch emulators, some stuff even Pascal lacks
00:54 RSpliet: Heh, yeah packed-SIMD, we meet again... after parting with that concept after GeForce 7 :')
00:54 fincs: Yes in the ISA
00:54 karolherbst: mhh weird
00:54 karolherbst: I have a jetson nano so I could probably check up on that
00:55 karolherbst: but that would be weird
00:55 imirkin: RSpliet: fermi+ has the "video" opcodes which are also pretty SIMD-ish
00:55 fincs: I wonder if the Switch's TX1 is a special TX1 with sekrit stuff
00:55 karolherbst: probably not
00:55 karolherbst: nvidia doesn't do this shit
00:56 karolherbst: mhh, but tx1 one is SM53
00:56 RSpliet: imirkin: oh I forgot about those. Are they even useful? Or is their accuracy too low for anything except very specific hard-coded shaders?
00:56 karolherbst: the nano as well btw
00:56 karolherbst: mhh
00:57 karolherbst: CUDA says that SM53 does have half precision stuff for real
00:57 karolherbst: so mhh, interesting
00:57 karolherbst: SM60 got real fp64 atomics :D
00:57 karolherbst: wow insane
00:57 fincs: I mean, I've seen stuff that shouldn't be there, such as some Volta regs inexplicably being used and working
00:58 karolherbst: what's odd about SM53 is that the registers per block is lower
00:59 karolherbst: and there is no texture memory cache
00:59 karolherbst: but maybe that's expected because of stolen RAM
00:59 fincs: There seems to be a texture cache though
00:59 fincs: And a descriptor cache
00:59 karolherbst: "Cache working set per multiprocessor for texture memory"
00:59 fincs: Maybe it's just virtual, who knows
01:00 karolherbst: yeah dunno
01:00 karolherbst: I wouldn't trust nvidia docs on that anyway
01:00 fincs: Docs? Which docs? :p
01:00 karolherbst: the cuda ones
01:00 fincs: Ah
01:00 fincs: Well yeah
01:01 fincs: Switch TX1 has 128 warps per SM
01:01 fincs: Not 64
01:01 karolherbst: sometimes I wonder if most of the differences are just some sw bs
01:01 karolherbst: the only real differences we were able to verify is ASTC support I think?
01:02 fincs: I have my own suspicions about stuff too
01:02 fincs: ASTC is indeed supported on Switch
01:02 fincs: Speaking of texture compression
01:02 karolherbst: yeah, not on desktop cards
01:02 fincs: Is ETC2 support supposed to exist?
01:02 fincs: nvn has no support for etc2 whatsoever
01:02 fincs: But I see nouveau code has no trouble with it
01:02 karolherbst: dunno
01:02 fincs: Haven't actually tested loading an etc2 texture yet
01:03 karolherbst: maybe it just works
01:03 fincs: Also fun fact: until very recently, the ASTC compressor was proprietary and you were not allowed to do anything with it
01:03 karolherbst: imirkin probably knows
01:03 karolherbst: ohh
01:04 karolherbst: ETC is also tegra only
01:04 karolherbst: fun
01:04 fincs: I should really test this purported ETC2 stuff lol
01:04 karolherbst: yeah
01:04 fincs: But I find it really suspicious how nvn doesn't expose it
01:04 karolherbst: in doubt: patents
01:04 fincs: Also true
01:05 fincs: But even so, if the support is there, then your hardware is infringing patents
01:05 fincs: I don't think nvidia would do that
01:05 karolherbst: nvidia pays for it ;)
01:06 fincs: Yes but maybe they aren't paying fees for tx1 or Nintendo's version of tx1
01:06 fincs: As it is "not supposed to be used"
01:06 karolherbst: ohh but ETC2 is royalte free
01:06 karolherbst: oh well
01:06 karolherbst: maybe ninteno thought ASTC is superior or so
01:06 karolherbst: dunno
01:06 fincs: And ETC2 is backwards compatible with ETC1
01:06 fincs: Maybe
01:07 fincs: So far the only compressed format I've verified working is BC1
01:07 karolherbst: anyway, it's late, I am going to sleep
01:07 fincs: Gn
01:07 fincs: Not sure when will I come back up next
01:07 fincs: I don't really use IRC much these days
01:08 fincs: I guess I can be reached by email too
01:08 fincs: (Or Twitter DM)
01:33 imirkin: ETC2 and ASTC are tegra-only. mesa has software decoders for them.
01:35 fincs: What happens if I'm actually running this on tegra? :p
01:35 imirkin: fincs: nouveau uses the hw decoders
01:35 imirkin: unless we screwed it up. but that's the intent at least.
01:36 fincs: Does nouveau know when to use hw stuff, or when to fall back on software decoding?
01:36 imirkin: fincs: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_screen.c#n78
01:36 imirkin: looks like we only expose it on GK20A, not on GM10B
01:36 imirkin: easily adjusted.
01:36 fincs: GM20B in our case
01:36 imirkin: er, right
01:37 imirkin: if someone would like to test that the rest of the integration works, happy to take a patch
01:37 imirkin: i don't have a GM20B
01:38 imirkin: (and my GK20A is sick)
01:38 fincs: I have a Switch in front of my keyboard, does it count? :p
01:38 imirkin: if you're running nouveau on it, sure
01:38 fincs: We do use nouveau for getting GL stuff to run, yeah
01:38 imirkin: GK20B also uses the "new" texture descriptor format
01:38 imirkin: so it's questionable whether we fill it in correctly for the extended formats
01:39 imirkin: GM20B*
01:39 fincs: Ah yes the TIC
01:39 imirkin: i.e. just coz it works on the GK20A doesn't mean it'll work as-is on GM20B
01:39 fincs: I mean I saw the ids in nouveau/envytools for ETC2 and used them
01:40 imirkin: oh yeah, looks like there are no shenanigans for the GM107+ format
01:40 imirkin: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_tex.c#n323
01:40 imirkin: look at that & 0x40 bit at the end :)
01:41 fincs: Ah
01:42 imirkin: on gm107+ it just fits nicely into TIC[0]
01:42 fincs: Perfect
01:42 fincs: So by just adjusting that check it should work then
01:42 imirkin: anyways, needs testing on GM20B - for the nouveau bit, just relax the check in nvc0_screen i pointed earlier
01:42 fincs: Noted
01:42 imirkin: and if ETC2/ASTC still work, then all good
01:43 imirkin: if not, then more debugging required
01:43 imirkin: also get rid of that GM107 note
01:43 imirkin: it's pretty clear that no desktop chips support it
01:43 imirkin: [if you're going to be sending a patch]
01:43 fincs: Heh, if no desktop chips support it I'm not sure what would be the appropriate check
01:44 imirkin: just keep adding != chip stuff
01:44 imirkin: it's a short list of non-desktop chips
01:44 fincs: GM20B is always tegra?
01:44 imirkin: yes
01:44 fincs: Ok
01:44 fincs: Still need to get around doing many things to our mesa/nouveau port tbh
01:45 fincs: It's been quite a long time since the last time we did anything on it
01:54 imirkin: fincs: btw, https://github.com/devkitPro/uam/commit/a42f4e2ce7f46394e57d30dc4b1b779f61ea759c
01:54 imirkin: it's a little subtle
01:54 imirkin: have a look at nouveau
01:54 imirkin: it depends on the prim mode
01:54 fincs: That was flipped I reckon
01:54 imirkin: oh, the flip, yeah
01:54 imirkin: i just twiddle it until tests pass
01:55 imirkin: coz there's also "y flip" considerations
01:55 imirkin: which screw with the clock-ness of the winding
01:55 fincs: True
01:55 fincs: I think nouveau actually does it properly, but it has some misleading comment somewhere
01:55 imirkin: yes
01:55 imirkin: i never went back to fix rnndb
01:55 fincs: So perhaps that should be updated :p
01:56 imirkin: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_program.c#n309
01:57 imirkin: the tess bringup was so long ago...
01:57 fincs: I tested the tess stuff btw
01:58 imirkin: 9f19ccff9c7f - hm, 2016. was probably a "later" fix after the initial bringup
01:59 fincs: I think the only shader type I didn't test was geometry, but gs sucks anyway so who gives a shit :)
01:59 imirkin: not too much weird stuff with GS...
01:59 imirkin: tess was much harder to bring up
01:59 fincs: Speaking of GS
01:59 fincs: Do you know anything about passthrough gs?
02:00 imirkin: uhm
02:00 imirkin: sure?
02:00 imirkin: it's when you don't set a GS
02:00 fincs: No I mean
02:00 fincs: https://www.khronos.org/registry/OpenGL/extensions/NV/NV_geometry_shader_passthrough.txt
02:01 fincs: I more or less know how that works, but sadly it needs GLSL parser/compiler support
02:02 imirkin: i don't think anyone's working on it
02:02 imirkin: how does it work? just set a bit somewhere?
02:02 imirkin: or do you specify a routing map somewhere
02:03 imirkin: probably a per-input bit if it's passthrough, for a total of ... 64? bits
02:03 fincs: Yup
02:04 fincs: It's a table of bits specifying which inputs are passthrough (or maybe it's in reverse)
02:04 imirkin: looks like it's GM20x+
02:04 fincs: Yup
02:04 imirkin: if you can give me the details, i could perhaps add glsl parser support for it
02:04 fincs: Sure
02:04 imirkin: do you know of any software that uses it?
02:04 fincs: Software, no idea
02:05 fincs: But nvidia seems to promote it for multi viewport stuff
02:05 imirkin: (i have a GP108 atm, so i can test this stuff)
02:05 fincs: And cubemap stuff
02:05 fincs: I can write the details in a text file, but that's probably going to happen some other day as it is really late for me
02:05 imirkin: if you just tell me where the bits live
02:05 imirkin: i can write it up
02:06 fincs: Register 0x1240..0x125F
02:06 fincs: 256 bits in total
02:06 fincs: And also there's a bit in the shader header that's undocumented, that turns on passthrough gs
02:07 imirkin: there can be up to 128 gs outputs iirc, so 2 bits per output?
02:07 imirkin: or did it get increased and it's 1 bit per output?
02:07 fincs: Aren't there 256 slots in the A[] area or am I misremembering?
02:08 fincs: It's definitely one bit per "output"
02:08 imirkin: it's 0x200 ( / 4) = 128
02:08 fincs: Hmm
02:08 imirkin: oh
02:08 imirkin: wait, yeah, there are the special ones
02:08 fincs: Yes
02:08 imirkin: ok, i see what you mean
02:09 imirkin: now i get it
02:09 fincs: If bit is set it's a normal attribute, if bit is clear it's a passthrough attribute
02:09 imirkin: like gl_TexCoord and such (which can then be sprite-replaced)
02:09 imirkin: you can only have up to 128 enabled at a time
02:10 fincs: Let's see if I can remember where the special enable bit is
02:10 fincs: bit28 of word0 there we go
02:11 imirkin: nice
02:11 imirkin: thanks!
02:11 fincs: ( ͡° ͜ʖ ͡°)
02:11 fincs: Would be nice to get this working, heh
02:11 imirkin: hmmmmmmmmmm
02:11 imirkin: bit28 in a 0-indexed world?
02:11 fincs: 0 is lsb 31 is msb
02:11 imirkin: i.e. 0x10000000 ?
02:11 fincs: Yes
02:12 imirkin: that's the SO mask for the first stream
02:12 fincs: Uh, that's in the middle of a reserved field
02:12 imirkin: k, let me check
02:12 imirkin: i could be off
02:12 fincs: https://nvidia.github.io/open-gpu-doc/Shader-Program-Header/Shader-Program-Header.html
02:12 fincs: CommonWord0
02:12 fincs: Hmm
02:13 fincs: Wait a minute
02:13 imirkin: gp->hdr[0] |= 0x10000000;
02:14 fincs: There's something odd in my notes
02:14 imirkin: yeah, we do this to mask out non-0 streams for non-points
02:14 imirkin: the reserved bits are a bit lower
02:17 fincs: Ahh yeah
02:17 fincs: It's bit24
02:17 fincs: Not bit28 sorry
02:17 imirkin: ok cool, that makes more sense
02:17 fincs: So 0x01000000
02:18 imirkin: right
02:18 imirkin: i'll write up a change which specifies this
02:18 fincs: Also, nvidia pushes passthrough gs alongside gl_ViewportMask[] (NV_viewport_array2), which should probably be a good addition too
02:19 imirkin: i think that one's semi-documented already
02:19 fincs: The two things are meant to go together
02:19 imirkin: at the time i didn't have a GM200+ GPU
02:19 fincs: No compiler support though
02:19 imirkin: nor could i find anyone to do enough of the legwork for it
02:19 imirkin: and then i forgot about it until just now :)
02:19 imirkin: in nouveau already: /* case TGSI_SEMANTIC_VIEWPORT_MASK: return 0x3a0; */
02:19 fincs: Yup
02:19 fincs: It's just not exposed in the glsl frontend
02:20 imirkin: ye
02:20 imirkin: should be easy
02:20 imirkin: i'll do that one first
02:20 fincs: Yeah do that one first
02:20 imirkin: that's just an extra thing, right?
02:20 imirkin: no funny parser business
02:20 fincs: Just a var
02:20 fincs: No funny business
02:20 imirkin: yeah that's like a 30-min change
02:20 imirkin: no tests though, so would need to write those too
02:21 fincs: https://www.khronos.org/registry/OpenGL/extensions/NV/NV_viewport_array2.txt
02:21 imirkin: yeah, i think there's a TODO to look at that
02:21 fincs: Anyway, it's getting really late for me lol
02:22 imirkin: https://trello.com/c/efjNP3bW/155-gm200-nvviewportarray2
02:22 fincs: Heh wow
02:22 fincs: 4 years ago damn :\
02:22 imirkin: like i said ... forgot about it :)
02:22 fincs: Understandable
02:22 imirkin: also haven't been doing too much nouveau stuff
02:22 fincs: Also understandable
02:23 imirkin: esp for gm200+, since no reclocking
02:23 imirkin: but if it's actually useful for someone, happy to put in some effort
02:23 fincs: Get a Switch running homebrew :p
02:23 imirkin: i barely have the time to run the arm boards i've gotten for free
02:23 fincs: nouveau is unleashed on the Switch
02:24 imirkin: glad someone's getting some use out of it
02:24 fincs: We are indeed getting quite a lot of mileage out of it lol
02:24 imirkin: the crowds tend to view it as an annoynace to uninstall ASAP
02:24 fincs: We have lots of lazy ports of random shit
02:24 fincs: Yeah and that's not your fault
02:25 fincs: It's the reclocking shit :\
02:25 fincs: (amongst other glaring issues)
02:25 imirkin: well, some of it is my fault. but yeah.
02:25 imirkin: s/my/our/
02:25 fincs: We got hardware accelerated sdl2 too, using gl
02:25 fincs: (specifically gles2)
02:25 imirkin: fincs: not necessarily today, but if you have notes/traces for the viewport_relative thing, let me know
02:26 fincs: Oh I missed that
02:26 imirkin: tracing myself requires reboot, which i tend to do quite rarely
02:26 imirkin: since i do actual work on this thing :)
02:26 fincs: I can torture^H^H^H^H^H^H get the blob to cough up code
02:27 imirkin: you don't have a mmt-style thing?
02:27 fincs: mmt?
02:27 imirkin: valgrind + special thing
02:27 imirkin: which just traces all userspace submissions to the kernel
02:27 fincs: Why do that when you can just hack up a Switch emulator and dump command lists
02:27 imirkin: hehe ok
02:28 fincs: viewport_relative is probably some shitty emulated thing though
02:28 fincs: Anyway
02:28 fincs: Good night :)
02:28 imirkin: ext requires it
02:29 imirkin: that means i have to implement.
02:29 imirkin: nite
02:57 imirkin: fincs: https://github.com/envytools/envytools/commit/e177a6b3bb1702ef52797967b24bb11481fcd91a
06:59 imirkin: fincs: https://github.com/imirkin/mesa/commits/viewport_swizzle -- totally untested.
10:12 karolherbst: imirkin: actually the jetson nano is also gm20b... I could also test the etc and astc stuff
10:39 flipmess: hi, i have lenovo w530 with GK107GLM [Quadro K2000M] and it never goes to sleep http://sprunge.us/594Eqj
10:40 flipmess: archlinux kernel: 5.5.13-arch2-1
10:41 karolherbst: flipmess: dmesg please
10:41 karolherbst: you might hit RSpliet bug
10:43 flipmess: http://sprunge.us/EGQfle
10:44 flipmess: karolherbst, i hope this is usefull... otherwise i should probably reboot
10:50 karolherbst: journalctl --dmesg
10:52 flipmess: karolherbst, http://sprunge.us/K2eJNN
10:53 karolherbst: flipmess: mhh, don't see anything wrong. Could probably some userspace misconfiguration
10:54 karolherbst: make sure that in tlp or laptop-mode-tools audio power management is enabled, etc...
10:54 karolherbst: check /sys/module/snd_hda_intel/parameters/power_save and /sys/module/snd_hda_intel/parameters/power_save_controller
10:55 karolherbst: should be 1 and Y
10:55 flipmess: it's 0 and N ^^;
10:56 karolherbst: yeah.. tlp or laptop_mode_tools if installed and running might turn that off on AC
10:56 karolherbst: and if you switch to battery the toggle those bits
10:56 karolherbst: *they
10:56 flipmess: its 1 and Y when on battery but nvidia card stays on...
10:57 karolherbst: it takes a while
10:57 karolherbst: like 5 seconds :p
10:57 karolherbst: but yeah it should go off then
10:58 karolherbst: they can still be something using it though.. didn't check that
10:58 flipmess: that was the problem... i've noticed that i only have like 2h of battery
10:58 karolherbst: mhhh
10:58 karolherbst: weird
10:59 flipmess: http://sprunge.us/5i3qY9
10:59 karolherbst: that's normal
10:59 karolherbst: flipmess: try this
11:00 karolherbst: cd /sys/module/snd_hda_intel/drivers/pci\:snd_hda_intel/
11:00 karolherbst: echo 0000:01:00.1 > unbind
11:00 karolherbst: and see if anything happens
11:01 flipmess: ack
11:03 karolherbst: flipmess: did anything changes now?
11:03 karolherbst: *change
11:04 karolherbst: flipmess: anyway, do you know if you have tlp or laptop-mode-tools running?
11:04 karolherbst: for tlp you'd need to change SOUND_POWER_SAVE_ON_AC to 1
11:04 flipmess: tlp
11:04 karolherbst: and SOUND_POWER_SAVE_CONTROLLER=Y (but that's usually set already)
11:05 karolherbst: if unbinding doesn't help, a reboot might get rid of weirdo state
11:05 karolherbst: I was able to run into weird issues last time I investigated and tried to reproduce this
11:06 flipmess: hmm... 0:IGD:+:Pwr:0000:00:02.0 1:DIS: :DynOff:0000:01:00.0 now but nvidia still shows temp which it didn't use to when off...
11:08 karolherbst: temp like a garbage value or some real value
11:08 flipmess: real... i think.. 55°C
11:08 karolherbst: ohh mhh
11:08 karolherbst: weird
11:08 karolherbst: I don't know how that stuff works pre _PR3 support
11:09 karolherbst: maybe that's expected..
11:09 karolherbst: dunno
11:09 flipmess: k
11:09 karolherbst: I added a patch to disable it
11:09 karolherbst: because on my system (_PR3) based it read -511
11:09 karolherbst: flipmess: you can check if the battery lifetime is better now
11:09 karolherbst: or check with powertools what the battery draw is, etc..
11:10 flipmess: hm... 30min better... maybe...
11:10 karolherbst: mhhh
11:10 flipmess: powertools... like powertop?
11:10 karolherbst: ahh yeah
11:10 karolherbst: powertop
11:10 karolherbst: powertop is a CPU heavy beast though, so I prefer to get the data myself if the battery reports it
11:10 karolherbst: /sys/class/power_supply/BAT0/
11:11 karolherbst: but the values are always others
11:11 karolherbst: current_now and voltage_now is what I have
11:11 karolherbst: and you can calculate the watts based on that
11:11 flipmess: oh.. good idea
11:11 karolherbst: flipmess: https://gist.github.com/karolherbst/1b865c0fe72b178dae4138f0d6f79191
11:12 karolherbst: but...
11:12 karolherbst: your battery might report the value directly
11:12 karolherbst: always depends
11:12 flipmess: it has power_now ^^; and shows 19703000
11:12 karolherbst: "power_now" would be the correct one I think
11:12 karolherbst: ahh yeah
11:12 flipmess: so 19,7W i guess
11:12 karolherbst: mhh
11:12 karolherbst: quite high
11:12 flipmess: yeah
11:13 karolherbst: but it's usually high on those older ones :/
11:13 karolherbst: dunno
11:13 karolherbst: on my current laptop I can get down to 7W
11:13 flipmess: hm..
11:13 karolherbst: the GPU usually adds something like 10W
11:13 karolherbst: you could trigger the GPU a bit
11:14 karolherbst: "DRI_PRIME=1 glxinfo" or so
11:14 karolherbst: and then theck the consumption
11:14 karolherbst: and how it changes over time
11:14 flipmess: ack
11:14 karolherbst: sometimes there are more bits to reducing power consumption...
11:14 karolherbst: like if I have a USB hub on my TB3 port, the lowest I can get is 14W
11:14 karolherbst: ...
11:14 karolherbst: it's just painful
11:15 karolherbst: but it might be that we screw up the ACPI calls in nouveau on some systems
11:16 karolherbst: everything before _PR3 is kind of painful
11:16 fincs: imirkin: :) However the length of "GP_PASSTHROUGH" is 32 bytes, not 8 -- viewport swizzle looks good to me so I think it should work
11:17 karolherbst: fincs: would be still cool to know who is using those extensions (or to have tests), that's usually the bits where extensions are.. well, turned down. But anyway those seem super small so nobody minds adding things like that
11:17 karolherbst: once I tried to add support for shader_include because a game was actually using it
11:17 karolherbst: that was painful
11:18 karolherbst: got turned down as we found a simplier workaround
11:18 fincs: Ouch
11:18 flipmess: karolherbst, i also tried out some kernel params like drm_kms_helper.poll=0
11:19 karolherbst: flipmess: those usually don't help much
11:19 karolherbst: fincs: yeah.. shader_include is a super useless extension anyway
11:19 karolherbst: applications needing that should just do it themselves, like everybody else
11:19 fincs: Heh actually that would be handy to have for me
11:19 karolherbst: no
11:19 karolherbst: there is no FS
11:19 fincs: FS?
11:20 karolherbst: filesystem
11:20 karolherbst: so, what you do is "this path is this shader" for every include
11:20 fincs: I mean the standalone shader compiler
11:20 karolherbst: and then let the driver do the merging
11:20 karolherbst: ahh
11:20 karolherbst: but this has nothing to do with a standalone compiler
11:20 karolherbst: even there it's useless
11:21 fincs: Would be nice to be able to #include shit
11:21 karolherbst: although I guess you could write a fs parser which just adds every file and be a bit smarter late on....
11:21 karolherbst: fincs: again, there is no fs support for it ;)
11:21 fincs: From the standalone tool there should be
11:21 fincs: Just like any other compiler
11:21 karolherbst: it's totall virtual
11:22 karolherbst: yes, I agree
11:22 karolherbst: but that's not what the extension is about
11:22 karolherbst: the thing is, games/applications don't have a path where the shader are inside
11:23 karolherbst: so... in order to not depend on that, the application creates the virtual filesystem themselves
11:23 karolherbst: and just upload the files to the driver essentiall
11:23 karolherbst: and just upload the files to the driver essentiall
11:23 karolherbst: y
11:23 karolherbst: and then you have a key value storage with some path semantics
11:23 karolherbst: it has no benefit over simple sprintfs really
11:24 fincs: But I have a standalone tool, taking in files directly from the filesystem
11:24 fincs: Not from a string
11:24 karolherbst: sure, but that's then not shader_include ;)
11:24 karolherbst: fincs: https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_shading_language_include.txt
11:25 fincs: Hmm so it's just a registry of "virtual" files
11:25 karolherbst: yes
11:25 fincs: Bah
11:25 karolherbst: exactly
11:25 fincs: I recall there was some _GOOGLE_ extension that was a more traditional include
11:25 karolherbst: it misses this "this is the root of my fs" call ;=
11:25 karolherbst: ;)
11:25 karolherbst: yeah, might be
11:26 karolherbst: but a game was actually using this on linux
11:26 karolherbst: but this game was broken anyway
11:26 karolherbst: checked for the existence of the func pointer
11:26 karolherbst: instead of checking against the ext
11:26 karolherbst: ...
11:26 karolherbst: stupid mistakes like this
11:26 karolherbst: so if you just NULLed out the returning pointers it didn't use the extension anymore
11:26 karolherbst: shit like this is just annoying
11:27 karolherbst: and they refused to fix it
11:27 fincs: I bet
11:27 karolherbst: flipmess: anyway, I would be happy to investigate it further, but it's really hard to tell what's actually wrong
11:28 karolherbst: if now the power consumption is lower than normal then I'd say it's working now, but... maybe it could be lower
11:28 karolherbst: always hard to tell
11:28 karolherbst: I have a kepler laptop as well and the lowest I got there was like 13W or so
11:30 karolherbst: I could actually check if a newer kernel there is fine or not.. I need to reinstall the entire system anyway
11:41 dirbaio: no ha de tener
12:15 fincs: imirkin: This https://github.com/mesa3d/mesa/commit/869e32593a9096b845dd6106f8f86e1c41fac968 seems really fishy to me, is there a glsl shader I can look at that exhibits the problem this commit is supposed to fix?
12:16 karolherbst: fincs: it's not a shader problem
12:16 karolherbst: it's a dumb issue because the GL runtime allows hacky stuff
12:16 karolherbst: so.. you have a 3D image, _but_ you can bind layers of that to a 2d image
12:16 fincs: That's... not fun :p
12:17 karolherbst: exactly
12:17 fincs: Thing is, I want to see what nvidia does in this case
12:17 karolherbst: but because 3d images are tiled differently (even on the z axis) a 2d image poeration returns garabge
12:17 karolherbst: fincs: I saw that nvidia disables z tiling on 3d images
12:17 fincs: On GL?
12:17 karolherbst: but this needs detection at runtime
12:17 karolherbst: yes
12:17 fincs: So they basically don't need to worry about it?
12:17 karolherbst: so you have to either retile or never tile on the x axis
12:17 karolherbst: *z
12:18 karolherbst: fincs: no, they detect at runtime to enable or not enable it
12:18 fincs: Ah
12:18 karolherbst: really it's a silly issue and the spec shouldn't allow stupid things like that
12:18 fincs: Meanwhile newer nouveau generates code that will work with both, at the cost of doing more things at shader runtime, right?
12:18 karolherbst: well, it's just an if
12:18 karolherbst: but yeah
12:19 fincs: I'd like to be able to disable this :\
12:19 karolherbst: you can disable z tiling instead
12:19 karolherbst: I have some alternative patch somewhere
12:19 fincs: I can say that this is disallowed in my API
12:20 karolherbst: fincs: https://github.com/karolherbst/mesa/commit/8137bafc7a702f72196613750fe1d429a7a1dcd5
12:20 fincs: Currently this means I need to revert this commit before porting newer mesa changes to my setup
12:21 fincs: Because I'm not allowing this 3D texture as 2D layer trickery and such there's no need to do this workaround
12:24 karolherbst: well, if you ship an GL implementation you have to conform to the spec ;) this feature isn't used by anybody seriously though, so you can just ignore it
12:25 fincs: I mean, for the standalone shader compiler, not used with GL
12:25 karolherbst: ahh, mhhh
12:25 karolherbst: well, I don't like the fix either, but imirkin was under the impression it has no perf penalties (and shouldn't really have any)
12:25 karolherbst: but yeah...
12:25 fincs: Still, this is a workaround that only makes sense for GL
12:26 fincs: Not for other APIs... I wonder if Vulkan explicitly bans this access
12:28 karolherbst: dunno
12:29 fincs: Seems to me like a massive oversight tbh
12:29 karolherbst: I mean you don't have to pick this patch, it's up to you anyway
12:29 fincs: Yeah, but it makes other stuff difficult to sync up
12:29 fincs: And ah, it also conflicts with the non-bindless image support stuff I did
12:30 fincs: Which I'd like to see make its way back to mesa
12:30 karolherbst: about that, I didn't see any benefit of the changes you made there, so what's the thing it addresses and changes?
12:30 karolherbst: or improves
12:30 fincs: Currently all image accesses are treated as bindless
12:30 fincs: I made it use non-bindless opcodes for non-bindless accesses
12:31 karolherbst: sure, but what's the benefit here? I guess you just skip the handle load?
12:31 fincs: Yes
12:31 fincs: Less gpr usage
12:32 karolherbst: ahh, I see
12:32 fincs: There's also this bugfix: https://github.com/devkitPro/uam/commit/76166bdb69be6a63a014492509c66ab2ddb333e5
12:32 karolherbst: imirkin probably knows more about those things
12:33 fincs: Leaving this stuff here for now
12:33 karolherbst: but reducing gpr count is normally a good sign
12:35 fincs: Non-bindless image op support is in these two commits: https://github.com/devkitPro/uam/commit/7f1aaabf694bbd99235474b5b08a6b718a3f2bd8 + https://github.com/devkitPro/uam/commit/b8df713edfe46a84743ddbce6b8960d92530b606
12:37 karolherbst: that stuff has high potential to break things though.. but yeah... I can already imagine some issues
12:37 fincs: Which issues?
12:37 karolherbst: that 0 immediate load is already a bit fuzzy
12:37 karolherbst: but maybe nvidia does that as well.. not quite sure
12:37 fincs: Where is that?
12:37 karolherbst: suq->setSrc(0, suq->tex.bindless ? ind : bld.loadImm(NULL, 0)); // fincs-edit
12:38 karolherbst: and then, what about indirect image access?
12:38 karolherbst: non bindless ones
12:38 fincs: Second commit
12:38 fincs: I fixed stuff related to that
12:38 karolherbst: ahh
12:38 karolherbst: would be nice to have those commits merged together :p
12:38 karolherbst: easier to review
12:38 fincs: Yeah sorry
12:39 fincs: But this is stuff I've been working on and then retroactively fixing
12:39 karolherbst: there is git rebase -i for a reason :p
12:39 fincs: True
12:39 karolherbst: but there is a trick
12:39 karolherbst: ehhh no, there isn't
12:40 fincs: I mean
12:40 fincs: I eat interactive rebases for breakfast
12:40 fincs: But in this case I was working on multiple things at once, not individual things
12:41 karolherbst: I guess this is probably one of those "not worth the effort" or "more important things to fix/deal with" so I doubt anybody would mind changing this stuff
12:41 karolherbst: just needs testing and proper review
13:02 imirkin: fincs: length is in 32-byte registers. i think 8 is right.
13:33 imirkin: fincs: and yeah, the whole 2d/3d thing with images is infuriating
13:33 imirkin: fincs: note that it only happens with bindless images, so there's that
13:33 imirkin: (hm, or does it? no. bindless has nothing to do with it)
13:34 imirkin: what it amounts to though is just an extra instruction, and only one of the two will get executed
13:34 imirkin: given that we're talking about image loads, which aren't exactly speedy to begin with, i think this is fine
13:35 imirkin: nvidia's image handling is a bit different from ours ... their image descriptor is always in gmem (probably to facilitate some bindless-related items, which we also handle differently)
13:37 imirkin: i think on input, we define the nv50_ir tex indirects to mean IndirectR + r, i.e. r is a constant offset on top of the indirect r
13:37 imirkin: in practice, i don't think that's really used for much
13:37 karolherbst: imirkin: well, but I think the point was that our image ops are always bindless
13:38 karolherbst: and using the non bindless forms would actually reduce gpr usages
13:38 karolherbst: I doubt it matters for bigger shaders though
13:39 imirkin: oh hm
13:39 imirkin: pretty soon every line will have a "// fincs edit" on it
13:39 karolherbst: or maybe it's only the case for maxwell...
13:39 karolherbst: but yeah
13:41 karolherbst: imirkin: I am fine with using the bound variants, I was just wondering if you have any opinions on why we use the bindless ones besides "it wouldn't matter perf wise"
13:41 imirkin: no good reason that i know of
13:42 imirkin: plausible reasons include "i never noticed" and "i forgot there was a non-bindless mode"
13:42 karolherbst: "other more important things to look at" might be ones as well :p
13:42 imirkin: that would imply that i knew about it in the first place
13:43 imirkin: which, if i ever did, left my head promptly.
13:43 karolherbst: mhhh.. I know that I saw nvidia using bound image ops on SM60 at least
13:43 karolherbst: but yeah..
13:44 karolherbst: fincs: so if you got some times, you can move those patches to a proper mesa tree and clean the patches and I can run some tests on my gp107
13:47 karolherbst: and even check with shader-db how big the difference is
13:47 imirkin: karolherbst: give astc/etc2 a shot on the gm20b. lots of deqp tests for those.
13:47 karolherbst: ahh, cool
13:47 imirkin: (the piglit ones are also there, but kinda crap)
13:48 karolherbst: but tegra is still broken on master or got that fixed?
13:48 imirkin: oh
13:48 imirkin: hadn't heard that
13:48 imirkin: what's wrong?
13:48 karolherbst: well, the shadow buffer issue
13:48 imirkin: oh, with stupid tiling?
13:48 karolherbst: something like that
13:48 imirkin: gr
13:48 karolherbst: I don't remember the details
13:48 karolherbst: just that using kmsro fixed it for me :p
13:49 imirkin: ok, so do that :) shouldn't affect deqp one way or the other
13:49 imirkin: er, astc/etc2
13:49 karolherbst: yeah
13:49 karolherbst: just annoying that upstream is broken.. really need to test it
13:49 karolherbst: maybe tagr or somebody else fixed it in the meantime
13:49 karolherbst: dunno
14:14 fincs: imirkin: Yeah but this 2d/3d thing is a workaround that only applies to GL (and as such I need to revert it), and it also conflicts with the non-bindless image op support I did in my source tree :\
14:14 fincs: Which kind of gets in the way of upstreaming
14:14 fincs: As I plan on having this commit reverted at all times
14:14 fincs: Unless a better solution exists for disabling this logic
14:16 imirkin: how does it interact with bindless?
14:17 fincs: I mean my commits implement *non* bindless image operations instead of emulating them as bindless operations
14:17 imirkin: right...
14:17 fincs: Which gets rid of the need to load texture handles and waste a gpr on that
14:17 imirkin: right...
14:17 fincs: When native image instruction forms exist
14:17 fincs: However
14:17 imirkin: and how does that interact with my workaround?
14:18 fincs: Same functions were changed by your workaround
14:18 imirkin: ok...
14:18 fincs: And the logic was quite heavily changed
14:18 fincs: At least from what I can tell
14:18 imirkin: and you don't feel like rebasing your work on top
14:18 imirkin: or is there a hard reason for it to not be possible?
14:19 fincs: I would rebase my work on top if there was a way to disable the new workaround cleanly without reverting it
14:19 imirkin: there's already an out for if (bindless) iirc
14:19 imirkin: (and it just plain fails for bindless iamges that are boudn this way. wtvr.)
14:20 imirkin: or maybe not? i forget. anyways, you coudl just skip over the code which adds the extra fetch. it's only done for 2d, so should be easy to avoid.
14:20 fincs: I have to look into this more
14:20 fincs: But anyway
14:21 fincs: The other thing I did is low hanging fruit
14:21 fincs: https://github.com/devkitPro/uam/commit/76166bdb69be6a63a014492509c66ab2ddb333e5
14:22 imirkin: ok... not sure that's correct though
14:22 fincs: Yeah it is
14:22 karolherbst: fincs: welcome to upstreaming :p
14:23 fincs: It was trying to load an offset through the handle table, using an indirect handle as the index...
14:23 fincs: The correct thing is to actually use the handle itself
14:23 imirkin: nv50_ir has several stages
14:23 fincs: This was generating wrong code
14:23 imirkin: the values before lowering aren't always the same as after lowering
14:23 imirkin: before lowering, indirectR is supposed to be an index into the table
14:23 imirkin: except in the bindless case
14:24 fincs: This is a bindless surface query
14:24 imirkin: ah
14:24 karolherbst: there are probably exactly 0 tests for that
14:24 fincs: It was trying to load the handle table with the handle as the index, instead of just using the handle itself
14:25 karolherbst: we have a few bindless tests in piglit though I guess it would be easy to add one for bindless queries as well
14:25 karolherbst:really wished the CTS would have tests for bindless
14:26 imirkin: fincs: ok, that's a problem then :)
14:26 fincs: Yeah
14:26 imirkin: sorry, i'm not really looking at this stuff very carefully right now
14:26 imirkin: i'll look at it more tonight
14:26 fincs: Yeah I get suddenly being dragged into compiler internals is not pleasant
14:26 karolherbst: fincs: did you ever got piglit to run?
14:26 imirkin: i like compiler internals
14:26 imirkin: i just have a job :)
14:26 fincs: karolherbst: I think someone else did, using our mesa port
14:26 fincs: Not me
14:27 imirkin: which is not to work on nouveau, as it happens
14:27 karolherbst: its usually way easier to provide tests which fail, also when upstreaming bits, etc...
14:28 imirkin: there are a handful of bindless tests, but they're not very good. still, much better than nothing.
14:28 karolherbst: fincs: but this patch would fail if you have a bound indirect image, no?
14:28 fincs: No
14:28 fincs: This else only applies to bindless
14:29 fincs: Err
14:30 karolherbst: yeah.. don't see it ;)
14:30 fincs: I recall indirect txq being lowered to bindless beforehand
14:30 karolherbst: mhhh
14:30 fincs: Or something like that
14:30 fincs: Let me check
14:32 imirkin: the logic for bindless is by far the least tested
14:32 imirkin: because there are no tests
14:32 karolherbst: ;)
14:32 imirkin: and the games that use it tend to crash for lots of other reasons
14:32 karolherbst: well.. there are _some_ tests :p
14:32 fincs: Yeah it looks like bindless is quite spotty
14:32 imirkin: the way images are handled is a bit bogus, but the lift to doing it the nvidia way is ... annoying.
14:33 karolherbst: yes
14:33 karolherbst: I am sure the emulator devs can tell their share on this topic as well :p
14:34 karolherbst:still has CL images on his todo list... *sigh*
14:34 karolherbst: those are super annoying, I can tell you that
14:37 fincs: Aaaand yeah, looks like I totally missed indirect
14:37 fincs: Let me fix the fix
14:39 karolherbst: imirkin: do we actually have an idea how to fix the "million resident images" problem?
14:39 karolherbst: uhm
14:40 karolherbst: bindless textures/images I mean
14:40 fincs: Ah wait
14:40 fincs: I'm misreading the sass
14:41 fincs: This is fine
14:41 imirkin: karolherbst: well, the TIC table is like 0x3fffff or something? 20 bits?
14:41 imirkin: er, more than that
14:41 imirkin: but whatever, it's only so many bit
14:41 imirkin: bits
14:41 fincs: With my fix
14:41 fincs: layout (binding = 0, rgba8) uniform image2D uImages[4];
14:41 karolherbst: I thought some games would run out of handles or something
14:41 imirkin: so ... there's a natural max number of textures that can be resident at a time
14:41 karolherbst: can't remember
14:41 fincs: s.dims = imageSize(uImages[s.index]);
14:42 fincs: Generated sass has
14:42 imirkin: karolherbst: well that's different ... we don't auto-scale the table
14:42 fincs: LDC R2, c[0x0][R2+0x120] ; <-- this is the driver constbuf
14:42 karolherbst: ahhh
14:42 karolherbst: I see
14:42 fincs: TXQ.B R2, R2, TEX_HEADER_DIMENSION, 0x0, 0x3 ;
14:42 imirkin: karolherbst: also we only allow like 512 images
14:42 karolherbst: I see
14:42 fincs: It does decay to bindless somewhere else as I correctly remembered
14:42 karolherbst: well the fix is obvious then, just need somebody caring enough
14:43 karolherbst: fincs: what's the initial R2? because this kind of looks fine
14:43 fincs: It is fine
14:43 imirkin: karolherbst: also once you create a handle, i think it's basically permanently resident
14:43 fincs: R2 is loaded from the ssbo and shifted left
14:43 karolherbst: okay
14:43 imirkin: (in our impl)
14:43 imirkin: because it's a literal TIC/TSC reference
14:43 imirkin: on nvidia, there's an extra level of indirection
14:43 fincs: imirkin: Yeah my fix respects indirect non-bindless access
14:44 imirkin: which allows them to hand back an address to the super-indirect buffer
14:44 fincs: The indirect case is lowered to bindless somewhere else
14:44 imirkin: which they can then manage behind the scenes as things get marked resident/not
14:44 karolherbst: mhhh
14:44 imirkin: that's the thing we don't do but probably ought to.
14:44 imirkin: otherwise we're limited in the number of handles we can create
14:45 fincs: I see where it is
14:45 fincs: It is lowered in GM107LoweringPass::handleSUQ
14:45 fincs: loadTexHandle is called from there
14:45 karolherbst: ahh, makes sense
14:46 fincs: So previously loadTexHandle was being called twice
14:46 fincs: It was loading a handle, then loading a handle using the handle as an index
14:46 fincs: Which is wrong
14:47 fincs: Even for indirect non-bindless
14:47 imirkin: right now we creating a handle reserves that id forever (until the handle is destroyed), while making resident simply marks the buffer to be included in the batch
14:59 fincs: Okay I read through the 2d/3d layer commit more closely and it looks like my unrelated non-bindless image stuff might not be affected after all
15:00 fincs: Let me see if I can actually clean this up and produce something that can be applied on mesa
15:00 fincs: Is it preferable to work against mesa master?
15:01 imirkin: if you want your changes applied, then yes
15:01 fincs: Ok
15:02 imirkin: (except highly exceptional circumstances, e.g. you're trying to do a non-trivial backport)
15:03 fincs: Cloning latest master right now
15:03 fincs: (also heh, if I get to upstream there will be no "fincs-edit" comments, don't worry :p)
15:05 imirkin: =]
15:06 karolherbst: the thing I am most concerned about with the dual issueing stuff is, that it requires us to set the yield flag :/
15:10 karolherbst: although I can imagine what it is, but I think we had some issues with it?
15:11 karolherbst: like setting that bit might hurt perf or something?
15:16 fincs: https://gist.github.com/fincs/28916aa987ec06ccf98c41f10c71a1db <-- patch
15:22 karolherbst:still doing shader-db runs
16:57 karolherbst: yay, new numbers
16:57 karolherbst: https://gist.github.com/karolherbst/82fdf5cc0ab5f12a2bfd4a2fcbb1996d
16:57 karolherbst: fincs: ^^
16:59 karolherbst: 73% single issued kind of sounds like a number nvidia is also around
16:59 karolherbst: dunno if you saw more or less
17:03 karolherbst: pendingchaos: didn
17:03 karolherbst: ...
17:03 karolherbst: didn't you hade a patch to report the cycle count of shaders?
17:04 pendingchaos: yes
17:04 pendingchaos: I think so
17:06 pendingchaos: found it: https://patchwork.freedesktop.org/patch/236941/
17:06 pendingchaos: it was archived for some reason
17:07 pendingchaos: I can't remember if it considered dual-issue
17:07 pendingchaos: and I don't think it handled nested loops
17:24 fincs: karolherbst: Nice numbers :) I see it even improved things a tiny bit for Kepler, and made huge gains on Maxwell/Pascal
17:24 fincs: You didn't change anything in dual_issue_v3 since yesterday, right?
17:27 karolherbst: I did, I just didn't push it
17:27 fincs: Oh
17:27 karolherbst: pendingchaos: yeah.. I guess those issues still have to be figured out, but at least I won't have to start from scratch :)
17:27 fincs: Looking forward to it :)
17:28 karolherbst: well, I changed nothing functional
17:28 fincs: So it's just refactoring?
17:28 karolherbst: more or less
19:31 kherbst: seems like dual issuing gives around 1.5% more perf
19:31 fincs: Yay :p
19:31 fincs: That's on your Pascal card?
19:31 kherbst: yes
19:31 kherbst: well, 1.5 in the best case
19:31 kherbst: some stuff is just unaffected
19:31 kherbst: oh well
19:32 kherbst: it's not like it's a big thing anyway
19:32 kherbst: some of the gpumark benchmarks give higher numbers
19:32 kherbst: volplosion eg
19:32 kherbst: or julia fp32
19:32 kherbst: pixmark_piano as well, but lower
19:32 kherbst: there is too much alu stuff going on, so only a low rate
19:33 fincs: An improvement is an improvement though
19:33 kherbst: yeah
19:33 fincs: Lots of small things can and in fact add up
19:33 kherbst: ohh, sure
19:33 kherbst: and 1% for a micro optimization is quite something
19:34 kherbst: but I'd rather implement zcull and get like 10% :p
19:34 fincs: Heh
19:34 kherbst: mhh, let me see what's really benefiting from this
19:35 fincs: I saw my "tiled" renderer did benefit from this codewise, as it was able to mix IPA with ALU instructions
19:35 kherbst: uhh, wow: helped single_issue ../nouveau_shaderdb/civilization_v/merged/05cb172dd00b734dd3eff26f5bbde1623c977cba.shader_test - 1 1.0 -> 0.384615
19:35 fincs: Wow
19:35 fincs: That's huge
19:35 kherbst: yeah
19:36 fincs: But... by how much does that actually improve the perf? :p
19:36 kherbst: not much
19:36 kherbst: juliafp32 is around 0.7
19:36 kherbst: and got like a 1.5% boost
19:36 kherbst: so.. maybe 2.5% in the best case?
19:37 fincs: Earlier log you posted was for around 73% of single issue
19:37 fincs: That one is 38%
19:37 kherbst: yeah
19:37 imirkin: average over all shaders
19:37 kherbst: thing is dual issueing just doesn't help all that much
19:37 imirkin: vs single shader
19:37 kherbst: you save like 2 ticks per instruction
19:37 kherbst: which.. is not much
19:38 kherbst: especially if you have integer ops in there
19:38 kherbst: or memory stuff
19:38 fincs: I guess it helps mixed workloads
19:38 kherbst: never saw much of an impact in kepler either
19:38 kherbst: fincs: like here the comit notes: https://github.com/karolherbst/mesa/commit/2a40abd91f3f1ca3642a4523ef5862aad86c96a4
19:39 kherbst: it'S a bid.. and on kepler the affect is higher anyway
19:39 kherbst: *effect
19:39 fincs: Either way, I'm glad there's a working dual-issue enhancing scheduler pass now :)
19:40 kherbst: well.. dunno if it doesn't break stuff :p
19:40 kherbst: but.. that's easier to figure out on maxwell/pascal
19:40 HdkR: woo dual issue, time to get minor perf gains on the lowest end hardware :)
19:40 kherbst: :p
19:40 kherbst: I really want to add support for zcull though
19:40 fincs: Hey, I'm still running maxwell lol
19:40 kherbst: maybe I am doing this next
19:41 kherbst: dual issuing is more interesting on volta anyway
19:41 kherbst: if there is dual issueing at all, maybe it's not needed anymore
19:41 HdkR: It's more pipeline ordering than dual issuing
19:42 kherbst: yeah.. which is still the same you don on maxwell
19:42 kherbst: kepler was the only gen with real dual issueing
19:43 HdkR: Just ends up being your `canDualIssue` check is just checking if the previous instruction is opposite the current instructions type :P
19:43 kherbst: sure :p
19:43 kherbst: well, at least on maxwell
19:43 kherbst: kepler could dual issue... like nearly everything
19:44 kherbst: two movs? sure
19:44 kherbst: mov and something? sure
19:44 fincs: I've observed the first instruction in a dual issue cannot have variable latency
19:44 HdkR: er, I meant on Volta
19:44 kherbst: add and add? sure!
19:44 HdkR: Since it has the split int/float pipelins
19:44 kherbst: ahh yeah
19:44 kherbst: this will be interesting to figure out
19:45 HdkR: You can issue an instruction every cycle, the int and float pipeline can accept a new instruction every two cycles. Interleave them perfectly to get perfect scaling
19:45 kherbst: right... but details
19:45 HdkR: Although I pity a shader that has a perfect 50/50 int/float split
19:45 kherbst: like some integer ops have variabl runtime and require a barrier on maxwell
19:45 kherbst: maybe that's different on volta
19:45 kherbst: but.. doubtful
19:46 kherbst: stupid stuff like imad
19:46 fincs: IMAD sucks lol
19:46 HdkR: ah, variable latency would heck it up anyway. is only true for basic alu ops
19:46 kherbst: like xor and fmad?
19:46 fincs: Also there was a small codegen issue in IMAD
19:46 kherbst: :p
19:46 kherbst: fincs: I am sure I fixed it years ago :p
19:46 HdkR: hey at least xmad is gone on Volta, so all you get is imad :P
19:46 fincs: https://github.com/devkitPro/uam/commit/07d283a6e1ccc197169e3918914253db9a6801e3
19:47 fincs: IMAD is not normally generated anyway, because nouveau optimizes it to xmad...
19:47 kherbst: fincs: I am actually wondering about this as it would have been caused issues before
19:47 kherbst: fincs: we added xmad later
19:47 kherbst: fincs: I am 100% sure our emitter is correct
19:48 fincs: Well, for some reason I can't remember I ended up having IMAD instructions show up
19:48 fincs: And they were being generated incorrectly
19:48 fincs: I look at shader code using nvdisasm
19:48 kherbst: did you verify at runtime or disassembler?
19:48 fincs: And it was wrong
19:48 fincs: Disassembler
19:48 kherbst: ahh yeah.. nvdisasm is buggy
19:48 fincs: Is it?
19:48 kherbst: yes
19:48 kherbst: HdkR will tell you as well
19:48 fincs: I have a hard time believing it to produce bad output for IMAD instructions
19:49 kherbst: you never know
19:49 HdkR: It usually misses details rather than being outright wrong
19:49 kherbst: let's see what envydis is saying
19:49 kherbst: ehh, I have seen both
19:49 kherbst: madsp is also wrong
19:50 kherbst: but it's also broken in mesa
19:50 kherbst: just works by... well, chance
19:50 HdkR: oof, never tested that one with it
19:50 imirkin: yeah, we should stop producing IMAD ops ... ever
19:50 fincs: imirkin: I think it was my fault that it produced IMAD ops
19:50 HdkR: imad super bad
19:51 imirkin: fincs: well, it'll happily produce them on fermi/kepler
19:51 fincs: Something else that I did afterwards unfucked it
19:51 kherbst: imirkin: but the question is if the emiter is broken or not
19:51 imirkin: i remember the change where i "fixed" that
19:51 fincs: And it resumed producing xmad
19:51 imirkin: in hindsight - bad idea.
19:51 kherbst: and I am sure imad was correct
19:51 imirkin: i can't speak to the emitter.
19:51 kherbst: otherwise we would have seen issues
19:51 imirkin: but mul + add is faster than IMAD
19:51 kherbst: :D
19:51 fincs: If nvdisasm is to be believed, I'm 100% sure the emitter is wrong here
19:51 fincs: And this commit fixes it
19:51 kherbst: fincs: I can run it through some runtime tests :p
19:52 fincs: Also, did any of you look at the patch I posted earlier?
19:54 kherbst:is happy that we have some working CL stack so writing such small tests is super trivial
19:56 HdkR: fincs: Never ever emit an imad. If not using xmad then you're doing it wrong at that point :P
19:56 fincs: Yeah
19:57 fincs: I think I was fucking around with the integer division stuff and somehow I broke things
19:57 fincs: Either way I haven't seen an imad since then so that's good
20:03 kherbst: huh
20:03 kherbst: I don't get a imad generated with the modifier
20:06 imirkin: i've yet to see nvdisasm be flat-out wrong
20:06 imirkin: it does crash for some valid instructions sometimes
20:06 imirkin: (texture query or surface query-related)
20:07 kherbst: imirkin: uff: https://gist.github.com/karolherbst/8c6691efdfd1526444885192f5ad6883
20:08 kherbst: 23: add s32 %r52 %r49 %r51 (0)
20:08 kherbst: this can't be turned into a mad or can we?
20:08 kherbst: the neg is annoing there
20:08 kherbst: *annoying
20:08 imirkin: ideally one of the mul sources would be negated
20:09 imirkin: but i don't think we're that smart
20:09 imirkin: it _is_ odd that the add doesn't get a negated source either
20:09 kherbst: ehh, that comes out like this from llvm already
20:09 imirkin: my guess is that the neg happens after modifier folding
20:09 kherbst: ohh wait
20:09 imirkin: and so it's too late and we don't run to a fixed point
20:09 kherbst: we get imul + isub
20:10 kherbst: and nir turns that isub into ineg
20:10 kherbst: + iadd
20:10 imirkin: we should fold that
20:10 imirkin: if we don't, there's something very weird.
20:10 kherbst: yeah...
20:10 kherbst: well
20:10 kherbst: 19: add s32 $r4 $r0 neg $r1 (8)
20:10 kherbst: that's the result
20:10 kherbst: 18: mul u32 $r1 $r5 $r4 (8) and this is the mul
20:10 kherbst: but that could have been an imad
20:12 kherbst: fincs: also, why did you reverted that COMBINED_TID stuff
20:12 kherbst: it usually generates better code
20:12 kherbst: rdsv is expensive as hell
20:12 kherbst: for.. odd reasons
20:12 fincs: I saw nvidia shaders would use the non-combined stuff
20:13 fincs: And I wanted to match it
20:13 kherbst: it's fine if you only use one of the values though
20:13 kherbst: and for me that gets optimized to TID:0
20:13 fincs: Maybe it should only be conditionally enabled according to used inputs
20:13 imirkin: that's hard.
20:13 fincs: Don't we have an input map available?
20:14 imirkin: (harder)
20:14 imirkin: it's not an input
20:14 imirkin: we do know which sysvals are read
20:14 fincs: Or well, we are supposed to know which sysvals are read
20:14 imirkin: but then they get eliminated in DCE
20:14 imirkin: so it's not an exact science :)
20:15 fincs: Could add a pass to detect it, whatever
20:15 kherbst: fincs: anyway, one rdsv + 3 extbfs are cheaper
20:15 kherbst: than 3 rdsv
20:15 kherbst: if you end up using two components, you are better of with the combined one
20:16 kherbst: only if you only need one dimension you can load it directly
20:16 fincs: I'd agree to using combined_tid if you use more than one component
20:16 kherbst: yes, and that's what we are doinbg
20:16 fincs: Seems to be used regardless of the number of components read
20:17 kherbst: fincs: https://gist.github.com/karolherbst/56627d210c756890d19cba2a2f83849f
20:17 kherbst: fincs: because you never now
20:17 kherbst: also, dce
20:17 kherbst: eg spirv always reads all three components
20:17 kherbst: so... you really have to rely on dce here
20:18 kherbst: and smart opts
20:18 fincs: Seems like it should really be lowered after dce
20:18 kherbst: maybe a shader also reads all three, but the result doesn't depend on two of those because of opts
20:18 kherbst: it's not that easy
20:19 kherbst: ahh well
20:19 kherbst: we turn TID into combined and then back again later :D
20:19 kherbst: like this: https://gist.github.com/karolherbst/e6b4311c229ea97b66e7ae6ea8a2b829
20:20 fincs: nouveau does that optimization?
20:20 kherbst: sure
20:20 imirkin: fincs: right, so basically every opt wants to be after every other opt
20:20 imirkin: unfortunately someone's gotta go first
20:20 fincs: Turning combined_tid back into tid if only ever uses one?
20:20 fincs: Where's the code for that?
20:20 kherbst: yes
20:20 kherbst: dunno, but that's what's happening
20:20 imirkin: we don't run to a fixed point so that compiles are faster
20:21 imirkin: but that does lead to occasionally not-fully-100% optimal code
20:21 kherbst: well, but we might want to change that soonish, especially with a shader cache then
20:21 kherbst: but yeah
20:21 imirkin: for a very long time there was no cache at all, so compile times were very important
20:21 kherbst: would require codegen to get some fixes
20:21 fincs: In my case I don't really mind paying the price of slightly longer compilation times
20:21 kherbst: anyway, codegen does what you would expect
20:22 fincs: Found it
20:22 fincs: AlgebraicOpt::handleEXTBF_RDSV
20:22 fincs: In nv50_ir_peephole.cpp
20:22 imirkin: i really need to review those shader cache patches
20:22 fincs: Awesome
20:22 fincs: Time for a revert
20:23 kherbst: imirkin: ahh.. refcount prevents that op
20:24 kherbst: or well
20:24 kherbst: the neg as well
20:24 kherbst: annoying
20:24 kherbst: don't want to provide my own spirv.. uff
20:25 kherbst: abusing piglit it is then
20:28 fincs: Reverted: https://github.com/devkitPro/uam/commit/1dade3eb723a01d4386cf252a72612d113e1dc5c
20:28 fincs: :)
20:29 imirkin: if you're curious why a piece of code is there, just do an annotate
20:29 imirkin: usually that'll point you to the commit that added it
20:30 imirkin: sometimes it dates back to the dawn of time
20:30 fincs: There's git blame too
20:30 imirkin: but most of the questionable stuff is stuff i added more recently :)
20:30 imirkin: blame == annotate... no?
20:30 fincs: Ah
20:30 imirkin: just a more confrontational version of it :)
20:31 fincs: Heh, I know it by the name "blame" :p
20:31 imirkin: yeah. just like i know the "m" shortcut in gmail as "murder", but apparently it's "mute" now
20:34 fincs: Alright, time for some food, I'll be back later
20:35 kherbst: okay, so with piglit I get at least an imad: mad u32 $r0 $r0 c0[0x10] $r1
20:36 kherbst: but I don't manage to get the neg folded in
20:36 kherbst: but that's probably because all of those bits are uniforms.. mhh
20:36 kherbst: let me cheat
20:38 kherbst: hum
20:38 kherbst: 18: neg s32 %r62 %r61 (0)
20:38 kherbst: 19: mad u32 %r63 %r55 %r58 %r62 (0)
20:39 kherbst: imirkin: that neg should have been folded in, no?
20:39 imirkin: no clue
20:40 imirkin: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/codegen/nv50_ir_target_nvc0.cpp#n473
20:40 imirkin: return false.
20:41 kherbst: ehhh
20:41 imirkin: if it's a thing, that means the emitter may not be ready
20:42 kherbst: but emitImad has it
20:42 imirkin: ok
20:42 kherbst: I mean handling for the negs
20:42 imirkin: so then allow it in isModSupported
20:42 imirkin: emitIMAD is written based on envydis
20:42 imirkin: without much thought to the earlier stuff
20:42 kherbst: no, we already handle it
20:42 kherbst: it's in the table
20:42 kherbst: neg support on all 3 sources
20:43 imirkin: assume i'm right for a second
20:43 imirkin: and look at the thing i posted
20:43 imirkin: more closelier.
20:43 kherbst: uhhh
20:43 kherbst: I see
20:43 imirkin: ;)
20:43 kherbst: nasty
20:44 imirkin: yeah
20:44 imirkin: you think that's nasty, go look at the nv50 rules and try to make sense of them
20:50 kherbst: ahhhh
20:50 kherbst: all the code prevents this
20:51 kherbst: 10: mad u32 $r0 $r0 neg $r1 $r2 \o/
20:51 HdkR: \o/
20:52 kherbst: annyoing
20:53 kherbst: and according to fincs we swaped the neg of src0 and 1 with that of src2, okay
20:53 kherbst: crap, fincs is right :D
20:54 kherbst: heh.. the xmad code just trips over the mod stuff now
20:55 lovesegfault: I think I found a bug on HDMI unplug; if I re-plug my monitor does not work
20:55 lovesegfault: Cases:
20:55 lovesegfault: A. Boot with monitor plugged in: works
20:56 lovesegfault: B. Boot with monitor plugged in, then hotplug off/on: works at first, doesn't work on re-plug
20:56 lovesegfault: C. Boot with nothing plugged in; then plug in: doesn't work
20:57 lovesegfault: cc. imirkin karolherbst
20:57 imirkin: lovesegfault: laptop?
20:57 lovesegfault: imirkin: yep, there's a relevant dmesg message when I unplug: nouveau 0000:01:00.0: DRM: DDC responded, but no EDID for HDMI-A-1
20:58 imirkin: in the bad case, presumably?
20:58 imirkin: i wonder if there's some script we're supposed to execute on hpd that we don't
20:58 lovesegfault: Yeah, here's the log for: plug in after boot and then plug/unplug a couple times: https://gist.github.com/df0ba82abb67d58191b5228a2c88ab5e
20:59 imirkin: lovesegfault: can you dump your vbios into https://people.freedesktop.org/~imirkin/nvbios/ and make a gist out of it?
20:59 imirkin: (i really need to add a "make gist" button or something...)
21:00 lovesegfault: where do I find my vbios?
21:00 imirkin: /sys/kernel/debug/dri/1/vbios.rom
21:02 lovesegfault: imirkin: https://gist.github.com/874adc10b518ce80dd7c4a3649164e15
21:04 imirkin: dunno what i was looking for tbh
21:04 imirkin: HPD stuff all seems there and reasonable
21:04 imirkin: GPIO 27: line 27 tag 0x51 [HPD_2] IN NEG DEF 0 gpio: normal SPEC_IN 0x00 [AUXCH_HPD_0]
21:04 lovesegfault: What does HPD stand for?
21:04 imirkin: CONN 1: type 0x61 [HDMI] tag 1 HPD_2
21:04 imirkin: hot plug detect
21:05 lovesegfault: Ah :)
21:05 lovesegfault: So, I think it detects fine because sway even shows the output and correctly id's the monitor
21:05 lovesegfault: but
21:05 imirkin: hpd is what happens when you plug in a cable
21:05 imirkin: (or remove one)
21:08 lovesegfault: imirkin: sway reports this: Apr 03 14:46:07 foucault sway[2654]: 2020-04-03 14:46:07 - [backend/drm/legacy.c:21] HDMI-A-1: Failed to page flip: Device or resource busy
21:08 imirkin: that's not great.
21:09 lovesegfault: c.f. https://github.com/swaywm/sway/issues/5176
21:09 imirkin: fun, there's the additional little tidbit that you're doing 4k@60
21:10 imirkin: is this a tv or a monitor?
21:10 lovesegfault: It's a monitor
21:10 imirkin: is the monitor powered on at the time of plugging in the cable?
21:11 lovesegfault: I... think so? there's a blinking LED
21:11 lovesegfault: but the screen is off
21:11 imirkin: can you try setting it to 1920x1080 instead of 4K?
21:11 imirkin: basically i'm not 100% confident that HDMI 2.0 stuff works perfectly
21:12 lovesegfault: yep, do you want me to do that before or after plugging it in?
21:12 lovesegfault: (or both)
21:12 imirkin: whenever
21:12 lovesegfault: alright, trying
21:12 imirkin: it might fix everything
21:12 imirkin: or it might have no effect
21:12 imirkin: if it fixes everything, i have a bug in HDMI 2.0 processing
21:12 imirkin: if it fixes nothing, i'm not (yet) to blame
21:14 lovesegfault: imirkin: it works!
21:14 imirkin: d'oh
21:14 imirkin: j'accuse... moi
21:14 imirkin: =/
21:14 imirkin: 4k@30 should work too then
21:14 lovesegfault:tries
21:15 lovesegfault: Yep :)
21:15 lovesegfault: that also works
21:16 imirkin: can you confirm that you don't see messages about scdc write failures in your dmesg/
21:17 lovesegfault: imirkin: here's all I see from the moment I first plugged it in and through all these tests: https://gist.github.com/f8f7c7afa9ce2c37a2daffc195349a48
21:17 imirkin: [12668.985124] i2c i2c-13: sendbytes: NAK bailout.
21:17 imirkin: interesting.
21:17 imirkin: can you try 4k@60 again?
21:18 lovesegfault: yep
21:18 lovesegfault: imirkin: doesn't work, nothing in dmesg, sway reports the same old thing: 2020-04-06 14:18:13 - [backend/drm/legacy.c:21] HDMI-A-1: Failed to page flip: Device or resource busy
21:18 imirkin: 'kay
21:19 lovesegfault: (50hz also doesn't work, fwiw)
21:20 imirkin: yeah
21:20 imirkin: have to get down to 30 to get into the HDMI 1 range.
21:20 lovesegfault: Figured
21:20 imirkin: you could do 4k@60 with yuv420, but we don't support that.
21:20 imirkin: ok, and just to super-duper confirm ... this HDMI-A-1 is hanging off the nvidia gpu, not intel, yes?
21:21 lovesegfault: yep, that is correct
21:21 imirkin: GP107... i have a GP108. i should test on a recent kernel, perhaps it got broken
21:21 imirkin: unfortunately the TV i tested it with is presently located further away than my longest cable
21:21 lovesegfault: which is also the main reason this 30Hz solution doesn't upset me: the time to copy to and from the igpu for compositing already had that screen working at ~30Hz
21:22 imirkin: lol
21:22 lovesegfault: I'm on kernel 5.6.2 with karolherbst's patches btw
21:23 imirkin: i'm on 5.0 :)
21:23 fincs: kherbst: Hehe, told ya :p
21:24 lovesegfault: imirkin: are you okay with me closing the sway bug as a nouveau issue?
21:25 lovesegfault: also: if you need any help testing patches feel free to ping me :)
21:25 imirkin: lovesegfault: well, it's a lot of tedious work, to figure out wtf is going on
21:25 imirkin: lovesegfault: are you a developer?
21:26 lovesegfault: Yeah, I work on HPC stuff in Rust
21:26 imirkin: ok, but you can add print's to the kernel if push comes to shove...
21:26 lovesegfault: Yep
21:26 imirkin: ok
21:26 imirkin: so the first thing to do is to just increaes the debug verbosity
21:27 lovesegfault: I assume the kernel has some special print, kprint?
21:27 lovesegfault: printk?
21:27 imirkin: ya
21:27 imirkin: it'll be obvious from surrounding code
21:27 imirkin: do whatever that does :)
21:27 lovesegfault: understood
21:27 lovesegfault: where should I sprinkle these?
21:27 imirkin: gimme a few
21:27 lovesegfault: 👍
21:28 imirkin: so with HDMI 2.0, there are to parts to it
21:29 imirkin: you have to tell the monitor that you're going to be sending stuff in a new-and-improved way
21:29 imirkin: and then you have to tell the gpu that it should do that
21:29 imirkin: two parts*
21:29 imirkin: this is called "SCDC"
21:29 imirkin: you have to tell it to enable TMDS clock division as well as "scrambling"
21:30 imirkin: https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/dispnv50/disp.c#L722
21:30 imirkin: this is where we do both -- first we tell the GPU that it should do this (that's the nvif_mthd thing)
21:30 imirkin: and then we tell the monitor with the attempted SCDC writes
21:30 lovesegfault: "Status and Control Data Channel"
21:30 imirkin: the SCDC writes can fail, which we totally don't handle
21:30 imirkin: however that doesn't appear to be what's happening
21:30 lovesegfault: TMDS = Transition-minimized differential signaling
21:30 imirkin: (since i scream bloody murder when that happens)
21:30 lovesegfault: (these are just noted for me)
21:31 imirkin: TMDS = digital pixels, basically.
21:31 imirkin: ok, so that's where we initiate the writes. the SCDC stuff is an i2c bus, like edid (but a different device id)
21:32 lovesegfault: Oh, this stuff is all i2c?
21:32 imirkin: this is the thing that the nvif_mthd calls: https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/nvkm/engine/disp/hdmigm200.c
21:34 imirkin: as well as here: https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/nvkm/engine/disp/sorgf119.c#L119
21:34 imirkin: it could be that these things are getting ordered a bit wrong
21:35 imirkin: and high_speed isn't set "yet" by the time the gf119_sor_clock function is run, for example
21:36 imirkin: lovesegfault: and i'm told that we do it backwards
21:36 imirkin: lovesegfault: we should be doing the scdc write *first*
21:37 imirkin: and only *then* turning up the link
21:37 imirkin: so also try reversing the order of those calls in disp.c
21:37 lovesegfault: So this[1] should come after this[2]?
21:37 lovesegfault: [1]: https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/dispnv50/disp.c#L722
21:37 lovesegfault: [2]: https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/dispnv50/disp.c#L736
21:37 imirkin: no
21:38 lovesegfault: wait
21:38 imirkin: sec
21:38 lovesegfault: let me read what you said again
21:38 imirkin: https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/dispnv50/disp.c#L746-L757
21:38 imirkin: this should come BEFORE the nvif_mthd thing
21:38 imirkin: and obviously you can't bail if scrambling isn't supported
21:38 imirkin: since the nvif_mthd is needed no matter what
21:39 lovesegfault: I see
21:39 lovesegfault: let me try that
21:39 lovesegfault:clones kernel
21:39 imirkin: (i had no idea what i was doing when i implemented this ... i thought i did it in the order the blob did, but it can be hard to tell)
21:40 imirkin: (but intel driver does it the other way, and vsyrjala says the spec says so, and he's the expert in these matters ... a lot more than me, anyways)
21:40 lovesegfault: Eh, what's the right place to clone the kernel from?
21:40 lovesegfault: I wanted to checkout the tag for my version, produce a diff with git, and then apply to my kernel for Nix to build
21:40 imirkin: https://www.kernel.org/
21:40 imirkin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
21:41 lovesegfault: Bingo
21:41 imirkin: this code has been in place since 4.20
21:42 lovesegfault: Jesus the kernel's git repo is yuge
21:42 karolherbst: well..
21:42 karolherbst: expected :p
21:42 lovesegfault:stares at company's 15GB git repo
21:43 lovesegfault:sees it's 24GB now
21:43 karolherbst: heh, the checkedout kernel uses like 24GB
21:43 karolherbst: :p
21:43 imirkin: more like 2.4GB
21:43 karolherbst: nah, --depth 1 are no real git repos :p
21:43 karolherbst: go aware with those pseudo checkouts
21:44 imirkin: uhh
21:44 karolherbst: but yeah.. .git is smaller
21:44 lovesegfault: git gc --agressive --prune=now
21:44 karolherbst: :D
21:44 karolherbst: no
21:44 imirkin: $ du -sh .
21:44 imirkin: 3.8G .
21:44 imirkin: that's with fully built objects
21:44 karolherbst: mine is 24G
21:44 imirkin: well i dunno what you did
21:44 karolherbst: I built it :p
21:44 imirkin: i did too.
21:44 lovesegfault: are there build arti
21:44 karolherbst: well, with debug symbols
21:44 lovesegfault: there we go :P
21:44 imirkin: yeah, of course
21:44 imirkin: i did a standard build
21:45 imirkin: git tree + checkout + all the build artifacts = 3.8GB
21:45 lovesegfault: Oh, this reminds me I should check on my chromium build
21:45 imirkin: so i dunno where your extra 20GB is coming from
21:45 karolherbst: builds without debug symbols are just pseudo builds :p
21:45 karolherbst: imirkin: probably because I use my distributions .config or
21:45 karolherbst: so
21:45 karolherbst: anyway.. it's huge
21:46 imirkin: lol ok
21:46 lovesegfault: fresh, untouched checkout: 2.5G linux
21:46 imirkin: in case you solder in some of those fancy raid controllers into your laptop?
21:46 karolherbst: well, I have TB3
21:46 imirkin: it could happen!
21:46 karolherbst: :p
21:46 imirkin: i don't think you can get a 3c509 to work over TB3 though
21:46 imirkin: TB3 -> PCIe -> PCI -> ISA
21:46 imirkin: what are the chances that would work :p
21:46 karolherbst: heck, why not
21:46 imirkin: coz ... io ports
21:47 lovesegfault: Oh, this reminds me, how do monitors over USB-C work?
21:47 karolherbst: DP
21:47 imirkin: and interrupts with ISA are super-dodgy
21:47 karolherbst: worst case, I solder on a button and some LEDs doing it with my finger :p
21:47 imirkin: haha
21:47 imirkin: should be able to keep up with ISA :)
21:49 karolherbst: I think I really want to work on zcull.. shouldn't be too hard
21:49 karolherbst: just need to figure out where to allocate the buffer
21:49 karolherbst: on which buffers
21:49 karolherbst: or ondemand...
21:49 karolherbst: mhh
21:49 fincs: Zcull needs support from the kernel interface apparently
21:49 fincs: I don't know how that bit works, as I only deal with stuff up to the interface level
21:49 lovesegfault: imirkin: https://gist.github.com/ad2daaead9fc31747626478effd10525
21:49 karolherbst: I highly doubt that
21:49 lovesegfault: does this look good
21:49 fincs: Not anything beyond
21:49 lovesegfault: just making sure before I apply
21:50 fincs: As in, nvidia has an ioctl for binding some sort of zcull buffer
21:50 karolherbst: the zcull buffer needs to be context switched, but we have that in firmware
21:50 karolherbst: mhhhh
21:50 karolherbst: but maybe there is more to it
21:50 karolherbst: dunno
21:50 fincs: You already have handling for handling zcull context?
21:50 lovesegfault: wait
21:50 lovesegfault: urgh
21:50 lovesegfault: ignore that diff, I have _no_ idea what I was thinking
21:50 karolherbst: fincs: well, if you get the firmware from nvidia it usually comes with random goodies :p
21:51 fincs: I'm saying it's an ioctl, I have no idea how it's internally implemented so I can't really comment
21:51 fincs: And I'm asking if nouveau already has something for that
21:51 lovesegfault: imirkin: https://gist.github.com/8d1a38747d484cc83451e80ca9b8d2e8
21:51 fincs: Even if it's just internally managed
21:52 lovesegfault: okay, this is a non-braindead attempt
21:52 fincs: Without requiring userland driver intervention
21:52 imirkin: well, you can't QUITE do that
21:52 imirkin: but in your case you can test with that
21:53 imirkin: you can only do the SCDC writes if SCDC is supported :)
21:53 imirkin: i think if you just put that into a if (hdmi->scdc.scrambling.supported) { ... } that should be good enoguh
21:53 lovesegfault: got it
21:53 imirkin: (high clock ratio requires scrambling, so it's ok)
21:55 lovesegfault: imirkin: perfection? https://gist.github.com/52b60b9a1be8bcca3a70651da24b61e7
21:55 imirkin: i've never seen a more perfect patch in my life
21:55 lovesegfault: :D
21:56 lovesegfault:applies
21:59 lovesegfault: Alright, building the kernel on my server
22:00 lovesegfault: going to enter weechat from another box, brb
22:02 lovesegfault: alright, this should avoid me going offline on reboot
22:03 fincs: karolherbst: I just noticed, in your latest take on dual issue (specifically SchedDataCalculatorGM107::setDelay) you moved around the targ->canDualIssue section to just prior emitStall, what was the reason for that?
22:03 fincs: I retested after moving that part and the output is identical
22:04 fincs: Hmm I guess it can be problematic if the delay is updated by the "delay <= GM107_MIN_ISSUE_DELAY" section
22:08 karolherbst: fincs: get rid of the barrier check
22:09 fincs: The entire "delay <= GM107_MIN_ISSUE_DELAY" section you mean?
22:09 karolherbst: well, we adjust the dealy if we find matching barriers
22:09 karolherbst: and that's something we might want to do before
22:10 fincs: The issue I saw was that with the previous version it would do the barrier check after the dual issue check, which overwrites the delay with bad info
22:10 karolherbst: the == -> <= conversion I did because we don't clamp anymore
22:10 karolherbst: that as well
22:10 karolherbst: but I mainly did this so I don't have to check for barriers anymore
22:10 karolherbst: in canDualIssue I mean
22:10 fincs: And I agree that this check goes *after* the dual issue one
22:10 fincs: Ah
22:11 fincs: Ahhhh I understand now
22:11 karolherbst: anyway, I think this way it's actually more correct
22:11 karolherbst: :p
22:11 fincs: Yeah I'm trying to understand and apply your latest changes
22:12 fincs: Oh wait
22:12 fincs: There was no barrier check in TargetGM107::canDualIssue
22:12 karolherbst: you added one somewhere though
22:12 fincs: I added one in TargetGM107::canDualIssue
22:13 fincs: Which I guess now can be removed
22:13 karolherbst: there you go :p
22:13 karolherbst: yep
22:13 fincs: Nice
22:13 fincs: So it does apply the rule
22:13 fincs: And simplify logic, sweet
22:13 karolherbst: yeah, it's just not very obvious
22:14 fincs: Let me retest, hang on a sec
22:15 fincs: Wow
22:15 fincs: This added a fuckton of dual issues :D
22:15 karolherbst: :D
22:15 karolherbst: I am still wondering about the emitYield though
22:15 karolherbst: we don't have a proper understanding on when we are not allowed to set that bit
22:15 fincs: I wonder if IPA + EXIT dual issue is legal
22:16 karolherbst: because setting it always hurts
22:16 karolherbst: sure.... :D
22:16 karolherbst: dunno
22:16 fincs: inb4 this breaks shit
22:16 fincs: I have to test on hw now
22:16 lovesegfault: imirkin: rebooting into patched kernel
22:16 imirkin: karolherbst: just coz you set the bit doesn't make it true :)
22:16 imirkin: i doubt EXIT can be dual-issued with anything
22:16 imirkin: or any flow ops
22:16 karolherbst: yeah well...
22:16 karolherbst: imirkin: exit as the second one, not the first...
22:17 karolherbst: but yeah...
22:17 karolherbst: sounds dubious. but...
22:17 karolherbst: you know
22:18 karolherbst: but I will probably remove it from the list, just to be sure
22:18 karolherbst: I also don't think we should dual issue with any kind of jumps as the first of the pair
22:18 karolherbst: but they get soo high delays it doesn't even matter
22:19 fincs: Okay
22:19 fincs: Tests passing with flying colors
22:19 lovesegfault: imirkin: alright, on boot it works (30Hz)
22:19 fincs: Looks like it *is* legal to dual issue EXIT in the second slot
22:19 lovesegfault: dmesg doesn't show anything
22:19 lovesegfault: i'll try a hotplug at30H
22:19 karolherbst: fincs: yeah dunno... :D
22:20 karolherbst: but on the other hand
22:20 karolherbst: it's just _issueing_ not _executing_
22:20 karolherbst: you still have the pipeline after that
22:20 karolherbst: and stuff
22:20 fincs: This is really stuffed with dual issues :D
22:20 karolherbst: issueing is essentially just enqueueing an instruction for execution
22:20 karolherbst: but mhhh
22:21 karolherbst: this all is super weird anyway
22:21 karolherbst: but yeah. I agree that executing anything alongside an exit sounds wrong :D
22:21 fincs: FMUL + EXIT -> legal
22:21 fincs: AST.128 + EXIT -> legal :p
22:21 karolherbst: the bigger question is, does nvidia do it
22:21 fincs: IPA + EXIT -> legal :p
22:21 fincs: I've seen Nvidia dual issue something with exit I think
22:21 fincs: Let me find it
22:22 lovesegfault: Alright, the hotplug worked at 30Hz imirkin, I needed to do dpms off, dpms on for it to work though
22:22 imirkin: lovesegfault: what about @60?
22:22 fincs: Heh, TEXS + EXIT -> legal
22:22 fincs: I say it's legal because it runs on hardware without crashing
22:23 lovesegfault: imirkin: holy shit, hotplug worked @60Hz!
22:23 imirkin: lovesegfault: yay!
22:23 lovesegfault: no need to dpms off/on even
22:23 fincs: Hmm AST.128 + LDC is legal too apparently?
22:23 fincs: No way
22:23 imirkin: fincs: again ... just coz the bits are set doesn't mean it does something
22:23 imirkin: i doubt it'd be reflect in the inst_issued2 counter
22:23 karolherbst: imirkin: it does
22:24 karolherbst: there is _no_ verification
22:24 imirkin: lol
22:24 imirkin: ok
22:24 karolherbst: it you mess it up, your shader does weirdo shit
22:24 karolherbst: and you have rendering issues
22:24 fincs: I *have had* rendering issues
22:24 lovesegfault: imirkin: dammit, wait a moment
22:24 fincs: While debugging stuff and having bad dual issues
22:24 karolherbst: goes from some tiles looking weird, to all colors are weird up to "the hell is that even supposed to be"
22:24 lovesegfault: I think it's setting it to 30Hz behind my back
22:24 karolherbst: enabling dual issue on all instructions is fun
22:25 karolherbst: even if you get the delay stuff right
22:25 karolherbst: things just terribly broken
22:25 lovesegfault: (testing again)
22:25 karolherbst: I enabled dual issueing of two alu instructions once
22:25 karolherbst: hell...
22:26 karolherbst: well at least this is the advantage over kepler, you know when you mess up
22:26 lovesegfault: imirkin: 2020-04-06 15:26:31 - [backend/drm/legacy.c:21] HDMI-A-1: Failed to page flip: Device or resource busy
22:26 lovesegfault: 😭
22:27 lovesegfault: sway was being cheeky and resetting it to 30hz on unplug/replug, which is why I thought it was working
22:27 karolherbst: imirkin: what's also fun, you can't dual issue two instructions which use immediates
22:28 karolherbst: no idea why, but apparently the hw can't
22:28 lovesegfault: [ 531.386480] i2c i2c-12: sendbytes: NAK bailout.
22:28 lovesegfault: [ 531.387804] nouveau 0000:01:00.0: DRM: DDC responded, but no EDID for HDMI-A-1
22:28 fincs: Yeah I see a case where nvidia dual issued MOV and EXIT
22:28 fincs: IADD.X and RET
22:28 lovesegfault: so, same stuff as before
22:28 fincs: ISETP and RET
22:29 karolherbst: fincs: what I am curious about is instruction on smem and gmem
22:29 fincs: Yeah that is a bit mysterious
22:29 fincs: I suppose smem only touches internal memory, while gmem is more generally touching global memory
22:29 karolherbst: thing is, it really doens't matter that often so I treat mem as mem
22:29 karolherbst: welll
22:29 karolherbst: sure
22:29 karolherbst: but
22:29 karolherbst: snen is just L2 cache
22:29 karolherbst: *smem
22:30 karolherbst: and it matters rather what hw resources are used to execute this instruction
22:30 karolherbst: so.. mhhh
22:30 karolherbst: dunno
22:30 fincs: I don't see tex instructions paired up with exit though
22:30 karolherbst: I rather be conservative here
22:30 HdkR: Watch out for that L2 slice balancing ;)
22:30 fincs: It's... odd
22:30 fincs: It shouldn't be happening
22:30 karolherbst: fincs: nope, tex with a stall of 0?
22:30 karolherbst: how?
22:30 fincs: I mean
22:30 karolherbst: doesn't it even install barriers?
22:30 fincs: nvidia does not dual issue tex instructions with exit
22:31 fincs: My current branch apparently does
22:31 karolherbst: yeah, but tex would be the first one and would also install a barrier or not?
22:31 karolherbst: would be really weird if tex isn't variable runtime
22:31 fincs: Like for example
22:31 fincs: /* 0x07ffbc03e200172f */
22:31 lovesegfault: imirkin: I'm going to spam printk everywhere
22:31 fincs: /*0028*/ IPA R0, a[0x84], R0 ; /* 0xe043ff884007ff00 */
22:31 fincs: /*0030*/ { TEXS R2, R0, R1, R0, 0x1a4, 2D, RGBA ; /* 0xd8301a4020070100 */
22:31 fincs: /*0038*/ EXIT }
22:32 fincs: ^ that's currently generated by my working tree
22:32 karolherbst: fincs: the fourth number is missing
22:32 fincs: I don't think that should be legal
22:32 karolherbst: :p
22:32 fincs: No it's not
22:32 fincs: Look above
22:32 fincs: sched insn insn insn
22:32 karolherbst: there are only three
22:32 fincs: Ah
22:32 fincs: Sorry
22:32 fincs: /* 0xe30000000007000f */
22:32 fincs: The weird thing is
22:33 fincs: This doesn't crash
22:33 fincs: Or produce bad output
22:33 karolherbst: it makes no sense
22:33 karolherbst: the texs result is not even used
22:33 karolherbst: so there is no point creating barriers
22:33 fincs: This is a fragment shader
22:33 karolherbst: so...
22:33 karolherbst: ohhh
22:33 karolherbst: mhhh
22:33 fincs: It's the output
22:33 karolherbst: huh
22:33 karolherbst: okay.. it has a barrier though
22:33 karolherbst: and exit waits on it
22:34 karolherbst: maybe it does make sense though
22:34 karolherbst: dunno
22:34 karolherbst: maybe it doesn't change perf either
22:34 karolherbst: so...
22:34 karolherbst: I can see why it's fine
22:35 karolherbst: again we talk about issueing here, not execution
22:35 fincs: Hmm
22:35 fincs: I guess nothing reads it
22:35 fincs: So there's no "variable latency" shit
22:35 karolherbst: in the pipeline the exit will still block on the barrier so it should be fine
22:35 karolherbst: well there
22:35 karolherbst: is
22:35 karolherbst: the barrier
22:35 fincs: So this dual issue is just useless
22:35 karolherbst: yeah, probably
22:35 fincs: Technically correct but it accomplishes exactly nothing
22:36 karolherbst: well, the yl flag does
22:36 karolherbst: :p
22:36 fincs: It just inflates our numbers to look us nice when we show off to our boss :p
22:36 karolherbst: fine by me
22:36 fincs: ( ͡° ͜ʖ‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌├┬┴┬┴
22:36 karolherbst: but I guess the yl flag will do something
22:36 fincs: Okay that makes sense
22:36 karolherbst: but that only matters if you have other threads in flight
22:36 karolherbst: I think
22:36 karolherbst: I still am not sure on what the flag does
22:37 karolherbst: but I think it's a point the SM can schedule to a different thread if something would take longer
22:37 karolherbst: or so
22:37 karolherbst: dunno
22:37 fincs: Maybe we accidentally made it greedier than whatever NV is doing
22:38 fincs: Hmm
22:38 karolherbst: might be
22:38 fincs: { MUFU.LG2 R0, R0 ;
22:38 fincs: FMUL.FTZ R1, R1, R4 }
22:38 karolherbst: as long as the perf is better than without it I don't care :p
22:38 karolherbst: yep SFU
22:38 karolherbst: SFU + ALU
22:38 fincs: SFU, our good old slow friend
22:38 fincs: Variable latency, but the next instruction doesn't depend on it
22:38 fincs: I guess it's correct
22:39 karolherbst: heh.. nvdisasm is sometimes weird
22:39 fincs: I mean if this works and we accidentally get good perf then that's also fine by me :)
22:39 karolherbst: XMAD.PSL.CBCC R0, R0.H1, R5.H1, R3 SLOT 0;
22:39 karolherbst: S2R R3, SR_TID.Y SLOT 1
22:39 karolherbst: slot 0 and 1?
22:39 karolherbst: the hell is that
22:39 fincs: Wat
22:41 karolherbst: ahh but yes
22:41 karolherbst: nvidia also dual issues stuff installing barriers
22:41 karolherbst: nice
22:41 karolherbst: so.. yeah
22:41 fincs: :)
22:41 karolherbst: whatever
22:42 fincs: Okay
22:42 fincs: I think I feel good enough about this
22:43 karolherbst: yeah.. so if an instruction activates a barrier it really only matters if the directly next one consumes it
22:43 karolherbst: because then you have a stall of 2
22:44 karolherbst: but if the one after it uses it, you get a lower stall
22:44 karolherbst: and can dual issue
22:44 karolherbst: so everything is fine
22:45 karolherbst: fincs: the other thing I was wondering about was filling up the shader with 0 at the end
22:45 karolherbst: I mean.. nvidia adds also this weirdo BRA
22:45 karolherbst: but.. the hell
22:46 fincs: Yeah I generate that too
22:46 karolherbst: I have no idea if that even matters
22:47 karolherbst: having the shader size be a multiple of 0x10 maybe helps with ... random weird shit. I don't know
22:47 karolherbst: caching or loading the sahder
22:47 karolherbst: *shader
22:48 karolherbst: no clue
22:48 fincs: Maybe it's some prefetch/speculation shit
22:48 karolherbst: and no idea
22:48 karolherbst: imirkin: any ideas?
22:48 fincs: https://github.com/devkitPro/uam/commit/408d69375ef95a31e7a9c0e7ddb070d017150d63 <-- done
22:49 fincs: I love how reverting stuff back to how you originally wrote it is giving me improvements lol
22:49 karolherbst: :p
22:50 karolherbst: you don't even want to know how I wrote this dual issueing stuff
22:50 fincs: lol
22:51 karolherbst: can't even say I fully understood that maxas code shit.. dumb perl script
22:51 fincs: Classic write only language
22:51 fincs: Like writeonly image2D :p
22:51 karolherbst: seriously...
23:00 fincs: Hmm
23:00 fincs: I think I know why this is suddenly so greedy
23:00 fincs: The instruction reorder pass depends on canDualIssue, and I removed the barrier check from canDualIssue
23:00 karolherbst: yep
23:01 fincs: That makes the pass think that anything can be dual issued pretty much, with very loose rules :p
23:01 fincs: Still
23:01 karolherbst: mhhhh
23:01 karolherbst: valid point though
23:01 fincs: It technically produces the right code as later passes disallow bad dual issues
23:01 karolherbst: yeah
23:01 karolherbst: the entire pass is just best effort anyway
23:01 fincs: Yeah
23:01 karolherbst: but the more you move those variable length instructions up, the more likely it becomes you can actually dual issue those as well
23:02 karolherbst: but yeah.. I think I'd like to limit the depth though
23:02 fincs: It's looking like the hardware doesn't care if you dual issue variable length instructions at all, contrary to what I previously thought
23:02 fincs: However
23:02 fincs: These dual issues may be fake after all
23:02 karolherbst: nope
23:03 karolherbst: why would it be?
23:03 karolherbst: it might just not lead to higher perf as you stall somewhere else
23:03 fincs: Seems fishy how nvidia doesn't generate dual issue scheduling for similar instruction sequences
23:03 karolherbst: mhhh
23:03 karolherbst: would be good to investigate where nvidia doesn't do it
23:03 fincs: Maybe dual issue is cancelled by variable length
23:03 karolherbst: doubtful
23:04 fincs: As in, dual issue ignored
23:04 karolherbst: it might just become pointless
23:04 karolherbst: but it always depends on what every is executed after it
23:04 karolherbst: anyway, would be good to get into the details there
23:14 fincs: Hmm
23:14 fincs: Previously there were FADD + MUFU dual issues, now those are replaced by MUFU + FADD dual issues
23:14 fincs: I don't know how to feel about that
23:14 fincs: I have a feeling that's a downgrade
23:15 fincs: I.e. what's better, { FADD MUFU } or { MUFU + FADD }
23:18 karolherbst: if you dual issue it doesn't matter :p
23:19 karolherbst: but usually it is better to start with high latency instructions anyway
23:19 karolherbst: so at the time you get to the users, the result is already there
23:19 fincs: I still wonder why nvidia never dual issues that
23:19 karolherbst: mhhh
23:19 fincs: First insn is never a variable latency one in nvidia generated shaders
23:20 karolherbst: let me check some bigger shaders
23:21 karolherbst: but yeah, maybe there is a reason they are doing it this way, but sometimes it is also something different
23:21 karolherbst: usually you really need to take a look at bigger shaders
23:22 fincs: Do you have a big database of nvidia generated shaders or something?
23:23 karolherbst: no
23:23 karolherbst: but I could trace a few shaders
23:23 karolherbst: but mhhh
23:23 HdkR: reading big shaders isn't fun
23:23 karolherbst: it's easier to compile big cl files
23:23 karolherbst: and just dump that
23:23 fincs: Whatever, works too
23:23 fincs: Assuming that cl does all the optimizations that are needed
23:24 karolherbst: I am just not aware of good cl kernels which do a lot of random shit :D
23:24 karolherbst: sure
23:24 karolherbst: it's the same for the hw
23:25 HdkR: Cuda compiler does spend a bit more time doing some more aggressive optimizations, not sure if that extends to CL
23:25 karolherbst: it does
23:25 karolherbst: CL gets translated to PTX first
23:25 HdkR: Nice
23:25 karolherbst: hihi, darktable kernels: https://github.com/darktable-org/darktable/tree/master/data/kernels
23:26 karolherbst: nice
23:26 karolherbst: https://github.com/darktable-org/darktable/blob/master/data/kernels/blendop.cl
23:26 karolherbst: this is big enough :D
23:26 fincs: Yeah seems big enough
23:27 karolherbst: how to deal with the includes though mhhh
23:27 karolherbst: ahhh
23:27 karolherbst: this way
23:28 karolherbst: nice
23:28 karolherbst: how big is that :D
23:29 karolherbst: fincs: there https://gist.githubusercontent.com/karolherbst/d9c11a4dcd77d59c563e2b6ab1408e25/raw/5e101c1b3a2007ffe309b87ba48502cf4f8fdcbb/gistfile1.txt
23:29 karolherbst: have fun :D
23:29 fincs: Alright let me have a look
23:30 fincs: FMUL + BRA = ok
23:30 karolherbst: I hope it has everything...
23:30 fincs: FADD + BRA = ok
23:30 fincs: FSETP + SSY = ok
23:30 fincs: FMUL + LDC = ok
23:30 fincs: FMUL + SYNC = ok
23:30 karolherbst: it might be that the first should be an alu one.. dunno
23:31 fincs: I guess alu + flow = ok
23:31 fincs: Nah I've seen non-alu in first slot before
23:31 karolherbst: alu + ldc as well
23:31 fincs: I.e. alu + mem
23:31 fincs: mov + depbar = ok
23:32 fincs: ffma + mufu = ok
23:33 fincs: Oh wait this is SM60
23:33 karolherbst: yeah
23:33 karolherbst: I can compile for sm53 if you want to
23:33 karolherbst: but I doubt it makes a difference
23:33 fincs: Yeah please
23:34 karolherbst: there are actually differences, wow
23:35 fincs: So far I really can't find dual issue with variable latency in first slot
23:35 karolherbst: fincs: https://gist.githubusercontent.com/karolherbst/d9c11a4dcd77d59c563e2b6ab1408e25/raw/0a8b369ee20c3c8f5b3f92b6c83ff90a26e98378/gistfile1.txt
23:36 karolherbst: fincs: FMUL32I?
23:36 karolherbst: I see some FMUL32I and f2i later on
23:36 karolherbst: /*9358*/ { FMUL32I R7, R7, 65535 ;
23:36 fincs: Eh?
23:36 fincs: I don't think FMUL32I is variable latency?
23:36 karolherbst: it is
23:37 karolherbst: needs to set a write barrier
23:37 fincs: Huh, isn't this just ALU?
23:37 karolherbst: it is
23:37 karolherbst: but it is integer
23:37 karolherbst: integer sucks
23:37 karolherbst: :p
23:37 fincs: Hmm
23:38 fincs: Wait a minute
23:38 fincs: What does fmul32i do
23:38 fincs: And why is it both f and i
23:38 karolherbst: 32immediate
23:38 fincs: Oh wait
23:38 karolherbst: :p
23:38 fincs: This is not integer
23:38 fincs: It's floating point mult with an immediate
23:39 karolherbst: ohhh wait
23:39 karolherbst: I am stupid, it was an fmul
23:39 karolherbst: I am silly
23:39 karolherbst: I thought it's imul
23:40 fincs: :p
23:40 karolherbst: well, nobody uses imul anyway
23:40 karolherbst: imuls suck
23:40 karolherbst: imul sucks so bad, you don't use them
23:40 fincs: So yeah so far all these are non-variable-latency on first slot
23:40 karolherbst: period
23:41 karolherbst: but mhh
23:41 karolherbst: that one texs, but that was weird
23:41 fincs: Which texs?
23:41 karolherbst: ahh that was us
23:41 karolherbst: not nvidia
23:42 karolherbst: but it's really crazy to see how few dual issuing nvidia has
23:42 karolherbst: sometimes there are blocks with 100 instructions without a single dual issue
23:42 fincs: True
23:43 karolherbst: but it seems like the first slot is always alu
23:43 karolherbst: or did you see anything else?
23:43 fincs: I've seen a vote instruction in some shader I once looked at
23:44 karolherbst: but it is odd to see that the generated shader is different between sm53 and sm60
23:44 karolherbst: there are even changes in the sched opcodes
23:45 karolherbst: ohh
23:45 karolherbst: no
23:45 karolherbst: it's just the yield flag
23:45 karolherbst: the helll
23:46 karolherbst: the changes also look like rather random stuff
23:47 fincs: Hmmm
23:48 fincs: It really looks like you're not supposed to put a variable latency instruction in the first slot of a dual issue
23:48 fincs: Is RRO alu?
23:50 fincs: Actually
23:50 fincs: Are there any non-variable-latency instructions which *aren't* ALU? :p
23:56 karolherbst: mhhh, good question
23:56 karolherbst: i2p and p2i are fixed
23:57 karolherbst: but you could say those are alu instructions as well
23:57 fincs: Is that isetp or whatever it's called?
23:57 karolherbst: no
23:57 karolherbst: it's i2p :p
23:57 karolherbst: integer to predicate
23:57 karolherbst: ehh that would include f2p and p2f as well
23:57 fincs: What does nvidia call those?
23:57 imirkin: what about c2p and p2c? :p
23:58 imirkin: fsetp, csetp, etc
23:58 karolherbst: they call them i2p, no?
23:58 imirkin: isetp
23:58 fincs: So isetp fsetp csetp alright
23:58 fincs: Yeah I think those are alu
23:58 imirkin: psetp
23:58 karolherbst: imirkin: I mean the convert ones
23:58 karolherbst: not the compare instructions
23:58 imirkin: there are no convert instructions
23:58 imirkin: only compare
23:59 imirkin: look at how we emit the CVT's :)
23:59 karolherbst: as movs
23:59 karolherbst: right
23:59 karolherbst: the predicate ones