06:02 pmoreau: imirkin: Oh neat! Does that include add for example? I know I have been doing that to handle indirect + offset.
06:03 imirkin: shl and add with a fixed value both work
06:03 imirkin: dynamic values don't
06:03 imirkin: but i'm having a lot of trouble wrapping my head around what needs to be done
06:03 imirkin: s.t. all the bits fall into place properly
06:03 pmoreau: Ok
06:04 imirkin: ideally add wouldn't get used in the first place and the value would get dumped into the fixed indirect offset in the access
10:09 pmoreau: Really weird how the add test on long4 works fine, except for the first 32 bits of long4.x which seems to get some random value.
10:15 shining: pmoreau: I upgraded to Debian 11 and it seems better now. intel accel is working, I can switch between both cards using DRI_PRIME. I even made a quick benchmark for fun with openarena and it's 7x slower with the nvidia card :D https://pastebin.com/4SBqgffJ
10:15 shining: I guess that's because of missing reclocking support.
10:19 pmoreau: Glad to hear that!
10:19 pmoreau: And yes, most likely due to no reclocking on the Nouveau side.
10:29 shining: Should we ask gently nvidia or still no hope ? Since it's a nvidia guy who directly commits the firmwares in the kernel, couldn't we ask the pmu ones ?
10:37 karolherbst: shining: sadly it's not that simple
12:00 lint-tech: Hi all! I have a problem with display on nouveau. The system is quite old, P4, GA-8PE800, nVidia GeForce FX 5600XT (NV31). Linux distribution is Mint 19.3, straight from installation, no tweaks made. Display looks like this https://bit.ly/3agP9g4, only mouse pointer remains intact. But I see no problems in console mode, and also display fixes for a brief moment when switching to it. I've tried simple
12:00 lint-tech: xorg.conf from installation guide - still the same problem persists, lsmod gives no signs of nvidia driver. Xorg log https://bpa.st/H63Q. Is there any way to fix this problem?
12:09 ccr: that goodle drive file is restricted
12:11 ccr: however, based on xorg.log, your "stack" is rather old. old kernel, old xorg.
12:27 lint-tech: I've fixed google drive file permissions
12:28 lint-tech: Mint 19.3 is the last version that supports 32-bit, also I installed all available updates from update manager
13:50 pmoreau: Something goes wrong somewhere, cause instead of doing `out.x.hi = inA.x.hi + inB.x.hi + carry`, I end up with `out.x.hi = inA.z.lo + inB.z.lo + inB.x.hi + carry`… and it’s only for the `out.x.hi`, all other parts are fine (with `long4 out, inA, inB`).
15:03 pmoreau: I think I understand what is going on: 1) (inA.x + inB.x) needs to be spilled due to lack of registers, 2) (inA.z + inB.z) also needs to be spilled but the logic does not handle compound values so well so it only consider the lower half of the compound as being valid, and ends up overwriting the upper half, 3) the rest of the code behaves as everything went well and retrieves the high half of (inA.x. + inB.x) where it should
15:03 pmoreau: have been, but will end up with the low half of (inA.z + inB.z) instead.
15:04 pmoreau: The whole annotated trace (NV50_PROG_DEBUG=255) can be found here: https://gitlab.freedesktop.org/pmoreau/mesa/-/snippets/1895
17:09 imirkin: lint-tech: do you know what DE is being used?
17:09 imirkin: i would guess something like gnome-shell or kde thing?
17:09 imirkin: if so, those won't work
17:10 imirkin: pmoreau: spilling compound values should, generally, work
17:10 imirkin: although i haven't debugged it much on nv50
17:11 imirkin: pmoreau: btw, that snippet page crashes chrome
17:11 pmoreau: :-D
17:11 imirkin: if you can link me to a not-so-heavy site, i can take a look
17:12 pmoreau: Firefox takes a few seconds to load, but it does it fine even on my 2009 laptop and its Core 2 Duo.
17:12 pmoreau: Does this work better? https://gitlab.freedesktop.org/pmoreau/mesa/-/snippets/1895/raw/main/snippetfile1.txt
17:12 imirkin: much better
17:12 imirkin: thanks
17:13 pmoreau: If you look for "XXX", the first hit will be when the second spill overwrites part of the first one.
17:14 imirkin: pmoreau: so ... the weird thing that i notice
17:14 imirkin: is that nothing uses l[0x4]
17:14 pmoreau: That’s a good point
17:15 pmoreau: I think I saw something in envytools about an auto-increment thing, but first time I’m seeing it in action: “R2G.U16.U16 g[A1+++0x1], R3H”
17:15 imirkin: we don't emit that
17:15 imirkin: does the blob emit that? or is that us?
17:16 imirkin: that's a post-increment of A1 by 1 (well, 2 ... it's sized to access size)
17:16 pmoreau: The blob does
17:17 imirkin: yeah, we're nowhere near smart enough for that
17:20 imirkin: pmoreau: ok, so...
17:20 imirkin: %207 is marked as spill
17:21 imirkin: unfortunately that's a merge op
17:21 imirkin: we *suck* at spilling merge ops
17:21 imirkin: there's all kinds of logic to make it not suck, and it still sucks.
17:21 imirkin: the thing is that it's pretty unclear wtf it means to spill a merge
17:22 pmoreau: I expected it was something like that :-/
17:25 imirkin: IDEALLY when spilling a merge op
17:25 imirkin: we get rid of the original
17:25 imirkin: and just rematerialize it whereever it's needed
17:25 imirkin: i forget if we do that already
17:26 imirkin: that way, the next round of spilling can decide if it wants to actually spill either or both of the underlying values
17:29 pmoreau: If I understand correctly, would it be that instead of spilling the merge, we spill the two sources of the merge?
17:32 imirkin: not sure what it's doing now
17:33 imirkin: but when spilling a merge, it's unclear that the underlying bits need to be spilled as well
17:33 imirkin: since a merge is a new ssa value
17:33 imirkin: which takes up regs as well, potentially
17:52 pmoreau: I think I’ll let that issue be for now, but thank you for the help!
17:52 pmoreau: Hopefully the issues with shared memory tests will be a bit easier to solve than spilling of merge ops.
17:53 imirkin: pmoreau: if you're on a G96, note that you don't have atomics
17:54 pmoreau: I will try to remember that, but the current test is not using atomics, just doing a bunch of writes to shared using a single thread, a barrier, and then cooperative loads from shared to global.
17:57 imirkin: pmoreau: ok. that's a pretty common pattern
17:57 imirkin: i'd be surprised if i hadn't hit it in my testing.
17:58 pmoreau: I think the issues there are more about mis-aligned pointers, and potentially using the wrong index.
18:00 imirkin: ah yeah, that could be a problem :)
18:00 pmoreau: :-)
18:04 pmoreau: Argh, I should not be reading a mix of Nouveau disassembly and blob disassembly: one uses g[] for global, the other g[] for shared and it gets confusing pretty quickly 🙃
18:05 RSpliet: with shared you mean local? :-D
18:05 pmoreau: 😰
18:06 imirkin: yeah, it's pretty annoying
18:08 lint-tech: imirkin: DE is MATE
18:08 imirkin: lint-tech: try xfce
18:10 lint-tech: imirkin: Thanks, I'll try
18:14 pmoreau: imirkin: Did the barrier stuff for shared needed weird reads from global, access to c15[0x6c4] and sv[PHYSID:0]?
18:14 imirkin: pmoreau: for shared, no
18:15 imirkin: only for global
18:15 imirkin: it checks the MEMBAR type
18:15 imirkin: (make sure it gets set properly in the conversion)
18:15 pmoreau: Yeah, I think something might be missing there
18:15 imirkin: but that shouldn't hurt
18:16 imirkin: (other than perf)
18:17 pmoreau: But it did confuse me quite a bit, trying to figure out where those weird strided reads from global were coming from.
18:17 imirkin: :)
18:24 pmoreau: I’m guessing GL is not what I want here, but rather CTA?
18:24 imirkin: correct
18:24 pmoreau: Double-checking with TGSI seems to say yes
18:24 imirkin: shared = CTA
18:25 imirkin: GL = global
18:25 imirkin: at least that's how i read it :)
18:27 pmoreau: Ooooh, right GL for global… I always read it as OpenGL xD
18:27 pmoreau: Which makes no sense, but that’s the first thing that comes to mind
18:27 pmoreau: Every single time
18:27 imirkin: :)
18:27 RSpliet: That totally didn't happen to me
18:27 pmoreau: :-)
18:31 pmoreau: Unsurprisingly, that did not solve the problem. So, now to investigate why NVIDIA thinks we need to do the work in 126 instructions when we can do it in only 52 (but wrongly).
18:32 imirkin: the Max Power way -- the wrong way, but faster!
18:34 pmoreau: ;-)
18:40 pmoreau: TBF, there are doing a bunch of weird things too
18:40 imirkin: fwiw, i think in many cases we do emit better code. their compiler lowers things in a weird order.
18:40 imirkin: but they are much better at like ... compiler things.
18:41 RSpliet: They have interesting optimisation passes for sure
18:42 RSpliet: and I guess it helps to understand your arch to the T when trying to tune your ins scheduling
18:43 imirkin: sure, but they're better at doing compiler things, like loop opt, etc
18:46 pmoreau: They actually partially unrolled the loop, so instead of having `for (i = 0; i < 64; ++i) { s[i] = g[i]; }`, they now have `for (i = 0; i < 64; i += 16) { s[i] = g[i]; x16 }`
18:46 pmoreau: That explains most of the added code
18:46 imirkin: like i said, they're a lot better at compiler things ;)
18:47 pmoreau: :-)
18:47 imirkin: but sometimes they'll also have a *super* obvious missed opportunity for optimizing some little thing
18:47 imirkin: which just implies a pass ordering issue
18:50 pmoreau: I had that impression a lot when I started looking at the code, and then I realised I had `-g` specified when assembling with ptxas… 🤦 But I have seen that as well with optimisations turned on.
18:51 imirkin: to be fair, that also happens to us, since we don't run it to a fixed point, but it's rare
18:51 pmoreau: Removing most of the loop unrolling (though still have a x4 instead of the x16 they were doing), and they now only have 2 extra instructions compared to us.
18:52 pmoreau: But it’s going to be a lot easier to compare the two.
18:53 imirkin: cool
18:57 pmoreau: They do some neat trick like `add b32 $c0 $r2 -1` (I inlined the immediate), rather than incrementing a count + doing a comparison against the limit.
18:57 pmoreau: *a counter
19:02 imirkin: did i mention they were better at compiler things?
19:03 pmoreau: Wouldn’t hurt to say it one more time
19:03 imirkin: :)
19:12 pmoreau: This is genius for packing/unpacking 8-bit values!
19:12 pmoreau: Are r0l/r0h still a thing on Fermi+?
19:12 imirkin: no
19:13 imirkin: unfortunately that's not ideal for 8-bit values
19:13 imirkin: it's better for packing 16-bit values :)
19:13 pmoreau: Well
19:13 imirkin: but fermi+ has stuff like F2I.B1 etc
19:13 imirkin: er, I2F.B1
19:13 imirkin: which will take the lower byte of a 32-bit reg, and convert to float
19:13 imirkin: etc
19:14 imirkin: (we use those)
19:18 pmoreau: This is kinda neat to unpack a char2 “cvt u16 $r3h u8 $r6l; and b16 $r4h $r6l c1[0x0] (maybe 0xff00?);" but why do they recombine it right afterwards “or b16 $r3h $r3h $r4h” ?
19:46 imirkin: pmoreau: 2x 8-bit values into 1 16-bit value?
19:46 imirkin: pmoreau: i do a lot of that in the image unpack/repack
20:27 pmoreau: Yes, 2x 8-bit in 1x 16-bit
20:54 imirkin: pmoreau: yeah, that's the best thing i could come up with for image pack/unpack
20:54 imirkin: and then i just arrange for those to be packed into adjacent regs
21:08 pmoreau: What is `st u16 # s[$a0+0x0] $r1 $r2` supposed to do? I see that in the emission but running envydis on it results in `st b16 s[$a1] $r0h` which makes more sense, except that’s the register wrong register to be used here.
21:08 imirkin: pmoreau: a bug :)
21:08 pmoreau: Cool :-)
21:09 pmoreau: I was surprised to see a st with 3 sources
21:09 imirkin: pmoreau: "in the emission"?
21:09 pmoreau: EMIT: st u16 # s[$a0+0x0] $r1 $r2
21:09 imirkin: ok, that's really bad.
21:09 imirkin: figure out where that's coming from
21:09 pmoreau: It’s even there way earlier
21:09 imirkin: coz that's up to no good.
21:09 imirkin: yeah, figure out wtf is going on, that should never happen
21:09 pmoreau: Ok
21:10 imirkin: the fact that the emitter doesn't choke on it is just luck
21:12 pmoreau: if (op == nir_intrinsic_load_kernel_input) { /* some stuff / if (op == nir_intrinsic_load_kernel_input) { / other stuff */ } }
21:12 pmoreau: One can never be too sure 😅
21:15 karolherbst: :D
21:16 RSpliet: I don't suppose /* some stuff */ has the potential side-effect of changing op? :-D
21:17 pmoreau: No it didn’t :-D
21:18 karolherbst: pmoreau: who wrote that code anyway :p
21:18 pmoreau: The second one was supposed to be a check on the chipset to only perform it on Tesla, but I guess someone was too tired…
21:18 karolherbst: please don't do chipset checks :D
21:42 pmoreau: Seems to be one of the optimisation pass doing that, by merging 2x st.u8 into 1x st.u16. But I thought I already told the MemoryOpt to not bother…
21:43 imirkin: apparently not hard enough ;)
21:45 pmoreau: It indeed was
21:50 imirkin: yeah, so for that to actually work, memoryopt would need to know how to combine those values
21:50 imirkin: so anytime there's sub-32-bit write, purge that record and move on.
22:09 pmoreau: It does not seem to care about what I tell it
22:09 imirkin: have you tried compiling? :)
22:10 pmoreau: I stepped through and saw that it did purge the records, but it still ended up merging them somehow
22:11 pmoreau: This is what I went with: https://gitlab.freedesktop.org/pmoreau/mesa/-/snippets/1896/raw/main/snippetfile1.txt
22:12 imirkin: mmmmm
22:12 imirkin: let me see how purgeRecords works
22:14 imirkin: pmoreau: hm, that seems reasonable.
22:14 imirkin: pmoreau: actually i think you want to flip the dType / sType
22:15 imirkin: load + sType; store + dType
22:16 pmoreau: That works better :-)
22:18 pmoreau: Guess I should rather go and get some sleep
23:10 pmoreau: imirkin: Do you think this is an emission problem? This is what Nouveau think it is doing “ld u32 $r2 u8 g[$r2+0x0]; st u8 # u32 s[$a0+0x34] $r2“, but this is what actually gets emitted “ld u8 $r2 g15[$r2]; st b8 s[$a1+0x34] $r1l”.
23:18 imirkin: pmoreau: hmmmm
23:19 imirkin: that's a problem, but not an emission problem
23:19 imirkin: it _appears_ that 16/8-bit stores want a half-reg
23:19 imirkin: (for shared mem only)
23:21 pmoreau: Ah, that is something I wanted to ask earlier: how does one get a half-reg? Just by doing `getSSA(2)` or similar?
23:33 imirkin: yep
23:36 pmoreau: Perfect, thanks! I initially went with a cvt to get from u32 to u16, but a split seems to work better.
23:37 imirkin: well, definitely cheaper :)
23:37 imirkin: have a look at my image pack/unpack logic
23:37 imirkin: i play a _lot_ of tricks
23:37 imirkin: lots of half-reg usage in there too
23:39 pmoreau: I will do that tomorrow, once my brain is a bit more awake. :-)
23:39 imirkin: that code is a wee bit dense, i must admit
23:49 pmoreau: Well that did solve vload_store for char2; for char3 it is sort of working: I correctly get the middle component, but it is consistently in the 3rd component; the other two are just random values.