09:50pmoreau: imirkin: I have sadly not; I ended up working longer than expected and got distracted by another.
09:56karolherbst: imirkin: does this ring any bell? https://bugzilla.redhat.com/show_bug.cgi?id=1930977
10:42RSpliet: karolherbst: this is aarch64?
10:42RSpliet: Yeah, Jetson Nano. Surprised they declare a blocker for such a niche piece of HW
10:43RSpliet: "niche" - pehaps a slightly premature judgement :-P
11:24karolherbst: RSpliet: yeah well :p
12:41imirkin: karolherbst: doesn't ring a bell, sorry. how does it die? not clear from the report
13:00karolherbst: apparently booting
13:00karolherbst: maybe the modifier stuff broke shit? dunno
13:00karolherbst: I am setting up my jetson nano right now
13:01imirkin: karolherbst: that's not what i mean
13:01imirkin: what causes the process to terminate
13:01imirkin: i see a backtrace
13:01imirkin: but not a reason for it dying
13:02imirkin: like did someone attach gdb to it and type "bt"? etc
13:02imirkin: (likely causes might be SIGSEGV, SIGBUS, etc)
13:02imirkin: pmoreau: no rush!
13:02imirkin: just my curiousity ;)
13:03karolherbst: imirkin: SIGSEGV, written in the bug title
13:03imirkin: karolherbst: lol, go me for missing that
13:04imirkin: in fairness to me, i'm still sipping my first cup of coffee
16:56pmoreau: imirkin: https://gitlab.freedesktop.org/pmoreau/mesa/-/snippets/1664
16:56imirkin: pmoreau: can you run just dEQP-GLES31.functional.compute.basic.shared_atomic_op_single_group
16:56pmoreau: Halfway there, though! 😉
16:56imirkin: with NV50_PROG_DEBUG=255
16:57imirkin: and also include the contents of TestResults.qpa
16:58bencoh: woops :)
16:59pmoreau: I added the results to the same snippet.
17:00imirkin: oh. nir.
17:00imirkin: that's untested.
17:00imirkin: can you double-check that it fails with tgsi?
17:01pmoreau: I trigger an assert with TGSI, in nv50_program.c
17:02pmoreau: Let me retry with commenting that code out
17:02imirkin: what's the assert?
17:03imirkin: oh, one you added?
17:04pmoreau: Yeah due to relying on some NIR properties to get the combined size of all inputs, to know by how much to offset the accesses to shared.
17:04imirkin: EMIT: shl (SUBOP:1) s32 $r1 $r0 4 (8)
17:04imirkin: that seems weird. wtf is subop 1?
17:05pmoreau: Added results for TGSI
17:05imirkin: also fail i assume?
17:08imirkin: i don't see anything obviously wrong =/
17:10imirkin: pmoreau: mind trying to write a similar test in opencl
17:10imirkin: and seeing what that compiles to?
17:10pmoreau: I was going to try that
17:10imirkin: could be missing something obvious
17:10imirkin: or could be something sutble
17:10pmoreau: Just stuck looking on the sources for that test 🙃
17:10imirkin: like $r63 being a lot more busted than i thought
17:10imirkin: it's in the TestResults.qpa file
17:11imirkin: along with the very helpful comment of "Comparison failed for Output.values"
17:11imirkin: definitely didn't want to know what value you *did* find there ... sigh
17:14pmoreau: You just get a bad grade, and no comments as to why. 🤷
17:15RSpliet: Lyude: silly suggestion, but are we sure that nouveau uses the 917d display class (codepaths) on kepler?
17:15RSpliet: sorry, on *those* keplers even
17:15Lyude: RSpliet: I was wondering the same thing actually
17:15Lyude: it should say in the open-gpu-docs
17:16Lyude: RSpliet: ok yeah-gk104+ is DISP021X, core channel cl917d and cursor cl917a
17:16imirkin: RSpliet: pretty sure.
17:16imirkin: RSpliet: otherwise they wouldn't get 128x128 either
17:16imirkin: and 256x256 at least sorta-works
17:17imirkin: just has rendering artifacts
17:17Lyude: i'm a bit surprised by james' suggestion, but it definitely sounds like it's worth trying
17:17imirkin: 4k-aligned vs 256-aligned?
17:17imirkin: we definitely wouldn't use a large page for those
17:17karolherbst: imirkin: https://gist.github.com/karolherbst/a8486ea170221987a1588da07b06e1ea :)
17:18Lyude: imirkin: yes. which is weird because the cursor doesn't follow the same alignment as the rest, but if nvidia suggested it I think it's definitely worth trying. especially since the issues with the cursor as described sound like what would happen when I was doing igt stuff and set the wrong alignment for surfaces
17:18imirkin: Lyude: could be that it's due to crossing pages
17:18Lyude: oooooh, good point
17:19imirkin: Lyude: although ... even 64x64(*4) = 16k
17:19imirkin: so it'd already be crossing pages
17:19imirkin: 256*256*4 = 256k
17:19imirkin: so it COULD be getting a large page (64/128k)
17:20Lyude: tbh - I wonder if this might also explain the issues we've been seeing with small ovlys on kepler
17:20imirkin: could be
17:21imirkin: anyways ... priority should be to getting users going again
17:21imirkin: and then worrying about fixing the universe
17:21imirkin: nobody actually needs 256x256 cursors
17:21Lyude: imirkin: yeah-that's why i'm going to try this today :)
17:21Lyude: otherwise we'll just limit it for now and figure it out later
17:22imirkin: sounds good
17:23imirkin: pmoreau: karolherbst: should separately figure out where that SHL + subop 1 was coming from in the nir path
17:24imirkin: why would that be getting set?
17:24karolherbst: nir shifts do warp or do not... dunno which way it was
17:24karolherbst: see commit 59b44f90aa4d
17:25imirkin: well the nv50 hw just does one thing
17:25imirkin: i don't know offhand which thing it does, should be obvious from mwk's excelltn hwtests
17:25karolherbst: so only clamped shifts?
17:25karolherbst: because that's kind of the default
17:26imirkin: doesn't look like it wraps
17:26imirkin: i dunno what clamped is
17:26karolherbst: okay, jason will love this
17:26karolherbst: imirkin: !wrap
17:26imirkin: is that where shifting by 100 == 0?
17:27imirkin: admittedly it does wrap
17:27imirkin: just not to the argument bit width
17:27karolherbst: I see
17:27imirkin: /* const shift count */
17:27imirkin: s2 = op1 >> 16 & 0x7f;
17:27imirkin: so there's that
17:27imirkin: the immediate is max 0x7f
17:27imirkin: er wait
17:27imirkin: hold on
17:28imirkin: exp = s1 << s2;
17:28imirkin: but ... this is C
17:28imirkin: do shifts wrap in C?
17:28imirkin: ok. we should test it out.
17:28imirkin: iirc they do
17:30pmoreau: OpenCL compiler is unhappy about that atomic add on shared 🙃
17:30karolherbst: pmoreau: ugh
17:30imirkin: pmoreau: did you target to sm_12?
17:30imirkin: (i dunno how one does that in CL)
17:30pmoreau: Using a global pointer did not help
17:30imirkin: ok. just use ptx i guess
17:31imirkin: might not be hooked up in CL
17:31karolherbst: pmoreau: using this secret API to compile from clc to PTX?
17:31pmoreau: Compiler is like: “(0) Error: unsupported operation”
17:31karolherbst: forcing sm12 didn't help?
17:31karolherbst: that's probably the frontend
17:31pmoreau: I need to double check what the compiler options are for it
17:31karolherbst: set CL to 1.1?
17:32karolherbst: -cl-nv-arch sm_12 -cl-nv-cstd=CL1.1
17:32pmoreau: Just found it
17:35pmoreau: Okay, that helped!
17:39imirkin: i was just guessing about how ld lock / st unlock worked
17:40imirkin: but it seemed VERY similar to the nvc0 variant
17:40imirkin: the kepler variant is slightly evolved relative to the nvc0 variant
17:40imirkin: (on kepler st unlock can also fail)
17:40karolherbst: the heck...
17:40karolherbst: and how do you figure out it failed?
17:41pmoreau: imirkin: Added to the snippet
17:41imirkin: am i missing something, or are those gitlab snippets just really hard to navigate?
17:41imirkin: there's no way to collapse a file etc?
17:42pmoreau: Here is a direct link
17:42karolherbst: run away
17:42imirkin: uh wut
17:42pmoreau: Yeah, I just saw there was no way to collapse the different files… 🤦
17:42imirkin: /*02b8*/ R2G.U32.U32 g[A1+0x0], R2; /* 0xe420878004000001 */
17:42imirkin: /*02c0*/ R2G.U32.U32.UNL g[A1+0x0], R2; /* 0xe4a0878004000001 */
17:43imirkin: so ... i guess you gotta write it and then write it again with the unlock?
17:43imirkin: that ... was not immediately apparent to me
17:43pmoreau: If at first you do not success, try again! 😉 🤣
17:43pmoreau: *succeed, would be better
17:44imirkin: and it loads it a second time too
17:44imirkin: that's just weird.
17:44imirkin: /*0290*/ G2R.U32.UNL.C0 o[0x7f], g [A1+0x0].U32; /* 0x4480c7c8140001fd */
17:44imirkin: /*02a8*/ MOV R1, g [A1+0x0]; /* 0x0423c7801400c005 */
17:44imirkin: oh, interesting
17:44imirkin: the first time it doesn't actually have a destination
17:44imirkin: ok yeah, so this is slightly diff than how it all works on nvc0
17:45imirkin: pmoreau: want to fix it up yourself, or want me to patch it up? i can't right now, but can do tonight
17:45pmoreau: I won’t have the time, other things to take care of, so go for it.
17:45imirkin: ok. tonight then.
17:46imirkin: thanks for testing
17:46imirkin: pmoreau: btw, if it's easy, can you output the same thing for nvc0 (sm_20)?
17:46imirkin: i want to see if they produce similarly weird code, or if there's more
17:46pmoreau: Sure, one sec
17:46imirkin: coz i have another theory - that some of those ops don't respect predicates
17:47imirkin: which would also explain some of these problems
17:49imirkin: yeah ok, so for nvc0 they do it the much more reasonable way
17:49imirkin: IME when nvidia does something weird like this it isn't completely for no reason at all
17:49imirkin: generally they know something we don't :)
17:50imirkin: i'll have a couple of commits to test out various theories
17:50pmoreau: Thanks for your work!
17:53imirkin: yw. been fun to do some bringup for a change.
17:54imirkin: and looks like we'll be able to finally fix 3d images on fermi
17:54imirkin: not like the most useful feature in the world
17:54imirkin: but ... still nice
17:58karolherbst: imirkin: that jetson bug is an evil memory corruption :/
17:58imirkin: karolherbst: i hate evil memory corruption. friendly memory corruption is so much nicer
17:58karolherbst: imirkin: wasn't there some fix for indirect draws?
17:58imirkin: there have been many fixes
17:58imirkin: since the dawn of time
17:58karolherbst: I meant like recently
17:59imirkin: user draws got broken for a period of time
17:59karolherbst: it's probably that
17:59imirkin: but you'd have to be using a git checkout for that iirc
17:59karolherbst: let's see
17:59imirkin: an old one at that
17:59karolherbst: I am at 21
18:00imirkin: my fixes landed in 21.0.0-rc2
18:00imirkin: and also https://cgit.freedesktop.org/mesa/mesa/commit/?h=staging/21.0&id=4dee39f04d025d6ae1ad631caef35b0254516bd7
18:01karolherbst: there is new stuff :/
18:01imirkin: and also this guy i guess: https://cgit.freedesktop.org/mesa/mesa/commit/?h=staging/21.0&id=f763d0f1952151e0fcae596e85600e7f391ea442
18:01karolherbst: the bug report is for rc4
18:01imirkin: but that was in the branchpoint
18:01karolherbst: and there are more commits from mareko on that file
18:01karolherbst: since your changes
18:01karolherbst: I bet one of them broke it
18:02imirkin: i don't see them...
18:02imirkin: other than obviously trivial / irrelevant ones
18:03imirkin: (in the -rc branch)
18:03karolherbst: imirkin: https://gitlab.freedesktop.org/mesa/mesa/-/commits/21.0/src/gallium/drivers/nouveau/nvc0/nvc0_vbo.c
18:03karolherbst: ohh wait
18:03karolherbst: wrong... branch/tag?
18:03karolherbst: ahh, I messed up
18:04imirkin: even if i pick staging/21.0 it doesn't seem to show my commits
18:04imirkin: which is weird
18:04imirkin: karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/commits/21.0/src/gallium/drivers/nouveau/nvc0
18:05imirkin: and my testing included running all of deqp / cts
18:05karolherbst: ahh yeah, there is your fix
18:05karolherbst: but the code is clearly doing garbage things
18:05karolherbst: p indirect->buffer
18:06karolherbst: (struct pipe_resource *) 0xffffa5b02504 <_mesa_validated_drawrangeelements+260>
18:06karolherbst: and calls nvc0_draw_indirect with that
18:06karolherbst: imirkin: I guess tegra_draw_vbo needs fixing too
18:06imirkin: there's a tegra_draw_vbo now?
18:07imirkin: lol yeah
18:07karolherbst: marek touched that as well
18:07karolherbst: let's see
18:07imirkin: doesn't seem obviously wrong
18:07imirkin: but also doesn't seem obviously right
18:12karolherbst: HUH :O
18:12karolherbst: imirkin: 4. ctx->Driver.DrawGallium(ctx, &info, &draw, 1);
18:12imirkin: right ...
18:12imirkin: that's the new super-multi-draw callback that mareko added
18:12karolherbst: but it has 4 arsg
18:12karolherbst: tegra_draw_vbo has 5
18:12imirkin: it's not 1:1
18:13imirkin: that calls DrawGallium in st/mesa
18:13karolherbst: ahh.. optimized builds
18:20imirkin: oh. and we're also looking at the wrong flag. super.
18:20imirkin: pmoreau: --^
18:20karolherbst: imirkin: this pindirect = &indirect; line looks like it breaks
18:20imirkin: they branch on LT, we branch on equality
18:20karolherbst: so indirect is on the stack
18:21karolherbst: and we never write to it
18:21karolherbst: then we pass garbage into nvc0
18:21karolherbst: inside tegra_draw_vbo
18:21imirkin: memcpy(&indirect, pindirect, sizeof(indirect));
18:21karolherbst: only if the clause is true
18:21imirkin: oh, but we should pass in null
18:21imirkin: if it's not
18:22imirkin: i think you can just move the pindirect = &indirect into the if ()
18:22imirkin: there's a second thing
18:22imirkin: which should be unwrapped
18:22imirkin: while you're at it
18:22imirkin: i forget what it's called
18:22imirkin: for the "count" to also be indirect
18:22imirkin: AZDO! yeah!
18:25imirkin: karolherbst: also you can get rid of the if (num_draws) thing at the front
18:25imirkin: since it just passes through to another driver
18:26karolherbst: ohh wait
18:26karolherbst: no, we can't
18:26imirkin: why not
18:26karolherbst: well... we could probably rewrite the entire function to just loop anyway
18:26karolherbst: because it calls itself to unwrap all the stuff? no?
18:26imirkin: but nouveau takes care of that too
18:27imirkin: and there's nothing to unwrap in the draws array
18:27karolherbst: yeah, you are right
18:27imirkin: really i should hook that up better in nouveau
18:27imirkin: since we can skip on some of the validation
18:27imirkin: something for the future.
18:28karolherbst: for the times where perf actually matters
18:28karolherbst: what do you have against AZDO? :D
18:28imirkin: it's good
18:28imirkin: you're the one against it!
18:28imirkin: some of those things are taking it a wee bit too far imho
18:29imirkin: like the indirect count
18:29imirkin: but hey, it was a fun macro to write
18:29karolherbst: oh well
18:29karolherbst: could be worse
18:31imirkin: but reducing validation is also part of AZDO
18:31imirkin: which is what mareko has been driving towards
18:31karolherbst: well, reducing CPU overhead is generally a good idea anyway
18:32karolherbst: just not a big fan of moving more load towards the GPU
18:32imirkin: this is purely CPU overhead though
18:32karolherbst: yeah, probably
18:32imirkin: a bunch of things that can't change don't need to be checked over and over
18:32imirkin: it's not a LOT of overhead
18:32karolherbst: I should CPU profile mesa on the jetson at some point
18:32karolherbst: because there it actually matters
18:33imirkin: but mareko has gotten it all so low that even little bits matter
18:33imirkin: if one is to improve
22:40Lyude: imirkin: so 256x256 cursors definitely do work on kepler (tested with a GK104, and I double checked to verify igt wasn't lying to me). I think the issue does have something to do with the alignment that X is using
22:41imirkin: Lyude: modetest
22:41imirkin: there's a patch on list to make modetest use 256x256 cursors
22:41imirkin: (in that thread, further up)
22:46pmoreau:sent a long message: < https://matrix.org/_matrix/media/r0/download/matrix.org/ZfzjJAeCDaswAUIzrGINyoCp/message.txt >
22:46pmoreau: How is this using 4 GPRs? 🤔
22:46imirkin: there's a min iirc
22:46imirkin: and/or alignment
22:47pmoreau: Oh, could be
22:47pmoreau: And same for the local then?
22:47imirkin: well, min for local is 0
22:48imirkin: that means that there's some l usage further up
22:48pmoreau: That’s the whole emitted program
22:48imirkin: what gpu? that doesn't decode at all
22:49pmoreau: It decodes with nvdisasm, but couldn’t get it to work with envydis for whatever reason.
22:49imirkin: yeah, doesn't decode in envydis at all
22:49pmoreau: (Also, it does run fine on actual hardware :-D)
22:49imirkin: yeah, seems fine
22:50imirkin: i dunno what causes the local usage
22:50imirkin: maybe the thing being compiled has some l which gets optimized out
22:51imirkin: oh lol
22:51imirkin: it decodes fine in envydis
22:51imirkin: i still had -W from when i was pasting 64-bit things into it
22:51imirkin: works much better with -w :)
22:52imirkin: apparently we both made the exact same mistake
22:52imirkin: good times
22:52pmoreau: Yes indeed! 🤣
22:52imirkin: anyways, i'm in the middle of coming up with some candidate fixes for the atomic thing
22:52imirkin: should have something in the next hour or so
22:52imirkin: basically a series of commits
22:52pmoreau: Cool! I am looking at those offsets for shared/inputs and cleaning/fixing that up.
22:53pmoreau: bin.tlsSpace is what gets reported as local afterwards, right?
22:53imirkin: i would think so
22:55pmoreau: Okay, that is set to the value reported by NIR as scratch, and it does report 4 bytes of scratch size.
22:56imirkin: that should be set by codegen
22:56imirkin: not an input
23:13Lyude: there's our buggy cursor
23:18Lyude: and, it's definitely not alignment related (unless I keep getting very lucky when running igt)
23:18imirkin: got it with modetest?
23:19imirkin: modetest sticks it into a dumb bo
23:19Lyude: that could definitely explain it then
23:19Lyude: imirkin: yeah I got it with modetest
23:19imirkin: also does igt place the cursor at random places
23:19Lyude: alignment/stride seems to be the same between the two
23:19imirkin: or only fixed offsets?
23:19Lyude: imirkin: it does have random testing (also like I said I double checked, e.g. I actually did look at the visual output :)
23:20imirkin: i just mean the cursor position
23:20Lyude: imirkin: yeah-that's what I meant, there's a test for placing the cursor in random spots
23:20Lyude: and that one does pass
23:21imirkin: ah ok
23:21Lyude: I'm pretty sure we're seeing the cursor scanout from somewhere other then the fb, because when I tried modetest I was able to see the remanents of one of the cursor patterns igt drew
23:21Lyude: so it is something related to the bo allocation
23:22imirkin: well it's pretty easy to see what modetest does
23:22imirkin: i.e. nothing complicated :)
23:22Lyude: yeah, which is why I'm thinking it might be something with how nouveau handles dumb bos
23:22imirkin: and it does work at 128x128
23:22Lyude: skeggsb: any idea how I can get nouveau to spit out as much info as possible about bo/fb allocations?
23:33Lyude: "Use VRAM if there is any ; otherwise fallback to system memory" that's definitely not the problem but I can't help but to think that's probably wrong in nouveau_display_dumb_create() :)
23:36imirkin: Lyude: when that system memory gets pinned, it'll end up in vram
23:36Lyude: ah right, forgot about that
23:39imirkin: (but maybe not, who knows)
23:39imirkin: but hopefully you're not running out of VRAM to make a cursor bo
23:39imirkin: that'd be a tight fit ;)
23:39Lyude: yeah lol, that's definitely not our issue
23:39Lyude: was just surprised to see that
23:39imirkin: "if only i had more vram, i could fit the cursor in there"
23:39imirkin: you know the next GPUs will have like a 4096x4096 cursor
23:40imirkin: with FP16 alpha blend
23:40imirkin: but packed in a surface that the GPU can't render to
23:41Lyude: hehe, we've already got 32 planes on newer gpus :P
23:43imirkin: Lyude: btw, one additional diff is that modetest (by default) doesn't use atomic
23:43imirkin: or universal planes
23:43Lyude: imirkin: it's reproducible with atomic
23:43imirkin: dunno what igt does
23:43Lyude: (I almost always use -a with modetest)
23:43imirkin: [of course neither does xf86-video-modesetting]
23:44Lyude: I'm convinced it has something to do with how we're allocating the bo in nouveau tbh
23:44imirkin: you could be right
23:44imirkin: what does igt do?
23:44Lyude: imirkin: we call down to libdrm's nouveau functions directly, which causes us to use the NOUVEAU_GEM_* ioctls
23:45Lyude: so my assumption is there's some requirement for scanout surfaces we're not enforcing in the dumb path, but i'm not sure what
23:46Lyude: oh-i have an idea that might tell me what
23:49pmoreau: imirkin: I rework the offsets for shared memory in the NIR frontend: https://gitlab.freedesktop.org/pmoreau/mesa/-/commit/a3004a82158d89c6a7ee2f50e1099f7645c06400
23:57imirkin: pmoreau: just pushed an update
23:58imirkin: with a bunch of commits at the end
23:58imirkin: which try various things
23:58imirkin: try the full branch first
23:58imirkin: and if that works, peel it back commit by commit
23:58imirkin: and if that doesn't work, would appreciate NV50_PROG_DEBUG=1 output (tgsi please)