00:41 yates: has anyone noticed that a system with an nvidia video card, the nouveau driver, and a dual-input monitor (hdmi), that the disconnect caused by switching monitors causes the video to die?
00:42 yates: if you have the nvidia system selected while booting, it comes up fine, but after switching to an alternate computer, the video dies.
00:42 yates: s/switching monitors/switching monitor inputs/
00:42 yates: the system with nvidia card is f32/cinnamon, the other system had intel chipset and is lubunty 18.04lts/lxde
00:42 yates: the lubuntu system comes back every time, no problem. i've switched cables and inputs and determined the problem is at the computer
00:42 yates: here's my /var/log/Xorg.0.log. i don't see anything that i recognize as a problem. http://paste.ubuntu.com/p/DSxmSM8tCb/
00:44 imirkin: yates: dmesg might be more instructive
00:45 imirkin: i've not seen any issues flipping between diff hdmi inputs
00:45 imirkin: but it'll depend on the specific monitor as to precisely what that does
00:45 imirkin: oh interesting
00:46 imirkin: can you confirm that 4k@60 is working as expected on this monitor?
00:46 imirkin: i suspect that the monitor may be losing its SCDC settings when flipping between inputs, and nouveau doesn't realize it should re-set those
00:47 imirkin: since it thinks that they're still set
00:47 imirkin: that's just a quick guess based on 0 data
00:47 imirkin: one quick way to check is if you change it to 4k@30 (or a lower resolution @60), these problems go away
00:48 yates: imirkin: here is the dmesg output: http://paste.ubuntu.com/p/2RQcnXcmbr/
00:49 yates: [13474.481587] nouveau 0000:01:00.0: DRM: DDC responded, but no EDID for HDMI-A-1
00:49 imirkin: yeah, saw that. not ideal, but not necessarily fatal, since it might have responded a bit later
00:49 yates: what is scdc?
00:49 imirkin: it's like edid
00:49 imirkin: but on a different i2c address
00:50 imirkin: it controls scrambing and tmds clock division
00:50 imirkin: which are both required to achieve > 340mhz pixclk's with hdmi
00:50 imirkin: but both the source and sink must be in agreement abou tit
00:54 yates: imirkin: what app can i use to change the monitor settings? i am ssh'ed in from another system with x11 ability so i can run a gui app from my second system and change it
00:54 imirkin: DISPLAY=:0 xrandr -r 30
00:54 imirkin: er
00:54 imirkin: sorry
00:54 imirkin: er no, that's right
00:54 yates: as normal user?
00:54 yates: or root?
00:54 imirkin: as the same user who's logged in no the main thing
00:55 imirkin: (that's the easiest)
00:55 imirkin: if it's at the display manager, it's achievable to change it, but not easy
00:55 imirkin: you're better off logging in
00:55 yates: will that change it across reboots?
00:56 imirkin: one-time
00:56 yates: ok let me try that
01:00 yates: that worked a couple times, then it died again..
01:01 yates: seemed to be better
01:01 imirkin: can you just try a lower resolution? it might decide to flip back to @60 for some reason
01:01 imirkin: run
01:01 imirkin: DISPLAY=:0 xrandr -s 1920x1080
01:01 yates: k
01:07 yates: imirkin: that didn't work right. the xrandr -x 1920x1080 made the display look "squashed". and then switching caused it to die again
01:07 imirkin: -s
01:07 imirkin: not -x
01:07 imirkin: (what's -x?)
01:07 yates: due to this roblem, i'm having to reboot each time
01:07 yates: yes, i meant -s
01:07 imirkin: hm
01:07 imirkin: "squashed" is right, since we reduced the resolution
01:08 imirkin: i bet that powering the display on/off would also fix it
01:12 yates: no, powering the monitor off and on did not fix it
01:12 imirkin: what about cable unplug/replug?
01:13 imirkin: and what about, when it's in a broken state
01:13 imirkin: doing
01:13 imirkin: DISPLAY=:0 xrandr -r 30
01:14 yates: how can i do that in its broken state? i can't see!
01:15 imirkin: you're ssh'd in no?
01:15 yates: yes
01:15 imirkin: so ... do it over ssh
01:15 imirkin: make sure you're in as the same user as the one who's logged in "locally" on the machine
01:16 imirkin: otherwise the X credentials won't match up
01:16 yates: Rate 30.00 Hz not available for this size
01:16 imirkin: erm
01:16 imirkin: pastebin just "DISPLAY=:0 xrandr"?
01:17 yates: http://paste.ubuntu.com/p/WGrQBYxnvw/
01:17 yates: that's different than it was!
01:17 imirkin: uhhhhh
01:17 imirkin: and the monitor is plugged in?
01:17 yates: yes
01:17 imirkin: that's not what the video card thinks :)
01:17 yates: it's the same one i'm using now, with the lubuntu system
01:17 imirkin: note how all the connectors say 'disconnected'
01:18 imirkin: it has that 4k@60 mode left-over
01:18 imirkin: but that's just an X internal detail
01:18 yates: i see that now
01:18 imirkin: that mode isn't really there
01:18 yates: well, it IS disconnected - i have my monitor switched to HDMI2, which is my lubuntu system
01:18 imirkin: oh
01:19 imirkin: oh i see, you only have one monitor total
01:19 yates: that's where i'm sshing from
01:19 yates: i do have another i could hook up
01:19 imirkin: does it have picture-in-picture?
01:19 imirkin: i used to do that a lot with an older monitor, hooking up the GPU i was working on to s-video or whatever
01:20 yates: yes, i have one monitor with two hdmi inputs, one from my f32/cinnamon system, and one from my lubuntu-18.04lts/lxde system
01:21 yates: it is the switching between these two that is the entire problem
01:22 imirkin: right
01:22 imirkin: does the monitor have a picture-in-picture setting
01:22 yates: i could run the command when the HDMI1 is selected. shall i try that?
01:22 imirkin: which allows showing both at once?
01:22 imirkin: yeah
01:22 imirkin: just like hit enter on the "other" machine
01:22 imirkin: after you've switched
01:22 yates: right
01:22 yates: it doesn't appear to have PIP
01:23 yates: ok here: http://paste.ubuntu.com/p/tS5YKntJGs/
01:24 yates: did you want me to try "DISPLAY=:0 xrandr -r 30" in a similar manner?
01:24 imirkin: yes
01:25 imirkin: btw, that's really bad of this monitor to do that. it's supposed to respond to EDID even if it's turned off
01:25 imirkin: (with power supplied by the source)
01:27 yates: it works for one switch, then it stops working. when i stopped working, i ran xrandr (no opts) and got this: http://paste.ubuntu.com/p/vZ2YwqxJTy/
01:28 yates: so it seems to be switching back to 60
01:28 imirkin: yeah
01:28 imirkin: ok
01:28 imirkin: so
01:28 imirkin: and to confirm, doing -r 30 "fixes" it that time right?
01:28 imirkin: (i.e. no reboot required/etc)
01:28 yates: yes
01:28 yates: right, no reboot required
01:28 imirkin: so my money's on SCDC being messed up
01:29 yates: you mean it's my monitor?
01:29 imirkin: no
01:29 imirkin: i mean, sorta
01:29 imirkin: your monitor is doing something in a way that's different than other monitors
01:29 imirkin: and all this HDMI 2.0 stuff hasn't gotten an immense quantity of testing on novueau in the first place
01:29 imirkin: it's a somewhat awkward protocol
01:30 imirkin: since it requires the sink and source to be in agreement
01:30 yates: i just noticed that my lubuntu system is running at 30 Hz, so i guess that's why it always works?
01:30 imirkin: probably, yeah
01:30 imirkin: older GPU
01:30 yates: i don't need 60 Hz, how to i force it to 30 Hz permanently? (on my nvidia system)
01:30 imirkin: this is up to your DE
01:31 imirkin: i don't know anything about ubuntu or its derivatives
01:31 yates: fedora
01:31 yates: f32/cinnamon
01:31 imirkin: i don't know anything about fedora or its derivatives
01:31 yates: that's the one with the nvidia
01:31 imirkin: ;)
01:31 yates: ha
01:31 yates: hokay let me poke around
01:31 yates: thanks much, imirkin
01:31 imirkin: you can also force the issue
01:31 yates: have you worked on the nouveau driver?
01:32 yates: huh? force the issue?
01:32 imirkin: i did the initial HDMI 2.0 impl, among other things
01:32 imirkin: i think you can run with nouveau.hdmimhz=300 or something
01:32 imirkin: whch will completely disallow the high-freq modes
01:32 imirkin: let me double-check though
01:32 yates: you mean in the kernel boot modeline?
01:32 imirkin: cmdline, yea
01:32 yates: cool
01:32 yates: well, let me try to configure first - that's easier
01:32 imirkin: yeah, should work
01:33 imirkin: nouveau.hdmimhz=300 will disallow modelines > 300mhz pix clk
01:36 yates: ok so i configure it to 30 Hz permanently and it's working over multiple switches.
01:36 imirkin: cool!
01:36 imirkin: ultimately this is a bug in nouveau
01:36 imirkin: but this works around it for you for now
01:36 imirkin: which monitor is this ftr?
01:37 yates: how can it be? if the monitor won't allow SCDC to be read
01:37 imirkin: i doubt it's that
01:37 yates: it's a 27-inch LG
01:37 imirkin: i think it just loses SCDC settings
01:37 imirkin: when nouveau doesn't think they'll get lost
01:37 yates: aha
01:37 imirkin: and so nouveau is sending a scrambled / clock-divided signal
01:37 imirkin: and the monitor isn't expecting it
01:37 imirkin: (don't ask me wtf clock division is)
01:37 yates: so it just needs to detect that scenario and then reestablish the setting if detected?
01:38 imirkin: (and i know what scrambling is, but i don't really see how it helps in this case)
01:38 yates: usually it means a higher frequency clock is divided down.
01:38 imirkin: yes
01:38 imirkin: right, but ... what does that mean in practice? :)
01:38 imirkin: so you have a 600mhz pixclock, and there's a 40x divider
01:38 imirkin: so now you're sending a 15mhz or whatever clock ... but you still need to get the data across
01:39 imirkin: er, 4x, not 40x
01:39 yates: not sure - i'd have to dive into all the parameters to get my head wrapped around it.
01:39 imirkin: those are the only two parameters
01:39 imirkin: scrambling and div-clk-by-4
01:39 yates: the pixclk and the divider ratio?
01:39 imirkin: both need to be enabled for the higher rate modes to work
01:39 imirkin: the division is always by 4 for the higher modes
01:40 imirkin: this is like a HDMI-level thing
01:40 imirkin: the cable can't handle the high-freq signals
01:40 imirkin: so they play games
01:40 imirkin: by being clever
01:40 imirkin: but they're cleverer than me, at least when it comes to signals stuff in practice :)
01:40 imirkin: so i don't worry about it too much
01:40 yates: ultimately it comes down to more bandwidth required. no matter what tricks.
01:41 yates: you can reduce clock rate by expanding the number of data lanes.
01:41 imirkin: not what's happening here
01:41 imirkin: same data lates
01:41 imirkin: lanes*
01:41 imirkin: same everything
01:41 imirkin: lower clock, scrambling, and more data
01:41 imirkin: magic.
01:41 imirkin: scrambling usually means that it goes through an encoder which ensures that you don't get too many strings of 1's or 0's in a row
01:42 yates: perhaps they have a mode where the clock sent along the cable is lower, but they tell the monitor it has to double it
01:42 imirkin: perhaps
01:42 imirkin: also leading/falling edge style cheating
01:42 imirkin: who knows
01:42 yates: right
01:42 yates: DDR blah
01:43 imirkin: pre-hdmi 2.0, hdmi was so simple
01:44 yates: i thought about making my own hdmi device for my tv/audio system, til i heard they want you to ante up $10K for a license...
01:45 yates: yes, scrambler or whitener
01:45 yates: thanks again, much, imirkin
01:45 imirkin: yw
01:47 imirkin: yates: can you get the exact model of the monitor? or just pastebin xrandr --verbose on it?
01:47 imirkin: er nevermind
01:47 imirkin: i have your EDID already in one of your pates
01:48 yates: sure hang on
01:48 imirkin: "LG Ultra HD"
01:48 imirkin: is what it says in the EDID
01:48 yates: ah ok
01:50 yates: http://paste.ubuntu.com/p/zmS2BXhqvx/
01:50 imirkin: that was done when it wasn't plugged in, but it's fine, i got the edid from one of your earlier pastes
01:50 imirkin: its dumped in xorg log
01:52 yates: http://paste.ubuntu.com/p/YMjJ6tckCd/
01:52 yates: :)
08:29 pmoreau: imirkin: Here is what I ended up with to fix the spill offset issues; I think it makes sense, but would definitely not mind a second opinion on it especially from someone who knows a lot better than me how this is supposed to work. https://gitlab.freedesktop.org/pmoreau/mesa/-/commit/e12de28192cd6996646c00a87a28a749d727c701
09:48 pmoreau: > Todo wtf is up with $a7?
09:48 pmoreau: I agree with that statement from envytools; the hardware has been throwing at me some “Invalid opcode” for the following: `R2G.U32.U32 g[A7+0xd], R1; /* 0xe42047840c001a01 */` and `LST.S64 local[A7+0x0], R0; /* 0x60800784dc000001 */`.
09:49 pmoreau: Both envydis and nvdisasm were happy to decode those instructions without issues, but the hardware was having non of it. So might need to restrict $a7’s usage.
16:59 imirkin: pmoreau: instead of *2
17:00 imirkin: pmoreau: you want * lval->reg.size / units
17:00 imirkin: er
17:00 imirkin: actually let me read over it more carefully
17:01 tavvva: Hello
17:03 pmoreau: imirkin: It sounds like `* lval->reg.size / units` will achieve the same thing but probably be more robust in case we later change the value of `units`.
17:03 imirkin: pmoreau: so the subtlety is that nv50 has 16-bit colors and 32-bit regs are expressed as pairs of colors), while nvc0+ has 32-bit colors
17:03 imirkin: pmoreau: unfortunately i don't remember how (or whether) this is reflected in e.g. compMask
17:03 imirkin: the fixed * 2 seems bogus pretty much no matter how you slice it...
17:04 pmoreau: Did you check the commit message? I tried to explain it there
17:04 imirkin: i'm trying to grok it
17:04 imirkin: but i'm wondering if it looks the same on e.g. nvc0
17:06 imirkin: pmoreau: there's also some difference if e.g. a merge is being spilled
17:06 imirkin: vs a component of a merge
17:06 imirkin: the lval->reg.size might be different
17:07 pmoreau: Okay, I understand better why I got those values on vn50: i thought the values would be expressed in bytes, not in 16-bit multiples.
17:07 imirkin: lval->reg.size is expressed in bytes
17:07 imirkin: so here's the question... what's the lval here?
17:07 imirkin: is it the U64 which is being spilled?
17:07 imirkin: or is it the U32 which is half of the U64?
17:09 pmoreau: Let me see if I can remember…
17:10 imirkin: i think the problem is that this logic works for one case but not the other
17:10 imirkin: or something.
17:10 imirkin: ok
17:11 imirkin: and compMask *is* done in terms of colors
17:11 imirkin: so because on nv50 colors are 2x
17:12 imirkin: you actually need to divide ffs(lval->compMask) by 2
17:12 pmoreau: If I multiply by the register size, I agree.
17:12 imirkin: although..... hm
17:12 imirkin: oh right
17:12 imirkin: on nvc0, it's 1 color per 32-bit
17:12 imirkin: so this works out
17:13 imirkin: actually this is just bogus
17:13 imirkin: heh
17:13 imirkin: instead of * lval->reg.size
17:13 imirkin: it should be * units
17:14 imirkin: targ->getFileUnit(lval->getFile())
17:14 imirkin: or something
17:14 imirkin: << getFileUnit(). it's a shift :)
17:16 pmoreau: Okay, let me try that out
17:18 imirkin: and this works because compMask is expressed in terms of colors, not bytes. so e.g. a u64 value will just have more colors associated with it
17:18 imirkin: so the offsets should all work out
17:19 pmoreau: Awesome! I will change the commit message to reflect that. I knew I was missing something and a second pair of eyes would help. :-)
17:20 imirkin: as with all matters of RA, i'm not 100% sure
17:20 imirkin: so ... some testing would be prudent
17:20 pmoreau: Do you happen to know how to avoid Nouveau from using $a7? I tried changing `TargetNV50::getFileSize()` for `FILE_ADDRESS` to return 3 (instead of 4), but that did not help.
17:21 imirkin: looks like this was one of calim's last contributions: 2e9ee44797fcce10 -- i guess he missed it / forgot
17:21 imirkin: mmmmmm
17:22 imirkin: if file size is 4, it should definitely never use a7 in the first place
17:22 imirkin: if file size is 4 it should use 0..3
17:22 imirkin: do you have a sample where this happens?
17:22 imirkin: if so, can you pastebin the NV50_PROG_DEBUG=255 for it? (and please don't use gitlab snippet thing, or give me a raw link or something)
17:22 pmoreau: I was thinking, maybe it is expressed in terms of 32-bit values and $a being 16-bit, a value of 4 would allow 8 of them.
17:23 imirkin: i didn't even know there were 7 address regs
17:23 imirkin: i only thought there were 4
17:23 imirkin: i guess that's $c
17:23 pmoreau: I have multiple samples where this happens. One sec.
17:24 pmoreau: Me neither; I’ve been looking at https://envytools.readthedocs.io/en/latest/hw/graph/tesla/cuda/isa.html#registers
17:24 imirkin: yeah, i don't think that's used
17:25 imirkin: and internally, $a0 -> $a1
17:25 imirkin: i.e. we will compute $a0, but then it will get fixed up to $a1 at emission time
17:33 pmoreau: imirkin: Here you go; I just noticed that the emitter thinks it is generating `st u32 # s[$r6+0x3c] $r3` but in practice we get `st b32 s[$a7+0x3c] $r3`. https://gitlab.freedesktop.org/pmoreau/mesa/-/snippets/1921/raw/main/Out.txt
17:36 pmoreau: Ah right, I added prints after each peephole pass. It looks like that weird stuff is coming from the IndirectPropagation pass, if I am not mistaken.
17:42 pmoreau: Disregard my previous comment, I was looking at the wrong part of the code.
17:46 pmoreau: Looks like it’s coming from splitting those 64-bit stores to local memory into 32-bit stores, and I’m quite sure I’m the one who wrote that… time to go and fix it! Thanks for the help. :-D
17:53 pmoreau: Yup, now it works a lot better 😅
18:04 tavvva: imirkin: sorry to interrupt your thoughts :] have you had a chance to look at the NVS 140 video thingy?
18:07 tavvva: imirkin: I'll have to return the laptop soon and the chance to test the patches will drop
19:02 imirkin: tavvva: not yet, sorry. i haven't been able to make progress on getting my test env set up
19:03 imirkin: pmoreau: cool. yeah, it's important to see what the emitter thinks it's doing
19:04 pmoreau: I usually do that, but for some reason I did not do it this time. :-/
19:04 imirkin: yea no worries
19:06 tavvva: imirkin: ok, should I try it later or you see it unprobable you could make progress with it?
19:06 imirkin: tavvva: it's improbable, sorry
19:06 imirkin: i dunno wtf is wrong with this G84, but it _really_ doesn't want to work
19:06 imirkin: maybe i can just reset it... let's try that
19:08 tavvva: imirkin: ok, thanks for your time anyway
19:08 imirkin: yea, reset doesn't help
19:08 imirkin: tavvva: sorry =/
19:09 tavvva: imirkin: no need to be sorry ... that's life :]
19:11 tavvva: imirkin: do you know anyone else who could look at the issue or you're the only one who has the required knowledge for that?
19:11 imirkin: dunno
19:12 imirkin: i definitely don't want to volunteer anyone else for it.
19:15 tavvva: imirkin: may I know more about the blocker you were/are facing? can I help somehow?
19:16 imirkin: tavvva: [274452.417103] nouveau 0000:04:00.0: bus: MMIO write of 0000003f FAULT at 00fd84
19:16 imirkin: and related issues
19:16 tavvva: imirkin: like giving you the access to the hardware or something
19:16 imirkin: the VP2 engine seems like it's semi-dead
19:17 tavvva: imirkin: does it mean it's a kind of regression that appeared recently?
19:17 imirkin: dunno
19:17 imirkin: i haven't tried it with this specific board ever, i think
19:17 imirkin: i could also plug a different board in
19:17 imirkin: but that requires reboots, etc
19:18 tavvva: imirkin: and if I give you access to the hardware, would it help
19:18 tavvva: ?
19:18 imirkin: tavvva: no, i'd rather do this locally
19:21 tavvva: imirkin: ok, in that case I have no other ideas how to move forward ...
19:21 tavvva: imirkin: I'll try to ask again in few weeks/months
19:21 tavvva: thank you :]
19:23 tavvva: c.u. guys
19:34 imirkin: pmoreau: sounds like you're making good progress too?
19:34 pmoreau: Fixed 6 tests over the weekend. :-)
19:35 imirkin: 6 down, 70000 to go?
19:35 pmoreau: :-D
19:35 imirkin: but this spilling fix should actually be good for everything
19:35 imirkin: it's a pretty subtle problem
19:35 imirkin: that said, spilling is less frequent on nv50 coz it has more regs
19:35 pmoreau: Down to 13 for the basic test
19:36 pmoreau: I am not sure about that: with the CTS I usually end up constrained to only 16 regs.
19:41 imirkin: i mean for graphics
19:41 imirkin: you get 128 regs
19:41 pmoreau: Ah, ok
19:41 pmoreau: Among the most annoying issues left for fixing some of the subtests of basic, are timeouts (“gr: TRAP_MP_EXEC - TP 0 MP 0: 00000008 [TIMEOUT] at 000398 warp 0, opcode f0000001 e0000001”) and still getting some LOCAL_LIMIT_WRITE despite the tls realloc working.
19:51 imirkin: sounds tricky
20:35 pmoreau: I am not impressed: the binary generated by NVIDIA for SM 11 `BAR.ARV.WAIT b0, 0xfff; /* 0x00000000861ffe03 */`, what my G96 thinks about it: “gr: TRAP_MP_EXEC - TP 0 MP 0: 00000010 [INVALID_OPCODE] at 0000e0 warp 10, opcode 861ffe03 00000000”
20:35 imirkin: at least we're consistent!
20:37 pmoreau: :-)
20:38 pmoreau: I guess I’ll need to try to follow the control flow to figure out what is going wrong with that kernel and why the last instruction is timing out for some of the warps.
20:40 imirkin: pmoreau: hm, i wonder if there's some control flow around the barrier
20:40 imirkin: and some threads exit before hitting it?
20:41 pmoreau: There definitely is some control low around the barrier, as the barrier sits inside a do-while loop.
20:42 imirkin: hm, scary
20:42 imirkin: i don't know how barrier works tbh
20:42 pmoreau: The conditional is based on the size of the block so all threads within the block should always have the same value there.
20:45 imirkin: hopefully
20:55 pmoreau: I’ll leave that timeout be for now and focus on the LOCAL_LIMIT_WRITE errors instead: that way I can include it in my TLS rework series and submit it, though I’ll need to do some testing with graphics workloads too before that.
20:55 imirkin: yeah, at least run like dEQP and KHR-GL33 test suites
20:56 pmoreau: What is the scope of a nv50_screen BTW? Would it be possible on Tesla to get both a non-compute program and a compute program in the same screen?
20:56 imirkin: i'd like to get us a standard "test set" that we can run easily, e.g. adocker thing
20:56 imirkin: sure
20:56 pmoreau: That would be great!
20:56 imirkin: with ES 3.1
20:56 imirkin: with my patches, there are only a small handful of non-conformant situations with ES 3.1
20:57 imirkin: the big one is textureGather with an explicit component
20:57 imirkin: the hw just doesn't support that
20:57 imirkin: earlier tesla's don't have texgather at all, and the later ones do only the 'r' component
20:57 imirkin: other than that, i think the DX10.1 tesla's can handle it all.
20:58 pmoreau: Okay, so I’ll need to make sure I support that case.
21:00 pmoreau: There is no barrier between the different 3D stages, right? So a fragment shader could already start running before all vertex shader instances are done (as long as the vertex shaders for all the points of the primitive containing that fragment did finish).
21:00 imirkin: correct
21:00 imirkin: in fact multiple draws can sometimes run in parallel
21:01 pmoreau: And those draws being in the same screen?
21:01 imirkin: yes
21:02 pmoreau: Grrrr
21:02 imirkin: but that shouldn't matter
21:02 imirkin: i wouldn't worry _too_ much about the mixing stuff
21:02 imirkin: usually it require explicit barriers
21:02 imirkin: in order to work properly
21:03 karolherbst: imirkin: you mean like different SMs run different shaders? Or just same shaders, but different parameters?
21:03 imirkin: karolherbst: i don't know all the specifics.
21:03 imirkin: the hw does it when it thinks it can
21:03 pmoreau: Shouldn’t it? If both the vertex and fragment shader use local memory, we need to make sure they don’t run on each others toes.
21:03 karolherbst: yeah.. I think the former thing is getting used more often the more SMs are on the GPU :D
21:03 karolherbst: but the latter should be "always" possible I think?
21:03 karolherbst: dunno..
21:04 karolherbst: and also no idea how much impact the driver has here
21:04 karolherbst: pmoreau: yeah, hence we reserve space per SM per thread
21:04 karolherbst: per shader stage even?
21:05 pmoreau: Not per shader stage IIRC, which is the problem
21:05 karolherbst: uhhh
21:05 karolherbst: huh?
21:05 imirkin: he means like let's say VS and FS both use lmem
21:05 karolherbst: sure
21:05 karolherbst: I get that
21:05 imirkin: pmoreau: ultimately the shaders have to run somewhere
21:06 karolherbst: I am just surprised we don't partition according to stage
21:06 imirkin: the tls should be based on the size of the chip, i think
21:06 karolherbst: but...
21:06 karolherbst: I mean..
21:06 imirkin: not based on the number of "different" shaders
21:06 karolherbst: per SM per thread _should_ be enough
21:06 pmoreau: We definitely give the same starting address into the TLS bo for all 3D stages, so even if it was per stage it wouldn’t matter.
21:07 karolherbst: pmoreau: which should be fine
21:07 karolherbst: do we even set more than offset+size?
21:07 pmoreau: Not if the fragment shader can start running while instances of the vertex shader are still running.
21:07 karolherbst: pmoreau: well...
21:07 karolherbst: not on the same SM
21:08 karolherbst: SM has to be done with its execution before it can load a different shader
21:08 pmoreau: An SM can’t execute two different shaders?
21:09 karolherbst: hard to say
21:09 karolherbst: there is a concept of blocks..
21:10 karolherbst: anyway
21:10 karolherbst: it doesn't matter
21:10 karolherbst: as local memory is allocated per thread
21:10 karolherbst: we might have to set the same size per thread across all stages?
21:11 karolherbst: dunno what happens if the VS has 40k and the FS 80k
21:11 karolherbst: and then a SM starts executing the FS while finishing the VS? if that's even possible
21:12 karolherbst: but then again
21:12 karolherbst: it doesn't matter what is possible on the same SM
21:12 pmoreau: Okay, if we have the same size per thread for all stages, it’s indeed fine.
21:12 karolherbst: I _think_ if one SM still works on a VS with a different local memory size than a differen SM running a FS...
21:12 pmoreau: If we don’t have the same size, that could be trouble.
21:14 karolherbst: I could see how that messes things up
21:14 karolherbst: pmoreau: okay, cool
21:14 karolherbst: next question, what happens if a CS runs in parallel :p
21:14 karolherbst: I am sure that's not possible on your hw, but ... maybe that's supported with ampere? :D
21:14 karolherbst: dunno
21:14 karolherbst: imirkin: do you know if any nvidia gen can run a grpahics pipeline alongside a compute pipeline?
21:15 pmoreau: I think it’s only possible since Ampere
21:15 karolherbst: okay
21:16 karolherbst: so we might have to serialize in a few places.. great...
21:16 pmoreau: Or allocate more memory :-)
21:16 karolherbst: ehhh...
21:16 karolherbst: well
21:16 karolherbst: can't do it while stuff is running
21:17 karolherbst: pmoreau: do you know what we should do? per stage tls space or whatever?
21:17 karolherbst: I don't know how much possibilities we have here
21:17 pmoreau: True
21:17 karolherbst: but it might make sense to just... split it
21:19 pmoreau: > in fact multiple draws can sometimes run in parallel
21:19 pmoreau: imirkin But those multiple draws would be using the same set of shaders, correct? Since the screen only contains up to one instance of each shader stage, if I am not mistaken.
21:19 imirkin: i'm talking about the hw doing the parallelism
21:19 imirkin: you can submit multiple draws
21:19 imirkin: it won't necessarily wait for each draw to complete
21:19 imirkin: before starting the next one
21:20 karolherbst: I think if we change the TLS config we either WFI or we hope the command itself will do it?
21:20 imirkin: yeah, probably
21:20 karolherbst: the issue is.. are draw calls interfering with each other, or just shader stages
21:20 imirkin: and perhaps the hw only does it if tls isn't used
21:20 imirkin: i dunno
21:22 karolherbst: yeah...
21:22 karolherbst: too many unknowns here
21:23 pmoreau: I’ll bail out for today; I’ll continue looking into fixing my non-parallel case tomorrow.
21:23 pmoreau: At least the rework should already be an improvement over the current situation, even if it doesn’t address the parallel cases.
21:23 karolherbst: I probablye review stuff tomorrow, so just throw links against me :D
21:24 pmoreau: The clover series is still up :-) I added an additional patch today
21:25 karolherbst: it's still open in a tab :D
21:27 pmoreau: Some feedback on https://gitlab.freedesktop.org/pmoreau/mesa/-/commit/a9b4302544e4f076be0aad89573428ad8e9086f4 and https://gitlab.freedesktop.org/pmoreau/mesa/-/commit/80dc15dc8523c0e3d326cb94d0ac1f18788029f3 would be appreciated (note that in the latter.
21:27 karolherbst: ohh yeah.. that API one
21:28 karolherbst: pmoreau: CL_SUCCESS is 0 tw
21:28 karolherbst: ehh wait
21:28 karolherbst: huh?
21:28 karolherbst: return status >= 0 ? CL_SUCCESS : status; is.. ehh CL_SUCCESS
21:29 karolherbst: or am I am missing something?
21:29 karolherbst: ohh...
21:29 pmoreau: So, status contains either an error code (< 0), CL_SUCCESS, or >0 like things like CL_QUEUED and others.
21:29 karolherbst: ehhh
21:29 karolherbst: annoying
21:30 karolherbst: pmoreau: you know what's the most annoying thing about clovers error handling? it throws random error values even if an API doesn't allow it :/
21:30 pmoreau: :-/
21:31 karolherbst: yeah.. but ... no idea on how to "fix" it without rewriting the entire api layer
21:31 karolherbst: but..
21:31 karolherbst: probably fine as long as the CTS doesn't complain..
21:33 pmoreau: And if you are desperate for things to look at, you could browse through this branch which contains all my current patches for NV50: https://gitlab.freedesktop.org/pmoreau/mesa/-/commits/nv50_improve_tls. I need to squash some patches there, and rework some of them to perform the transformation in the legalisation pass instead.
21:35 pmoreau: Given the current status of OpenCL, I’m not sure rewriting the entire API layer is worth it, unless OpenCL gets picked up again as an API but I doubt it.