00:05 karolherbst: https://gist.github.com/karolherbst/ad8ee8572526bd6adaf622094d745abc
00:05 karolherbst: that shader should be fine with 8 regs...
00:07 karolherbst: HdkR: any ideas?
00:08 HdkR: Ah, so this is the failing one
00:08 karolherbst: I am quite sure
00:09 karolherbst: I should probably write a shader_test file tomorrow and see where that goes
00:09 HdkR: Turing instead of Volta right? So no errata workarounds needed?
00:09 karolherbst: yeah, turing
00:10 HdkR: That one looks fine, shouldn't be a problem with 8
00:11 HdkR: It's pretty straightforward to break if that is the case
00:16 karolherbst: okay..
00:16 karolherbst: I can reproduce with a shader_test file
00:18 karolherbst: HdkR: btw, which errate did you mean?
00:18 karolherbst: *errata
00:20 HdkR: nop padding on Volta
00:20 karolherbst: yeah.. well
00:20 karolherbst: that should be unrelated :p
00:20 HdkR: Yea, since this is Turing it doesn't have that :)
00:20 karolherbst: I hope :D
00:21 karolherbst: I think we upload padding regardless
00:21 karolherbst: HdkR: https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/gallium/drivers/nouveau/codegen/nv50_ir_target_gv100.cpp#L514 :p
00:23 karolherbst: although I think we load it first and after that comes the shader.. mhh
00:23 karolherbst: anyway
00:23 karolherbst: all I do is to change the max reg from 12 to 8 and then it breaks
00:24 HdkR: huh
00:24 karolherbst: mhhhh
00:24 karolherbst: mhhhh
00:26 HdkR: Wonder if the requirements differ between compute and graphics stages, that would be annoying
00:27 karolherbst: it would be
00:27 karolherbst: let me try something annoying
00:28 karolherbst: heh
00:28 karolherbst: HdkR: messed with the emiter, now it's fine: https://gist.githubusercontent.com/karolherbst/2b5c95dbb9654e0aeb119b42e9c8624b/raw/8e8ec0573c085bfd06600a839da20278abfde772/gistfile1.txt
00:29 karolherbst: if I set the last source to RZ instead it's also fine...
00:30 karolherbst: ehhh wait
00:30 karolherbst: I forgot to revert other changes
00:30 HdkR: :)
00:33 karolherbst: ehhh
00:35 karolherbst: mhhhhh
00:47 karolherbst: HdkR: lol... write to R5 is fine, but R6 is not
00:48 karolherbst: regs set to 8
00:49 HdkR: Some nuance is being missed that I don't remember
00:50 karolherbst: regs set to 9 allows a write up to 6 :p
00:51 karolherbst: let's say there are two hidden regs
00:51 karolherbst: so if I set regs to 32 it should be okay with 29, but 30 fails
00:51 karolherbst: well.. let's just go over the top and make it 128, 125 and 126
00:53 karolherbst: heh.. nice
00:53 karolherbst: HdkR: so yeah...
00:53 karolherbst: it seems that for vertex programs you have to set it to max reg id used + 3
00:53 karolherbst: if there are more special conditions? no clue
00:59 karolherbst: HdkR: and for fps it seems to be max + 1...
00:59 karolherbst: I am sure there is _some_ logic to it
00:59 karolherbst: anyway, I am convinced we can do better :)
00:59 karolherbst: just need to figure out how exactly
01:00 karolherbst: and I bet the +5 comes from somethign stupid like tess ctrl shaders :p
01:07 HdkR: Maybe, I never touched those stages :P
05:55 fling: is VN240GT-MD1G a fine card?
05:56 fling: should not I worry about ddr3?
05:56 fling: ddr5 is what works bad?
05:57 HdkR: If you want a card weaker than a GT 710 then maybe
05:57 fling: There is also gddr5 version from msi
05:57 fling: Which one should I pick? :P
05:58 fling: s/msi/palit/
06:00 fling: imirkin: RSpliet: hello.
06:08 fling: ok I will just avoid gddr5 for now
06:27 imirkin: fling: the GT 240 with GDDR5 should work fine too
06:27 imirkin: there was a (long) period of time when it didn't work, but then it was fixed, so all was well
06:28 imirkin: it won't reclock though
06:28 imirkin: i wonder what's faster - a top-clocked GT 240 with GDDR3 or boot clocks (which were medium on mine) on a GDDR5
08:10 fling: imirkin: I bought gddr3 one
08:10 fling: imirkin: I want to run it low clock all the time
12:59 karolherbst: HdkR: it seems like in the end it's that 2 more regs are needed, but I bet you also have no idea why
13:00 karolherbst: or maybe that's just the uniform register stuff and it takes 2 regs from each thread
13:00 karolherbst: ...
18:18 airlied: skeggsb: https://paste.debian.net/plain/1158685 probs needs some testing
18:19 imirkin: fling: ah yeah, if perf isn't your goal, then the gddr3 is fine. note that the number of perf levels on those earlier GPUs was not always ... high. sometimes just one perf level, sometimes 2
18:19 qeeg: hey, i'm working on riva 128 emulation for 86box, and i was wondering what the itm classes do exactly? i know itm stands for image to memory, but i'm not sure what the difference between ifm and itm is
18:20 qeeg: i've at least got gdi rects working tho :p
18:20 qeeg: also, what does the notify method on the itm class do? i was thinking it might send a PGRAPH NOTIFY interrupt but i'm not exactly sure
18:21 airlied: is ifm the opposite?
18:21 airlied:doesn't know
18:21 qeeg: i guess? but i don't know what they're supposed to do exactly either way :/
18:22 airlied:assumes they translate a linear buffer to a tiled buffer
18:22 airlied: and vice-versa
18:23 imirkin: i don't think we ever fully figured out tiling on pre-nv10
18:23 qeeg: nono, it's like
18:23 qeeg: image to memory and image from memory i think
18:23 qeeg: but i don't know wtf that means exactly?
18:24 airlied: memory is usually a linear buffer, and image is usually tiledf
18:24 qeeg: ah :/
18:24 qeeg: that sucks...
18:24 qeeg: weirdly, the width of the image is only 1x1???
18:25 qeeg: and i'm in 640x480x256 mode
18:28 qeeg: the given parameters are offset 0x1a, size 1x1, pitch 0 (???), location 20, 10
18:28 qeeg: also for some reason, it occasionally sends method 0 to an already bound object
18:29 qeeg: this is in win98se's default drivers btw
18:31 airlied: might be a small image used for filling something
18:31 airlied: that was a common pattern back then, for a fill, just upload 1x1 image and then execute blits from it
18:33 qeeg: yeah but it's currently showing as magenta, and i'm using the default background
18:33 qeeg: furthermore, i'm not getting ANY commands after that to draw the rest of the ui
18:34 airlied: yeah I'm lost beyond that in terms of what the hw is expected to do
18:38 qeeg: i'm mostly just stuck with like, infinite vblank interrupts...
20:54 karolherbst: imirkin: I was wondering if it would make sense to have per context shaders rather than per screen and allocate the buffer lazily so that context might not have a bo at all until a shader get bound
20:55 imirkin: a common pattern is to compile in one thread and use in another
20:55 karolherbst: right, that's why to not directly allocate the bo and wait until a context binds
20:55 imirkin: if you just mean the final code upload - delaying that until first use - whatever
20:56 karolherbst: I am also more thinking about multithreaded applications also binding shaders concurrently and having a shared bo for shaders is kind of racy
20:57 karolherbst: spent some time thinking recently about the MT issues and wondering if we want to rework the drivers a bit to make it suck less
20:58 karolherbst: and I am sure if multiple contexts go through validation it will break horribly already :)
20:59 karolherbst: just wondering if you have a better idea
20:59 karolherbst: I kind of prefer to have resources per context if there is no good reason to share them
21:01 imirkin: seems reasonable to just have a mutex around validation
21:01 karolherbst: locks hurt perf
21:01 karolherbst: so having the shader uploaded twice (if that even happens) is kind of preferable here
21:03 karolherbst: but I also don't know how much VRAM that would safe with eg chromium or so
21:04 karolherbst: at this point I am at a "I prefer solutions without locks, but I have no data on saying which case would be better overall"
21:11 airlied: if other drivers do it, copy them :-]
21:12 karolherbst: airlied: it's unclear to me what radeonsi is doing as it doesn't seem they have a GPU buffer with all the shaders
21:12 airlied: then if there is a problem with apps, everyone has it
21:12 karolherbst: :D
21:12 karolherbst: I mean.. I am mainly wondering what makes more sense
21:12 airlied: i think it has a bo that might be suballocated
21:13 karolherbst: having a global buffer with all shaders (and deal with evictions globally) or do it per context
21:13 airlied: but not in front of it rigbt now
21:16 karolherbst: we need to be able to compile without a context anyway... but that's unrelated as gallium keeps track of the objects already
21:17 karolherbst: ahhh
21:17 karolherbst: mhh
21:25 karolherbst: yeah.. dunno.. I really don't know what would be best here
21:26 airlied: copy what radeonsi does within the limits of hw, because mareko has probably spent a bit of time working it out
21:26 karolherbst: right.. it's just not easy to follow on waht radeonsi does with it's 1MB structs :p
21:26 karolherbst: *its
21:27 airlied: ill decode it when i get to my desk, tablet sucks for code reading
21:47 airlied: karolherbst: okay it doesn't have a special buffer for shaders
21:47 karolherbst: okay
21:47 airlied: it just uploads them via it's normally resource allocation path
21:48 airlied: which suballocates from a slab for smaller things
21:48 karolherbst: right..
21:48 karolherbst: but there is not this big buffer containing all shaders, right?
21:48 airlied: nope
21:48 karolherbst: so you suballocate, put in the shader you need for the draw and submit, correct?
21:48 airlied: each shader gets it's own allocation
21:48 karolherbst: *shaders
21:48 karolherbst: ahh, I see
21:48 karolherbst: mhhh
21:48 karolherbst: we have a real code area on nv
21:49 airlied: si_shader_binary_upload is the path
21:49 karolherbst: and we have to specify the address of the function the shader stage calls
21:49 karolherbst: and we have to bind the buffer on the hw channel
21:50 karolherbst: so it's more of a "big buffer with all code" + function offset per shader stage on draw
21:51 karolherbst: airlied: so I guess on a draw on AMD hw you also rather specify which resources/addresses the shaders have rather than offset into a bound buffer?
21:51 airlied: yeah you have VM address per shader
21:51 karolherbst: okay..
21:52 airlied: oh so each shader on nouveau needs to be a common based buffer? with a limited size offset?
21:52 karolherbst: more or less
21:52 karolherbst: it's not really limited though
21:52 airlied: like is it an address space sized offset?
21:53 karolherbst: seems like we have 32 bit offsets
21:53 karolherbst: at least for compute
21:53 imirkin: normally we set the offset size to much smaller
21:53 karolherbst: right
21:53 imirkin: but yes, in theory the max is 32-bit. (dunno, maybe 0 means "unbounded")
21:53 karolherbst: just saying it's not really a limitation
21:53 airlied: like maybe older hw is more limited
21:54 karolherbst: airlied: thea idea is that you can upload common functions and call them
21:54 karolherbst: airlied: not really
21:54 airlied: karolherbst: ah okay which we don't really take any advantage of
21:54 airlied: like older intel had a limit and you had to put them into one state buffer
21:54 karolherbst: seems like on tesla we have 24 bits
21:55 airlied: even 32-bit might be a limit you have to handle alright
21:55 karolherbst: on older gens.. it might be less
21:55 karolherbst: airlied: it's really just for code though
21:55 karolherbst: and I doubt you will run into situations with 4GB of shaders :p
21:56 karolherbst: I guess the 16MB are more of an issue
21:56 karolherbst: but still
21:56 karolherbst: imirkin: do we even check for that in nv50? ...
21:56 imirkin: check for what?
21:56 karolherbst: that our text area grows bigger than 16MB
21:57 imirkin: we grow the buffer
21:57 imirkin: but we max out
21:57 imirkin: if we max out and need to add another shader
21:57 airlied: intel reuplaod the active shaders when the buffer tops out
21:57 imirkin: then we dump all the current shaders and upload the ones we need
21:57 imirkin: if the *active* set of shaders is greater than the limit, then you get a warning :)
21:57 imirkin: i.e. the number of shaders needed for that one draw
21:58 imirkin: (or rather shader size)
21:58 karolherbst: airlied: we do the same, we just resize the buffer while at it
21:58 karolherbst: but maybe not for nv50
21:58 imirkin: yeah, buffer starts small-ish, maybe 256 or 512K
21:58 imirkin: and then it goes up to some amount
21:58 imirkin: remember nv50 shipped with 128MB VRAM GPUs
21:58 karolherbst: imirkin: I think it's constant for nv50 actually
21:58 imirkin: so making a 16MB code buffer would be... impractical.
21:59 karolherbst: yeah...
21:59 karolherbst: nv50 is constant
22:00 karolherbst: and fermi got 32 bit already
22:00 karolherbst: so it makes sense that it grows in nvc0 but not nv50
22:02 karolherbst: imirkin: actually.. we don't have to lock for the entire copy, so maybe a spinlock would be good enough.. we really just have to reserve the space and can continue with other threads I guess
22:03 imirkin: you're micro-optimizing.
22:03 imirkin: get something that doesn't deadlock first.
22:03 karolherbst: it's just about moving the unlock a bit up :p
22:04 airlied: yeah don't reaplcea a mutex with antying until it's in a profile
22:05 karolherbst: oh well.. but yeah, I guess I can start with a normal lock and just move on
22:05 karolherbst: just have to think about how the interactions with eviction and uploading from multiple threads works out
22:05 karolherbst: but maybe that's not a problem anyway
22:06 imirkin: should be single-threaded at that point
22:06 imirkin: if you take a mutex around the draw
22:06 karolherbst: ehhhh
22:06 karolherbst: I will for sure not lock the entire draw
22:08 imirkin: why not
22:08 karolherbst: but once I am done with it we will also have per context buffers, proper state reemission and no implicit submissions
22:08 karolherbst: because I still want there to be a benefit when droing threading
22:08 karolherbst: throwing locks solves the crashes, but doesn't help with concurrency
22:20 airlied:would liked to have seen the just add a mutex around the whole thing fix first :P
22:20 airlied: esp if you want to say compliant with such a big broken place
22:21 airlied: but really just getting it fixed at all would be a big win
22:21 airlied: I could finally run virgl :-P
22:21 karolherbst: airlied: doesn't work anyway
22:21 imirkin: airlied: unfortunately JUST adding a mutex couldn't be made to work
22:21 imirkin: i tried and failed
22:21 imirkin: karolherbst didn't believe me and tried too, and failed
22:21 airlied: ah yeah the libdrm calls back
22:22 airlied: and hey am I holding a mutex, who knows
22:22 imirkin: not just calls back
22:22 imirkin: but calls back at unpredictable points in the flow
22:22 karolherbst: yeah, but it's not that hard to get rid of it
22:22 karolherbst: just... needs some rework
22:22 imirkin: nouveau_bo_map could end up triggering a kick, etc
22:22 karolherbst: I'd like to remove all implicit submissions
22:23 karolherbst: and only submit on flush, etc..
22:23 karolherbst: uhh.. right bo_map as well for very stupid reasons
22:28 airlied: is bo_map in libdrm or mesa?
22:28 karolherbst: libdrm
22:28 airlied: seems like you'd want to keep all of that tracking in mesa
22:28 airlied: and only have libdrm do the actual maps
22:28 imirkin: yes
22:29 imirkin: that's the big part - essentially get rid of libdrm_nouveau
22:29 imirkin: (or keep very small portions of it)
22:29 airlied: yeah or just merge into mesa
22:29 imirkin: can't actually delete it since it's also heavily used by the ddx (and correctly there)
22:29 airlied: and then drop it
22:29 karolherbst: if "moving" alone would be enough
22:29 imirkin: but the API it presents is inappropriate for a sophisticated application like mesa.
22:29 karolherbst: but we need to rework our entire pushbuffer handling anyway :p
22:29 airlied: well moving means you can change the api
22:29 imirkin: right
22:30 airlied: designing it in two places while trying to retain the API is messy
22:30 imirkin: anyways, it's a deep rabbit hole
22:30 karolherbst: airlied: well skeggsb claims we can already do what we need with the current API, but at this point only skeggsb seems to know how
22:30 airlied: yeah I'm happy to say unless skeggsb gives you a patch today, burn it all
22:31 airlied: he's been saying that, but if nobody else can figure it out then it's not a great position to be in from a bus factor pov
22:31 imirkin: i've been reluctant to climb down it because i expect it'd require me a week or two full time of heavily concentrated work
22:31 imirkin: and i just don't have that sort of time commitment to give
22:31 airlied: merging libdrm_nouveau into mesa sounds boilerplately at least
22:31 imirkin: yeah, that's the easy part :)
22:31 karolherbst: airlied: and is not the main part of the work anyway
22:31 imirkin: "cp" works nicely
22:31 karolherbst: :D
22:31 karolherbst: yeah
22:31 karolherbst: nouveaus command submission is just busted
22:32 airlied: but if you do it get it the first step out of the way it's easier to see the next step
22:32 airlied: and maybe when you merge it you can burn lots of things down straighty away
22:32 imirkin: [and spreading that sort of work out over a long period of time is difficult too, since the key is to stay concentrated and keep it all in your head]
22:32 karolherbst: airlied: I already know the next steps, I just don't know if we need a new API or not :p
22:33 imirkin: karolherbst: what do you think about the modification where we lose the tracking of "currently in use" buffers and get rid of the implicit flush in nouveau_bo_wait
22:33 airlied: karolherbst: don't worry aboutthat
22:33 imirkin: i wonder if that'd be enough to get an impl going
22:33 airlied: if you merge libdrm_nouvea, there is no API
22:33 karolherbst: imirkin: refn also flushes
22:33 karolherbst: ohh wait
22:33 imirkin: karolherbst: that's fine though
22:33 imirkin: that was never the problem
22:33 karolherbst: no?
22:33 karolherbst: we should have _zero_ implicit pushes :p
22:33 imirkin: the problem was that nouveau_bo_map/wait would flush
22:34 imirkin: ok, well, maybe later.
22:34 imirkin: but that wasn't the immediate problem.
22:34 karolherbst: right
22:34 karolherbst: but it is a potential problem
22:34 karolherbst: and if we fix it, I'd like to have no potential problems I am aware of :p
22:34 imirkin: the immediate problem, at least with my patches, was that nouveau_bo_map() would get randomly called, which would in turn trigger fence-attached work to happen
22:34 imirkin: which would in turn potentially do bad things
22:35 imirkin: it's not a problem - at least my implemetnation handled kicks in the middle of submission like that just fine
22:35 karolherbst: imirkin: do you know what would be the easiest solution? malloced buffer and allocate pushbuffers on submit and tear it down again
22:35 karolherbst: lock around submission
22:35 imirkin: still not enough :p
22:35 imirkin: at least not without redoing lots of other stuff
22:35 imirkin: the pushbuf was never the problem
22:35 karolherbst: that's the part we have to do anyway
22:35 karolherbst: it actually is
22:35 imirkin: not in my recollection :p
22:35 karolherbst: we need per thread buffers anyway
22:35 imirkin: perhaps i implemented differently
22:36 imirkin: we definitely do not
22:36 karolherbst: of course we do :p
22:36 imirkin: per-thread buffers get you ~nothing
22:36 karolherbst: a global buffer is just calling for problems
22:36 imirkin: except extra confusion
22:36 imirkin: you can still only submit one thing at atime
22:36 karolherbst: sure
22:36 imirkin: and the thing you submit depends on the previous thing
22:36 imirkin: so ... what does it get you.
22:36 karolherbst: but we shouldn't share an array with all threads which can write into it randomly
22:37 karolherbst: or we lock 90% of the time
22:37 imirkin: the key is to make all writes be under a lock
22:37 imirkin: which is what i did
22:37 karolherbst: I don't see why even bother keeping one buffer
22:37 karolherbst: ehh...
22:37 karolherbst: I like perf
22:37 imirkin: because you have to order the logic which writes anyways
22:37 imirkin: so you STILL have to have the mutex
22:37 imirkin: but problems become harder to detect
22:37 airlied: per screen buffer seems a bit wrong
22:37 karolherbst: ahh, that leads to another propblem
22:37 karolherbst: our state emission is broken :p
22:37 imirkin: airlied: hw context switches are expensive
22:37 imirkin: lots of data
22:37 imirkin: so we have one hw context per screen
22:37 karolherbst: imirkin: who said something about multiple contexts?
22:37 imirkin: rather than per pipe context
22:38 imirkin: having multiple buffers per hw context sounds like disaster
22:38 imirkin: that's just asking for trouble.
22:38 karolherbst: soo.. this is how it should be: 1. state tracking per context 2. state emmision on flush with per thread buffer containing the "full state" 3. done
22:38 karolherbst: nouveau does nothing of that
22:38 karolherbst: or only in broken ways
22:38 imirkin: ok, well we clearly have a slightly different vision there
22:39 karolherbst: I know
22:39 airlied: imirkin: you just flush more state at the expense less locking
22:39 imirkin: the thing i did only broke coz of the nouveau_bo_map thing
22:39 imirkin: airlied: and have to re-audit everything
22:39 karolherbst: and in a world were 12 cores are the norm, locking is the last we need
22:39 karolherbst: I prefer any solution over locking
22:39 karolherbst: cause I like concurrency
22:40 karolherbst: killing concurrency is just wrong
22:40 karolherbst: uhm.. parallelism would be the better term here I think
22:41 karolherbst: also.. that's also more like how vulkan works
22:41 karolherbst: and I bet its designed like that for a reason
22:41 airlied: it seems like just having a fully self consistent per-context buffer should be fine
22:41 karolherbst: yes
22:42 airlied: you just track state changes in the contxt, and send the whole thing to the hw
22:42 karolherbst: and I'd like to have the option of doing a full state reemision on draw
22:42 karolherbst: and dump the _full_state_ if something gets rejected or the context crashes
22:42 airlied: allocate a new per-context buffer and emit all the current state on flush
22:42 karolherbst: for simplier debugging
22:42 airlied: or on first draw
22:42 karolherbst: yeah
22:42 karolherbst: we can be smarter for future calls by keeping some state if we know it never changes
22:42 airlied: like I can't see interleaving things into a per-screen thing really winning here
22:42 karolherbst: but by default I'd like to have the opton of full reemission
22:42 karolherbst: especially because it helps debugging
22:43 airlied: well draws are per-context, so just normal context state dirty tracking should be fine
22:43 karolherbst: right
22:43 airlied: and you dirty all the bits on flush
22:43 karolherbst: but.. well
22:43 karolherbst: airlied: we don't track :)
22:43 karolherbst: or not really
22:43 karolherbst: it's just broken from a design perspective
22:43 karolherbst: we know what state is dirty, we just don't know what a thread did
22:44 karolherbst: or if we have to dirty something at all, we just do it
22:44 karolherbst: which might be fine
22:44 karolherbst: just... annoying with multiple contexted
22:44 karolherbst: as we dirty everything anyway
22:44 karolherbst: or pretty much everything
22:45 karolherbst: I already found bugs where the concept of gallium and the concept of nouveaus "switch_pipe_context" just clashes
22:46 karolherbst: airlied: what we do is to mark things as dirty once something changes or another thread gets active
22:46 karolherbst: the first part is okay, the second not
22:47 imirkin: it is. st/mesa is busted if it leaves dangling pointers.
22:47 karolherbst: imirkin: it's CPU overhead optimized for a reason
22:47 imirkin: makign things difficult to debug for no gain?
22:47 imirkin: i don't think that's a good reason.
22:48 karolherbst: that we can't debug stuff isn't galliums fault
22:49 karolherbst: imirkin: I don't know why you keep ignoring CPU overhead as a solid argument
22:50 imirkin: it's not CPU overhead until it is
22:50 imirkin: if it's like 2 extra instructions executed once in a blue moon
22:50 imirkin: that's not CPU overhead
22:50 imirkin: that's just making developers' lives more difficult for no reason
22:50 karolherbst: imirkin: I can give you games where you end up with full CPU load and maybe 20% gpu load
22:50 imirkin: i bet
22:51 imirkin: but it's not the cpu overhead we're talking about here
22:51 imirkin: that's overhead from doing tons of glVertex calls or whatever
22:51 airlied:can't see dirty tracking being a problem here
22:51 karolherbst: I kind of see the dangling pointer bit, but you also say that locking is fine on a screen level and so on
22:51 airlied: like where's the extra CPU coming in over the current solution?
22:51 imirkin: yes. it's all fine. it will not impact CPU overhead in any material way.
22:52 airlied: per context dirty is how nearly every driver ever has worked
22:52 imirkin: which is what we do
22:52 imirkin: we also have some screen-levle things which are "expensive" to flip
22:52 imirkin: small handful
22:52 karolherbst: airlied: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/1415191c22391a441e93d9af87c221880e667d60
22:53 karolherbst: on context switch we try to reupload cb0 from a user pointer
22:53 karolherbst: which gallium destroyed already
22:53 karolherbst: anyway...
22:53 imirkin: so st/mesa is buggy. shity happens.
22:53 imirkin: not the first time, not the last.
22:53 airlied: imirkin: if you have per-gallium context dirty, then when you change threads you reemit all the state anyways?
22:53 karolherbst: imirkin: my patch still reduces CPU overhead in nouveau why adding no overhead to gallium....
22:53 imirkin: airlied: almost all yeah
22:53 karolherbst: *while
22:54 airlied: imirkin: so there isn't a real difference in having it in a per-context pushbuf then
22:54 airlied: it's possible more efficient
22:54 imirkin: airlied: there are some rarely changing things which trigger WFI's in the GPU
22:54 imirkin: like changing the code segment, among others
22:55 airlied: but changing them once per pushbuf doesn't seem too excessive
22:55 imirkin: it can be, depending on usage
22:55 imirkin: WFI vs not-WFI
22:55 airlied: once per draw I'd say maybe
22:55 karolherbst: airlied: the bigger issue we have is, that we don't submit in one go
22:56 karolherbst: we build our push buffer and it can just gets pushed while filling it
22:56 imirkin: i've never seen anything wrong with that
22:56 karolherbst: which.. is a horrible thing to do anyway
22:56 airlied: imirkin: I assume the hw is optimised for vulkan or d3212 like behaviour anyways
22:56 imirkin: airlied: probably. but the driver for those APIs will also ensure that those things don't change too much
22:57 karolherbst: imirkin: so.. let's say we have a bug and want to track it down, how do you know what state is on the hw besides just dumping _all_ the submitted buffers?
22:57 imirkin: code segment, texture/sampler stuff
22:57 imirkin: a few other things i'm not thinking of
22:57 karolherbst: and why doing a state reemission on flush helps here
22:57 karolherbst: does it even make sense to track texture/samplers on a screen level honestly?
22:58 karolherbst: I don't mean the buffers, but the tic/tsc entries
22:58 airlied: imirkin: the way vulkan works the driver doesn't rreally control things like push buf ordering or shader base addresses as easily
23:00 imirkin: well, the tic/tsc table is fixed
23:00 imirkin: changing it requires a WFI
23:00 imirkin: airlied: yeah, so you bake the CSO type stuff into fixed pushbufs, like we do in nouveau
23:00 imirkin: and generate the rest i guess?
23:01 imirkin: airlied: the vast majority of stuff will work fine like that
23:01 imirkin: a handful of things need love and care
23:01 imirkin: anyways, i haven't looked at blob traces in a _very_ long time
23:05 imirkin: karolherbst: as for multicore ... the whole glthread thing mareko did, which was a win in a bunch of places, essentially single-threads the whole thing
23:06 airlied: is a WFI a big hit on context boundaries?
23:06 karolherbst: I wasn't talking about glthread
23:06 airlied: like is there anything that people run that might notice
23:06 airlied: it's not like a ctx switch type event
23:06 karolherbst: but that's still to reduce application level CPU overhead by offloading stuff to another core
23:07 karolherbst: "essentially single-threads the whole thing" couldn't be further from the truth, even though it single threads gallium, but not the application
23:07 karolherbst: just makes the application wait less on gl calls
23:07 karolherbst: so you get more prallelism in the end
23:07 airlied: oh compute/graphic switches cause one
23:07 karolherbst: airlied: yep...
23:08 airlied: "Subsequent render calls having a bound UAV in common are conservatively separated by GPU WFI commands injected by our driver, to prevent any data hazards."
23:09 imirkin: airlied: it's a pipeline stall basically
23:09 imirkin: it's pretty rare to occur naturally unless you're doing weird shit
23:09 karolherbst: airlied: nvidia also faked async compute support in d3d for a looong time
23:09 karolherbst: cause the hw simply couldn't
23:10 HdkR: It could, just badly ;)
23:10 karolherbst: I think that changed with volta?
23:10 karolherbst: HdkR: right... but then it's pointless :p
23:10 karolherbst: but I think with volta that's better now
23:10 karolherbst: or turing
23:11 HdkR: Each generation improves async compute in some aspect
23:11 airlied: imirkin: not like intel then :-P
23:11 HdkR: Volta and Turing may be the first generation that it could actually be worth using though :P
23:11 karolherbst: yep
23:12 karolherbst: how much of ampere is public at this point? :D
23:12 imirkin: airlied: right, not like intel's "emit 32 pipe controls" workarounds :p
23:12 karolherbst: ahh.. quite so
23:12 imirkin: queries can cause WFI's, that's probably the most common cause
23:13 imirkin: sticking that in allows us to not wait on the CPU
23:13 imirkin: but still gotta wait for the thing being queried to actually complete
23:14 karolherbst: yeah.. but that might change in the future
23:14 HdkR: karolherbst: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf You get all of this since it is technically a released product even though nothing is in the consumer space
23:15 karolherbst: HdkR: just have to check the stuff is in there I'd like to talk about :D
23:15 imirkin: HdkR: do you know if it's going to be like volta - essentially datacenter-only, or will there be consumer hw too?
23:16 karolherbst: imirkin: ampere has one funky thing: global to shared async copy instruction
23:16 karolherbst: that will be fun to support
23:16 imirkin: yeah, i saw you talking abou tit earlier
23:16 karolherbst: but I guess that's just a CPY s[] g[] thing and a barrier
23:17 karolherbst: just... teaching codegen will be fun
23:17 HdkR: imirkin: With all the leaks and two years since the last hardware release, one can assume consumer hardware will arrive eventually
23:17 imirkin: well, with volta it never did
23:17 imirkin: unless you count the DGX or whatever
23:17 karolherbst: well turing is essentially volta
23:18 karolherbst: I could imagine they do it like intels failed tick-tock :p just compute-graphics
23:18 HdkR: :D
23:19 karolherbst: imirkin: the MIG stuff is interested as you can partition the GPU quite precisely
23:19 karolherbst: but they only talk about the high level bits
23:20 imirkin: the cgroups people will be happy
23:20 karolherbst: I doubt it
23:20 imirkin: they're never happy? :)
23:21 karolherbst: well.. with ampere you can split the SM between applications/work/whatever
23:21 karolherbst: so I guess you could also use it for graphics vs compute splits
23:21 karolherbst: just need to find the balance
23:21 karolherbst: but no idea how that works precisely
23:22 karolherbst: ahh you can even slice the caches
23:25 karolherbst: but I bet you would be able to load balance between contexts somehow... oh well
23:28 karolherbst: anyway... one of the early steps is to rework how submission works and make it not implcitly submit while other threads have the chance... regardless on how it gets solved