15:21 robclark: karolherbst: btw, https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/frontends/rusticl/core/event.rs?ref_type=heads#L264 .. that should probably happen after queue.finish() somehow? Since a blocking read of a query will trigger a flush (which you probably don't want if there are multiple queue.enqueueNDRangeKernel(), and even if there is only one, it will trigger an extra execbuf/submit ioctl to get a fence)
15:22 karolherbst: robclark: sure, but the entire profiling needs to be reworked anyway
15:23 robclark: ok, as long as $new_thing dtrt
15:23 karolherbst: the thing is, that start/end events are about when the GPU starts/ends processing commands
15:23 karolherbst: and I don't really want to read those values out on the CPU
15:24 karolherbst: should use `get_query_result_resource` instead of `get_query_result`
15:24 karolherbst: and have a slab allocated heap that I can just throw into it
15:25 robclark: well, app is going to _eventually_ read them out on CPU.. I guess you could use the get_query_result() path, but that is a pita for me since I can't properly convert ticks to ms on the gpu without spinning up a compute shader ;-)
15:26 robclark: so for timestamp queries I have to kinda fake it for get_query_result_resource()
15:26 karolherbst: mhhh
15:26 robclark: if you just kept a list/vector/whatever of the queries and then did readback after flush it wouldn't be so bad
15:27 karolherbst: the thing is.. I just submit commands, and I need to know when the GPU starts/end processing a command, an a query object thrown into the command stream is generally how it should be (tm). Though I can see that on tilers things are different (tm)
15:27 robclark: for compute, fortunately tiling isn't a thing.. it really just about not being able to convert to ms on the CP
15:28 karolherbst: it doesn't have to be ms or something afaik
15:28 robclark: or us or something.. I forget what units the result is defined in, but it is a unit of time, not ticks
15:28 karolherbst: oh actually.. the spec wants it in ns
15:28 robclark: ahh, right
15:30 robclark: hmm, although I wonder if I could get something added to fw.. we really just need to multiply by (approx) 52
15:31 robclark: I guess 52 is close enough to 52.083333333
15:33 robclark: still, it would be easier to do on cpu
15:33 karolherbst: robclark: anyway.. the idea was to insert a timestamp query object, do a bunch of gallium commands, do the second timestamp qery and just read out the results once the GPU is done
15:34 robclark: yeah, and as you pointed out one way to do that is get_query_result_resource().. the other is to track the query objects
15:35 karolherbst: how are queries implemented for you anyway? Is it like a command stream thing where the GPU writes a timestamp into a location at some point in time, or is it more like.. done on the cpu side?
15:35 robclark: it's on the GPU
15:35 karolherbst: okay
15:36 karolherbst: yeah when it's writing the values asynchronously to a buffer I can map later, that's perfect. We _could_ do CPU side fixs ups of the value
15:36 karolherbst: what's just important is, that I don't want to stall the pipeline with busy waits
15:36 karolherbst: as I'm doing atm
15:36 robclark: right
15:36 robclark: well, with get_query_result_resource() you eventually stall on the result resource
15:36 karolherbst: but I could apply a factor before reporting back the values
15:37 karolherbst: well sure
15:37 robclark: so it amounts to the same thing.. but I could see get_query_result_resource() being easier to implement
15:37 karolherbst: I can map without stalling
15:37 robclark: yeah, I guess we could add a pipe cap to adjust the result on the cpu
15:37 karolherbst: PIPE_MAP_UNSYNCHRONIZED or something
15:38 robclark: sure, but I guess you want to actually _have_ the result on the CPU at some point ;-)
15:38 karolherbst: yeah... needs to flush the thing at some point
15:38 karolherbst: but.
15:38 karolherbst: you can also copy to a second resource
15:38 karolherbst: and map that one without stalling :D
15:39 robclark: only for readback of result on the GPU
15:39 karolherbst: there are a few tricks how the actual important main work can be left alone doing it's stuff
15:39 robclark: (idk if cl can do that)
15:39 karolherbst: oh the CL API doesn't care about how it's implemented really
15:39 robclark: if you are reading back on the CPU, then you have to wait
15:39 karolherbst: it just gives you raw values
15:39 robclark: right, but on the _CPU_
15:39 karolherbst: sure
15:40 robclark: so unless you invent time travel, it needs to wait on GPU somewhere ;-)
15:40 robclark: (but moar fps via time travel would be a neat trick)
15:40 karolherbst: not really if you e.g. tell the GPU to write the results into a coherent/staging buffer
15:41 robclark: sure, but cpu needs to read it after gpu writes it
15:41 karolherbst: right, you can't prevent that one :)
15:42 karolherbst: but atm, we do the read after each even is processed, then stall the GPU then execute the next event
15:42 karolherbst: that just stalls the CL queue all the time
15:43 karolherbst: an "event" here is like a cl queue command
15:43 robclark: right, either the defer query object read in getProfilingInfo() or stall and read result rsc in getProfilingInfo() would amount to the same thing
15:43 robclark: it would stall until result is avail if it isn't already
15:43 karolherbst: nah, it's done on different threads
15:43 karolherbst: there is a queue thread working through the commands
15:44 robclark: gpu doesn't care so much about cpu threads
15:44 karolherbst: and that's stalling the GPU side of things with the current implementation constantly
15:44 robclark: sure, ofc
15:44 karolherbst: with get_query_result_resource the GPU would just write the result into some buffer working through the commands
15:44 karolherbst: and then at some point it gets read out once the queue is flushed/finished or so
15:45 karolherbst: but that happens on a different thread then and wouldn't bother the queue one
15:45 robclark: but if that thread just pushed the queue objects to some data structure, and deferred get_query_result() until getProfilingInfo() is called.. then you don't stall any more than you would with the get_query_result_resource() approach
15:45 karolherbst: getProfilingInfo doesn't call into get_query_result
15:45 karolherbst: the get_query_result happens way earlier
15:45 robclark: right, that is the problem
15:46 robclark: oh, but I guess you might need extra locking with my approach to avoid calling into ctx on multiple threads
15:46 karolherbst: yeah and instead of get_query_result, I want to use get_query_result_resource so it's not constantly waiting on the GPU. And getProfilingInfo simply reads from the buffer instead of temporary values the results of get_query_result were written to
15:47 karolherbst: there is already a bit of indirection going on there, because things are already cursed enough
15:48 karolherbst: well.. that's why I want to map unsynchronized or something, so I can just map on a different context
15:48 karolherbst: need to figure out the details at some point
15:48 karolherbst: maybe I just do a resource_from_user_memory thing...
15:49 robclark: yeah, for the result rsc approach, that would work, because you can wait on fence on any thread
15:49 karolherbst: yeah
15:49 karolherbst: just need to wait until the GPU is actually done, maybe make sure the results are flushed, but then it should work in principle
15:51 robclark: let me look into whether CP_TICKS_TO_NS is something that I can talk someone into.. I guess it at least has a non-zero chance now..
15:51 karolherbst: could also just collect all the query results whenever I had to wait on a fence anyway
15:51 robclark: and it would be useful for qbo
15:52 karolherbst: could be default 1
15:53 karolherbst: it's a bit more problematic with GL, because you can hand out a GPU resource to applications with the raw data, no?
15:53 karolherbst: or well.. have it written to a memory object
15:53 robclark: right, right now I just fail the big gl qbo timestamp/elapsed tests
15:53 robclark: with CP_TICKS_TO_NS would help
15:54 karolherbst: yeah..
15:54 karolherbst: anyway.. it's harder to get all the gl bits correct here. I can just apply a factor if needed, that's not really a big issue
15:54 robclark: s/with/which/
15:55 robclark: yeah, I guess we could do that as the workaround if we had to (or at least do that when fw is too old)
16:09 karolherbst: mhh, I think I just understood what you were trying to explain earlier 🙃.. I guess I could move the `get_query_result` calls to a later place and only do that after waiting on related fences anyway...
16:14 karolherbst: anyway, that would also require rewriting msot of the profiling code, for weird reason
16:14 karolherbst: s
16:15 robclark: yeah.. but if you are calling into pctx on a different thread from app thread, the threading might be a bit awkward
16:15 robclark: but other than that detail the two approaches are the "same"
16:16 karolherbst: yeah... atm PipeQuery stores a reference to the Context, and for rust reasons it would be a bit painful to delay reading it out. So writing into a resource would get around that part
16:17 karolherbst: because then I won't have to keep the query object around
16:17 karolherbst: And using host visible memory mapped into the GPU might just make everything trivial enough to handle
16:18 robclark: yeah, you'd have to wait on a fence but no threading constraints there
16:18 karolherbst: or it's a persistent mapping and the event object just gets a pointer into a slice allocated for it
16:18 karolherbst: and then it's just reading from a pointer (after a flush/wait) and nothing else matters
16:18 karolherbst: anyway...
16:19 karolherbst: I have ideas, and I'd just need to figure out what I like the most
16:20 karolherbst: I think I like the idea of using a coherent + persistent mapping thing, because then I don't have to bother on the CPU side with copying values around and the results just appear at the right location at some point