14:07 gsedej: hi! I have an "old" laptop with geforce 840m ("optimus"). If i boot vanilla ubuntu 18.04, i get very many output to dmesg
14:08 karolherbst: gsedej: yeah... and I kind of know what issue that is and we are working on a fix. I assume that "nouveau.runpm=0" those messages will be gone
14:13 gsedej: ok, just checking if it's known issue
14:13 gsedej: this issue is being present for~6 years. I am using that laptop with blob driver though
14:14 karolherbst: gsedej: you can try out a patch if you want to but maybe it's a different issue.
14:14 karolherbst: do you have the messages somewhere?
14:21 gsedej: sorry, don't have laptop currently at hand. I can try check message and report here
14:28 imirkin_: karolherbst: that's a bunch older than the current runpm issues you're dealing with
14:31 gsedej: by the way... I was testing old GTS250 with ubuntu 18.04 and it works really nice, including 3D performance
14:37 karolherbst: imirkin_: not nessarily
14:38 karolherbst: *necessarily
14:38 karolherbst: imirkin_: I think I've heard about this issue on maxwell based laptops as well, but couldn't pin point it to a model anymore
14:39 imirkin_: karolherbst: there were issues, but unclear what those issues were
14:39 imirkin_: i thought they were fixed, but apparently not
14:39 imirkin_: iirc it was the whole acpi vs pcie pm thing
14:39 karolherbst: right, but that's also the issue on pascal
14:41 karolherbst: or do you mean the nouveau calls acpi vs pci core calls acpi stuff?
15:15 karolherbst: imirkin_: how much do you know about the barrier stuff inside codegen?
15:16 karolherbst: I am kind of investigating weird issues with shared memory
15:16 karolherbst: and I am wondering if I just set the wrong barriers
15:17 imirkin_: some?
15:18 karolherbst: mhhh, maybe the barrier code is wrong indeed
15:18 karolherbst: I end up with a membar gl; bar sync 0x0 0x0 0x1
15:18 karolherbst: but that membar is for global memory I guess
15:21 imirkin_: right.
15:21 imirkin_: so ... a memory barrier is entirely unrelated to a barrier.
15:21 imirkin_: (almost entirely)
15:21 imirkin_: a memory barrier is basically a syncpoint that ensures all previous reads/writes are finished
15:22 karolherbst: right, but the kernel also syncs all threads
15:22 imirkin_: a barrier is a code execution syncpoint which ensures that all threads have reached that point
15:22 karolherbst: actually both is happening
15:22 imirkin_: i don't know that barrier implies memory barrier (in the ISA)
15:22 imirkin_: it might, but i could definitely imagine implementations where it doesn't
15:23 karolherbst: OpControlBarrier 2 2 272, 2 means Workgroup, first 2 is for execution, second for memory
15:24 karolherbst: 272 is a mask for WorkgroupMemory | SequentiallyConsistent
15:24 karolherbst: SC: All observers will see this memory access in the same order with respect to other sequentially-consistent memory accesses from this invocation.
15:25 karolherbst: but... what bothers me is that we emit a membar on global, not shared memory...
15:25 karolherbst: and this should be wrong I thnk
15:25 imirkin_: well, i think global > shared
15:25 imirkin_: but i could be wrong
15:25 imirkin_: (i.e. doing it on global _also_ hits shared)
15:25 imirkin_: i'd look at the ptx docs for this
15:26 imirkin_: it's highly likely that those ptx ops mirror the hw fairly precisely
15:26 karolherbst: mhhh yeah
15:26 karolherbst: probably
15:26 karolherbst: maybe something else is bonkers
15:26 karolherbst: weird thing is, that the tests see spartial writes
15:27 karolherbst: ERROR: Data sample 137 for vstore of char2 did not validate (expected {97 f5}, got {00 f5}, stored from store #17 (of 512, offset = 137) with address offset of 1)
15:27 karolherbst: like this
15:27 karolherbst: and the numbers change with each execution
15:28 imirkin_: do the tests do atomic things?
15:28 karolherbst: no
15:28 imirkin_: ok. atomics are funny since they bypass some caches
15:28 imirkin_: which can in turn lead to sadness
15:28 imirkin_: we don't handle that case very well
15:28 imirkin_: in large part because i don't have a solid handle on what the sadness is
15:29 imirkin_: (i.e. combo of atomic + regular access)
15:29 karolherbst: I see
15:30 RSpliet: imirkin_: from what I recall, NVIDIA doesn't handle that situation well either
15:30 karolherbst: the kenrel is weird though
15:30 imirkin_: hehe
15:31 RSpliet: Not joking. I did some stuff with that back in 2013, could not enforce sequential consistency across work-groups.
15:31 karolherbst: https://gist.github.com/karolherbst/6a1c63472f15692a8944b143776994b7
15:32 imirkin_: RSpliet: yeah, just nice to know they also have issues
15:34 RSpliet: I suspect it's a "feature". You're not supposed to share data across work-groups, I don't think the OpenCL spec at the time gave any guarantees either.
15:34 karolherbst: RSpliet: why did they have barriers then
15:34 RSpliet: to synchronise the warps within a work-group
15:35 karolherbst: CL doesn't know warps
15:35 RSpliet: I know, but it does have a notion of work-items in a work-group requiring sync
15:35 karolherbst: sure
15:36 karolherbst: but the kernel I linked should execute successfully, no?
15:37 RSpliet: let me take a look.
15:37 karolherbst: I mean.. even nvidia needs to pass this test somehow
15:38 RSpliet: without looking at the launch parameters, indexing into a local buffer using a global TID is a red flag
15:39 RSpliet: no
15:39 RSpliet: wait, there's another level of indirection (offsets)
15:39 karolherbst: itäs fine
15:39 RSpliet: so that's fine
15:39 karolherbst: block dimensions = 512x1x1
15:39 karolherbst: grid dimensions = 1x1x1
15:40 karolherbst: maybe I count barriers wrongly
15:40 karolherbst: we also have this barrier count thing
15:40 RSpliet: wait, with grid dimension you mean work-group dimensions?
15:41 karolherbst: block is the size of each block and grid of the launched grid ;)
15:41 karolherbst: grid > block
15:41 RSpliet: oh... this terminology confuses OpenCL devs
15:42 RSpliet: But ok, so one work-group, 512 work-items.
15:42 karolherbst: well with barriers set to 2 it still doesn't work
15:42 karolherbst: *sigh*
15:44 karolherbst: ehh.. the nvidias CL implementations is also only slightly annoying about everything
15:46 karolherbst: okay.. if it doesn't want to load, I can't help it
15:54 RSpliet: karolherbst: it looks like quite the dodgy kernel to me. Lines 4-6 set parts of the local storage to 0 (no guarantee the whole thing is, the values in offsets[] are unknown to me), line 8 copes data into it at different locations (alignmentOffset away) And then there's an open-coded copy from local mem to global mem...each work-group reading the exact values it just wrote into sp[].
15:54 RSpliet: The barriers don't look problematic to me tbh.
15:55 karolherbst: yeah well.. it's part of the official CTS
15:56 RSpliet: Is there anything known about the values in the offsets[] array? Does it contain duplicates?
16:01 karolherbst: mhhhh, good question
16:02 karolherbst: RSpliet: https://github.com/KhronosGroup/OpenCL-CTS/blob/master/test_conformance/basic/test_vloadstore.cpp
16:02 karolherbst: test_vstore_local
16:03 karolherbst: ahh, it seems like it tests for duplicates inside offsets
16:04 karolherbst: or rather generates it in a way they shouldn't be
16:04 RSpliet: Yep, looks like it, good.
16:05 karolherbst: the same tests works for other memory anyway
16:06 RSpliet: In that case the kernel is obfuscated, but should not be flawed
16:06 karolherbst: well, except private, but that's special
16:06 RSpliet: Got failing NVIDIA assembly?
16:06 RSpliet: or well... nouveau assembly...
16:07 karolherbst: https://gist.githubusercontent.com/karolherbst/f884e26dd84110a1892cb405a670efc2/raw/6a28144ae34a2f6f869f2726fa0b9c3a1d4897ce/gistfile1.txt
16:08 karolherbst: duhdaushdakjsdhkwlhdjk.1hkd1nw.m,dn1,mdn1d
16:08 karolherbst: I found it
16:08 karolherbst: guess what
16:09 RSpliet: st u8 instead of st u16?
16:09 karolherbst: NV50_PROG_SCHED=0
16:09 karolherbst: mhh, but still got one random fail with long16
16:11 karolherbst: but uhm... why
16:11 karolherbst: I mean, why does skipping sched makes it work better
16:12 karolherbst: ohhh.. maybe it just changes timings
16:12 RSpliet: Oh, the compiler actually issues two "st u8 s[]" commands for that vstore2
16:12 karolherbst: yep
16:12 karolherbst: ohhhh
16:12 karolherbst: no
16:12 karolherbst: oh no oh no oh no oh no
16:12 karolherbst: duh
16:13 karolherbst: but that shouldn't matter, right?
16:14 karolherbst: RSpliet: main reason is, that codegen implicitly assumes vec4 alignment for load/stores
16:14 RSpliet: In this case I don't think so. Just a tad wasteful from the ld/st perspective, but that's future work
16:14 karolherbst: so... optimizing those loads/stores breaks things
16:15 RSpliet: a vec4 of what?
16:16 karolherbst: input outputs generally
16:16 karolherbst: but I also don't make use of alignment information inside nir -> codegen
16:16 RSpliet: No, I mean is that a vec4 of single-precision floats or chars?
16:16 karolherbst: so it's safer to split those
16:16 karolherbst: 32bit values
16:18 RSpliet: Ok. Yeah alignment constraints are a PITA if you don't know alignment up-front.
16:19 RSpliet: "The write address computed as (p + (offset * n)) must be 8-bit aligned if gentype is charn" - Khronos
16:20 karolherbst: yep
16:20 karolherbst: which is different then charn alignment
16:20 karolherbst: *than
16:20 RSpliet: Indeed :-)
16:21 RSpliet: Anyway, hope I've been a helpful rubber duck / cardboard helpdesk for a bit. Back to my dissertation.
18:45 karolherbst: imirkin_: mhhhhh nvidia inserts NOPs
18:46 imirkin_: maybe nvidia knows something we don't
18:46 karolherbst: https://gist.githubusercontent.com/karolherbst/5aa0da659dd7f24009c5a6b97c2c38e3/raw/fb79cc99e05743be5d1142893576af7d0623755f/gistfile1.txt
18:46 karolherbst: the ones in the middle
18:46 karolherbst: but...
18:46 karolherbst: a little bit before they don't
18:47 imirkin_: yeah, dunno. the formatting of the bundles is all funny
18:47 karolherbst: no
18:47 karolherbst: it shows dual issueing
18:47 imirkin_: right. "all funny" :)
18:47 karolherbst: :)
18:47 karolherbst: I see
18:47 karolherbst: anyway.. those NOPs are weird
18:47 imirkin_: anyhoo ... i have no comment about those NOP's ... check if they have flags set
18:47 imirkin_: like, barrier flags
18:48 imirkin_: (we print all that in envydis)
18:48 karolherbst: nop 0x0
18:48 imirkin_: i mean the flags
18:49 imirkin_: i.e. the sched line
18:49 karolherbst: ohhh.. right
18:49 karolherbst: heh
18:49 karolherbst: the nops have stall counts
18:50 imirkin_: like 0xf stall counts?
18:50 karolherbst: no
18:50 karolherbst: the first one has 0xf, the second 0xb
18:51 imirkin_: welp, i've never seen that before
18:51 imirkin_: but i also have never looked at SM50+ with any level of care
19:05 namtsui: https://nouveau.freedesktop.org/wiki/IntroductoryCourse/
19:07 namtsui: what might be the motivation for understanding the compiler theory articles? will it help me understand gallium 3D or how shaders are compiled?
19:16 imirkin_: namtsui: that wiki page is fairly far out of date
19:16 imirkin_: what are you intereted in?
19:16 namtsui: good question. I should have led off with that.
19:16 imirkin_: i'm guessing it hasn't been updated much in the past decade or so
19:16 namtsui: I want to understand how nouveau works on linux in the hopes of porting it to openbsd
19:17 imirkin_: ah yeah, then pretty much 100% of that page is of zero use to you
19:17 namtsui: very far fetched and I am starting from 0 understanding right now. for now I just want to understand how nouveau works.
19:17 imirkin_: the simplest win is to make a "drm" adapter in openbsd -- this is what the other bsd's have done
19:18 imirkin_: at that point you can just copy the kernel modules over and let the good times roll
19:18 imirkin_: separately, nouveau is fairly separated from its "core" which controls the gpu, and a wrapper which adapts linux's drm interface requirements to what the core can do
19:18 imirkin_: if one wanted, one could just copy that core and write a different adapter around it
19:19 karolherbst: and port drm over
19:20 imirkin_: attempting to reimplement the drivers from scratch is going to be a losing strategy
19:20 imirkin_: unless you can bring massive resources to make it happen - rivaling the effort required to create the drivers on linux in the first place
19:24 namtsui: https://marc.info/?l=openbsd-misc&m=138311697223527&w=2
19:24 namtsui: this is the current state of openbsd graphics (project 3 - hard)
19:25 imirkin_: right
19:25 namtsui: sounds like dri (drm is kernel component right?) infrastructure from linux 3.8 is in place
19:25 imirkin_: right, so you have drm from 3.8
19:25 imirkin_: so just grab nouveau from 3.8 and let the good times roll :)
19:25 imirkin_: it should be fairly easy
19:25 namtsui: excellent
19:26 imirkin_: nouveau doesn't need anything that radeon doesn't need
19:26 imirkin_: and if radeon already works, the conversion should be moderately straightforward
19:26 imirkin_: and anything tricky can probably just be commented out / removed
19:27 imirkin_: of course, that email is from 2013
19:27 imirkin_: hopefully they've pulled in a fresher drm than that?
19:27 namtsui: yes I think so
19:27 imirkin_: but yeah - figure out what kernel they grabbed it from, and just use nouveau from the same one
19:27 imirkin_: also, what GPU do you have?
19:27 namtsui: 770
19:28 namtsui: it works well with reclocking on linux
19:28 imirkin_: GTX 770?
19:28 namtsui: yeah
19:28 imirkin_: ok, then options 1 and 2 aren't for you :)
19:28 imirkin_: xf86-video-nv only supported up to the tesla series
19:28 namtsui: oh yeah option 1 is done already by the OP
19:28 imirkin_: oh ok
19:28 karolherbst: I've never looked into nv myself...
19:29 karolherbst: is it worth it?
19:29 imirkin_: define 'it'
19:29 karolherbst: well, to look at the code for fun
19:29 imirkin_: it's basically for pre-tesla and someone hacked g80 support onto it
19:29 imirkin_: it's a lot like the dispnv04 code
19:29 imirkin_: (dispnv04 = copy of -xv, with arbitrary changes to remove various chunks of code)
19:30 imirkin_: (unfortunately mixed in with fixes, so we can't ever know if a difference is good or bad)
20:14 karolherbst: heh.. it seems like the barrier is without affec
20:14 karolherbst: t
20:31 imirkin_: double check that it's emitted properly?
20:33 karolherbst: "BAR.SYNC RZ, RZ" nouveau vs "BAR.SYNC 0x0" nvidia
20:34 imirkin_: figure out what bits are different?
20:34 imirkin_: we probably replace the 0x0 imm with RZ and shouldn't?
20:34 karolherbst: uhh, there are quite some differences
20:35 imirkin_: right ... imm vs register variant
20:35 karolherbst: ahh, probably
20:36 karolherbst: what's interesting is that nvidia inserts first the bar, then membar, mesa the other way around
20:38 imirkin_: i've never looked at that stuff
20:38 imirkin_: i was really happy when the piglits started passing
20:38 imirkin_: i think hakzsam actually did most of the compute bringup
20:39 imirkin_: i did an initial pass and then he filled in the gaps? or other way around? i forget.
20:39 imirkin_: he did do a bunch of work on it though
20:39 karolherbst: ehh.. how was this thing called replacing 0 with $r255
20:39 imirkin_: replaceZero()
20:39 karolherbst: ohh, it's not part of RA
20:39 imirkin_: no
20:39 imirkin_: after RA
20:39 imirkin_: on nv50, there is no "zero" register
20:40 imirkin_: so we just use $r63 so that we can use short encodings
20:40 imirkin_: however if r63 is legitimately used, we move to $r127
20:40 imirkin_: (which is obviously determined after RA)
20:41 karolherbst: okay... but at least that didn't change a thing
20:42 imirkin_: might be hardcoded in the BAR emission
20:42 karolherbst: ohh, no, I meant using the imm variant didn't change anything
20:42 imirkin_: oh ok
20:42 imirkin_: do the bits match now?
20:42 karolherbst: zes
20:42 karolherbst: *yes
20:43 imirkin_: xes!
20:44 karolherbst: mhh.. although emiting the barriers makes things a bit better
20:44 karolherbst: disabling sched has a bigger impact though
20:44 karolherbst: and disabled opt + sched makes it nearly bugfree
20:44 imirkin_: i'm sure throwing like 30 nop's in between each op will fix it too
20:44 imirkin_: maybe those stalls weren't for nothing
20:45 karolherbst: well.. nvidia throws in two
20:45 karolherbst: but...
20:45 karolherbst: I will try swapping the bar and membar
20:45 karolherbst: maybe that does something
20:45 imirkin_: oh, i wonder...
20:45 imirkin_: the STS is a "deferred" instruction -- you normally set a bar on it
20:46 imirkin_: and then the use will effectively wait for it to complete
20:46 imirkin_: i wonder if BAR / MEMBAR don't properly wait for all those to actually complete
20:48 karolherbst: I can't comment on that
20:49 imirkin_: :)
20:49 karolherbst: anyway.. will try out the ordering thing, that might help.. hopefully
21:03 edgecase: I hear about "bar" a lot, is this related to PCI address decode BAR?
21:04 imirkin_: edgecase: no
21:04 imirkin_: http://docs.gl/sl4/barrier
21:07 edgecase: is there any chance I can think of say GPU discrete ram as a pool, where TTM is the equivalent of malloc() ? I'm afraid to ask for documentation for something RE'd like nouveau.
21:07 imirkin_: sorta
21:07 imirkin_: more like ...
21:08 imirkin_: GPU needs access to certain resources
21:08 imirkin_: those resources may be in system RAM or GPU RAM
21:08 imirkin_: the GPU has a VM which enables a particlar VA to point to either one
21:08 imirkin_: and the GPU has a finite amount of ram (sadly)
21:08 imirkin_: so you need to be able to move stuff around
21:08 imirkin_: ttm helps handle some of that
21:09 HdkR: imirkin_: A dual epyc system has a finite amount of RAM too, just happens to be 8TB :P
21:09 imirkin_: pretty sure infinity = 4TB
21:09 edgecase: but not a finite number of shader types... we have a tesselation control shader now have we?
21:09 imirkin_: so this is like 2 * infinity!
21:09 edgecase:puts finger to head and pulls trigger
21:09 imirkin_: edgecase: yes, finite number of shader types
21:09 imirkin_: but you can submit a job that runs many shaders (or the same shader many times)
21:10 imirkin_: which needs access to various resources
21:10 imirkin_: all this computation is happening asynchronously from the cpu
21:10 imirkin_: so you need to know when that is done before migrating those resources off of vram
21:10 edgecase: well i thought the earth was flat, and that geometry shader was the edge, until now.
21:10 imirkin_: funny enough, geometry shader *is* the edge
21:10 imirkin_: tess happens before geometry ;)
21:10 edgecase: omg
21:11 edgecase:puts on clown suit
21:11 imirkin_: after geometry you have primitive assembly and rasterization
21:12 edgecase: ok, so your malloc() also lets you pin/unpin things in vram, swap them to/from host ram
21:12 imirkin_: HdkR: where was that pdf that shows like the full GL pipeline... the giant one?
21:12 imirkin_: having trouble searching for it
21:13 edgecase: well, GL in general, there are resources, tutorials, you know the rainbow-coloured triangle etc.
21:14 HdkR: imirkin_: I've seen too many different ones of those graphs to know which one you want :P
21:14 imirkin_: there's a specific thing i'm looking for
21:14 pendingchaos: imirkin_: https://www.g-truc.net/doc/OpenGL%204.4%20Pipeline%20Map.pdf ?
21:14 imirkin_: YES
21:14 imirkin_: pendingchaos: thank you :)
21:14 HdkR: ah right
21:15 HdkR: Need a new one for mesh shading now
21:15 imirkin_: lol
21:15 edgecase: can I call Tank and have him just download this into my brain (while I'm in The Matrix)?
21:15 imirkin_: that just replaces everything, right?
21:15 imirkin_: edgecase: only frank the tank
21:16 HdkR: https://devblogs.nvidia.com/wp-content/uploads/2018/09/meshlets_pipeline.png Most of it yea
21:16 imirkin_: and you might not want to download what he has to say :)
21:16 imirkin_: HdkR: right, i meant the whole vertex pipeline. obviously rast still seems like a thing
21:16 HdkR: yea
21:17 HdkR: Although I guess if you include the D3D model, then RT ends up escaping from any shader stage, so that would be fun to represent on that graph
21:17 edgecase: you know, what they shoud really do, is replace it all with one block "graphics", with annotation "Nobody kumz in here but nVidiaz!"
21:19 edgecase: imirkin_, am I wrong to think of GPU as a bus-master like a NIC doing SG-DMA? except it's got it's own VM and CPU
21:19 imirkin_: yeah, basically a GPU is just a fancy NIC
21:19 imirkin_: DisplayPort is pretty much a networking protocol :)
21:19 edgecase: at the TTM level, everything is just a buffer, mesa packs instruction streams into buffers,
21:20 edgecase: with references to textures, in other buffers, etc.?
21:20 imirkin_: yep
21:21 edgecase: well I want to learn the pipeline stuff, because well I want to posess *all* knowledge, but this buffer management has my attention due to boogz
21:22 imirkin_: bugs in buffer management are basically undebuggable
21:22 imirkin_: requires a lot of dedication and effort and luck
21:22 imirkin_: basically the GPU is a remote computer
21:23 imirkin_: which doesn't have gdb :)
21:23 edgecase: choosing to put buffer in host vs vram is an optimization, it should still give correct result, but slower, if chosen wrong?
21:23 imirkin_: usually.
21:23 imirkin_: some things _must_ be in vram
21:23 edgecase: unless buffer moves but pointers aren't updated
21:23 imirkin_: well, for better or worse, nouveau never moves between vram and "gart" memory
21:23 imirkin_: but things do get evicted from vram
21:24 edgecase: gart is iommu in some cases?
21:24 imirkin_: gart meant something back in the day
21:24 edgecase: AGP days
21:24 imirkin_: but today it's "system memory accessible by the gpu"
21:25 edgecase: used to be on host bridge, GART
21:25 edgecase: is that the VM on GPU you mentioned?
21:25 karolherbst: VM
21:25 karolherbst: VM != RAM/VRAM
21:26 edgecase: i assume he means address translation block?
21:26 karolherbst: VM is just a view on memory and the GPU and CPU have their own views
21:26 edgecase: sure, CPU has MM unit, page tables, etc. what has GPU got?
21:26 karolherbst: same
21:27 edgecase: nv has it's own, or uses system IOMMU?
21:27 karolherbst: the GPU has its own MMU
21:27 HdkR: Eveery recent desktop GPU has its own MMU, even most mobile ones do
21:27 edgecase: ok, so GL "programs" work in the GPU's virutal-memory space
21:28 edgecase: so, buffers are relocatable, without changing "code", but rather update GPU VM mappings?
21:29 imirkin_: exactly.
21:29 imirkin_: frequently you'll have explicit VA's in command sequences as well as other buffers
21:29 imirkin_: and those are static throughout the lifetime of those buffers
21:32 edgecase: so logicall, TTM's generic "malloc()-like" heap management routines, must work with nouveau's nv-specific VM mapping hardware routines.
21:33 edgecase: s/logicall/logically/
21:33 imirkin_: right
21:33 imirkin_: and keep in mind that nouveau supports lots of different hardware
21:33 imirkin_: not all of which has a VM available
21:33 edgecase: mine is nv50
21:33 imirkin_: VMs became a thing with nv50
21:33 imirkin_: the original G80 itself has some extra-special restrictions
21:39 edgecase: so, [TTM] failed to evict blah blah, i should be able to at least see what type of objects are filling up whatever is full?
21:41 edgecase: are there any nice diagrams of nv(50) VM?
21:42 imirkin_: it's pretty standard
21:42 imirkin_: PDE / PTE
21:43 karolherbst: edgecase: TTM doesn't know as there are no types of buffers
21:43 imirkin_: various bits indicate whether it's GPU or CPU memory
21:43 karolherbst: buffers types are indicated by their use
21:45 edgecase: TTM's user (nouveau?) must have various lists of objects by type, that it allocated using TTM?
21:45 edgecase: i mean, framebuffer (plane?), that accounts for a few yes?
21:46 karolherbst: there are no types
21:46 karolherbst: as there is also no different kind of memory for system ram
21:47 edgecase: ok maybe i can look at it another way, the routine that allocated the memory; say there's 1000 buffers allocated for textures? or the routine that's freeing up stuff?
21:47 imirkin_: there are "kinds", which are specified at the PTE level, and indicate somewhat different types of buffers
21:48 edgecase: for (i=1;i++;i<1000) { ttm_alloc_vram(thing } ;
21:48 karolherbst: no, you allocate a buffer, fill it with data, then use it for textureing and then other stuff
21:48 edgecase: ttm doesn't track who called it, but all 1000 of those are the same "type" yes?
21:48 edgecase: ok
21:49 edgecase: like asking, what "type" of data is on a hard drive. answer: 160GB of it.
21:49 edgecase: or, answer: files
21:50 edgecase: PTE type just means, write-thru vs write-back, prefetch, those kind of "types" ?
21:50 edgecase: page attributes?
21:58 karolherbst: edgecase: the type of data on a hard driver is set by interpretation of the existing ones
21:58 karolherbst: there is no label on the file directly
21:58 karolherbst: PTE is just the type of page table
21:58 karolherbst: there are several levels to be able to allocate bigger pages at once
21:58 karolherbst: and reduce overhead and stuff for huge allocations
21:58 karolherbst: there are also properties like security flags and stuff
21:59 karolherbst: but that's all derived from what the MMU supports
21:59 imirkin_: ttm is generic - it's meant to assist in performing these things
21:59 imirkin_: it doesn't know about ... almost anything
21:59 imirkin_: it just keeps track of where buffers are at
21:59 imirkin_: and calls driver functions to move buffers around
22:00 imirkin_: you should look at what would could cause eviction to fail
22:00 imirkin_: tbh i can't think of too much
22:00 edgecase: ah, so generic ttm-malloc, would call a nouveau hook to update VM PTE/PDE, flush caches, etc
22:00 karolherbst: actually, there was a ttm talk at fosdem this year
22:02 imirkin_: edgecase: something liek that
22:02 imirkin_: there are memory manager objects
22:02 imirkin_: it's actually fairly confusing :)
22:04 karolherbst: isn't an eviction fail not also caused by all other pages are in use and can't be paged out to system memory_
22:04 karolherbst: aka OOM
22:05 edgecase: drm/ttm/ttm_bo_vm.c sounds interesting
22:06 edgecase: karolherbst, OOM sounds possible, there are 3 DRM clients, gnome-shell, Battle.net.exe and StarCraft.exe, and only 512M VRAM, maybe one or all of them is wasting vram-pinned buffers.
22:07 edgecase: is there accounting per DRM client of vram used?
22:08 karolherbst: I don't think so
22:08 edgecase: maybe try non-compositing WM to save ram?
22:08 karolherbst: would be cool to have it though
22:09 edgecase: shouldn't be too hard, just add var to each client's struct when they open, mem+=alloc, mem-=free
22:09 karolherbst: yeah well...
22:09 karolherbst: what if an application has 6 clients
22:10 karolherbst: but sure
22:10 edgecase: well having to add 6 numbers, beats having nothing
22:10 karolherbst: also it needs to be tracked on the kernel side
22:11 edgecase: another thing I don't get, is gpu context switching, i assume it's happening, client A does all their 3d setup and renders into a texture, client B also, then compositor renders them into framebuffer...
22:11 karolherbst: right, we have hw contexts on the GPU
22:12 edgecase: they just round-robin?
22:12 karolherbst: and each application allocates their own
22:12 karolherbst: edgecase: we have firmware doing context switching
22:12 edgecase: each client has their own cmd ring, textures, etc. is there a lot of GPU state to context-switche?
22:12 karolherbst: since 2nd gen maxwell we need to use signed firmware so we don't know what nvidias firmware is doing
22:12 karolherbst: but ours for older hw is open source
22:13 edgecase: hmm well however large it is, it's swapped to VRAM, is that one of the things that can't go to host ram?
22:13 karolherbst: it's not big enough to matter
22:13 karolherbst: and you want that to be in VRAM for speed
22:13 edgecase: ok, so bottom line, context switching happens, multiple drm clients are suported.
22:14 edgecase: most of the client context is in the buffer objects yes?
22:14 karolherbst: unkown
22:14 karolherbst: the client can also provide buffers to the runtime
22:14 karolherbst: and they get shadow copied or whatever
22:14 edgecase: i mean, many megabytes of stuff allocated by drm client, vs small context switched by firmware
22:15 karolherbst: ohh, sure
22:15 karolherbst: the context isn't that big
22:15 karolherbst: and it depends on the GPU
22:15 edgecase: man the firmware was REd and rewritten? that's insane
22:15 karolherbst: well, had to be done
22:15 edgecase: really old GPUs didn't context switch very well?
22:16 edgecase: assume only 1 client, a game or CAD app?
22:18 karolherbst: each application get's only one context anyway
22:18 karolherbst: we don't have as many
22:20 karolherbst: imirkin_: this shared memory issue is annoying :( it just makes no sense
22:20 karolherbst: unless we also have this bug in gl but out of luck just never hit it
22:27 edgecase: is nouveau using GEM in front of TTM?
22:27 karolherbst: imirkin_: what if the bar.sync needs to be the last instruction within a block of three?
22:28 imirkin_: karolherbst: could be!
22:28 karolherbst: nope
22:28 imirkin_: edgecase: all drm drivers use GEM
22:28 karolherbst: tried to compile code with O0
22:28 karolherbst: so there is a bar sync as the second one :(
22:29 imirkin_: edgecase: GEM provides handles to objects which can be passed around
22:29 edgecase: and some (with vram?) cards use TTM for the actual objects?
22:29 imirkin_: yes
22:29 imirkin_: e.g. i965 doesn't use ttm
22:29 imirkin_: er
22:30 imirkin_: i915 doesn't use ttm
22:30 imirkin_: while radeon and nouveau do
22:30 imirkin_: basically if you don't have vram, ttm doesn't do much for you
22:31 edgecase: is there a way to look at vram layout, ie where are framebuffers, firmware, buffer objects, etc?
22:33 imirkin_: no
22:36 edgecase: i see /proc/<pid>/maps tracks GEM and "ttm swap" perhaps
22:37 edgecase: but those are cpu virtual addresses. perhaps building blocks tho
22:37 imirkin_: also even figuring out what process those buffers belong to isn't an exact science
22:37 imirkin_: with DRI2, the fd for nouveau comes from the X server
22:37 imirkin_: so a lot of the accounting goes to Xorg
22:37 imirkin_: with DRI3, the client does the allocations directly
22:38 edgecase: will I be shot if I mention Wayland?
22:38 karolherbst: and with prime things get even more complicated as you can pass buffers around
22:40 edgecase:buried head in sand
22:40 edgecase: hmm, cat /sys/kernel/debug/dri/*/i915_gem_objects
22:41 imirkin_: this stuff isn't impossible to do
22:41 imirkin_: radeon has a radeontop thing too, no idea exactly what it does
22:41 imirkin_: just requires someone who cares to put in the effort
22:41 edgecase: i was thinking about radeontop. would be neat to show mem acct per client, maybe even execution times etc
22:42 edgecase: yeah, someone with a bug
22:42 edgecase: an itch that needs scratching
23:27 edgecase: much to study. thanks for the guided tour!
23:51 imirkin: edgecase: feel free to ask more questions! although note that karol and i are less expert on the kernel side of the code. skeggsb is the kernel guru