17:24 edgecase: so is all of VRAM always mapped to CPU's address space? PCIe BARs, MMIO window, etc?
17:25 edgecase: kernel seems to just use any BO as framebuffer with it's cpu address
17:25 edgecase: with KMS add_fb2
17:29 imirkin_: edgecase: generally no
17:29 imirkin_: but there's cleverness which allows it all to be accessed
17:31 edgecase: so, virtually yes?
17:32 imirkin_: yeah, it just moves the window around in the fault handler, or something along those lines
17:32 edgecase: i'm thinking, PCI BARs are setup, but kernel page tables not always?
17:33 imirkin_: yeah, but the BARs are only so wide
17:35 edgecase: oh, on 32bit cpus, "640k ought to be enought for anyone"
17:36 edgecase: in theory, 64bit cpu could have it all decoded on PCI as one big window?
17:36 imirkin_: BAR width is controlled by the card itself
17:36 imirkin_: as well as restrictions on the path to the root complex
17:39 edgecase: oh well. so are textures moved to/from VRAM by DMA always, or does host CPU access VRAM also?
17:39 karolherbst: imirkin_: I am currently wondering if there is a DMA controller involved for accesses to VRAM from the CPU or not.. never looked into that in detail
17:40 imirkin_: well, by definition, DMA is not-cpu accessing system memory
17:40 karolherbst: mhh.. true
17:40 edgecase: dma engine in gpu, could have large mapping of host ram, but secured by iommu/gart?
17:41 imirkin_: yes
17:41 imirkin_: usually you try to use DMA as much as possible
17:41 imirkin_: to let the CPU do useful things
17:41 edgecase: karolherbst, did you mean accesses to host ram from GPU?
17:41 imirkin_: rather than wait on bits to move around
17:41 karolherbst: no, the other way around
17:41 karolherbst: I am wondering how VRAM gets avertized to the CPU
17:41 karolherbst: *advertized
17:41 edgecase: plus DMA wouldn't need host cpu to setup/teardown windows
17:41 imirkin_: karolherbst: you just access special memory regions
17:42 edgecase: cat /proc/iomem ?
17:42 imirkin_: which get decoded by the memory controller to go to the PCI root or whatever
17:42 karolherbst: imirkin_: okay, so we already have the physical addresses and such..
17:42 imirkin_: yea
17:42 karolherbst: okay.. cool
17:42 imirkin_: i mean, it can work however
17:42 edgecase: when you said the windows are only so big, which windows?
17:43 imirkin_: the GPU could provide an aperture into VM
17:43 imirkin_: but then which VM, etc
17:43 imirkin_: edgecase: lspci -v -- look for BAR sizes
17:43 edgecase: ah iomem is page tables
17:43 karolherbst: no
17:44 karolherbst: physical addresses have nothing to do with pages per se. Memory pages only makes sense in regards to virtual memory
17:44 karolherbst: or physical memory
17:45 edgecase: i mean, /proc/iomem shows things mapped by page tables, vs lspci, which is PCI decoded, but may or may not be mapped by cpu page tables
17:45 karolherbst: _but_ if you map iomem into your virtual memory, then you've got page table entries declaring those mappings
17:45 karolherbst: edgecase: no
17:45 edgecase: ok my lspci shows, 256M largest BAR, but I have 512M VRAM
17:45 karolherbst: iomem shows physical memory
17:45 imirkin_: which is why you can't have it all accessible at once =]
17:46 edgecase: ok
17:46 edgecase: userspace has to mmap it, but kernel has a permament map I think?
17:46 karolherbst: no
17:46 karolherbst: in linux there is ioremap
17:46 karolherbst: on the kernel
17:47 karolherbst: which nouveau uses as well
17:47 edgecase: so by default there's no page tables for kernerl eiter?
17:47 karolherbst: right
17:47 edgecase: okey
17:47 karolherbst: wouldn't make sense to map memory the kernel doesn't know what to do with it
17:47 edgecase: waste of page table pages etc
17:47 karolherbst: so drivers are responsible for mapping whatever they need
17:48 karolherbst: yes
17:49 karolherbst: however you can access memory by it's physical address inside the kernel directly, but it's discuraged to do so
17:50 edgecase: in drivers/gpu/drm/tegra/drm.c there is some example debufs stuff, looks useful to me
17:50 edgecase: when adding a framebuffer, it uses the bo address after translating from GEM handle
17:51 edgecase: i guess that address gets mapped somehow, when the bo is created?
17:52 karolherbst: you don't have to map VRAM in order to use it, but if it's used by the CPU you have to map it first
17:53 karolherbst: creating a bo is nothing more than allocating VRAM and adding page entries into the GPUs MMU
17:53 edgecase: is there a flag to say "I want this bo accessible by CPU?"
17:54 edgecase: i'm getting ahead of myself, haven't found that code yet
17:54 karolherbst: no, you just map VRAM into your CPUs VM
17:54 edgecase: so, 2 step process
17:55 karolherbst: in userspace we have the drm_mmap syscall, but what it does on the kernel side I never looked into
17:55 karolherbst: well...
17:55 karolherbst: it's just mmap anyway
17:56 karolherbst: mmap on an fd
17:57 edgecase: i think i saw that... it uses fault handler to do a lot of the work
17:57 karolherbst: and a fd you get from the kernel whenever you create a bo
17:57 karolherbst: why fault handler?
17:58 edgecase: i skipped over why... not in my direct line of inquiry
17:58 imirkin_: edgecase: there's a flag to auto-make a map, but it's the fault handler which makes those maps real
17:59 karolherbst: ohh, right, we have a NOUVEAU_BO_MAP flag
18:00 edgecase: is that what you use if you're making a BO for a framebuffer, that you want to write to from CPU? (fb console for example?)
18:00 imirkin_: can't seem to find the actual page fault code
18:08 edgecase: nouveau_dmem.c is moving pages to/from HOST/VRAM sending DMA commands to GPU
18:08 imirkin_: dmem is the "device memory" stuff
18:08 imirkin_: which actuall allows the GPU to have page faults
18:09 imirkin_: used for SVM
18:09 imirkin_: and/or HMM
18:09 edgecase: svm? hmm?
18:09 imirkin_: hmm = heterogenous memory
18:09 imirkin_: svm = shared virtual memory
18:10 edgecase: so OUT_RING() adds them to tail of current cmd buffer, queueing them up?
18:11 imirkin_: just adds to the cmd buffer
18:11 imirkin_: there's another thing to actually submit
18:11 edgecase: not guaranteed to execute until flush(fence) ?
18:11 imirkin_: FIRE_RING maybe? also OUT_SPACE will optionally do that
18:12 edgecase: each client has a ring? does driver have it's own for mm housekeeping like this?
18:12 imirkin_: yes
18:12 imirkin_: it's slightly more involved, as you might imagine
18:13 edgecase: doesn't sound like you can pre-empt cmd rings, more like cooperative multitasking?
18:14 imirkin_: it's just a sequence of commands
18:14 imirkin_: gpu does whatever it wants with them
18:14 imirkin_: the gpu is just like a remote computer you can access over dialup
18:14 imirkin_: really fast dialup :)
18:15 edgecase: but if you have a ring of cmds, that moves around pages, you don't want to fire another ring that uses those pages, until they're actuall there...
18:15 imirkin_: why not
18:15 edgecase: i guess with GPU side VM, pages can move, as long as you update mappings before going back to client ring cmds
18:16 edgecase: you could use GPU mm fault handler, do to the dma, i guess, just the code i was looking at was doing immediate commands for memory moves
18:17 edgecase:puts telephone receiver in acoustic coupler
18:17 edgecase: AT DT 1-800-4NVIDIA
18:17 edgecase: NO CONNECT ;<
18:17 imirkin_: :)
18:17 imirkin_: forgot the dial-around code
18:18 imirkin_: 1010321!
18:18 edgecase: those jerks!
18:18 imirkin_: as an aside - did you know that that's actually 101-0321? 101 = dial-around, 0321 = OCN of the handler of the call
18:19 imirkin_: https://localcallingguide.com/lca_cic.php?cic=0321&acna=&entity=
18:19 edgecase: what's the context, do I need a Captain Crunch whistle?
18:19 imirkin_: oh, i figured you were old enough to remember that since you mentioned AT DT...
18:20 edgecase: i couldn't afford long-distance BBSes
18:20 imirkin_: there were these long-distance things in the US after the ILEC deregulation, to allow random companies to handle your long distance calling
18:20 imirkin_: so you'd dial a prefix, and then the ultiamte number, and then billing would go through that LEC
18:21 imirkin_: 10-10-321 was a fairly popular one, esp since it was easy to remember
18:21 imirkin_: https://en.wikipedia.org/wiki/10-10-321
18:21 edgecase: ah. CRTC mandated not having to do that, but anyone can setup a dialin where you get a 2nd dial-tone
18:24 edgecase: ok, so I specifically don't want to debug page table mappings, host or cpu side, GEM layer... I think my target is VRAM allocations
18:26 edgecase: which consists of firmware, and then TTM/bo allocated stuff, right? Does the firmware just reserve some part, then let TTM allocator manage the rest?
18:38 edgecase: hmm i'm curious about firmware loading now.
18:41 karolherbst: edgecase: the HMM stuff is only available for the compute engine
18:42 karolherbst: and only matters for the GPU VM
18:42 karolherbst: so the compute engine is able to recover from page faults, and that's what most of the code there is for (+ mirroring the CPU VM into the GPUs VM as well)
18:56 edgecase: ic. says here pre-NV50 PCI BAR1 was direct mapped to all VRAM, NV50+ thru VM
18:56 karolherbst: yep. as the PCI bar is too small for all VRAM
19:23 edgecase: nouveau HMM you mentioned, is using the new Linux HMS?
19:24 imirkin_: HMM is a new thing. SVM is a new thing. don't worry about either one - they don't affect your situation
19:35 edgecase: Linux's HMM was initially called HMS, hence my confusion
19:35 edgecase: pretty nice stuff tho
19:35 edgecase: yeah page tables is too low level for what i want
22:30 lovesegfault: karolherbst: do you know if this is a sane default? https://github.com/linrunner/TLP/blob/master/tlp.conf#L311-L318
22:31 lovesegfault: Shouldn't runtime pm for the nouveau dev be enabled?
22:33 karolherbst: the kernel driver enables it itself already
22:33 karolherbst: the bigger problem is SOUND_POWER_SAVE_ON_AC=0
22:34 karolherbst: as this will prevent GPUs to runtime suspend on AC when they also expose an audio device
22:34 lovesegfault: Yeah, that insanity I already changed :)
22:34 karolherbst: which actually also reduces performance on AC effectivley
22:34 karolherbst: ahh, cool
22:34 lovesegfault: b/c you waste a zillion watts
22:34 karolherbst: yeah.. and most cooling on laptops have a shared hear pipe for the CPU and GPU
22:35 karolherbst: *heat
22:36 karolherbst: on *U CPUs all of that matters even more as those usually boost according to avg W
22:36 lovesegfault: karolherbst: is there a way to find out what is keeping my GPU from suspending on AC?
22:36 karolherbst: not really
22:36 karolherbst: most of the time it's either because of the audio device or something is keeping it awake
22:36 karolherbst: but mostly it's the audio stuff
22:37 karolherbst: you should check if the audio device gets runtime suspended as well
22:37 lovesegfault: how do I check?
22:37 lovesegfault: Oh
22:38 lovesegfault: I know
22:38 lovesegfault: one moment
22:38 lovesegfault: karolherbst: https://github.com/lovesegfault/nix-config/commit/dbd9140074e8f02cc093f4ba95cb2c4a03cf1b69
22:38 lovesegfault: this solved it
22:38 lovesegfault: for future refence
22:39 karolherbst: ohh auto
22:39 karolherbst: nice
22:39 karolherbst: yeah, I guess it's best to keep to the kernel defaults here
22:41 lovesegfault: yeah, TLP is kind of dumb
23:06 edgecase: hey does nv50 have that FuC and RTOS?
23:07 edgecase: trying to find where TTM calls nouveau, the one i found sends IOCTL to nvkm to do everything
23:07 edgecase: but maybe nv50 does it a more direct way?
23:40 imirkin: edgecase: nv50 does have falcons, but not for most things
23:44 edgecase: i am lost trying to find out how gpu page tables are updated. to bootstrap, surely the host CPU accesses VRAM bypassing gpu vm?
23:45 imirkin: of course
23:45 imirkin: on nv50 it's also a bit different than nvc0+
23:45 imirkin: the PDE's are in a fixed location in physical memory
23:45 imirkin: so to "switch" VMs, you have to overwrite the PDE's
23:45 imirkin: instead of having a CR2-style pointer
23:46 karolherbst: imirkin: how many entries do we have in the texture binding table?
23:46 imirkin: textures? iirc 128. and 16 samplers on nv50, 32 on nvc0
23:47 karolherbst: ahh
23:47 karolherbst: perfect
23:47 imirkin: infinity on kepler+
23:47 karolherbst: CL requires 128 textures and 16 samplers :=
23:47 karolherbst: :)
23:47 imirkin: yeah, that's the DX10 requirement
23:47 karolherbst: which.. you can mix and match in the kernel though
23:47 imirkin: right
23:47 imirkin: which is how we're set up
23:47 imirkin: SVIEW vs SAMP in tgsi
23:47 karolherbst: okay
23:47 karolherbst: cool
23:47 karolherbst: trying to get CL images to work
23:48 imirkin: that said, iirc we're forcing a max of 16 textures on nv50
23:48 imirkin: and 32 on nvc0
23:48 imirkin: for not-so-great reasons
23:48 karolherbst: yeah... I know
23:48 karolherbst: sadly CL requires us to have 128 read only image args
23:48 karolherbst: but only 8 writeable image args
23:49 karolherbst: but... that shouldn't matter
23:49 karolherbst: as we don't need support for 128 textures inside st/mesa
23:49 karolherbst: I guess the driver might need some fixing here and there... oh well
23:49 imirkin: yeah, only 8 images on nvc0
23:49 karolherbst: welll
23:49 imirkin: iirc PIPE_MAX_TEXTURES is 32
23:50 karolherbst: CL images are like textures
23:50 imirkin: ok, but ... as far as the driver is concerned, entirely differnet things
23:50 imirkin: (and as far as the hw is concerned)
23:50 karolherbst: read_imagef == texs lz 0x0 $r0 $r2 $r0 0x51 t2d r
23:50 karolherbst: :)
23:50 imirkin: on maxwell maybe
23:50 imirkin: where the same descriptor is used for textures and images
23:51 karolherbst: "TEX.LZ.P R2, R2, 0x51, 2D, 0x1" on kepler1
23:51 karolherbst: I think..
23:51 imirkin: ok
23:51 karolherbst: depends on what SM30 was
23:51 karolherbst: but SM30 was kepler1..
23:51 karolherbst: I think
23:51 karolherbst: SM35 kepler2 and SM50 maxwell
23:51 imirkin: so then they must just bind it both ways
23:51 imirkin: right
23:51 karolherbst: probably
23:51 imirkin: what about SM20?
23:51 imirkin:is curious now
23:51 karolherbst: my cuda is too new
23:51 karolherbst: SM30 is the lowest I can go
23:51 imirkin: hehe
23:52 karolherbst: I could install an older version in parallel though..
23:52 karolherbst: mhh, let me check
23:52 imirkin: it was idle curiosity on my part
23:52 karolherbst: ohh wait
23:52 karolherbst: I think I have it installed actually
23:55 edgecase: imirkin, any idea where the code is that rewrites the PDEs?
23:56 edgecase: i guess i could look for the pci BARs getting mmaped, and go from there
23:56 edgecase: the PRAMIN one that bypasses VM
23:56 imirkin: maybe https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/nvkm/subdev/mmu/vmmnv50.c
23:57 imirkin: tbh i dunno. nouveau is fairly heavily indirected to support multiple gens of stuff
23:57 edgecase: nvkm is FuC side?
23:57 imirkin: no
23:58 imirkin: it's just an interface
23:58 imirkin: fuc is in *.fuc files