17:24edgecase: so is all of VRAM always mapped to CPU's address space? PCIe BARs, MMIO window, etc?
17:25edgecase: kernel seems to just use any BO as framebuffer with it's cpu address
17:25edgecase: with KMS add_fb2
17:29imirkin_: edgecase: generally no
17:29imirkin_: but there's cleverness which allows it all to be accessed
17:31edgecase: so, virtually yes?
17:32imirkin_: yeah, it just moves the window around in the fault handler, or something along those lines
17:32edgecase: i'm thinking, PCI BARs are setup, but kernel page tables not always?
17:33imirkin_: yeah, but the BARs are only so wide
17:35edgecase: oh, on 32bit cpus, "640k ought to be enought for anyone"
17:36edgecase: in theory, 64bit cpu could have it all decoded on PCI as one big window?
17:36imirkin_: BAR width is controlled by the card itself
17:36imirkin_: as well as restrictions on the path to the root complex
17:39edgecase: oh well. so are textures moved to/from VRAM by DMA always, or does host CPU access VRAM also?
17:39karolherbst: imirkin_: I am currently wondering if there is a DMA controller involved for accesses to VRAM from the CPU or not.. never looked into that in detail
17:40imirkin_: well, by definition, DMA is not-cpu accessing system memory
17:40karolherbst: mhh.. true
17:40edgecase: dma engine in gpu, could have large mapping of host ram, but secured by iommu/gart?
17:41imirkin_: usually you try to use DMA as much as possible
17:41imirkin_: to let the CPU do useful things
17:41edgecase: karolherbst, did you mean accesses to host ram from GPU?
17:41imirkin_: rather than wait on bits to move around
17:41karolherbst: no, the other way around
17:41karolherbst: I am wondering how VRAM gets avertized to the CPU
17:41edgecase: plus DMA wouldn't need host cpu to setup/teardown windows
17:41imirkin_: karolherbst: you just access special memory regions
17:42edgecase: cat /proc/iomem ?
17:42imirkin_: which get decoded by the memory controller to go to the PCI root or whatever
17:42karolherbst: imirkin_: okay, so we already have the physical addresses and such..
17:42karolherbst: okay.. cool
17:42imirkin_: i mean, it can work however
17:42edgecase: when you said the windows are only so big, which windows?
17:43imirkin_: the GPU could provide an aperture into VM
17:43imirkin_: but then which VM, etc
17:43imirkin_: edgecase: lspci -v -- look for BAR sizes
17:43edgecase: ah iomem is page tables
17:44karolherbst: physical addresses have nothing to do with pages per se. Memory pages only makes sense in regards to virtual memory
17:44karolherbst: or physical memory
17:45edgecase: i mean, /proc/iomem shows things mapped by page tables, vs lspci, which is PCI decoded, but may or may not be mapped by cpu page tables
17:45karolherbst: _but_ if you map iomem into your virtual memory, then you've got page table entries declaring those mappings
17:45karolherbst: edgecase: no
17:45edgecase: ok my lspci shows, 256M largest BAR, but I have 512M VRAM
17:45karolherbst: iomem shows physical memory
17:45imirkin_: which is why you can't have it all accessible at once =]
17:46edgecase: userspace has to mmap it, but kernel has a permament map I think?
17:46karolherbst: in linux there is ioremap
17:46karolherbst: on the kernel
17:47karolherbst: which nouveau uses as well
17:47edgecase: so by default there's no page tables for kernerl eiter?
17:47karolherbst: wouldn't make sense to map memory the kernel doesn't know what to do with it
17:47edgecase: waste of page table pages etc
17:47karolherbst: so drivers are responsible for mapping whatever they need
17:49karolherbst: however you can access memory by it's physical address inside the kernel directly, but it's discuraged to do so
17:50edgecase: in drivers/gpu/drm/tegra/drm.c there is some example debufs stuff, looks useful to me
17:50edgecase: when adding a framebuffer, it uses the bo address after translating from GEM handle
17:51edgecase: i guess that address gets mapped somehow, when the bo is created?
17:52karolherbst: you don't have to map VRAM in order to use it, but if it's used by the CPU you have to map it first
17:53karolherbst: creating a bo is nothing more than allocating VRAM and adding page entries into the GPUs MMU
17:53edgecase: is there a flag to say "I want this bo accessible by CPU?"
17:54edgecase: i'm getting ahead of myself, haven't found that code yet
17:54karolherbst: no, you just map VRAM into your CPUs VM
17:54edgecase: so, 2 step process
17:55karolherbst: in userspace we have the drm_mmap syscall, but what it does on the kernel side I never looked into
17:55karolherbst: it's just mmap anyway
17:56karolherbst: mmap on an fd
17:57edgecase: i think i saw that... it uses fault handler to do a lot of the work
17:57karolherbst: and a fd you get from the kernel whenever you create a bo
17:57karolherbst: why fault handler?
17:58edgecase: i skipped over why... not in my direct line of inquiry
17:58imirkin_: edgecase: there's a flag to auto-make a map, but it's the fault handler which makes those maps real
17:59karolherbst: ohh, right, we have a NOUVEAU_BO_MAP flag
18:00edgecase: is that what you use if you're making a BO for a framebuffer, that you want to write to from CPU? (fb console for example?)
18:00imirkin_: can't seem to find the actual page fault code
18:08edgecase: nouveau_dmem.c is moving pages to/from HOST/VRAM sending DMA commands to GPU
18:08imirkin_: dmem is the "device memory" stuff
18:08imirkin_: which actuall allows the GPU to have page faults
18:09imirkin_: used for SVM
18:09imirkin_: and/or HMM
18:09edgecase: svm? hmm?
18:09imirkin_: hmm = heterogenous memory
18:09imirkin_: svm = shared virtual memory
18:10edgecase: so OUT_RING() adds them to tail of current cmd buffer, queueing them up?
18:11imirkin_: just adds to the cmd buffer
18:11imirkin_: there's another thing to actually submit
18:11edgecase: not guaranteed to execute until flush(fence) ?
18:11imirkin_: FIRE_RING maybe? also OUT_SPACE will optionally do that
18:12edgecase: each client has a ring? does driver have it's own for mm housekeeping like this?
18:12imirkin_: it's slightly more involved, as you might imagine
18:13edgecase: doesn't sound like you can pre-empt cmd rings, more like cooperative multitasking?
18:14imirkin_: it's just a sequence of commands
18:14imirkin_: gpu does whatever it wants with them
18:14imirkin_: the gpu is just like a remote computer you can access over dialup
18:14imirkin_: really fast dialup :)
18:15edgecase: but if you have a ring of cmds, that moves around pages, you don't want to fire another ring that uses those pages, until they're actuall there...
18:15imirkin_: why not
18:15edgecase: i guess with GPU side VM, pages can move, as long as you update mappings before going back to client ring cmds
18:16edgecase: you could use GPU mm fault handler, do to the dma, i guess, just the code i was looking at was doing immediate commands for memory moves
18:17edgecase:puts telephone receiver in acoustic coupler
18:17edgecase: AT DT 1-800-4NVIDIA
18:17edgecase: NO CONNECT ;<
18:17imirkin_: forgot the dial-around code
18:18edgecase: those jerks!
18:18imirkin_: as an aside - did you know that that's actually 101-0321? 101 = dial-around, 0321 = OCN of the handler of the call
18:19edgecase: what's the context, do I need a Captain Crunch whistle?
18:19imirkin_: oh, i figured you were old enough to remember that since you mentioned AT DT...
18:20edgecase: i couldn't afford long-distance BBSes
18:20imirkin_: there were these long-distance things in the US after the ILEC deregulation, to allow random companies to handle your long distance calling
18:20imirkin_: so you'd dial a prefix, and then the ultiamte number, and then billing would go through that LEC
18:21imirkin_: 10-10-321 was a fairly popular one, esp since it was easy to remember
18:21edgecase: ah. CRTC mandated not having to do that, but anyone can setup a dialin where you get a 2nd dial-tone
18:24edgecase: ok, so I specifically don't want to debug page table mappings, host or cpu side, GEM layer... I think my target is VRAM allocations
18:26edgecase: which consists of firmware, and then TTM/bo allocated stuff, right? Does the firmware just reserve some part, then let TTM allocator manage the rest?
18:38edgecase: hmm i'm curious about firmware loading now.
18:41karolherbst: edgecase: the HMM stuff is only available for the compute engine
18:42karolherbst: and only matters for the GPU VM
18:42karolherbst: so the compute engine is able to recover from page faults, and that's what most of the code there is for (+ mirroring the CPU VM into the GPUs VM as well)
18:56edgecase: ic. says here pre-NV50 PCI BAR1 was direct mapped to all VRAM, NV50+ thru VM
18:56karolherbst: yep. as the PCI bar is too small for all VRAM
19:23edgecase: nouveau HMM you mentioned, is using the new Linux HMS?
19:24imirkin_: HMM is a new thing. SVM is a new thing. don't worry about either one - they don't affect your situation
19:35edgecase: Linux's HMM was initially called HMS, hence my confusion
19:35edgecase: pretty nice stuff tho
19:35edgecase: yeah page tables is too low level for what i want
22:30lovesegfault: karolherbst: do you know if this is a sane default? https://github.com/linrunner/TLP/blob/master/tlp.conf#L311-L318
22:31lovesegfault: Shouldn't runtime pm for the nouveau dev be enabled?
22:33karolherbst: the kernel driver enables it itself already
22:33karolherbst: the bigger problem is SOUND_POWER_SAVE_ON_AC=0
22:34karolherbst: as this will prevent GPUs to runtime suspend on AC when they also expose an audio device
22:34lovesegfault: Yeah, that insanity I already changed :)
22:34karolherbst: which actually also reduces performance on AC effectivley
22:34karolherbst: ahh, cool
22:34lovesegfault: b/c you waste a zillion watts
22:34karolherbst: yeah.. and most cooling on laptops have a shared hear pipe for the CPU and GPU
22:36karolherbst: on *U CPUs all of that matters even more as those usually boost according to avg W
22:36lovesegfault: karolherbst: is there a way to find out what is keeping my GPU from suspending on AC?
22:36karolherbst: not really
22:36karolherbst: most of the time it's either because of the audio device or something is keeping it awake
22:36karolherbst: but mostly it's the audio stuff
22:37karolherbst: you should check if the audio device gets runtime suspended as well
22:37lovesegfault: how do I check?
22:38lovesegfault: I know
22:38lovesegfault: one moment
22:38lovesegfault: karolherbst: https://github.com/lovesegfault/nix-config/commit/dbd9140074e8f02cc093f4ba95cb2c4a03cf1b69
22:38lovesegfault: this solved it
22:38lovesegfault: for future refence
22:39karolherbst: ohh auto
22:39karolherbst: yeah, I guess it's best to keep to the kernel defaults here
22:41lovesegfault: yeah, TLP is kind of dumb
23:06edgecase: hey does nv50 have that FuC and RTOS?
23:07edgecase: trying to find where TTM calls nouveau, the one i found sends IOCTL to nvkm to do everything
23:07edgecase: but maybe nv50 does it a more direct way?
23:40imirkin: edgecase: nv50 does have falcons, but not for most things
23:44edgecase: i am lost trying to find out how gpu page tables are updated. to bootstrap, surely the host CPU accesses VRAM bypassing gpu vm?
23:45imirkin: of course
23:45imirkin: on nv50 it's also a bit different than nvc0+
23:45imirkin: the PDE's are in a fixed location in physical memory
23:45imirkin: so to "switch" VMs, you have to overwrite the PDE's
23:45imirkin: instead of having a CR2-style pointer
23:46karolherbst: imirkin: how many entries do we have in the texture binding table?
23:46imirkin: textures? iirc 128. and 16 samplers on nv50, 32 on nvc0
23:47imirkin: infinity on kepler+
23:47karolherbst: CL requires 128 textures and 16 samplers :=
23:47imirkin: yeah, that's the DX10 requirement
23:47karolherbst: which.. you can mix and match in the kernel though
23:47imirkin: which is how we're set up
23:47imirkin: SVIEW vs SAMP in tgsi
23:47karolherbst: trying to get CL images to work
23:48imirkin: that said, iirc we're forcing a max of 16 textures on nv50
23:48imirkin: and 32 on nvc0
23:48imirkin: for not-so-great reasons
23:48karolherbst: yeah... I know
23:48karolherbst: sadly CL requires us to have 128 read only image args
23:48karolherbst: but only 8 writeable image args
23:49karolherbst: but... that shouldn't matter
23:49karolherbst: as we don't need support for 128 textures inside st/mesa
23:49karolherbst: I guess the driver might need some fixing here and there... oh well
23:49imirkin: yeah, only 8 images on nvc0
23:49imirkin: iirc PIPE_MAX_TEXTURES is 32
23:50karolherbst: CL images are like textures
23:50imirkin: ok, but ... as far as the driver is concerned, entirely differnet things
23:50imirkin: (and as far as the hw is concerned)
23:50karolherbst: read_imagef == texs lz 0x0 $r0 $r2 $r0 0x51 t2d r
23:50imirkin: on maxwell maybe
23:50imirkin: where the same descriptor is used for textures and images
23:51karolherbst: "TEX.LZ.P R2, R2, 0x51, 2D, 0x1" on kepler1
23:51karolherbst: I think..
23:51karolherbst: depends on what SM30 was
23:51karolherbst: but SM30 was kepler1..
23:51karolherbst: I think
23:51karolherbst: SM35 kepler2 and SM50 maxwell
23:51imirkin: so then they must just bind it both ways
23:51imirkin: what about SM20?
23:51imirkin:is curious now
23:51karolherbst: my cuda is too new
23:51karolherbst: SM30 is the lowest I can go
23:52karolherbst: I could install an older version in parallel though..
23:52karolherbst: mhh, let me check
23:52imirkin: it was idle curiosity on my part
23:52karolherbst: ohh wait
23:52karolherbst: I think I have it installed actually
23:55edgecase: imirkin, any idea where the code is that rewrites the PDEs?
23:56edgecase: i guess i could look for the pci BARs getting mmaped, and go from there
23:56edgecase: the PRAMIN one that bypasses VM
23:56imirkin: maybe https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/nvkm/subdev/mmu/vmmnv50.c
23:57imirkin: tbh i dunno. nouveau is fairly heavily indirected to support multiple gens of stuff
23:57edgecase: nvkm is FuC side?
23:58imirkin: it's just an interface
23:58imirkin: fuc is in *.fuc files