02:43 imirkin: skeggsb: any thoughts on the likely failure for the sparc guy?
02:44 airlied: is the answer in the question?
02:44 imirkin: :p
02:45 imirkin: airlied: https://bugs.freedesktop.org/show_bug.cgi?id=103284
02:45 imirkin: it dies right after successful creation of the disp object, while in nv50_display create logic.
02:46 airlied: the sparc oops logo it still the coolest
02:49 skeggsb:really should fix that failure path some day
02:49 skeggsb: it annoys me more frequently than i'd expect
02:49 skeggsb: as to why it's failing in the first place...
02:49 skeggsb:keeps reading
02:49 imirkin: it seems like if something fails *really* early, we fail at the destroy stage. but it's all pretty indirected, and i don't know how it's supposed to all work.
02:50 skeggsb: no, the oops is from a failure path going badly
02:50 imirkin: well, it goes to out:
02:50 imirkin: which then calls destroy
02:50 imirkin: which dies on the first line
02:51 imirkin: nv50_dmac_destroy(&disp->mast.base, disp->disp);
02:51 imirkin: pretty sure this fails
02:51 imirkin: perhaps because the list isn't initialized? dunno.
02:52 skeggsb: yeah, i'm aware that that's broken, i hit it semi-regularly.. i'm more interested why there's a failing bit in the first place
02:53 imirkin: my *theory* was that it's trying to do a 4K allocation and the platform has 8K pages.
02:53 skeggsb: yeah, i'm trying that theory now..
02:53 imirkin: this theory is based on ... nothing :)
02:53 skeggsb: i don't think so though, because nouveau_bo_new() will round to PAGE_SIZE
02:53 imirkin: and if you hit this, then it's for something else
02:54 imirkin: unless every so often your system happens to become a sparc :)
02:54 imirkin: some sort of boot init missing, sometimes it'll be a sparc, sometimes x86. ;)
02:55 skeggsb: haha... or, it fails because i break some cruicial underlying code
02:55 skeggsb: or screw the GPU up enough that we can't init the EVO channels
09:03 karolherbst: does somebody has some time to proof read those project ideas or give suggestions (for others, changes, etc...)? https://gist.githubusercontent.com/karolherbst/f80890aad3983cd37a502888229d0978/raw/91cd15c577998129204ba734a634bc104b5ede86/gistfile1.txt
11:10 pmoreau: karolherbst: Just having a quick glance at your gist: For OpenCL, hardware required is Tesla+, as Tesla should support OpenCL 1.0. No need to constrain to Fermi+. And I’m trying to support Tesla+, even though Tesla is getting a bit old.
11:12 pmoreau: Could be good to use the same terminology as well: “Any NVIDIA G80 or newer” versus “Any NVIDIA Fermi or newer”. Not a big deal, but consistency.
11:14 pmoreau: For 1), I wouldn’t put it as less difficult than the OpenCL one, and compiler knowledge should be required.
11:16 karolherbst: pmoreau: well, the problem I have with the Term "Tesla" is, that some might confuse it with those super high end GPUs
11:16 karolherbst: pmoreau: you mean vulkan?
11:16 karolherbst: uhm or which entry did you mean
11:17 karolherbst: ahh "Instruction Scheduler"?
11:17 pmoreau: For OpenCL, everything is in C++, so no need to mention C. I would add “Compilers” as a useful skill as well, to understand what SSA is, and other changes to RA and the NVIR compiler.
11:17 pmoreau: Yes
11:17 pmoreau: 1st of the list
11:17 karolherbst: well, the first three ones are the ones which are already in the wiki
11:17 karolherbst: so I don't really plan to change those
11:17 karolherbst: except there is a good reason
11:17 pmoreau: You use Tesla for the reclocking one ;-)
11:18 karolherbst: context is important, it will make sense after my changes
11:38 RSpliet: karolherbst: I wonder how hard insn scheduling really still is now that dboyan has done some important pre-work
11:39 RSpliet: And I've already played around a bit with heuristics that work (and some that don't work)
11:41 RSpliet: Esp. now that I have a strong impression that you don't want to do true liveness analysis during the scheduling pass, as its exactness can misguide heuristics
11:44 RSpliet: In the coming few weeks I won't have any lot of time to invest, but the stuff I currently have in github only very minimally regresses on voloplosion (presumably due to register allocation dominating this benchmark perf rather than instruction distance), does great on heaven and piano. I want some more real benchmarks now, which'll involve making steam believe it should use my custom mesa build over what my distro/steam ships with
11:47 RSpliet: What will be more valuable is a post-RA pass that optimises for dual issue, something you've already taken a closer look at. Considering the kind of large-scale reshuffling the pre-RA pass already does, your ad-hoc approach of scanning/swapping only within a small window might just be all we need. If we can get dual-issue close to "20% of the insn have the dual-issue bit set" (so 40% dual issue), I think we do about as well as the
12:46 jolar2: Any ideas on what to test in the kernel to progress from the status in https://bugs.freedesktop.org/show_bug.cgi?id=101778 ?
12:48 jolar2: Guess it is hard to debug when this hybrid graphics setup does not even work properly with the proprietary nvidia driver
14:10 karolherbst: RSpliet: yeah, optimizing for dual issueing is a big win
14:11 karolherbst: pmoreau: update https://gist.githubusercontent.com/karolherbst/f80890aad3983cd37a502888229d0978/raw/7d80731ccc7a180a4ca492e40f8205f2d0021333/gistfile1.txt
14:38 karolherbst: pmoreau: we already hit dual issue rates above 40%
14:38 karolherbst: ...
14:38 karolherbst: RSpliet:
14:38 karolherbst: the best I got was 60% for something
14:39 RSpliet: karolherbst: I've peeked at some NVIDIA OpenCL benchmarks, and they achieved 40%. If that was your average improvement, I'd say it's good enough
14:40 RSpliet: (well, the scientist in me cries, but the engineer cheers)
14:47 karolherbst: it highly depends on the shaders though
14:47 karolherbst: big shaders with only few branching usually do pretty well
14:48 karolherbst: on of our problems is, that we don't really reorder memory instructions
14:49 RSpliet: The main heuristic in the insn scheduler I built is bringing loads forward
14:50 RSpliet: which is one of the reasons why I tend to feel that a small sliding-window post-RA reordering for dual-issue is probably the best approach.
14:51 imirkin_: sliding window is a great idea
14:51 imirkin_: iirc my main objection to karol's previous attempt was that there was unbounded N^2-edness
14:51 imirkin_: but in practice, moving more than, say, 5 instructions is going to be *very* unlikely
14:52 imirkin_: and 5^2 is a much smaller number than 1000^2
14:52 imirkin_: not sure why i didn't think of suggesting that. very good idea, RSpliet.
14:52 RSpliet: not to mention moving more than 5 instructions is undesirable, because the pre-RA pass does all the heavy lifting on trying to meet minimum distance between dependent instructions
14:53 karolherbst: imirkin_: ahh right, I actually wanted to work on that
14:53 karolherbst: and somebody suggested that already
14:53 karolherbst: but there is a bug in it anyhow I still want to track down
14:54 karolherbst: RSpliet: right, with pixmark_piano I got to 50% dual issueing
14:55 RSpliet: karolherbst: I've found that the gputest benchmarks are all static in their own regard. Good for testing whether any pass has serious effect, but not very representative for what (prospective) users are running on their machines
14:56 karolherbst: true
15:02 karolherbst: imirkin_, RSpliet: I think I will rework that pass to just switch with the next instruction and doesn't look beyond that. This should be good enough for now and eliminates that deep lookup which may introduce some weird issues
15:04 imirkin_: iirc we already do something like that on nv50
15:04 imirkin_: not for dual-issue purposes, but for packign purposes
15:04 imirkin_: but the end result is the same
15:05 karolherbst: I see
15:05 imirkin_:tries to find it
15:05 karolherbst: I already saw it I think. Wasn't it somewhere inside RA?
15:06 imirkin_: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/codegen/nv50_ir_target.cpp#n253
15:06 imirkin_: not exactly a logical place :)
15:06 karolherbst: ;)
15:06 imirkin_: see the "try to group short instructions" thing
15:06 karolherbst: should be a post RA pass
15:06 karolherbst: ohh wait
15:06 karolherbst: mhh yeah,
15:07 imirkin_: anyways, same general idea
15:07 imirkin_: you can only have 4-byte instructions in pairs
15:07 imirkin_: so you want to group instructions together that can both be emitted in 4 bytes
15:07 karolherbst: I kind of doesn't like the idea to move instruction outside passes to keep saneness
15:07 karolherbst: *sanity
15:07 imirkin_: i'm not saying this is a GOOD idea
15:08 imirkin_: i'm just saying we already have something like it
15:08 imirkin_: it does feel like it could be largely done in a post-ra pass
15:08 imirkin_: as long as you made sure it was the last.
15:08 imirkin_: oh goodie. this randomly ends up in there too:
15:09 imirkin_: if (i->op == OP_MEMBAR && !targ->isOpSupported(OP_MEMBAR, TYPE_NONE)) {
15:09 imirkin_: bb->remove(i);
15:09 imirkin_: continue;
15:09 imirkin_: }
15:09 imirkin_: that might be my bad, i forget :)
15:09 karolherbst: :D
15:13 RSpliet: karolherbst: it'd be interesting to look at the influence of that window size (look ahead one, two, three instructions)... although it could easily complicate dependency satisfaction logic
15:14 karolherbst: 1 already does more than 50% of the improvement
15:14 karolherbst: *causes
15:14 karolherbst: more than 3 doesn't really make any sense
15:15 RSpliet: perhaps try it on top of my branch, where memory loads should be less of a bottleneck
15:15 RSpliet: for performance measurements
15:16 RSpliet: (I say "should", because I'm far from satisfied with the set of benchmarks I used to validate ;-))
15:16 karolherbst: yeah, I will try to work on that patch today
15:19 RSpliet: the "arsepull" cost model(tm) I use has a couple of tunables. The default cost of 15 was determined by trial-and-error on volplosion and piano. With higher dual issue rates, it might pay off increasing that value to say 20
15:20 RSpliet: (line 358 in my insn_sched tree)
15:20 karolherbst: well, with higher dual issueing you need to move instructions further away from each other afaik
15:20 RSpliet: and that's precisely what increasing that number will do
15:21 karolherbst: k
15:22 karolherbst: I am currently wondering if we can disable that hw check for real. Because currently if you put garbage valies into the sched opcode, nothing bad happens on Kepler
15:23 RSpliet: note that I deliberately don't use the cost derived from the per-arch throughput routines. as I wanted "reg<-reg,reg(,reg)" instructions to be of the same cost, not favouring one over the other based on what the distance to the next would be. Again trial-and-error heuristic tuning, so there might be better ideas ;-)
18:29 karolherbst: RSpliet: I pushed an update to my dual_issue branch. It is untested though
18:30 karolherbst: I think there is still a bug somewhere regarding texbars or something, not _quite_ sure
18:41 duttasankha: What is the purpose of register 0x1700...in lookup it is showing as PBUS.HOST_MEM.PMEM => { OFFSET = 0 | TARGET = VRAM } ......if I pass a value of 1 then it is PBUS.HOST_MEM.PMEM => { OFFSET = 0x10000 | TARGET = VRAM }...
18:41 duttasankha: My idea is that it is used to set address offset for VRAM
18:42 duttasankha: But I want to discuss to make sure...\
18:44 karolherbst: duttasankha: you can open a window into vram through the mmio registers
18:44 karolherbst: or sys mem
18:45 duttasankha: yes...so this open the window for VRAM as that is set as target....what is the offset for?
18:48 karolherbst: offset inside memory
18:52 duttasankha: okay...If I want to access the global memory (RAM) inside the GPU I can directly access the VRAM using 0x700000 register...right? but before doing that I can set the offset by 0x1700 register and then access the device RAM by using 0x700000 register...am I right?
18:53 imirkin_: accessing VRAM from the GPU uses internal interfaces
18:54 imirkin_: it all goes via the MMU (on G80+)
18:57 duttasankha: I am trying to figure out a way to access and get the VRAM data from the host side using the MMIOs...
18:58 imirkin_: oh. then yes.
18:59 duttasankha: okay...so first I would set the offset using 1700 and then access through 700000...
18:59 duttasankha: right?
19:07 imirkin_: yep
19:07 imirkin_: note that this is a very slow and inefficient way of accessing VRAM
19:07 imirkin_: but it works :)
19:18 karolherbst: wuhu, found another shader bug
19:22 karolherbst: allthough, could be a wine bug as well
19:27 karolherbst: only happens with the gallium nine code path...
19:34 duttasankha: imirkin_: thank you...what would be recommended way to access the VRAM faster ?
19:48 imirkin_: duttasankha: fastest would be to get the GPU to DMA the data directly
19:51 imirkin_: second fastest would be to do a map on the VRAM data ... which works by .... you know, i can never remember. maybe that's that PMEM thing deep down inside.
19:51 imirkin_: skeggsb would know.
19:54 duttasankha: doing the DMA would be really interesting thing to do and I agree to that...will it be possible for you you to direct me to the code/documentation/anything which would help me to understand the procedure for DMA the data out of the VRAM...
20:00 imirkin_: welll ... it gets a lot trickier
20:00 imirkin_: you have to submit commands to the GPU
20:00 imirkin_: which means that you have to play nice with whatever driver is controlling it
20:03 imirkin_: duttasankha: what are you doing again?
20:04 duttasankha: well, actually there are 3 things I am trying to achieve...they are as follows....
20:05 duttasankha: 1. trying to access and retrieve the data of VRAM from the host side
20:05 imirkin_: https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/nouveau_bo.c#L1064
20:05 duttasankha: 2. Another thing is I am trying to understand the memory mapping of the VRAM so I can locate a particular data
20:06 imirkin_: this has code for various methods of moving stuff around
20:06 imirkin_: i guess i mean ... what's your end goal?
20:06 imirkin_: those are all means to an end... what's the end?
20:07 duttasankha: 3. Retrieve the page directory and page table entries which would in turn help me to understand the memory mapping...
20:07 pmoreau: karolherbst: I would still had “Compiler” knowledge as being useful, or on the verge of required, as, as you said yourself, “This will include working a lot on the Nouveau compiler”.
20:09 duttasankha: imirkin_: My end goal is to demonstrate different methods through which I can access and get VRAM data (using both physical and virtualization addressing)
20:11 karolherbst: pmoreau: ohh, to the OpenCL thing, right, will add it
20:11 pmoreau: Yup, thanks :-)
20:11 imirkin_: karolherbst: a bunch of those are non-actionable
20:11 imirkin_: karolherbst: like "write a bunch of opts"
20:11 duttasankha: imirkin_: thanks a lot for the code pointer..
20:12 imirkin_: duttasankha: hm, ok. well - i guess i'm still confused as to why that's desirable - some kind of security research?
20:12 karolherbst: imirkin_: well, why not?
20:12 imirkin_: karolherbst: let's say a student shows up and says "i'd like to do that one, please". what do you tell them?
20:12 duttasankha: imirkin_: yes
20:12 imirkin_: duttasankha: ah ok. cool.
20:15 karolherbst: imirkin_: dunno. Think about some crazy super ops that student could try to implement. My initial idea was to have that student to work on a big one, but mupuf said it might be a good idea to rather work on more smaller ones
20:15 imirkin_: a big one like what? a bunch of little ones like what?
20:15 imirkin_: you expect the student to come up with this stuff? good luck.
20:15 karolherbst: loop unrolling
20:15 karolherbst: or that loop invariant thing
20:15 imirkin_: then write those
20:16 karolherbst: or that value range thing
20:16 imirkin_: rather than "unspecified list of things"
20:16 karolherbst: I would simply link to the trello list
20:20 duttasankha: imirkin_: we are trying to work GPU security research where we are trying to demonstrate various kinds of GPU vulnerability and VRAM is one of them....can you provide me some suggestion as to how can I access,view and retrieve the GPU PD and PT from the host side...
20:20 karolherbst: duttasankha: isn't it a bit pointless to search for sec issues within the mmio space if you need root priviliges to actually access those?
20:21 karolherbst: I mean, sure the entire GPU can read all the memory and you can use the GPU to write into random kernel memory...
20:21 duttasankha: we have already shown that the OS is vulnerable....kernel can be hacked as well...
20:21 karolherbst: but this isn't a secret, is it?
20:21 karolherbst: well right, but I mean, of course the GPU can write/read to/from all memory
20:22 karolherbst: personally I would have assumed that looking into the driver is more profitable than looking at the hardware
20:23 airlied: its why we have iommu
20:24 karolherbst: well there is this iomem=strict thing to prevent access from root to those registers though
20:24 karolherbst: but only if a driver is loaded
20:24 karolherbst: afaik
20:24 airlied: ee have found root escalations in driver before
20:24 karolherbst: airlied: right, but then somebody would look into the driver, right?
20:25 duttasankha: karolherbst: We are trying to look into the hardware cause we are primarily a architecture research group...so ultimately our goal is provide a secure architecture for GPUs
20:25 karolherbst: I mean, if I would already know, that I can do everything with the GPU, I would rather spend time to find a way to access the GPU without being root
20:25 karolherbst: duttasankha: ahhh, yeah, that makes sense then
20:25 duttasankha: so first I have to find the vulnerabilities....
20:26 duttasankha: it is just for demonstration...mainly we are writing a grant
20:26 duttasankha: so finally the grant would be for secure architecture...
20:26 karolherbst: well, you can crash the kernel with the pramin window
20:30 karolherbst: or maybe this all an be actually disabled already so that the GPU can't access random kernel memory. Never actually looked into preventing the GPU to mess up the system
20:31 duttasankha: Actually I am trying to do the reverse...I am trying to GPU from the host...not the other way round...so when I am talking about MMU, I am referring to GPU VM
20:32 karolherbst: duttasankha: ohh, so you want to access VRAM from the host, not system ram through the GPU?
20:32 duttasankha: yes
20:33 karolherbst: I see
20:33 karolherbst: wait a second
20:33 karolherbst: it is already done
20:34 karolherbst: duttasankha: there is this exploit where any userspace can read out non cleared GPU buffers which contains content of other process vram, like content of a firefox window. Something like that?
20:34 karolherbst: no root privs needed
20:36 duttasankha: yes that is done...I think it was published in ATC or CCS...I don't remember
20:36 karolherbst: well, I only have the source of that exploit, no idea where it came from
20:37 karolherbst: duttasankha: but I assume you are trying to find a way without needing the help of a driver or something like that?
20:37 duttasankha: yes...I am trying to write m own module to achieve that I mentioned before...
20:39 duttasankha: So I am also trying to access retireve the GPU PT and PD....can you provide me some pointer as to how I can do it?
20:40 karolherbst: duttasankha: mhh, well it might make sense to read the nouveau source code regarding all that memory stuff
20:41 duttasankha: I am doing it but I am getting lost whenever I am starting something with nouveau after some point of time....I have absolutely no idea the correct way of doing it...I am doing everything that was suggested by this community
20:43 karolherbst: well the problem shouldn't be actually accessing the memory. The bigger problem is like to properly parse everything. Or to setup the GPU to a state where it is actually useable.
20:44 duttasankha: Yes...that is true...
20:45 karolherbst: duttasankha: nvidia published some information about the pascal MMU stuff: http://download.nvidia.com/open-gpu-doc/pascal/1/gp100-mmu-format.pdf
20:49 duttasankha: okay...but I guess accessing the MMU and retrieving the PT and PD from GF100 has been implemented in nouveau...or am I wrong?
20:49 duttasankha: not retrieving but accessing the PD and PT
20:50 karolherbst: yeah
20:50 duttasankha: or I don't know may be both...
20:51 duttasankha: do you know where/what should I look into it to understand it?
20:51 karolherbst: subdev/mmu makes sense
20:52 duttasankha: okay...then I was looking into the write one....
20:53 karolherbst: I think skeggsb is the one with the best understand of all this anyhow
20:55 duttasankha: then problem is I would again get lost if I don't find a better systematic way of approach of understanding...
20:58 duttasankha: do you know how can I contact skeggsb..I was wondering to discuss with him as well..but I don't see him here in discussion...so do you know how can I approach him?
21:18 karolherbst: duttasankha: at some point he just turns up being here, sometimes it takes some days
21:19 karolherbst: duttasankha: you could also just write to the mailing list and maybe somebody answers there as well
21:20 duttasankha: okay...yeah I can do that as well...
21:40 skeggsb: duttasankha: you've got three options, either through the BAR0 window (0x700000), or BAR1 (which can be mapped in various ways, if a driver is running, it's through the MMU with page tables etc), or you've got IFB (i think rnndb calls it PEEPHOLE or something)
21:40 skeggsb: nouveau does all three in different circumstances
21:41 skeggsb: or you use any of the accel engines i guess
22:40 duttasankha: @skeggsb: thank you very mcuh for the pointers....I am currently running accessing the VRAM using BAR0 window. But I want to figure out the memory mapping of the VRAM and retrieve GPU VM PD and PT ...do you have any suggestion regarding that...
22:40 skeggsb: read the pascal doc that you were pointed at and/or the code in nouveau
22:40 skeggsb: there's not just one PD/PT either.. every GPU context has its own
22:42 duttasankha: Yes I was going through the pascal document that karolherbst mentioned..but I am using a GPU which has GF100 in it..so that pascal mmu document would still be valid for my generation GPU?
22:43 skeggsb: no, gf100's layout is a lot simpler
22:43 skeggsb: the general idea is the same though
22:44 skeggsb: they support basically the same capabilities, gf100 is 2 levels instead of 5, and bits in different positions
22:46 duttasankha: okay..so there is just PDE0 and PT?
22:48 duttasankha: regarding the nouveau code I was looking into subdev/mmu to retrieve the PDE and PT as pointed by karolherbst...