11:08 danvet: is nv50_wndw_func->acquire not called anywhere or am I blind?
11:12 pmoreau: danvet: Isn’t it called on the last line of wdnw.c:`nv50_wndw_atomic_check_acquire()`?
11:13 pmoreau: nouveau/dispnv50/wndw.c:318: return wndw->func->acquire(wndw, asyw, asyh);
11:13 danvet: indeed
11:14 danvet: why did grep not find that
11:14 pmoreau: It did for me :-D
11:14 pmoreau: Maybe a typo in the grep?
13:00 pmoreau: Back to being able to run some super basic OpenCL tests on Tesla: https://hastebin.com/azezofaqed.coffeescript
13:02 RSpliet: ye olde faithful macbook?
13:04 pmoreau: Indeed! :-)
13:05 pmoreau: I’m not planning on replacing it yet.
13:17 pmoreau: Ah, the 2nd test is failing because the offsets aren’t properly applied apparently.
13:17 pmoreau: It thinks it is emiting:
13:17 pmoreau: EMIT: st u32 # g[$r0+0x0] $r1 (8)
13:17 pmoreau: EMIT: st u32 # g[$r0+0x4] $r2 exit (8)
13:17 pmoreau: But actually, does the following:
13:17 pmoreau: 00000018: d00f0005 a0c00780 st b32 g15[$r0] $r1
13:17 pmoreau: 00000020: d00f0009 a0c00781 exit st b32 g15[$r0] $r2
13:20 pmoreau: Anyway, I should be working on work stuff and not Nouveau, at the moment. I might have a look at it tonight.
14:59 imirkin: pmoreau: i don't think offsets are supported on nv50
15:48 pmoreau: imirkin: I’ll need to check what nv50_ir_from_tgsi does to avoid generating offsets, cause apparently the lowering does take care of removing them.
15:54 imirkin: pmoreau: it's in the target
15:55 imirkin: i might have added a second function which doesn't respect this restriction
17:24 kherbst: imirkin: there is a comment
17:24 kherbst: return true; // blah blah, doesn't matter because we odn't support global/shared, need to rethink when we do ;)
17:26 imirkin: lol
17:26 imirkin: i guess it's rethinkin' time!
17:28 pmoreau: :-D
17:28 kherbst: well, I didn't copy&paste the actual comment ;)
17:28 kherbst: the actual comment is longer, but same meaning
17:29 kherbst: pmoreau: src/gallium/drivers/nouveau/codegen/nv50_ir_target_nv50.cpp:404
17:29 kherbst: this need some changes
17:29 imirkin: probably part of the fixing i did to only inline constant offsets up to the allowed quantity
17:29 kherbst: yeah
17:30 imirkin: (which really only became necessary due to some opts which got much better at inlining the offsets in the first place)
17:30 kherbst: right now load and stores aren't further checked, just a simple return true
17:30 imirkin: on nv50?
17:30 kherbst: yes
17:30 imirkin: or in general?
17:31 kherbst: nvc0+ has different code
17:31 imirkin: yeah ok, makes sense
17:31 imirkin: there was probably rhyme to that reason
17:31 imirkin: i think i was running into the fact that the offsets turn negative on nvc0
17:31 imirkin: and also the cross-ubo accesses
17:31 kherbst: oh, maybe
17:32 kherbst: for nvc0 we do return "offset >= -0x8000 && offset < 0x8000" though
17:32 imirkin: e.g. one might be tempted to encode uboblock[2+indirect] as an offset of 0x20000
17:32 kherbst: yeah... maybe we could even make use of that
17:32 imirkin: can't
17:32 imirkin: offset has a range
17:32 imirkin: hence the problem =]
17:32 imirkin: it's a signed 16-bit integer
17:33 kherbst: well, for load/stores we could
17:33 imirkin: so not even the full 16-bit range is covered of a UBO
17:33 imirkin: maybe. i don't remember what the limits are there.
17:33 imirkin: might be the same.
17:33 kherbst: I just don't think it makes much of a difference
17:33 imirkin: [on nvc0]
17:33 pmoreau: Thanks for the link, Karol.
17:33 imirkin: saved a lot of instructions when i improved the offset inlining
17:33 kherbst: on nvc0 everything except const always works :)
17:33 pmoreau: I’ll try to push a branch with my modifications, sometimes this week.
17:34 kherbst: which makes sense, as only const buffers are inlined
17:34 kherbst: (and for input/output it doesn't matter)
17:34 imirkin: btw, i'm out until sunday, in case someone needs to reach me - send email.
17:34 kherbst: imirkin: any idea about 4 byte encodind btw?
17:35 kherbst: and nv50
17:35 imirkin: is there a question in there?
17:35 imirkin: nv50 has 4- and 8-byte encodings
17:35 kherbst: I meant relevant to load/stores
17:35 imirkin: every op that has a 4-byte encoding can also be encoded in 8 bytes
17:35 kherbst: but it seems load/stores are always 8 byte?
17:35 imirkin: but the inverse is obviously not true
17:35 imirkin: mmmmmmmmm
17:35 imirkin: dunno
17:35 imirkin: doubtful.
17:35 kherbst: well.. the emiter assumes 8 bytes
17:36 kherbst: wondering what the constraints then are...
17:36 kherbst: pmoreau: mind checking with cuda?
17:37 imirkin: looked at envydis - sounds right
17:37 imirkin: aaaactually
17:38 imirkin: looks like shared memory should be able to have 4-byte encodings for some ops
17:38 imirkin: look at tabi + tabis (ptype = CP)
17:39 kherbst: mhh, sadly my installed cuda is too new and nv50 support was removed way back :/
17:39 imirkin: but only for reading shared memory, not for loading/storing it
17:39 kherbst: can't just compile CL to SASS
17:39 imirkin: er, well i guess technically for loading it. but not for storing.
17:41 kherbst: mhhhh
17:41 kherbst: global has a 4 bit offset and local 8 bit?
17:46 kherbst: huh
17:46 kherbst: "echo "0x80000000d0000000" | envydis -m g80 -W -O cp" should be correct, no?
17:46 imirkin: i didn't count the 0's, but sure
17:47 kherbst: ??? $r0 $r0 $r0 [unknown: d0000000] [unknown instruction]
17:47 imirkin: ok
17:47 imirkin: what are you trying to do?
17:48 kherbst: match against { 0x80000000d0000000ull, 0xe0000000f0000000ull, N("ld"), T(ldstm), T(ldsto), GLOBAL, .ptype = CP }
17:49 imirkin: 0x80000780d0000001
17:50 kherbst: mhh, yeah, that works
17:50 imirkin: lower 1 indicates it's a long instruction
17:50 imirkin: echo "0x80000000d0000001" | ~/src/envytools/envydis/envydis -m g80 -W -O cp
17:50 imirkin: 00000000: d0000001 80000000 (never) ld u8 $r0 g0[$r0]
17:50 kherbst: what's the 0x1 for?
17:51 imirkin: <imirkin> lower 1 indicates it's a long instruction
17:51 kherbst: ohhhh
17:51 kherbst: important to know
17:51 imirkin: check tabroot
17:55 kherbst: seems like there is a lot of unknown bits in that one: join (no $c3) ld s16 $r127 g15[$r127] [unknown: 0ff00000 1f1fc07c]
17:56 imirkin: "that one"?
17:57 kherbst: mhh SM11 added support for 32 bits and SM12 for 32 bits
17:57 imirkin: -V g84 / gt215
17:59 kherbst: yeah.. envydis doesn't care about that so far
17:59 kherbst: just enables 32/64 for all archs
18:01 kherbst: heh cuda 6.5? mhh, that will be anoying to install
18:03 imirkin: right. actually i'm not sure what SM11/12/13 map onto
18:04 imirkin: i don't think it's actual gens
18:05 kherbst: SM10 == G80, SM11 == (G80, GT2xx), SM12 == GT21X, SM13 == GT200
18:05 kherbst: just pain to handle that in mesa
18:05 kherbst: like.. do we really want to lower 32 bit operations? Guess we do
18:06 imirkin: well, 64-bit is only even semi-supported on G200
18:06 imirkin: iirc i never upstreamed full support
18:06 kherbst: okay.. so cudas 8.0 nvdisasm still supports SM1X
18:06 imirkin: coz it still needs a ton of lowering
18:07 imirkin: (f64 support that is)
18:07 kherbst: sure
18:08 imirkin: iirc it has add and mul (and f32 <-> f64) which are really the important ones
18:09 Lyude: We're still waiting from fw for turing, right?
18:09 kherbst: ahh, cuda 9.1 is the last one with nvdisasm support for SM1X
18:09 kherbst: important to know!
18:09 kherbst: Lyude: I guess so, yes
18:09 Lyude: kherbst: btw, going to take a short look into why your runpm patch doesn't seem to work on this machine
18:10 kherbst: Lyude: why?
18:10 kherbst: it doesn't work.. that's all we need to know
18:10 kherbst: it wasn't like that this patch was supposed to work, it simply works by chance
18:10 Lyude: kherbst: ah, gotcha
18:11 kherbst: and apparently we need a new patch as it doesn't seem to work on all gens :(
18:11 Lyude: that's why I'm a bit curious
18:11 kherbst: what I really need is a laptop where my patch works relibly and has a Turing GPU
18:11 kherbst: then we can just check whatever nvidia is up doing
18:11 Lyude: especially since that branch that bjorn linked seemed to kind of work
18:12 kherbst: worst case, it doesn't work with nvidia, but then we get nvidia to fix it
18:12 kherbst: then we fix it in nouveau
18:12 Lyude: kherbst: well we should be able to trace the driver if their driver works with the X1
18:12 Lyude: or we get them to fix it, yeah
18:12 kherbst: sure
18:12 kherbst: but the point is that it's different
18:12 kherbst: _but_
18:12 kherbst: it could be that we need the same thing fixed
18:12 kherbst: Lyude: anyway, mind leaving up SSH access on that machine for tomorrow+?
18:12 Lyude: kherbst: I can do it right now since I'm still going to take a brief look anyway
18:13 kherbst: I guess you won't work the until monday, right?
18:13 Lyude: kherbst: yeah I'll be off after tommorrow but I'm here today
18:13 kherbst: right
18:13 kherbst: I was just meaning that after you are down with that machine, I can poke into it the next days :)
18:13 Lyude: kherbst: mind giving me your ssh key again? I'll add an account for you
18:13 Lyude: kherbst: ahh, gotcha
18:13 kherbst: Lyude: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIL0nWnr+mGAmMbV133JkQmFJhO7xrloI+LUHuZgoeJEl
18:13 kherbst: save it somewhere :D
18:17 Lyude: I'm going to try bjorn's branch again and see how "working" things actually are after the card appears to runtime resume properly
18:22 Lyude: a-ha
18:22 Lyude: i knew something was suspecious
18:23 Lyude: kherbst: it works with noaccel=1, get some weird unresolvable handle errors but I'm guessing that's just because the gr channel doesn't have a valid handle programmed to it
18:23 kherbst: probably
18:24 Lyude: i figured that might have been it from the sched errors, although then that doesn't entirely explain why those sched errors don't happen with the other branch
18:24 Lyude: oh hold on, those are from the disp channel
18:24 Lyude: sigh
18:26 Lyude: oh, but, hm. i have an idea
18:43 Lyude: i what
18:43 Lyude: kherbst: I think your patch isn't needed here, surprisingly
18:44 kherbst: maybe they fixed the bug in later hw :p
18:44 Lyude: yeah maybe
18:44 Lyude: seems like the actual failure is gr bringup
18:44 Lyude: although things still looked like they acted way better with bjorn's patches
18:44 kherbst: yeah
18:45 Lyude: anyway, let's see if my hunch about the disp channel issue just being a problem of needing to move the channel into vram is correct
18:49 Lyude: yep, +1 points for me
18:50 kherbst: heh
18:50 kherbst: maybe dma is busted or so?
18:50 kherbst: but I guess having that in VRAM makes sense anyway
18:50 Lyude: kherbst: I think it is
18:50 Lyude: kherbst: well it shouldn't really matter afaik, at least for everything after pascal
18:51 Lyude: it does tell us the gpu is upset that it can't access certain bits of memory on resume though
18:51 kherbst: I'd rather we do all the display stuff in vram in order to decrease accesses over pci
18:52 Lyude: kherbst: note this is just for the actual channel though, which doesn't really push that much data across does it?
18:52 kherbst: well
18:52 kherbst: if you do reverse prime there isn't much bandwidth left anyway
18:52 Lyude: yeah but this is like, 4k
18:52 kherbst: I am more worries about access latency than anything else
18:52 kherbst: *worries
18:53 kherbst: reverse prime is slow as heck
18:53 kherbst: even prime is
18:53 kherbst: and I think it's because of stupid reasons like that
18:54 kherbst: and maybe even ttm doing too many migrations or something.. never really looked into it
18:54 kherbst: but I'd prefer to keep as much as possible in VRAM
18:54 kherbst: and simply because it's 4k, why even bother not putting it into VRAM?
18:55 Lyude: yeah I'm not sure tbh
18:57 kherbst:wished we would have more time looking into why prime is slow though
19:14 Lyude: kherbst: interesting, so bjorn's branch also seems to fix the issues with the pushbuf being inaccessible from system memory after resume
19:25 kherbst: yeah
19:25 kherbst: makes sense
19:27 pmoreau: kherbst: There is more to it that just that function needing to be changed: it is never called in the first place for the store operations.
19:29 kherbst: :(
19:29 kherbst: ohhh, uhm
19:29 kherbst: I see
19:30 pmoreau: Looking into why right now.
19:30 kherbst: I know why
19:30 kherbst: when doing nir -> codegen, I just emit instructions with the offsets already embeded
19:31 kherbst: and I think we would do the same in the tgsi path as well
19:31 pmoreau: Ah
19:32 kherbst: pmoreau: loadFrom
19:32 kherbst: and mkStore
19:33 kherbst: mkLoad(ty, def, mkSymbol(file, i, ty, base + c * tySize), indirect0) inserts the symbol containing the offset
19:33 kherbst: I think we shouldn't do that and rather have an optimization moving the offset into the instruction when possible
19:34 pmoreau: Okay, will check after opening an MR to fix spirv2nir no longer compiling with ToT.
19:37 kherbst: ohh, obviously this needs to be a codegen pass, can't do that in nir :)
19:37 kherbst: and might even benefit other shaders
19:38 kherbst: as we could merge constants with the current offset
19:38 kherbst: like (ld (iadd a 10)+20) -> (ld a+30)
19:39 kherbst: (which shouldn't happen on the nir path, as nir will already constant fold it, but inside codegen through TGSI shaders we might end up with that
23:22 Lyude: Hey karolherbst I just realized it's actually probably a silly idea for me to leave this X1 Extreme at the office because I won't be able to reboot it if you crash it, so I'm going to bring it home with me. Let me know over the weekend when you want access to it and I'll hook it up
23:25 karolherbst: Lyude: okay, sounds good