01:45imirkin: pmoreau: just did a force-push, both the "safe" and the more feature-complete nv50_compute branches
01:46imirkin: pmoreau: i think what's left is (a) cleanup the image functions and (b) adding writeonly image formats to shader "key", so that we can support the GL format-less writeonly image, which is required for desktop GL
01:46imirkin: oh, and (c) having dynamic allocation between buffers and images, rather than the current hard-coded slot thing
01:47imirkin: i think it's in pretty good shape otherwise
01:52Lyude: how far ahead does the nv50 stuff you're doing apply?
01:52imirkin: can you rephrase the question?
01:52imirkin: (aka i understand the words individually, but not together)
01:54imirkin: pmoreau: oh, just remembered. also shared atomics, but i think we can copy that ~verbatim from nvc0
01:54imirkin: since nvc0 had the same load locked/store unlock thing
03:12imirkin: mwk: do you have any reason to believe that ld lock/st unlock are restricted to 32-bit bytes? nvdisasm happily allows it for e.g. 16-bit, but who knows in practice...
03:12imirkin: 32-bit *types*
03:14imirkin: i guess CUDA doesn't let you do it
03:14imirkin: so probably there's some semi-decent reason for it
03:48Lyude: imirkin: sorry-I meant what nv generations does the work you're doing w/r/t compute affect? I'm mostly just curious
03:48imirkin: Lyude: tesla
03:50imirkin: (yeah... long list there ... https://www.youtube.com/watch?v=68t74jofBD0 )
05:10imirkin: pmoreau: fyi, i'm listing the remaining TODO items here: https://trello.com/c/QUbwENLp/220-nv50-compute
05:11imirkin: feel free to add stuff to it, but if it's opencl-specific, i'd rather you make a separate card for it
07:04pmoreau: I’ll have another look at the series this evening, and try to rebase my work on top of your branch + test it.
08:39karolherbst: anybody here with a fermi GPU and ubuntu running?
11:37karolherbst: pmoreau: clEnqueueSVMMigrateMem is also supported :D
13:57mwk: imirkin: heh, good question
13:58mwk: no idea, really
13:58mwk: I see no particular reason why it wouldn't work
16:17imirkin: mwk: hey, i'm reading the i2i test code... am i correct in understanding that cvt s8 $r1l s32 $r0, when $r0 = -1, will end up with $r0l == 0xffff and not 0xff?
16:20karolherbst: imirkin: for what gpus?
16:20karolherbst: but I think that's correct, let me check as well
16:21imirkin: karolherbst: nv50
16:21imirkin: i don't think any later gpu's have individually addressable 16-bit regs
16:23karolherbst: yeah, but nvidias 16/8 bit handling is generally.... weird
16:24imirkin: yeah, manually getting rid of the high bits fixes the test.
16:25karolherbst: I think nvidia generally keeps the value correct for the full width register
16:25karolherbst: or at least that's what I seem to remember from working on the conversion stuff for CL
16:26karolherbst: makes it easier to consume the values in ALU instructions
16:27imirkin: yeah. i added the clamp, all good now. onto figuring out why snorm is messed up
16:27imirkin: looks like some edge cases, only a handful of values get messed up
16:27imirkin: maybe i need to clamp
16:32imirkin: oh neat. i2f s8 -> f32 does exist.
16:33imirkin: it's just f2i that's missing
16:45imirkin: ok, well at least that's a nice simplification to what's probably the most common format path
16:45imirkin: (i.e. rgba8_unorm)
16:52imirkin: mwk: if you feel like thinking ... can you see a way that this shader will store a different value than what was loaded? https://paste.debian.net/plain/1187363
16:52imirkin: (and yes, it's a dumb shader. it's basically loading an image, and then storing it. but our compiler isn't smart enough to realize that it's all a no-op)
16:53imirkin: it does seem like we get the correct result *most* of the time, but there's some funny edge case
16:56imirkin: (here's one with terminal color escape codes... curl https://termbin.com/dhs0 )
17:08imirkin: this nearly identical shader works fine for rgba8i: https://paste.debian.net/plain/1187367
17:08imirkin: the difference being that it doesn't round-trip to float
17:10imirkin: so something in the round-tripping must be broken for *some* values of s8...
17:16pmoreau: karolherbst: Re “clEnqueueSVMMigrateMem is also supported”: feel free to add it. :-)
17:16pmoreau: It was a best effort to generate something that matched the current status, and since no one seemed to care enough to jump in and point out mistakes/missing support, this is what I ended up with.
17:16pmoreau: Similarly some of the drivers could be marked as supporting OpenCL 1.0/1.1, but since I do not know how much the drivers actually support vs claim to support, I didn’t mark any of them as even supporting 1.0.
17:16karolherbst: well... I probably lande the support for migration 2 or 3 weeks ago
17:20pmoreau: I didn’t look around too much once the MR was opened, so that could explain why I missed it if it landed within the past month.
17:22pmoreau: (or even, within the past 6 months)
19:05Lyude: imirkin: btw - you were able to reproduce those cursor issues?
19:05imirkin: Lyude: on a GP108 - no.
19:05Lyude: well yeah-it should definitely work on pascal, i'm more curious about kepler
19:06imirkin: i've been otherwise occupied and haven't had a chance to reboot / switch gpu's
19:06imirkin: however people on the ML have reported that both GK104 and GK208 have issues with 256x256, but 128x128 seems to be fine
19:06Lyude: i don't have the time to look at it quite yet but I actually did manage to recover some kepler cards from my office that should still be working
19:06Lyude: imirkin: ooooh, that's weird
19:07imirkin: and this is with modetest, so no funny Xorg weirdness involved
19:08imirkin: could be a bit in pdisp ala PDISP_UNFUCKUP_LARGE_CURSOR somewhere...
19:09Lyude: you'd think the RM would just handle that itself though
19:29pmoreau: imirkin: Okay, I think I successfully rebased my branch on top of yours.
19:29pmoreau: And second batch of comments sent your way; I’ll see if I can go through more patches today.
19:31imirkin: pmoreau: sounds good
20:31imirkin: pmoreau: btw, some of the changes which i didn't include weren't necessarily because i thought they were 'bad' but more because i didn't think they were needed for the core support
20:31imirkin: as much as possible, i was trying to reduce feature creep
20:32imirkin: i do think that the lowering pass fixing up shared offets is bad -- that should be done by the frontend
20:33imirkin: pmoreau: btw, do you have a working cuda setup for g80?
20:33imirkin: would be nice to find out what the right incantation for membar is
20:46pmoreau: Yeah, no worries regarding the changes; the current MR is already quite packed commit-wise and I would not want to review a 40-commit long one! 😆
20:48pmoreau: Re “i do think that the lowering pass fixing up shared offets is bad -- that should be done by the frontend” IIRC I went with a lowering pass to only implement it once rather than once in the NIR frontend and once in the TGSI one, but given that only the NIR frontend will support compute (I imagine), it is no big deal.
20:49imirkin: hm? TGSI definitely supports compute
20:49imirkin: but it's just not logical to do it in the lowering pass
20:49imirkin: since you don't necessarily know what needs to be offset
20:49imirkin: and what doesn't
20:49imirkin: only the frontend knows that
20:50imirkin: offsetting FILE_INPUT is fine, since you know exactly what needs offsetting
20:50pmoreau: The lowering pass can access the attribute containing the offset, so it does not need to know the value of the offset just that one is needed.
20:51imirkin: but offsetting general SHARED loads ... who knows where they came from
20:51pmoreau: Ah, right
20:51pmoreau: I see what you mean
20:52pmoreau: Re “btw, do you have a working cuda setup for g80?” I have a semi-working one (i.e. I can compile OpenCL C (and most likely CUDA too) to binary, but I haven’t tried running it as I do not have the blob installed).
20:52imirkin: i just want the compiled binary
20:52imirkin: can you see what membar becomes?
20:53pmoreau: Let me find where I hid my script for doing it, one sec
20:53imirkin: allegedly it's some long sequence of accesses
20:53imirkin: i'm concerned that it might be GPU-specific
20:53imirkin: depending on the memory geometry
20:54pmoreau: Okay, found my script again
20:54imirkin: pmoreau: in CL, I think it's barrier(CLK_GLOBAL_MEM_FENCE)
20:57pmoreau: BTW I am using https://github.com/ljbade/clcc which calls the NVIDIA OpenCL compiler to compile OpenCL C down to PTX, and then run ptxas on top of it. So you only need to have CUDA installed (6.5 for Tesla support) and use clcc to get PTX; no need to run an OpenCL program.
20:57pmoreau: Okay, let’s see what I can get.
20:57imirkin: well, the main thing is ptxas
20:57imirkin: i could probably give you an appropriate PTX program too
20:57imirkin: i just don't have that old a version
21:01pmoreau: I get the same thing for SM1.2; let me try 1.0
21:02imirkin: it wouldn't be based on the SM
21:02imirkin: it'd be based on the specific chip
21:02imirkin: and/or fuse settings
21:02imirkin: or maybe they weren't so careful
21:02imirkin: anyways, this is great, thanks
21:04pmoreau: Welp, the CUDA version I have is too “recent” for SM 1.0; it only supports from 1.1+. 🙃
21:04pmoreau: Mmh, if it’s chip-/fuse-based, then I would need to run a CUDA/OpenCL program and see what is being generated. :-/
21:04imirkin: although it looks like it just assumes there are 8 memory partitions or whatever
21:05imirkin: which was the max on tesla
21:05imirkin: so all is well
21:05imirkin: it could be that you could have cheaper barriers on some boards
21:05imirkin: or whatever
21:05pmoreau: I see
21:05imirkin: i really don't remember how the memory stuff works on tesla
21:05imirkin: i just remember lots of weirdness
21:17imirkin: pmoreau: hm, i guess that assumes that the zero page is mapped?
21:17imirkin: i wonder if we do that
21:19imirkin: huh, i wonder why it does the MOV R63, R1 stuff
21:19imirkin: i can't imagine that does anything...
21:19imirkin: probably just a quirk of its own RA?
21:27imirkin: mind doing the other one too?
21:28imirkin: (the local barrier)
21:34pmoreau: Sure, what would it look like?
21:35imirkin: s/GLOBAL/LOCAL iirc
21:35imirkin: https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/barrier.html :)
21:35pmoreau: Got it
21:36imirkin: i sorta assume it's a no-op
21:36imirkin: but maybe it's a BAR with some args
21:40pmoreau: BAR.ARV.WAIT b0, 0xfff; /* 0x00000780861ffe03 */
21:41pmoreau: Not too much going on, but on the other hand it's the only thing happening in the kernel.
21:41RSpliet: it's been forever, but last I remember Tesla only has a hammer for its memory barriers (or fences?), and the blob just issues the same barrier for all type of requested OpenCL ones
21:41imirkin: pmoreau: ok cool. so it's just the bar
21:42imirkin: i wonder what the diff is between that 0xfff thing and "ALL"
21:42imirkin: the bar we emit is "bar inc wait 0x0 all", whereas theirs is "bar inc wait 0x0 0xfff"
21:43pmoreau: Mmh, is there a way to specify that all in OpenCL or PTX?
21:43imirkin: in PTX it's just "membar.a" or whatever
21:44imirkin: there's membar.cta and membar.gl -- those roughly correspond to the global/local things
21:45pmoreau: Okay, let me try that
21:46pmoreau: BTW, in PTX local is just `bar.sync 0;` whereas global is `membar.gl; bar.sync 0;`
21:49pmoreau: It’s not happy about that `membar.a`.
21:50pmoreau: “Barrier modifier required for instruction 'membar'”
21:51imirkin: pmoreau: what about membar.cta?
21:51pmoreau: Better! 👍️
21:52imirkin: i think bar.sync is the equivalent of barrier() in GL
21:52imirkin: whereas membar is the equivalent of memoryBarrier()
21:52pmoreau: That one got optimised away
21:53imirkin: yeah, they're probably clever.
21:53imirkin: those jerks.
21:53imirkin: i think i have what i need though - thanks a lot!
21:53pmoreau: You’re welcome!
21:54pmoreau: If you have a slightly more involved kernel you would like me to try, do not hesitate.
21:56imirkin: they seem to rely on the first few pages being mapped ... unfortunate.
21:56imirkin: i'll come up with something.
21:58pmoreau: I have a question for you before I get some rest: when creating a GPR value it has to be at least 4 bytes, right? So when doing a `ld u8 %r114 g[%r113+0x0]`, %r114 is actually 4 bytes-wide. Or should %114 be allocated with some `getScratch(1)`?
22:02pmoreau: The issue being that `MemoryOpt::combineSt()` crashes due to `sizeSt` becoming negative rather than 0 (`typeSizeof(st->dType) == 1` but `st->getSrc(s + 1)->reg.size == 4` so `sizeSt == -3` and since the loop condition is `sizeSt`, it keeps on iterating passed the amount of sources that `st` has.
22:03imirkin: yeah, so it's all slightly wishy-washy
22:03imirkin: there are a few things going on here
22:03imirkin: there's a dType/sType on the instruction
22:03imirkin: and there are practical widths of the arguments
22:03imirkin: now, while the IR can support any theoretical combination of these
22:03imirkin: the hw can only do the things it can do
22:03imirkin: as reflected in the emitter
22:04imirkin: there are no 1-byte registers on nv50
22:04imirkin: so the value should at least be getScratch(2)
22:04imirkin: i'd have to check if ld u8 $r0 g0 is supported
22:04imirkin: or if the dest has to be a half-reg
22:04imirkin: i usually check with envyas :)
22:05imirkin: looks like ld only supports full-width regs
22:05imirkin: (or wider)
22:05pmoreau: Ah good point, ld u8 might not be supported; if you could check on non-Tesla too it would be nice, as we will have the same problem there.
22:06pmoreau: Same with st I would assume?
22:06imirkin: so ld u8 $r0 g0[$r0] is legal
22:06imirkin: and will do a 1-byte read from memory
22:06imirkin: and will zero-extend it into $r0
22:06imirkin: yes, same idea with st
22:06pmoreau: Okay, so we will need to fix combineSt then.
22:07imirkin: slightly more ideally
22:07imirkin: the instruction would be
22:07imirkin: dType = TYPE_U32
22:07imirkin: sType = TYPE_U8
22:08imirkin: fwiw i haven't run into that particular issue
22:08pmoreau: Mmh, good point.
22:08pmoreau: May I introduce you to the OpenCL CTS then? :-)
22:09imirkin: but of course i also don't have a ton of code exercising this stuff
22:09imirkin: yeah, i'm sure the CTS will (or does) expose tons of problems
22:09imirkin: i'm just trying to make things work at least slightly
22:09imirkin: to quote the simpsons...
22:09pmoreau: I’ll try to change that tomorrow in the current nir frontend.
22:09imirkin: i'm just trying to do a good deed here, i'm not running for jesus!
22:10karolherbst: ahhhh... networking is annoying
22:10imirkin: esp if you don't plug the cables in
22:10karolherbst: yeah well
22:10karolherbst: I have this new apartment with 6 eth ports all over the place
22:10pmoreau: While we are looking at ld/st, I believe we need something like https://gitlab.freedesktop.org/pmoreau/mesa/-/commit/61b529ed000e198eee06fa7ce9b80032e18d5b48. Do you think it should be implemented differently?
22:10karolherbst: 1 works at 1000, 2 at 100 and 3 are just broken :/
22:11pmoreau: Arf, not great
22:11karolherbst: and my WFH room has one of the 100 mbit and a broken one :D
22:11karolherbst: my bedroom has the 1000 one
22:11imirkin: pmoreau: i haven't given it much thought. i think there's some 64-bit support in at least g200, maybe all of g200+
22:11karolherbst: pmoreau: but I also have a managed switch which is able to check cable quality and one is "short" and another one "open" and stuff
22:12karolherbst: very fun
22:12pmoreau: Okay no worries, it can wait until a later MR.
22:12imirkin: pmoreau: looks like at least CAS/EXCH support 64-bit with SM1.2 on global memory
22:14pmoreau: IIRC I was getting hardware errors when doing 64-bit loads/stores to shared, but it could have been an encoding issue in the emitter or me coming to the wrong conclusion regarding a bug.
22:15imirkin: and ADD for 64-bit too
22:15imirkin: yeah, doesn't seem like there's any 64-bit shared support
22:15imirkin: would be good to confirm with opencl
22:15imirkin: or whatever
22:15pmoreau: ADD as in atomic ADD, or regular ADD?
22:15imirkin: atomic global memory add on 64-bit memory
22:16imirkin: again, not shared mem ;)
22:16pmoreau: I’ll see if I can confirm tomorrow the 64-bit ld/st on shared mem with OpenCL.
22:16imirkin: i guess if you can lock 32-bit mem
22:16imirkin: you can also treat that as the lock location for any later memory too
22:16imirkin: there's no 64-bit shared mem variants in envydis, but it's conceivable those were missed
22:17imirkin: but there's no 64-bit shared mem in GL, so i left that off until later.
22:17imirkin: pmoreau: btw, what board are you testing on?
22:17pmoreau: MCP79 and G96 mostly
22:17imirkin: ok cool
22:18imirkin: the MCP79 is SM12 =]
22:18imirkin: i'll have some more patches later to do shared atomics
22:53Lyude: skeggsb: btw - do we have any plans to get the new nouveau repo included in drm-tip by default?
22:57Lyude: actually seems like it shouldn't be hard at all (just merged it with drm-tip in a local branch of mine, no conflicts)
23:05Lyude: also - do folks still use the nouveau trello board? I'm thinking of putting up a TODO list for igt, since I'm hoping to send out most of my pending patches today and I think with that we'll be at a point where multitasking with this should be possible
23:05imirkin: i do
23:05imirkin: from time to time
23:05Lyude: btw mupuf - prepare for igt patches :)
23:05imirkin: it's kinda a personal todo list, in many ways
23:06imirkin: it's unlikely that someone will do an item that someone else puts up
23:06imirkin: unless there's a concerted effort to distribute the work, and the cards are actionable items
23:06Lyude: yeah - that's kinda what i'm hoping to go for
23:07Lyude: i'd like to have it up there at least so folks have the option
23:07imirkin: i guess you and karol have gone so far as to just create yourselves lists of TODO's :)
23:07imirkin: i disguised mine under more general categories
23:51Lyude: well, that explains this issue
23:53Lyude: ...sigh, or not
23:54Lyude: wait-no, I think it does, I think the last kms_plane bug that nouveau is hitting is because we're forgetting to include the ovly immediate in the wndw interlock
23:55Lyude: s/wndw interlock/interlock/
23:56skeggsb: i'm not 100% sure the interlocks work on evo with the pio channels (curs/ovim), iirc having problems interlocking cursor with core when initially writing atomic
23:56skeggsb: we interlock everything on nvd
23:56skeggsb: i could have just done something dumb at the time though
23:58Lyude: skeggsb: I -think- this might be it though, since it's the only thing that makes [ 324.980409] nouveau 0000:1f:00.0: disp: chid 7 mthd 0080 data 00000001 00005080 0000000b make sense
23:58Lyude: (also hold on, I will post the full log just for reference)