00:00imirkin_: most of it is meant to just be adapting some of the ways in which tgsi hands you the data
00:00imirkin_: the min/max thing is really my bad - sorry!
00:01imirkin_: are there other weird ops?
00:01karolherbst: quite some
00:01imirkin_: i guess stuff like SSG?
00:01imirkin_: we could add a "sign" op in nvir
00:02karolherbst: well I also fail basics like add for 64bit ints
00:03karolherbst: but I am not sure why though
00:03imirkin_: yeah, that might be your bad :p
00:03karolherbst: I doubt that though
00:04karolherbst: there are sets
00:04karolherbst: in the test
00:04karolherbst: so the add is probably fine, but the checks within the sahder
00:04karolherbst: probably not
00:05karolherbst: anyway, I will go carefully through this and move stuff which makes sense
00:06karolherbst: imirkin_: bit select
00:06karolherbst: not bit select
00:08imirkin_: what's select?
00:08imirkin_: like OP_SLCT?
00:08karolherbst: there are two versions of that in nir
00:08karolherbst: fcsel and bcsel
00:08karolherbst: fcsel compares against 0.0f
00:08imirkin_: fcsel == CMP, bcsel == UCMP i guess?
00:09karolherbst: it is a real select
00:09imirkin_: what's the condition?
00:09karolherbst: a boolean value
00:09imirkin_: for fcsel
00:09karolherbst: float != 0.0f
00:09imirkin_: so kinda like CMP but not identical
00:09imirkin_: CMP is float < 0.0f iirc
00:09imirkin_: UCMP is a boolean
00:10imirkin_: mkCmp(OP_SLCT, (srcTy == TYPE_F32) ? CC_LT : CC_NE,
00:11karolherbst: 64 bit in to bool conversion is also quite painful
00:11karolherbst: well, or basically doing a 64bit AND
00:12karolherbst: or did I get this one right?
00:13karolherbst: I now that some conversion things related to bools is also broken
00:14karolherbst: I need to go through those again and see what should be moved somewhere else
04:11imirkin: karolherbst: i pushed that textureGrad change
04:12imirkin: gonna look at the maxwell thing and see if something obvious pops out
05:56bazzy: I filed that "double adapter" bug I discovered: https://bugs.freedesktop.org/show_bug.cgi?id=104344
05:57bazzy: crashes nouveau
05:57bazzy: it's a rare scenario I reckon
10:08pmoreau: imirkin: I stopped working on the SPIR-V -> NVIR while karolherbst was playing with NIR -> NVIR, as if he could get it working, it would become more relevant to instead work on improving NIR to support OpenCL SPIR-V, rather than continuing with the SPIR-V -> NVIR.
10:08pmoreau: (Plus that gave me some more time to work on SPIRV-Tools.)
14:24karolherbst: pmoreau: wow.. some games are also already running through my nir pass :) allthough I didn't test anything complex
14:26imirkin: karolherbst: if you have a gm10x+ handy, can you test out my textureGrad patch?
14:26karolherbst: pascal if that helps?
14:26imirkin: yep, that counts as gm10x+ :)
14:27karolherbst: you didn't move the quadon/pops, but I doubt they matter except for performance or whatever reason
14:28imirkin: they were already positioned same way as in the new code
14:28imirkin: oh fuck, just saw a bug
14:28karolherbst: ahh, okay
14:29imirkin: here's an updated version
14:30karolherbst: the one test passes
14:31karolherbst: tex-miplevel-selection texturegrad cube
14:31imirkin: mind checking the others?
14:31imirkin: CubeShadow, CubeArray
14:31imirkin: and the CTS test
14:31karolherbst: the two others as well
14:31karolherbst: mhh, do you know the name of the cts one?
14:32imirkin: check that patchwork link
14:32imirkin: i say which tests in the description / comments
14:35karolherbst: mhh, it still fails
14:35karolherbst: it runs much much longer
14:35karolherbst: well it tests quite a lot
14:35imirkin: grep 'Sampling shader' TestResults.qpa
14:36karolherbst: no output
14:36imirkin: but it fails?
14:36imirkin: must be some other failure then
14:37imirkin: pastebin the whole TestResults.qpa file?
14:38imirkin: or at least a good chunk of it if it's huge
14:38karolherbst: ugh... I've messed the run up, needed to force 4.5...
14:39imirkin: ah oops :)
14:39karolherbst: I get lines like this:" <Text>Format: GL_RGBA8, Mutability: 1, Sampling shader: Fragment, Sampling function: TextureGrad, W: 64, H: 64</Text>"
14:39imirkin: can you grab the full list?
14:39imirkin: i.e. run that grep
14:40karolherbst: no, I messed up. It passes with your test
14:41karolherbst: I forgot to use the local version of mesa :)
14:41imirkin: ok awesome
14:41imirkin: well, play around with it, feel free to send a Tested-by
14:41imirkin: i'll sit on it until at least tonight
14:42karolherbst: maybe at some point I get to run a full piglit here
14:42karolherbst: there are at least no regressions on the run with NIR
14:43karolherbst: but that's also not a full run, more like a 60% run
14:53imirkin: pmoreau: well, i'd still much much much rather have SPIR-V -> NVIR directly.
14:55karolherbst: imirkin: for what reason though? I see a lot of benefit in using NIR here, especially in terms of work sharing and benefiting from what others do
14:55karolherbst: if we do it directly, we are basically on our own
14:56imirkin: we're on our own either way.
14:56karolherbst: I don't see why we should be
14:57karolherbst: we are not the only ones wanting to have this
15:01imirkin: nir is an extra translation step
15:01imirkin: which seems unnecessary
15:02imirkin: and we're bound to how intel works, because nir is basically an intel compiler (in terms of manpower)
15:03karolherbst: well, right, but when working on nir it doesn't feel like I are screwed up by how intel hardware works
15:03imirkin: and lastly, i'm still and will probably remain pissed at the fact that nir was created without *any* consideration to the already-existing in-tree compilers which did everything it did and more.
15:04karolherbst: you mean tgsi?
15:04imirkin: i mean nvir
15:04imirkin: and to a lesser extent, r600/sb, although i think that one is *very* r600-specific. i haven't looked super-closely though.
15:05imirkin: tgsi is not a compiler - it's just an instruction spec + stream.
15:13karolherbst: mhh, but what difference could it have made to consider any of the backend compilers? None of those would be choosen instead of nir anyway, because they are too specific and if you remove the nv specific stuff from codegen, not that much will remain left
15:21imirkin: nvir isn't specific at all
15:21imirkin: and it's a fully optimizing compiler
15:22imirkin: which does SSA and everything
15:22imirkin: which the nir guys spent the better part of a year reimplementing
15:22imirkin: instead they could have been making nvir better and everyone would benefit
15:25karolherbst: maybe, maybe not
15:26karolherbst: anyway, it is tiring to discuss on a what if kind of perspective. I doubt I would have even consider it, because a lot of the code is still being developed with nv hw in mind, and then you have to decide on whether a clean new implementation or reworking something else is more worth the effort
15:26karolherbst: and usually the former decision turns out being the better decision, even if it is the harder one
15:27karolherbst: and we still would have to fix most opt passes in nvir
15:27karolherbst: especially because they depend on order and can't be run multiple times and so on
15:27karolherbst: sure, most of it can be fixed
15:28karolherbst: but I highly doubt it would be a 2m vs 12m kind of difference
15:28karolherbst: and then again, doing something new allows you do make different foundation and other design decisions
15:38karolherbst: in the end, I don't know what would have been the better decision or the better thing to do. And I don't want to look back and be annoyed by the decision others did and other adepted to it. And because we have nir now and a lot of drivers are picking it up, I don' see why I should be grumpy about this. It isn't like NIr is crap and not worth to investiage. I see it as a way to reduce our work to support certain things and if
15:38karolherbst: it makes it easier for us to reach certain performance/feature goals, then I don't see why I should even consider it being the wrong decision. It is maybe not the perfect one, but so are 99% of decisions made bascially
15:57imirkin_: i'm pulling for spirv -> nvir. this isn't a performance/feature project, it's a "i'm having fun" project. i can't have fun with nir.
16:05karolherbst: I wouldn't mind if we end up with both anyway. I am having having fun with nir currently and I am getting the feeling, that this will be the way we get those features faster as well. So this is basically the foundation of my decision here
16:10imirkin_: glad you're having fun :)
16:15karolherbst: mhh, currently a little stuck with edgeflag...
16:16karolherbst: I see that in TGSI we get the .xxxx swizzle, but I don't see it happening in nir, anyway, there are also those weirdo writes into o[0x0] - o[0x8] with the edflag.x value
16:16imirkin_: ignore it.
16:16imirkin_: it doesn't do jack shit.
16:17karolherbst: mhh, test is failing due to this though
16:17karolherbst: "shaders@point-vertex-id divisor" for example
16:17imirkin_: i think that fails with tgsi too
16:18karolherbst: it doesn't
16:18karolherbst: otherwise I would have ignored it :)
16:18imirkin_: good luck
16:18karolherbst: in nir I do this: "vfetch b128 $r0q a[0xa0]" and exporting those values in o[0x3fc] + o[0x0] ... o[0x8] and tgsi just uses the x component
16:19karolherbst: yeah, right
16:19imirkin_: the edgeflag does not need to be fetched or written in the shader.
16:19imirkin_: i know tgsi passes it through because other hw supports it
16:19imirkin_: but nvidia hw has no edgeflag support in shaders
16:19karolherbst: I see
16:19karolherbst: so we could just drop it?
16:19karolherbst: and just toally ignore it
16:20karolherbst: well I guess I still need to care about the slots and so on
16:20imirkin_: it's handled by nvc0_vbo_translate
16:21imirkin_: the only thing is
16:21imirkin_: iirc whether there's an edgeflag vbo thing or not is determined by the TGSI shader
16:21imirkin_: but it's just a "present or absent" sort of thing i think
16:21imirkin_: maybe it keeps track of which vertex attrib it is too
16:21karolherbst: mhh, interesting
16:22imirkin_: either way, it has no direct effect on the shader
16:22imirkin_: i.e. the shader compiles identically whether there's edge flag stuff or not
16:22imirkin_: but nvc0_vbo will do different things depending
16:22karolherbst: then I just need to stop writing/reading the wrong values
16:22karolherbst: but then again...
16:24karolherbst: there is a nir_lower_passthrough_edgeflags thing
16:25karolherbst: I guess if I make nir not running this, it might just work :)
16:31imirkin_: well, you still need to know if there's an edgeflag or not, and which attrib it's in
16:31imirkin_: (might always be the last attrib though)
16:32karolherbst: yeah, I am currently checking within gallium how that thing is handled
16:32karolherbst: maybe we get that information without having to pass it through
16:32imirkin_: you don't.
16:32imirkin_: but it could easily be done.
16:58karolherbst: imirkin_: but would you agree that writing into o[0x0] is wrong if the output is edgeflag? with tgsi we end up writing the value into 0x3fc and 0x0, 0x4, 0x8. Or won't the last write matter at all anyway?
16:58imirkin_: you're using nouveau_compiler
16:59imirkin_: which is not representative as to the actual output locations
16:59karolherbst: I am not
16:59imirkin_: if we write to a[0x3fc] then ... i can't imagine it matters
16:59imirkin_: i assume it's discarded.
16:59imirkin_: by the hw
16:59karolherbst: I am running "/home/kherbst/git/piglit/bin/point-vertex-id divisor -auto -fbo"
16:59karolherbst: mhh, yeah
17:01karolherbst: I have the same shaders/headers now, but it still fails, I guess something else is wrong then
17:02karolherbst: and I am sure I set io.edgeFlagOut correctly
19:41pmoreau: imirkin_: I haven’t flushed all the code down the toilet yet either :-D
19:48pmoreau: Need to do a couple modifications, and I’ll resend the clover series, regardless of doing NIR or SPIR-V.
19:55karolherbst: pmoreau: this includes accepting spir-v through clover?
19:55karolherbst: and all the general spir-v IR handling there?
19:56pmoreau: So, nothing that implicates accepting NIR or SPIR-V in Nouveau, besides adding PIPE_IR_SPIRV
19:57karolherbst: pmoreau: by the way, I did a full piglit run except those silly tests: 24984/37936
19:57RSpliet: pmoreau: you had some nice work on local(/shared) memory and atomics too right?
19:57karolherbst: let's see if I can manage to implement variable array accesses as well
19:57pmoreau: karolherbst: Looking quite good!
19:57karolherbst: then the biggest general stuff should be done
19:58karolherbst: I am actually surprised that I got texturing working
19:58karolherbst: actually I should incrase my sample size here
19:58karolherbst: but I am too lazy to compile a 32bit version of mesa right now
19:58karolherbst: and I have no clue what are 64bit games
20:00karolherbst: amd games not using geometry or compute shaders :)
20:00pmoreau: RSpliet: Not entirely sure about that. For atomics, I just translated the SPIR-V ops to NVIR. For shared, I only did the patch allowing to report shared mem in the stats, and I want (but haven’t finished yet) to report stats per kernel rather than per program.
20:08karolherbst: does anybody know off hand, if we choose the first lmem address depending on info->bin.tlsSpace in the spilling code?
20:11karolherbst: most likely prog.tlsSize instead though
20:11karolherbst: yep, it uses this
20:21imirkin_: karolherbst: the idea is that you want to spill *after* the variable-indexed arrays
20:22karolherbst: I already figured that
20:22imirkin_: ok :)
20:22karolherbst: doing it the other way sounds like a lot of pain
20:22karolherbst: well maybe not that much, because you basically just have to rewrite all lmem access symbols....
20:22karolherbst: but pointless
20:22imirkin_: but ... why cause yourself the pain
20:22karolherbst: for ... fun?
20:23karolherbst: maybe it makes even sense to consider it being a bit area for both
20:23karolherbst: and optimize accesses based on those memory bank stuff or something
20:23karolherbst: no idea if that is a thing there
20:24karolherbst: but maybe
20:24karolherbst: RSpliet should know more :)
20:28RSpliet: local memory banks? phoa... no I have no details on those I'm afraid
20:29RSpliet: I'm sure they did something to get vast throughput from 32-work-item warps even when doing weird stride patterns, but I haven't looked at any measurements on its performance
20:31pmoreau: shared memory banks, there is some info available on those, but no clue on local one :-/
20:33karolherbst: RSpliet: well, if you are interested and want to check if you are able to improve perf in situations where we have spilling and "normal" lmem access ;)
20:33RSpliet: pmoreau: OpenCL local memory == CUDA shared memory
20:34RSpliet: so... which "shared memory banks" were you talking about? :-D
20:34pmoreau: RSpliet: But spilling is to CUDA local memory, not shared, right?
20:34RSpliet: I don't think there is such a thing as CUDA local memory
20:34pmoreau: RSpliet: I’m so used to CUDA names that I can’t get used to the OpenCL ones, so I’m always using the CUDA ones ;-)
20:35RSpliet: hah, I'm the opposite
20:35karolherbst: well, we use the terms we use in nouveau now...
20:35pmoreau: I think there is, but I don’t think you can use it directly.
20:35RSpliet: ideally you want to spill to what OpenCL calls local memory
20:36RSpliet: Stackexchange: "Local memory" in CUDA is actually global memory (and should really be called "thread-local global memory") with interleaved addressing (which makes iterating over an array in parallel a bit faster than having each thread's data blocked together).
20:36RSpliet: This is why I follow OpenCL convention...
20:36karolherbst: so lmem is thread-local global memory?
20:36pmoreau: RSpliet: Right, CUDA local memory is global memory.
20:37RSpliet: karolherbst: what other flavours do we have in nouveau?
20:37karolherbst: RSpliet: smem
20:37pmoreau: smem is CUDA shared mem
20:37karolherbst: and then those pseudo types
20:37karolherbst: like a/o for input/output
20:38karolherbst: did I forgot something?
20:38pmoreau: a is also for address thingies on Tesla
20:38RSpliet: cmem == constant mem. That's in my terms read-only global memory so without the cache coherence
20:38karolherbst: what is buffer?
20:38karolherbst: do we have a b thing?
20:39karolherbst: ubos and that kind of stuff?
20:39karolherbst: RSpliet: I meant constant buffer here
20:39karolherbst: c0 - c15
20:39karolherbst: but yeah
20:39karolherbst: this should be read only
20:43imirkin_: ubo is c0..c15
20:43imirkin_: buffers are converted to g ... somewhere. i think in the from_tgsi logic.
20:43imirkin_: i forget.
20:44RSpliet: karolherbst: yeah me too. Think they're mapped into global mem. might be some more hw faff for buffering/prefetching values or sth, lots of magic I don't know of
20:44karolherbst: imirkin_: ahh, I see. I just noticed we have that MEMORY_BUFFER constant, didn't check where it is actually used
20:44karolherbst: RSpliet: right, they are allocated global memory
20:45imirkin_: constbufs have *tons* of magic behind them
20:45imirkin_: to support the draw; glUniform(); draw; glUniform(); draw; stuff
20:46karolherbst: imirkin_: ATOM stuff uses FILE_MEMORY_BUFFER
20:46karolherbst: and it gets lowered in NVC0LoweringPass::handleATOM
20:46karolherbst: or the NV50 version of it
20:47RSpliet: shared, smem or s is OpenCL local memory, the per-SM stuff (the 16 or 48KiB chunk)
20:47RSpliet: that's where you want to spill into if memory permits, lots and lots cheaper
20:47pmoreau: (or even 96KiB on recent cards ;-))
20:47imirkin_: you spill into lmem
20:47imirkin_: which is invocation-private
20:47karolherbst: RSpliet: but then we have to be careful not to overwrite those
20:48imirkin_: smem is global to that execution, and is cut out of L2 memory
20:48imirkin_: i believe lmem is cut out of that too
20:48imirkin_: or maybe out of L1
20:48pmoreau: smem is L1
20:49karolherbst: lmem is global memory if it is true what RSpliet said
20:49pmoreau: (100% sure for Volta and up, 95% for previous gens)
20:49imirkin_: all this stuff has all sorts of tricks to it
20:49karolherbst: I figure
20:49karolherbst: in the end, it matters what nvidia does ;)
20:49karolherbst: if they spill to lmem, so should we
20:49karolherbst: and hopefully it is the right thing to do
20:53pmoreau: It looks like CUDA shared mem has always been a chunk of L1, as you could configure on Tesla and Fermi the split between L1 and smem.
20:54pmoreau: lmem, from what I remember, is off-chip and just a subsection of global mem. Let me find a ref
20:54imirkin_: hmmmm interesting
20:54imirkin_: but of course smem isn't available in gfx shaders right?
20:55pmoreau: I don’t think so
20:55pmoreau: Maybe through an NV extension?
20:55imirkin_: i meant in hw
20:57pmoreau: Well, hw-wise, I don’t see why shaders couldn’t use it: they run on the same cores as the CUDA kernels which can use them.
20:58pmoreau: http://on-demand.gputechconf.com/gtc/2010/presentations/S12011-Fundamental-Performance-Optimization-GPUs.pdf has some information about the different memories (except lmem), and the banks backing CUDA shared memory. It’s for Fermi though, but still interesting
20:58imirkin_: i just don't think you can use s ops in the hw
20:58imirkin_: you have to declare how much shared mem you use
20:58imirkin_: which is in the compute descriptor / compute class method
20:58imirkin_: (kepler+ / fermi)
21:00karolherbst: I would assume we have to rely on cachine for gfx shaders, because usually you don't need lmem in gfx shaders anyway
21:00pmoreau: When we compile the shaders, we can compute how much we will spill, and could declare that when launching the shaders, no?
21:00karolherbst: only in a few cases
21:00karolherbst: pmoreau: not if that is only part of the compute descriptor, no?
21:00imirkin_: pmoreau: but where would you declare it...
21:00pmoreau: imirkin_: Oh, you meant it’s not part of the shader descriptor?
21:00pmoreau: Ah, gotcha
21:01pmoreau: Or could it be we mistook it for just some random value that’s always 0?
21:02karolherbst: might be worth investigating
21:03imirkin_: well, the SPH is documented by nvidia
21:03imirkin_: and there's no mention of shared mem
21:03karolherbst: but I would assume that nvidia moves things so much around, that having a bigger cache improves perf more then use that for spilling
21:03karolherbst: imirkin_: it is?
21:03karolherbst: ohh right
21:03imirkin_: now obviously it might be there but not documented
21:04karolherbst: Reserved ;)
21:04pmoreau: It’s true that we have some official documentation we can look at :-D
21:04karolherbst: in those public docs, Reserved doesn't really mean "reserved for future ues"
21:04imirkin_: with helpful descriptions like "The SPH field Version sets is used during development to pick the version."
21:04pmoreau: http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf slide 34: “All data lives in DRAM: global memory, local memory, textures and constants”
21:05pmoreau: Awesome stuff! xD
21:05karolherbst: imirkin_: how big is the field in the compute desc?
21:05imirkin_: for shared? i forget
21:05imirkin_: that's documented too btw
21:05imirkin_: http://download.nvidia.com/open-gpu-doc/Compute-Class-Methods/1/ and http://download.nvidia.com/open-gpu-doc/qmd/1/
21:06imirkin_: the latter is the descriptor bo
21:06imirkin_: a0c0 = kepler
21:10pmoreau: I miss those slides about memory requests sizes and all that stuff. :-D I should spend some time going through them again.
21:18pmoreau: I had completely missed that: the GV100 has actual 32-bit int units
21:19pmoreau: Err, rather we know how many there are, which was apparently not the case before
21:19RSpliet: What's this thing: https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/nvkm/engine/gr/ctxnv50.c#L3210 ?
21:20imirkin_: "good times"
21:20imirkin_: this is for the whole GR unit
21:20imirkin_: it's not class-specific
21:21NSA: damn it
21:22NSA: INFO: task kworker/u24:12:2640 blocked for more than 120 seconds
21:22NSA: Workqueue: events_unbound nv50_disp_atomic_commit_work [nouveau]
21:29imirkin_: RSpliet: remember that nv50 doesn't have a unified isa... all the smem opcodes are in compute shaders only
21:30NSA: btw those logs also mean that my screen is completely frozen
21:32NSA: i'm open for suggestions to try to fix that
21:33imirkin_: it's stuck waiting for some fence which will most likely never be reached
21:33imirkin_: either you have multiple GPUs and something went wrong
21:33imirkin_: or you got a gpu hang
21:33NSA: single gpu currently
21:35NSA: so any ideas for fixing/analysing other than reboot?
21:40imirkin_:has no clue how to debug that
21:53NSA: https://wiki.archlinux.org/index.php/nouveau#Random_lockups_with_kernel_error_messages that doesn't sound like a good idea :/
22:41karolherbst: imirkin_: do temporaries have to be aligned to 0x10 even if they are just vec1 in glsl?
22:41karolherbst: like if I have a vec1, the elements have to be stored at 0x0, 0x10, 0x20 and 0x30?
22:41imirkin_: alignment only matters for access
22:41karolherbst: well, right
22:41imirkin_: if you do a 128-bit load, it has to be aligned to 0x10
22:41karolherbst: but they are accessed
22:41imirkin_: if you do a 32-bit load, you have to be aligned to 0x4
22:41karolherbst: ohh, I meant on a glsl level
22:41imirkin_: are you talking about ubo?
22:42imirkin_: there are specific ubo layouts
22:42imirkin_: e.g. layout (std140) has very well-defined layout
22:42karolherbst: no, plain arrays
22:42karolherbst: like if you do arr[some_var].x
22:42imirkin_: i have no idea wtf you're talking about then
22:43imirkin_: you mean like if you have a variably-indexed array?
22:43imirkin_: glsl doesn't know or care about what the compiler does
22:44imirkin_: those variables don't leave the shader invocation
22:44imirkin_: so as long as it's all internally consistent, no one is the wiser
22:44imirkin_: the reason it's that way for TGSI is ... i'm lazy
22:45imirkin_: coz all temps are vec4 in tgsi
22:45imirkin_: so i'd have to back out the access mask for that var
22:45imirkin_: which i think might actually be passed in now, but definitely wasn't before
22:45karolherbst: ahhh, okay
22:45karolherbst: I see
22:45karolherbst: I think I still need to allign the input address a bit
22:45karolherbst: I see there is a shr in the tgsi
22:45karolherbst: which I don't do yet
22:46karolherbst: just wondering how the input and all works out so far
22:46imirkin_: you know, i DO remember some bit of code
22:46imirkin_: which relied on something being vec4-aligned
22:46imirkin_: i THINK it had to do with inputs/outputs
22:46karolherbst: I see
22:47imirkin_: like if you have indirect indexing on an input/output
22:47imirkin_: but that output is backed not-as-a-vec4
22:47imirkin_: then... sadness
22:47imirkin_: but that can't happen in practice
22:47karolherbst: might be a tgsi limitation though
22:47karolherbst: in the end
22:48imirkin_: ok ... so
22:48imirkin_: have a look at Converter::shiftAddress
22:48imirkin_: this shifts the address by 4, i.e. * 16
22:49imirkin_: which assumes a vec4 situation
22:49imirkin_: and esp look at the "XXX" comment in fetchSrc()
22:50imirkin_: naturally it's specific to GS inputs
22:50imirkin_: which are extra-super-weird on nv50
22:50karolherbst: and I wished I could get around just consuming the nir layout
22:50imirkin_: some stupid $vstride thing
22:50imirkin_: i forget
22:50imirkin_: you can.
22:51imirkin_: hopefully you're learning a lot about the nvidia isa in the process
22:51imirkin_: like i said, the conversion should be fairly straightforward
22:52imirkin_: i suspect most of the issues you're running into are "i don't know how nir works" and "i don't know how nvir works" :)
22:52karolherbst: pretty much yeah
22:52karolherbst: allthough for the nvir parts I can just look at what tgsi does
22:53karolherbst: I somehow stoped looking at other drivers how they use nir, because I already got like the basics
22:53imirkin_: yeah, it's mostly self-explanatory
22:53imirkin_: as long as you're not modifying it
22:53imirkin_: once you start touching it, it's crashville.
22:54imirkin_: but consuming it is very easy
22:54karolherbst: ohh by the way, we can't decode maxwell/pascal binaries with nvdisasm and SM70 :)
22:54imirkin_: so the volta isa is different?
22:55karolherbst: maybe just the scheds
22:55imirkin_: have you tried fuzzing it?
22:55karolherbst: who knows
22:55imirkin_: is it public?
22:55karolherbst: it was just a quick check of the horros we might end up have to deal with
22:55imirkin_: just need a new nvidia-cuda-tools?
22:55karolherbst: the cuda 9 tools support volta afaik
22:55imirkin_: well, if it's a different isa, i know one thing for sure --
22:55imirkin_: they reordered the tex arguments!
22:58karolherbst: ohh I can compile my ptx code to SM70
23:00karolherbst: sched for _every_ instruction?
23:00imirkin_: so ... not 100% identical :)
23:00karolherbst: or 128 bit encdoing?
23:01imirkin_: i guess.
23:01imirkin_: mwk will cry
23:01imirkin_: (coz envydis doesn't support 128-bit)
23:02karolherbst: I am sure those are scheds though
23:03imirkin_: if you look at the 3 mov's
23:03karolherbst: you know
23:03karolherbst: does it make a difference?
23:03imirkin_: they def have some slightly diff stuff
23:03imirkin_: so if it's op/sched pairs, that should work out ok for envydis
23:03karolherbst: either you have a 64 bit sched info for every instruction or just one 128 bit instruction. no difference
23:03karolherbst: well, right
23:03imirkin_: well, if they're groupped as 64/64, then envydis can avoid changing
23:03karolherbst: look at the bra
23:03imirkin_: if they're all mixed up, then fail
23:03karolherbst: it isn't 64/64
23:03karolherbst: it is 128
23:04imirkin_: yeah. you're right.
23:04imirkin_: or at least > 64-bit
23:04imirkin_: mwk: time for __int128?
23:05karolherbst: volta ISA
23:06karolherbst: check the gist above
23:06karolherbst: and look at the bra
23:06mwk: well fuck
23:09karolherbst: pmoreau: you also want to take a look? so far everybody enjoys the new volta ISA :)
23:10karolherbst: imirkin_: well maybe we have now proper 64 bit immediates :)
23:10mwk: nv20 is back :)
23:10karolherbst: or crazy thinks like having 4 fp16 immediates :)
23:11karolherbst: for the tensor stuff
23:11karolherbst: let me check something
23:14imirkin_: mwk: well, that has weird stuff in the middle of the instruction stream too, so it's extra odd
23:14imirkin_: iirc immediates and/or consts are embedded?
23:15mwk: well, if you have 128 bits, why not
23:15karolherbst: there is no 64bit immediate mov
23:16karolherbst: or I can't really force to use the immediate form anyway
23:16karolherbst: nvidia does it's silly c2 opt
23:16mwk: we have 64-bit operands now?
23:17karolherbst: what do you mean
23:19karolherbst: 0x7f4dbdf900047802 : MOV R4, 0x7f4dbdf9;
23:19karolherbst: 0x40383bf600057802 : MOV R5, 0x40383bf6;
23:19karolherbst: both having the same high bits: 0x003fde0000000f00
23:20karolherbst: mhh, let me try something out
23:21karolherbst: what could that be?
23:24karolherbst: the encoding is >64 bit
23:24karolherbst: 0x00047808 0x7f4dbdfa 0xffffffff 0xe0000000 : FSEL.FTZ R4, -|R0|, 2.73478153844571749285e+38, !PT;
23:24karolherbst: 0x00047808 0x7f4dbdfa 0xffff0000 0xe0000000 : FSEL.FTZ R4, R0, 2.73478153844571749285e+38, !PT;
23:25karolherbst: but there are maby bits which don't do anything obvious
23:26imirkin_: like why did 16 bits change in there :)
23:26karolherbst: anyway, if anybody wants to have fun with it :)
23:26karolherbst: imirkin_: I changed them
23:26imirkin_: that makes more sense.
23:27karolherbst: 0x00000100 : neg
23:27karolherbst: 0x00000200 : abs
23:27karolherbst: for src0
23:27karolherbst: I didn't find the predicate parts yet
23:28karolherbst: 0x04000000 : not
23:29karolherbst: predicate at 0x02800000
23:29karolherbst: predicate at 0x03800000
23:29karolherbst: which is also the true predicatet
23:29karolherbst: and 0x07800000 is !PT ;)
23:30karolherbst: this is bits 64-91
23:30karolherbst: imirkin_: maybe there is just one form now
23:30karolherbst: because they have space
23:30karolherbst: well mabye two for each
23:30karolherbst: one with immediate and one without
23:31karolherbst: IADD3 R244, P0, P0, R15, 0x7f4dbdfa, R0; :) the heck
23:34imirkin_: IADD3 is a thing on maxwell too
23:35imirkin_: supported by the nouveau emitter
23:38karolherbst: anyway, I couldn't figure out if those bits 92-127 are doing anything
23:38karolherbst: maybe we have a 92 bit encoding + 32 bit sched
23:39karolherbst: or something like that
23:39karolherbst: might be 90/34 as well
23:39karolherbst: or just 91/33 for no reason at all
23:41imirkin_: still maxes out at 256 regs?
23:47karolherbst: imirkin_: seems that way
23:51karolherbst: bit 0-31: 0x00070000 the predicate for conditional execution
23:51karolherbst: 0x00070000 means no predicate
23:51imirkin_: aka PT
23:51karolherbst: makes sense
23:51imirkin_: that's how all the fermi+ encodings work
23:52imirkin_: the next bit is inverting it
23:52karolherbst: 0xff is RZ
23:52imirkin_: i.e. instead of P0 it'll be !P0
23:52karolherbst: then comes the dest reg
23:52karolherbst: 0x0000f000 is the precicate
23:53karolherbst: 0x00ff0000 is the dest reg
23:53karolherbst: 0xff000000 nothing
23:53karolherbst: at least for iadd3
23:53imirkin_: right, not all bits will mean something for all ops
23:56karolherbst: ... the fuck
23:59karolherbst: 0x00000386 0x00000000 0x00000000 0x0001c000 is STG
23:59karolherbst: the high bits seem to be important somehow