01:24 Lyude: i swear, the biggest challenge of getting clockgating working is just figuring out the right spot to put the register writes you want in the driver...
07:16 karolherbst: some stats with Lyude clock gating patches: https://gist.githubusercontent.com/karolherbst/ef26354f501f025136a1be1d556b902b/raw/f5e5161d3a04750f2743fcc2b6a3d9e0ba530571/gistfile1.txt
07:17 karolherbst: 3W/14% on highest clocks
07:17 karolherbst: but in general it looks like at least 10% less power consumption
07:29 karolherbst: added values on nvidia for 07 and 0f/0
07:29 karolherbst: there is still some potential :)
07:42 karolherbst: added values for furmark: https://gist.github.com/karolherbst/ef26354f501f025136a1be1d556b902b
07:43 karolherbst: so BLCG seems to be the real thing here ;)
08:11 koz_: karolherbst: Are you working on power saving or something?
08:20 karolherbst: no
08:36 karolherbst: koz_: Lyude is
08:38 koz_: karolherbst: Neat!
09:20 pmoreau: karolherbst, Lyude: Neat! Where to the patches live?
09:55 karolherbst: does anybody has this fancy perl command for nvdisasm?
09:56 karolherbst: perl -ane 'foreach (@F) { print pack "I", hex($_) }' > tt; nvdisasm -b SM30 tt
09:56 pmoreau: yes
09:57 karolherbst: "nvdisasm error : Unaligned instruction found"
09:57 pmoreau: Wow, I wasn’t fully awake earlier: “Where *do* the patches live?” --"
09:58 karolherbst: pmoreau: IRC log some hours ago
09:59 pmoreau: OK, will have a look
10:00 pmoreau: Yeah, this is the command Ilia gave me.
10:00 karolherbst: ohhh maxwell crap
10:00 pmoreau: SM51?
10:00 karolherbst: I have to fill in with nops
10:01 karolherbst: but mhh
10:01 pmoreau: Well, use the proper SM version to start with as well :-p
10:01 karolherbst: IMADSP.U24.U24.U32 R2, R2, R3, R4;
10:01 karolherbst: sounds right, right?
10:01 pmoreau: Ugh, I forgot to check on my Kepler, but yes, sounds right
10:01 karolherbst: mhhh
10:01 karolherbst: weird
10:05 karolherbst: https://gist.github.com/karolherbst/46a5611e8e648db8c206812bab7c0e60
10:05 karolherbst: pmoreau: ^
10:06 karolherbst: what result would you expect?
10:06 pmoreau: 0x11
10:06 karolherbst: well, it is 0x3
10:06 pmoreau: Eh?!
10:06 karolherbst: exactly
10:07 pmoreau: Does the driver complain about something? Like an OOR or similar?
10:07 karolherbst: no
10:07 karolherbst: kernel is *out = mad24(3, 4, 5);
10:08 pmoreau: Try to run it on NVIDIA and check what they do, cause it does look quite right what you currently have
10:08 karolherbst: yeah
10:09 karolherbst: I will use this to RE the immediate forms and everything else anyway
10:09 pmoreau: Awesome, thanks! :-)
10:09 karolherbst: is mad24 often used anyway?
10:10 karolherbst: well, we need to supoprt it, but still
10:10 pmoreau: I never used OpenCL besides for this project, so no clue :-D
10:10 pmoreau: RSpliet: Might have more experience
10:10 karolherbst: :D
10:10 karolherbst: well it is part of the spec
10:10 karolherbst: so
10:10 pmoreau: Yes
10:11 pmoreau: While you are at it, you could have a look at XMAD on GM107+ https://trello.com/c/SAMEdPQR/30-re-xmad-on-gm107 (there was some discussion on devtalk on what it could be)
10:13 karolherbst: the fucky crap...
10:13 karolherbst: pmoreau: /usr/lib64/nvidia-bumblebee/libOpenCL.so.1: version `OPENCL_2.1' not found (required by ./smad24.test)
10:13 karolherbst: ....
10:14 pmoreau: Ugh, let me have a look
10:14 karolherbst: well, nvidia only does opencl 1.2
10:14 karolherbst: but I am sure they have the extension for that spirv stuff?
10:14 karolherbst: maybe
10:14 karolherbst: or maybe not?
10:14 pmoreau: Yeah, but it shouldn’t be using the SPIR-V stuff by default, I changed its behaviour.
10:15 karolherbst: ahh
10:15 karolherbst: yeah well, on master it won't work, but maybe make fails to detect changes
10:16 pmoreau: It should be using createProgramFromSource (in utils.cpp), which calls clCreateProgramWithSource
10:16 karolherbst: well, it doesn't
10:17 pmoreau: If you go in utils.cpp and comment lines 248 to the end? https://phabricator.pmoreau.org/diffusion/SPVTES/browse/master/util.cpp;01fd7930a91035745ebe1f524ac7cd3fd927b096$248-281
10:18 karolherbst: right
10:18 karolherbst: you can't do that
10:18 karolherbst: always runtime checks ;)
10:19 pmoreau: I am runtime checking (well not yet): the function call by the test is createProgram, which **never** calls createProgramFromIL
10:19 pmoreau: *called
10:19 karolherbst: mhh, wait, I need to write a vendor file
10:20 karolherbst: mhhh
10:21 pmoreau: I think the issue is that the included OpenCL header files define CL_VERSION_2_1, so the function gets compiled, but NVIDIA does not expose 2.1 which is required for the symbol clCreateProgramWithIL, and all hell breaks loose.
10:22 karolherbst: I compile not against vendor headers, right
10:22 karolherbst: I compile against libclc
10:22 pmoreau: I am not sure which headers it ends up using
10:22 karolherbst: pmoreau: just like in opengl: don't depend on linking time
10:23 karolherbst: mhh "1 is not a valid index: only 1 valid devices found."
10:23 pmoreau: And how would you do it instead?
10:23 karolherbst: check the runtime
10:24 karolherbst: the point in OpenCL is, that you can even use three drivers at once
10:24 pmoreau: karolherbst: Change the default value in utils.hpp, for device_index
10:25 pmoreau: That function is currently **never** called. Checking at runtime won’t change anything to that: I can’t remove that the binary contains clCreateProgramWithIL at runtime, besides making the binary self-modifying.
10:26 RSpliet: karolherbst: mad24 is quite useful for converting 3D work-item dimensions to a 1D array offset. Often you know how many threads those are (and usually they fit well within the 24 bits)
10:27 karolherbst: pmoreau: don't argue with me about it. In OpenGL you do basically the same. You don't check at linking time if something is there, you do that at runtime.
10:27 karolherbst: there is clGetExtensionFunctionAddress
10:27 karolherbst: for example
10:28 pmoreau: Hum, cause clCreateProgramWithIL is not a function pointer, but a real function
10:28 pmoreau: ?
10:29 karolherbst: I think you can still link against certain stuff, but you still need to check at runtime if this function is there for real
10:29 karolherbst: either by checking the version or checking if the func pointer exists
10:30 karolherbst: if you don't know about it at compile time, then your are screwed, because nobody can use it, but if you compile against a 2.1 OpenCL, but the application is ran on a 1.2 context, then you could still say: either 1.1 + extension or 2.0+
10:31 karolherbst: it gets even worse, if you run the same stuff on a system, where you have multiple devices, which all expose different openCL versions
10:31 pmoreau: That’s what I wanted to do, but went the same way and decided to **only** call the function that is present in OpenCL 1.0: clCreateProgramWithSource
10:31 pmoreau: s/same way/easy way
10:32 karolherbst: well right, but you have to assume, that somebody compiles against 2.1, but has only devices which expose 1.0
10:33 pmoreau: I can see it being an issue if the function was called, but it is not.
10:33 pmoreau: Unless it complains for every missing symbol, even if unused?
10:33 karolherbst: well something checks if the version is 2.1 or not
10:33 karolherbst: at runtime time
10:33 karolherbst: ...
10:33 karolherbst: *runtime
10:33 karolherbst: well, the test passes on nvidia at least
10:35 pmoreau: Why don’t I get that error on this computer. :-/
10:35 karolherbst: do you compile against 2.1 and run on 1.2?
10:36 karolherbst: "brk segment overflow in thread #1: can't grow to 0x4a2c000"
10:36 karolherbst: sigh
10:37 pmoreau: I should, but let me check
10:38 pmoreau: #ifdef CL_VERSION_2_1
10:38 pmoreau: #error "foo"
10:38 pmoreau: I get foo, so I should be compiling against 2.1
10:39 karolherbst: mhhh, odd
10:39 karolherbst: anyhow, valgrind-mmt can't trace it
10:40 karolherbst: https://gist.githubusercontent.com/karolherbst/c31ca310bb96bc4263a8ddeffbb8dd8a/raw/8c19ebf686cbf2682a75c76a5b86b01df3a5abe6/gistfile1.txt
10:40 pmoreau: :-(
10:41 pmoreau: Write a CUDA app using it instead, and disassemble the binary?
10:41 karolherbst: good plan
10:41 karolherbst: or I just compile that cl kernel to PTX
10:42 pmoreau: And then run ptxas on it? Yeah, that should work as well I think
10:42 karolherbst: https://arrayfire.com/generating-ptx-files-from-opencl-code/ ...
10:47 pmoreau: You *might* even be able to generate the PTX using clang.
10:49 karolherbst: mhh
10:52 pmoreau: karolherbst: If you add https://hastebin.com/pipezuyage.cpp (and an “#include <regex>") inside createProgramFromIL (in utils.cpp), right after the “CHECK_CL_CALL( context.getInfo(CL_CONTEXT_DEVICES, &devices); )”. Does that help with the VERSION_2.1 error?
11:02 karolherbst: pmoreau: llc -mcpu=sm_20 test.ll -o test.s
11:03 karolherbst: and the ll you get with something like this: lang -Dcl_clang_storage_class_specifiers -isystem libclc/generic/include -include clc/clc.h -target nvptx64-nvidia-nvcl -xcl test.cl -emit-llvm -S -o test.ll
11:03 karolherbst: c
11:05 karolherbst: meh
11:05 karolherbst: ...
11:05 karolherbst: it calls a _Z5mad24iii function
11:05 karolherbst: super anoying
11:09 karolherbst: pmoreau: did you ever wrote ptx code yourself?
11:10 karolherbst: ahh, no worries
11:10 karolherbst: I just didn't remove enough from that silly llvm stuff
11:11 karolherbst: :D
11:11 karolherbst: pmoreau: https://gist.github.com/karolherbst/dbec77b5c2f61cc990b63f7a5acbb6d3
11:12 karolherbst: the hell
11:30 RSpliet: ... reuse flags. If that's what I think it is, it's a really really cool optimisation
11:56 pmoreau: karolherbst: I did inline some PTX code in my CUDA kernels, but that’s about it.
11:57 pmoreau: I can guess what the two BFE instructions are doing, but all those XMAD? No clue.
12:00 pmoreau: I guess it is to get some high bits or sign bit right, but it still seems weird.
12:09 pmoreau: karolherbst: Have a look at the linked PPTX presentation in https://devtalk.nvidia.com/default/topic/980740/cuda-programming-and-performance/xmad-meaning/post/5033928/#5033928 They explain how mad24 worked on 1.x, 2.x-3.x and 5.x.
12:18 karolherbst: pmoreau: the first xmad actually does most of the things
12:18 karolherbst: or is the mad where the addition takes place as well
12:18 karolherbst: somehow this is super weird
12:18 karolherbst: and I actually just wanted to use madsp...
12:19 karolherbst: imirkin: do you know more about madsp on maxwell/pascal?
12:24 karolherbst: pmoreau: are you sure that that smad test passes on kepler?
12:24 pmoreau: Certain
12:26 karolherbst: maybe something is broken for compute on maxwell and the parameters aren
12:26 karolherbst: 't passed in?
12:26 karolherbst: allthough the first one is
12:26 karolherbst: and the result should be 0 kind of
12:27 pmoreau: Well, the other memory_access tests do take multiple arguments, and do pass.
12:28 pmoreau: But there is definitely something wrong going on.
12:29 karolherbst: I will switch the parameters and check what happens then
12:29 pmoreau: Ok
12:29 pmoreau: Or try replacing the mad24 with a regular a + b * c
12:31 karolherbst: uhm, a * b + c, right?
12:31 pmoreau: Well rather (a + b) * c to get the priorities right
12:31 pmoreau: Euh, you’re right
12:32 karolherbst: PASS
12:33 pmoreau: So, not something wrong with how the parameters are passed to the kernel
12:36 karolherbst: yeah
12:36 karolherbst: wondering what is wrong with the instruction
12:36 karolherbst: maybe it means something else here
12:36 karolherbst: maybe not
12:38 pmoreau: They “deprecated” it in some way, so it ends up doing a simple mov of src(0) to dest?
12:38 karolherbst: maybe?
12:40 karolherbst: I turned it into a IMADSP.U24.U24.U32 R2, R3, c[0x0][0xc], R2; now
12:40 karolherbst: and I still get the initial value of r2
12:41 karolherbst: not r3
12:41 pmoreau: Ah, then probably imadsp -> nop
12:41 karolherbst: so I would rather assume it does nothing at all
12:41 pmoreau: yep
12:43 karolherbst: on sm_30 it uses IMAD R4, R4, R0, c[0x0][0x150];
12:44 karolherbst: and starting with sm_50 they use that xmad stuff
12:44 pmoreau: Yup
12:44 pmoreau: Look at the slides I pointed to
12:52 karolherbst: so we need to lower that indeed into that xmad thing
12:53 karolherbst: or well, using our own ops for now anyway
12:54 karolherbst: pmoreau: well, we basically just need to do this, right? d = mad24(a, b, c) ==> d = mad(a & 0x00ffffff, b & 0x00ffffff, c)?
12:55 pmoreau: What if a or b are signed? ;-)
12:55 pmoreau: Hum, maybe that would still work
12:55 karolherbst: ;)
12:55 karolherbst: well, what is bfe doing?
12:56 pmoreau: bitfield extract
12:56 karolherbst: it extracts the first 24 bits starting from pos 0
12:57 karolherbst: aka & 0x00ffffff
12:57 pmoreau: yup
12:57 karolherbst: I think I will just lower that and be done with it
12:58 karolherbst: mhh
12:58 karolherbst: lowering was after ssa?
12:59 pmoreau: You could do it after SSA
13:00 karolherbst: why?
13:00 karolherbst: I want to do that post ra
13:00 karolherbst: or....
13:00 karolherbst: mhhh
13:00 karolherbst: actually
13:00 karolherbst: I want to do it prior ssa
13:00 karolherbst: the later we do it, the less opts we can do upon it
13:00 karolherbst: and nobody will write opts based on madsp
13:01 karolherbst: so GM107LoweringPass it is
13:06 karolherbst: the fuck... now it links against the cude opencl stuff
13:07 pmoreau: :o
13:07 pmoreau: You changed some paths?
13:07 karolherbst: I installed some cuda tools
13:08 karolherbst: sigh
13:08 karolherbst: why does it have to be so much pain
13:09 pmoreau: Agreed :-(
13:09 karolherbst: I removed the ld.so.conf.d entry
13:09 karolherbst: PASS, yay
13:09 karolherbst: you don't want to see the patch though
13:10 karolherbst: ;) https://gist.github.com/karolherbst/f1e14f5bc9a770ed59cfd5d356e39517
13:11 pmoreau: lol :-D
13:11 pmoreau: I thought you would at least add a BFE or AND to restrict to 24-bits
13:11 karolherbst: :D
13:11 karolherbst: fast results don't need precision
13:11 karolherbst: ;)
13:12 karolherbst: well, I can work on that now, right?
13:12 karolherbst: now that we know that the instruction is just broken
13:12 pmoreau: On the other hand, the OpenCL specs says that if the values for a and b are outside the 24-bit domain, it is implementation defined.
13:12 karolherbst: :D
13:12 karolherbst: well
13:12 karolherbst: I can still parse the subop and try to do the right thing
13:13 karolherbst: I am currently just thinking about if I should use the extbf or AND instruction
13:13 karolherbst: but extbf should make more sense in case uf s16hi
13:13 pmoreau: Yes; xmad could still be useful in other contexts.
13:13 karolherbst: and then I just do extbf 0x1010
13:13 pmoreau: Right
13:15 karolherbst: well umad24.test is still broken
13:15 karolherbst: but I think this is due a broken spir-v to nv50ir thing
13:15 karolherbst: becuase I get madsp (SUBOP:34) s32 %r4 %r1 %r2 %r3 for both cases
13:16 karolherbst: mhh
13:16 karolherbst: 3129d83f vs 5129d83f
13:17 pmoreau: The blob seems to prioritise BFE over AND, even in cases which do not require any shifting. The only AND I can find are applied as modifiers on other ops.
13:18 karolherbst: mhh, bfe shifts though
13:18 karolherbst: ohh
13:18 karolherbst: right
13:18 karolherbst: well, maybe bfe is faster
13:19 karolherbst: what about ands like 0xff00ff00?
13:19 pmoreau: Can you dump the SPIR-V in both cases please? (using CLOVER_DEBUG=spirv CLOVER_DEBUG_FILE=somefilename
13:19 pmoreau: I don’t have such weird things in my CUDA kernels :-D
13:19 pmoreau: (At least I don’t think so...)
13:20 karolherbst: pmoreau: https://gist.githubusercontent.com/karolherbst/f7c3bea50a147243d948085c82e4730c/raw/802ea4ac26a93879992791c589608d5b5d0d019c/gistfile1.txt
13:20 karolherbst: uhhh
13:21 karolherbst: is the spirv wrong?
13:21 pmoreau: Mmh, I should have some of those weird AND. Not sure what they became
13:22 pmoreau: Yeah, it should be calling s_mad24 and u_mad24, not twice s_mad24
13:23 pmoreau: Weird, I don’t remember having that issue
13:42 rhyskidd: Lyude: ^^ great work on those preliminary clockgating stats
13:43 karolherbst: Lyude: well, now you know: people are only thankful, if there are benchmarks as well :p
13:45 karolherbst: pmoreau: the painful part is now parsing that damn subop bitmask
13:50 RSpliet: On kepler, NVIDIA definitely doesn't do bit extracts for mad24
13:51 karolherbst: right, we know
13:51 karolherbst: but it was mainly a question whether they use imadsp or not
13:51 karolherbst: on kepler they use imadsp
13:52 karolherbst: on maxwell they use xmad
13:53 karolherbst: and by the way, that madsp subop thing is terrible
13:55 RSpliet: Ah yes. the quick references I have (traces of OpenCL code) are all stored as decoded by envydis in my archives, so I'm afraid some subtleties have gone lost in translation.
14:01 imirkin_: before you guys get too far with something i'm going to reject...
14:01 imirkin_: someone should state clearly what is trying to be attempted, and what the proposed solution is
14:02 karolherbst: imirkin_: lower madsp to extbf + mad
14:02 karolherbst: on maxwell+
14:02 karolherbst: and we want to implement madsp on maxwell+
14:02 imirkin_: see, now i know you're crazy
14:02 imirkin_: :)
14:03 imirkin_: "i want to implement madsp" is never a thing anyone wants
14:03 imirkin_: it could be a means, but not an end
14:03 karolherbst: ;)
14:03 karolherbst: well, we need mad24 for opencl
14:03 imirkin_: what's mad24?
14:03 karolherbst: madsp with u24 u24 u32
14:03 karolherbst: or s24 s24 s32
14:03 karolherbst: or whatever
14:04 karolherbst: but that's basically it
14:04 karolherbst: basically this:
14:04 karolherbst: mad24(a, b, c) = mad(a & 0x00ffffff, b & 0x00ffffff, c)?
14:04 karolherbst: without the ?
14:04 imirkin_: ok
14:05 imirkin_: well, e.g. the nv50 mul has a 24-bit version of itself
14:05 imirkin_: in addition to the 16-bit one which we predominantly use
14:05 imirkin_: do we have a TYPE_U24?
14:05 imirkin_: (and S24)
14:05 karolherbst: I doubt that
14:05 karolherbst: no, we don't
14:05 imirkin_: looks like no...
14:05 karolherbst: to be honest
14:05 imirkin_: so the question is
14:06 karolherbst: I would just remove that madsp subop thing and put that into types
14:06 imirkin_: should this be properly represented and be subject to usual math operations
14:06 imirkin_: or should it be hacked like the MADSP op
14:06 karolherbst: well, here is what I base that on: nobody understands this madsp subop thing
14:06 imirkin_: now, the thing about MADSP is that it's just a convenience hack for some of the image-related lowering.
14:06 karolherbst: and it wasn't made to be understood
14:06 karolherbst: and that's clearly reason enough to change it to something else
14:06 imirkin_: it's not like a "proper" instruction
14:06 karolherbst: uhm
14:07 karolherbst: nvidia uses imadsp on kepler for mad24
14:07 imirkin_: right.
14:07 imirkin_: perhaps you didn't understand what i meant
14:07 imirkin_: it's not like a proper instruction in nv50_ir
14:07 imirkin_: it's a hack instruction
14:07 karolherbst: right
14:07 imirkin_: because it doesn't map to a clean concept
14:07 imirkin_: it *does* map directly to a nve4 op, but that doesn't make it right
14:07 karolherbst: right
14:08 imirkin_: but since it was only ever used in nve4 image lowering
14:08 imirkin_: that seemed a lot more acceptable
14:08 karolherbst: well, what I do first is to rev engineer the meaning of that op based on what the emitter is doing
14:08 karolherbst: and check if all usages make sense
14:08 RSpliet: imirkin_: if I want to do a 32*32 multiplication, I basicallly perform four 16*16 muls. Is there a reason to assume that 24*24 leads to me using only 3 of those? I'm assuming this is the optimisation NVIDIA is aiming for, but I'm too frazzled at the moment to work out why...
14:08 karolherbst: because this is clearly something we want to improve
14:08 karolherbst: totally unrelated to anything else
14:09 imirkin_: RSpliet: on nv50, it doesn't help. on nvc0, there's 32-bit mul.
14:09 imirkin_: RSpliet: but if your values are < 24 bit, then you can just do your 1 mul and move on
14:09 imirkin_: RSpliet: and the reason it exists is that float mantissas are 24 bit
14:10 imirkin_: so they already have the hw to do it
14:11 imirkin_: karolherbst: so ... it seems like having a MUL24 and MAD24 op might make sense.
14:11 karolherbst: it would make things much easier, somehow
14:11 karolherbst: right
14:11 imirkin_: that way we don't mess up our types
14:12 imirkin_: (which wouldn't be real anyways)
14:12 karolherbst: right
14:12 RSpliet: Ah, so on kepler they can be dual-issued (pipelined) with a 32-bit integer mul for virtually free...
14:13 karolherbst: imirkin_: just saw that we never had those as inputs anyway
14:13 karolherbst: wondering how clover does it then...
14:13 karolherbst: or we never hooked it up
14:13 imirkin_: RSpliet: and remember that nv50 doesn't have 32-bit imul, only 16 and 24
14:13 karolherbst: where is the list with all OPs tgsi knows?
14:14 imirkin_: doesn't really exist
14:14 imirkin_: and stuff can get lowered too
14:14 imirkin_: (or optimized)
14:14 karolherbst: not even as an enum in a header file?
14:14 imirkin_: er wait
14:14 imirkin_: perhaps i don't understand what you mean
14:14 karolherbst: I just want to know what instructions tgsi knows
14:15 imirkin_: like the list of tgsi opcodes?
14:15 imirkin_: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/include/pipe/p_shader_tokens.h#n336
14:15 RSpliet: karolherbst: I suspect that in OpenCL behaviour is undefined if the top 8 bits are not a sign extension (perhaps worth double-checking), so they might just blindly issue a 32-bit mad?
14:15 karolherbst: pmoreau: please call the file nv50_ir_from_spirv.cpp
14:16 karolherbst: RSpliet: well, nvidia doesn't do mul32 here
14:16 karolherbst: or mad 32
14:16 karolherbst: where did you get this idea they would?
14:16 RSpliet: Oh I thought you asked for clover
14:17 karolherbst: well, not really, well I am interested what clover does, but in the end we do spir-v to nv50ir stuff here
14:17 karolherbst: and spir-v knows mad24
14:18 pmoreau: RSpliet: Have a look at http://arith23.gforge.inria.fr/slides/Emmart_Luitjens_Weems_Woolley.pptx for the “history” of mad24
14:19 karolherbst: and if tgsi knows mad24 as well, then it makes totally sense to have a OP_MUL24 and OP_MAD24 as well
14:19 imirkin_: tgsi does not.
14:19 imirkin_: but nv50 ir isn't tgsi
14:19 karolherbst: okay, then now I am wondering how they do mad24 at all
14:19 karolherbst: I guess they lower it?
14:19 imirkin_: tgsi?
14:19 karolherbst: yeah
14:19 imirkin_: it never comes up
14:19 karolherbst: ohh right
14:19 karolherbst: clover does llvm
14:20 karolherbst: mhhh
14:20 pmoreau: clover can do tgsi as well, just there is no consumer for it.
14:20 RSpliet: pmoreau: slide 7 explains why the "reuse" flag is even more important :-D
14:20 karolherbst: imirkin_: so we would end up adding OP_MAD24, which no hw can do and lower it on every target?
14:20 imirkin_: lower it?
14:20 imirkin_: who can't do OP_MAD24?
14:21 karolherbst: I am sure no chipset can do mad24
14:21 karolherbst: not directly
14:21 karolherbst: kepler can do madsp
14:21 imirkin_: which is different from mad24... how?
14:21 karolherbst: but maxwell can't do it at all
14:21 karolherbst: well
14:21 RSpliet: pmoreau: thanks, I'll forward that home
14:21 karolherbst: madsp has some bits to configure the size of the operands
14:21 imirkin_: uh huh
14:21 pmoreau: karolherbst: NV50IR is not specific to NV50, and we tend to call it NVIR more and more. But for consistency, either we rename everything to nv_ir or leave it as-is.
14:21 pmoreau: bbiab
14:21 karolherbst: pmoreau: yes
14:21 imirkin_: nv50 ir op doesn't have to map 1:1 with gpu ops
14:22 imirkin_: and yes, i want to rename nv50_ir to nvir
14:22 karolherbst: I know, but we could either lower mad24 to madsp and set the subop or we emit it
14:22 imirkin_: lower implies a transformation
14:22 imirkin_: mad24 can be emitted directly
14:22 karolherbst: okay
14:22 imirkin_: and what's the issue with SM50+?
14:22 karolherbst: madsp doesn't work
14:23 karolherbst: it is basically a nop
14:23 imirkin_: ok
14:23 imirkin_: and XMAD doesn't have a 24-bit mode?
14:23 karolherbst: nvidia does bef and xmads
14:23 karolherbst: :D
14:23 karolherbst: no
14:23 imirkin_: ok, so *that* would imply lowering then.
14:23 imirkin_: at which point
14:23 karolherbst: this is mad24 on nvidia: https://gist.githubusercontent.com/karolherbst/dbec77b5c2f61cc990b63f7a5acbb6d3/raw/aeac9b253f2bf77a00488d5bc1a922e60f793f78/gistfile1.txt
14:23 Sprow_: Not sure this is the best place to ask - but does nouveau assume an x86 host (to run the BIOS ROM)? Or could it say run on PowerPC or ARM Linux
14:23 imirkin_: the whole MAD24/MUL24 stuff becomes less interesting.
14:24 imirkin_: Sprow_: it could, but often doesn't
14:24 RSpliet: Sprow_: should work on ARM, mixed results on Power
14:24 imirkin_: karolherbst: what about fermi?
14:24 karolherbst: I couldn't check anything below SM30
14:25 karolherbst: SM30-SM37 is imadsp, SM50+ is this BFE+XMAD stuff
14:25 karolherbst: I have the ptx if you wanna try?
14:26 imirkin_: no
14:26 karolherbst: I didn't check what they do for mad24.hi though
14:26 karolherbst: only mad24.lo
14:26 karolherbst: where we might want to support the .hi one in the future as well
14:26 karolherbst: maybe
14:27 imirkin_: anyways ... if this is only for nve4 in the first place
14:27 karolherbst: but currently I only care about opencls mul24 and mad24
14:27 imirkin_: i'd avoid touching anythin
14:27 karolherbst: right
14:27 imirkin_: and just emit it as-is
14:27 Sprow_: RSpliet: So the BIOS scripts are largely ignored (or replaced), or is a 8086 emulator employed?
14:27 imirkin_: i.e. have a target check for the MADSP op
14:27 imirkin_: if it's supported, great. if not, emit BFE + MAD
14:27 karolherbst: imirkin_: we don't get madsp as input
14:27 karolherbst: well
14:27 karolherbst: not now
14:27 imirkin_: Sprow_: bios scripts aren't in x86. they have to be interpreted anyways.
14:27 karolherbst: imirkin_: my idea was to lower it pre ssa
14:27 imirkin_: Sprow_: the bios rom includes an interpreter written in x86
14:28 imirkin_: karolherbst: my point is ... it doesn't have to exist in the first place.
14:28 RSpliet: Sprow_: nouveau has such an interpreter built in, and there's a kernel param to force it to run your init script
14:28 karolherbst: imirkin_: right
14:28 karolherbst: imirkin_: but then we need to clean up that subop thing, because this part is terrible anywat
14:28 karolherbst: *anyway
14:28 RSpliet: Which may or may not be necessary, and mileage may vary
14:28 imirkin_: karolherbst: meh
14:28 karolherbst: seriously
14:28 imirkin_: karolherbst: it's just for nve4, so wtvr
14:28 imirkin_: it'll always be a fixed subop
14:28 karolherbst: uhm... I didn't mean replacing it with a non subop thing
14:29 karolherbst: but to make that subop at least understandable
14:29 karolherbst: because currently it isn't
14:29 karolherbst: and nobody is able to be 100% if that subop is used correctly or not
14:29 karolherbst: *sure
14:29 Sprow_: imirkin_: Thanks for the clarification
14:29 imirkin_: karolherbst: ok
14:29 imirkin_: karolherbst: some additional macros would be fine by me
14:29 karolherbst: and I am sure we even use it wrongly in some places
14:29 karolherbst: yeah
14:29 karolherbst: that's what I meant
14:30 karolherbst: or maybe even "fix" it in someway, because not all combinations can be used anyway
14:30 Sprow_: RSpliet: Clever to have a script interpreter built in. I guess I'll grab a test card and plug it in and see what happens! Thanks
14:30 karolherbst: maybe we have fixed combinations in the end?
14:30 karolherbst: not quite sure yet
14:30 karolherbst: I want to RE the nve4 ISA and see what combinations can be used and so on
14:31 karolherbst: and just change it, so that it is less painful
14:36 imirkin_: karolherbst: pretty sure all of them. hence the mask.
14:36 karolherbst: I wouldn't be so sure
14:36 karolherbst: I don't think it can be configured in such a way in hw
14:36 karolherbst: I hink they have a value for each combination
14:36 karolherbst: and maybe theu cover all or none
14:37 karolherbst: but it isn't in a way like (this src is this)
14:37 karolherbst: but more like: 0x5 means this for all
14:37 marmistrz: pmoreau, I've been using your dual-driver Xorg.conf for a longer time and I have noticed one problem - I can't start an X server manually
14:37 marmistrz: I get a permission denied, like in these logs:
14:37 karolherbst: imirkin_: I am currently creating the list and you will see what I mean
14:38 marmistrz: https://zerobin.net/?b4a6e79d0553a575#NqplKg2CPwdBqOPyKnqIGY6bRAkX4GOhpNzztMYTqC4=
14:38 marmistrz: Have you ever encountered something like this? It's weird that I get no error without any custom xorg.conf
14:39 imirkin_: marmistrz: [ 63.459] (EE) modeset(0): glamor initialization failed
14:39 imirkin_: not a good start.
14:39 karolherbst: well, there are some patterns to it, just a question if everything is covered in some way
14:44 karolherbst: imirkin_: so this seems to be all: https://gist.githubusercontent.com/karolherbst/72baf5ac8ded0f27fc480006d988430c/raw/abc50e74300e953c251113b72fb41377da670622/gistfile1.txt
14:44 imirkin_: there's more.
14:44 imirkin_: check the macros in nv50 ir
14:44 karolherbst: I am sure there are no more
14:44 karolherbst: well, nvdisasm didn't show more
14:44 imirkin_: that's the V1 ones
14:44 imirkin_: there's V2/3/4
14:45 imirkin_: also, where there's a U16H0 i'm surprised there's no U16H1
14:45 karolherbst: last block
14:45 imirkin_: for second arg
14:45 karolherbst: I will try to find more, but mhh
14:48 imirkin_: karolherbst: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/codegen/nv50_ir.h#n241
14:48 imirkin_: oh wait, shit. that V1 stuff is for something else.
14:48 karolherbst: yeah I am aware of that, I am just not 100% if I can trust what is currently there within mesa
14:49 imirkin_: yeah, that V1/etc stuff is for the vector ops
14:49 imirkin_: (like VSHL)
14:49 karolherbst: tabimadp3 in envydis seems wrong as well
14:50 karolherbst: especially because there are three s32
14:51 karolherbst: there is this higher bit though
14:52 karolherbst: but mhhh
14:52 karolherbst: at least nvdisasm doesn't react to it
14:52 imirkin_: also worth noting that nvdisasm isn't perfect
14:52 karolherbst: right
14:53 imirkin_: but it's PRETTY good
14:53 karolherbst: well, I could check with some emited code
14:53 karolherbst: but not on this machine
14:53 karolherbst: so for all I know, the table I wrote is what nvdisasm tells me
14:53 karolherbst: and envydis agrees, that the middle type has no u16h1
14:54 imirkin_: k
14:54 karolherbst: just both 24 and 16h0
14:54 karolherbst: I think h0 is low
14:55 imirkin_: it is.
14:55 karolherbst: just wondering what the result is
14:55 karolherbst: always s32?
14:56 karolherbst: or does it depend on the third source?
14:56 imirkin_: good q.
14:56 karolherbst: ptx docs might answer it
14:56 karolherbst: wait
14:56 karolherbst: they don
14:56 karolherbst: 't
14:56 karolherbst: becuase imadsp isn't documented there
14:56 marmistrz: Hmmm... missed that. Any idea why this small config file causes that? https://zerobin.net/?91d809859e3c7bec#b5Hlnz4a9N152SWKtqN/0bexpn3XkmPoIbyRXEG/JHg=
14:56 imirkin_: ultimately it doesn't matter
14:56 karolherbst: right
14:57 imirkin_: because the destination is 32 bits of data
14:57 imirkin_: there's no signed/unsigned/etc
14:57 marmistrz: My normal, LightDM-started session works perfectly
14:57 karolherbst: true
14:57 imirkin_: marmistrz: that's almost definitely not what you want
14:57 imirkin_: marmistrz: what are you trying to do?
14:57 karolherbst: imirkin_: well, I check if we I can reduce the stuff in mesa to the things I just found out and may simplify it in a way, that it is easier to use and easier to understand
14:59 imirkin_: marmistrz: https://nouveau.freedesktop.org/wiki/Optimus/
14:59 marmistrz: imirkin_, I want to switch nouveau->nvidia in runtime
14:59 imirkin_: ok
14:59 imirkin_: then you don't want the nvidia gpu in there at all
14:59 marmistrz: and I want to make use of DRI_PRIME when I don't need the nvidia card
15:00 imirkin_: switching between nouveau and nvidia at runtime means that intel has to be driving your screens
15:03 karolherbst: :O
15:03 karolherbst: imirkin_: the sign of the third source depends on the other two
15:04 karolherbst: if there is one signed value, the third source is automatically signed
15:04 imirkin_: right, coz there's no U32 vs S32 addition
15:04 imirkin_: there's just ... addition
15:05 karolherbst: imirkin_: that means it doesn't make sense to have three parameter to that subop macro
15:05 karolherbst: because the third one is always 32 bit
15:06 karolherbst: and its signess always depends on the other two values
15:06 karolherbst: that also means, there is no IMADSP.S32.S24.U32 for example
15:07 karolherbst: imirkin_: so would you mind if I just kill the third arg?
15:08 karolherbst: there seems to be a way to change the amount of bits though...
15:08 karolherbst: let me figure that out first
15:08 karolherbst: might be a bug inside nvdisasm though
15:10 karolherbst: mhhh
15:10 karolherbst: I will just assume it is correct for now
15:10 karolherbst: I mean what is hinted at in envydis
15:10 karolherbst: but I remove the signess logic from the third one
15:11 karolherbst: imirkin_: are there cases where they added stuff in hw, but never used it in sw?
15:12 karolherbst: https://gist.githubusercontent.com/karolherbst/72baf5ac8ded0f27fc480006d988430c/raw/8d80e662dfbcd3c43388d5464da8baabf85ea32f/gistfile1.txt
15:24 karolherbst: imirkin_: how do you like something like this? https://gist.github.com/karolherbst/b48e8aae09d25938d328db9ea4978e0e
15:25 karolherbst: mhh, wait I have an idea
15:25 karolherbst: much better now
15:32 karolherbst: but mhh, maybe this can be even better
15:33 pmoreau: marmistrz: Hum, I might have had a dual-driver Xorg.conf, but I’m fairly sure I would have gotten it from karolherbst. I haven’t played much with that kind of scenario myself. :-/
15:34 imirkin_: karolherbst: fine by me
15:37 karolherbst: imirkin_: or something more explicit like this? https://gist.github.com/karolherbst/c8af7c6aab8d685c9b0230d5ca39ba6f
15:38 imirkin_: worksforme
15:38 imirkin_: (assuming it's correct)
15:38 karolherbst: well, it doesn't matter right? the emiter can still change the values if it wants to
15:39 karolherbst: I like the last idea, because then the emiter could do stuff like this: insn->subOp & 0xf00 == NV50_IR_SUBOP_MADSP_SRC2_32 or so
15:39 karolherbst: but those values also map directly to hardware afaik
15:39 karolherbst: just need shifting
15:39 imirkin_: yeah
15:40 karolherbst: nvdisasm only tells me, that there is a 32 bit type for src2, so maybe I should leave the others out for now as long as we don't use it?
15:41 karolherbst: mhh, we use 8 as the value for src2 though
15:41 karolherbst: and 0
15:42 karolherbst: and 8 maps to NV50_IR_SUBOP_MADSP_SRC2_16L, just with different shifting
16:28 karolherbst: imirkin_: meh... they call those things VSET2 in SM35?
16:31 karolherbst: something is odd here
16:38 karolherbst: ahh no, the encoding is just odd
16:41 imirkin_: VSET2 is a vector op
16:41 imirkin_: (or video, depending who you ask)
16:54 karolherbst: I got confused with how the opcodes are generated on SM35
16:54 karolherbst: anyhow, here is how it looks like now
16:54 karolherbst: https://github.com/karolherbst/mesa/commit/f68641c228443aedf936ddfbd77dd0071842acad
16:54 karolherbst: pmoreau: ^
16:55 imirkin_: uhm
16:56 imirkin_: that seems off. or the previous emitter was off.
16:56 karolherbst: no, this is correct
16:56 karolherbst: I checked with nvdisasm
16:56 imirkin_: k
16:56 karolherbst: I changed the values of SRC2
16:56 karolherbst: previous 8 meant 16L for src2
16:56 karolherbst: now it is 2
16:57 karolherbst: I could also set SRC2_24 to 0x400 and SRC2_16L to 0x800
16:57 pmoreau: karolherbst: The SPIR-V change is wrong: you are doing Uxx both for sMad24 and uMad24, but otherwise looks good. Gonna give it a try tonight.
16:57 karolherbst: pmoreau: it does the same thing it did before :p
16:58 karolherbst: imirkin_: that comes from the assumption, that src2 could take a sign bit and this made all that shifting super complicated. Those previous shifts could even override the u32 of src0 to be a s32, allthough src0 was 0 aka u32
16:59 imirkin_: k
16:59 pmoreau: Ah, yes, the value were fixed as well. So is U24 really different from S24 in that case? Maybe I need to do some more testing.
16:59 karolherbst: well, maybe?
16:59 imirkin_: yes, S24 and U24 are different. but S32 and U32 are the same.
16:59 karolherbst: ohh, okay
16:59 karolherbst: well, I put there what nvdisasm told me about
16:59 imirkin_: at least ... i think
17:00 karolherbst: and I figured out, why the src2 didn't work: I put 64 bit hex values into that perl script, which didn't work out so well
17:00 karolherbst: imirkin_: it shouldn't
17:00 karolherbst: because you just mask it with 0x00ffffff anyhow
17:00 pmoreau: Doesn’t the sign come from the instruction enum type? (I haven’t checked the code)
17:00 karolherbst: maybe the result changes? who knows
17:01 karolherbst: pmoreau: there is dType and sType. For madsp we basically ignore sType
17:01 karolherbst: like completly
17:02 pmoreau: When you do something like `mkOp3(OP_MADSP, TYPE_U32, res, op1, op2, op3)`, TYPE_U32 is the dType or sType (or both)?
17:02 pmoreau:is at work and doesn’t have a copy of Mesa at hand
17:02 karolherbst: dType
17:02 imirkin_: dtype == stype most of the time
17:02 imirkin_: you have to very explicitly set the stype to something else for e.g. cvt
17:03 imirkin_: normally you only care about one
17:03 karolherbst: okay, right
17:03 pmoreau: Right, makes sense
17:04 karolherbst: anyway, I think this makes the madsp subop mess easier to use :) so now write that gm107 lowering for madsp
17:04 karolherbst: I think I will adjust those mask so that unrelated bits won't get overwritten
17:05 pmoreau: I was wondering if the macros could be simplified by only having some I32/I24 and such, with the unsigned vs signed coming from the dType instead.
17:05 pmoreau: Yes! Thanks for the work Karol!! I’ll give all of that a try tonight.
17:06 karolherbst: pmoreau: I thought about this as well, but then I ended up with 5 arguments
17:06 karolherbst: because the third one doesn't know any signess
17:06 karolherbst: well it is implicitly there
17:06 karolherbst: and doesn't matter
17:07 karolherbst: well my goal was as long as you use the NV50_IR_SUBOP_MADSP_TUPLE macro, you can't do anything wrong
17:07 karolherbst: the compiler won't let you :p
17:07 pmoreau: :-)
17:08 karolherbst: well except you add new defines, but then you bascially knew it was wrong
17:12 imirkin_: no - just don't generate the MADSP op
17:16 karolherbst: imirkin_: mhh, so we should check if the target supports it while converting to nvir?
17:16 imirkin_: yes.
17:16 karolherbst: mhhh
17:17 karolherbst: let us assume we have three different converters like this, then either all of them implement that logic or we have a helper function somewhere which does it based on a madsp subop
17:18 imirkin_: note that it'd be nice to figure out how to handle the nv50 case as wel
17:18 imirkin_: where 24-bit mul/mad is supported
17:18 karolherbst: right
17:19 karolherbst: but I doubt that anybody cares about compute on those
17:22 karolherbst: pmoreau: you are aware that this spir-v to nvir will be painful to review, right? :p
17:22 pmoreau: As it currently is? Yes, I am painfully aware of that. :-D
17:22 karolherbst: :D
17:24 pmoreau: I want to clean it up first, and submit only a skeleton: just the overall structure + the memory management. And then, do separate pull requests to add missing pieces.
17:25 karolherbst: sounds good
17:25 karolherbst: the opencl parts mayb be a good first step
17:25 karolherbst: and then the nouveau stuff
17:26 karolherbst: even if nobody is able to generate those spir-v kernels :D
17:26 karolherbst: or maybe some are?
17:26 karolherbst: doesn't matter though
17:26 pmoreau: imirkin_: There are two hardware instructions for mad: mad24.lo and mad24.hi. I guess there are two for mul24 as well
17:26 karolherbst: pmoreau: on tesla?
17:26 pmoreau: Yes
17:26 karolherbst: insane
17:26 karolherbst: :p
17:27 pmoreau: karolherbst: Oh, yes! I need to resend the clover pull request updated, but that was the plan to have it first and separate from Nouveau.
17:27 karolherbst: yeah, sounds good
17:28 pmoreau: For mul24/mad24: it’s still from that (partly NVIDIA) presentation I linked to earlier: http://arith23.gforge.inria.fr/slides/Emmart_Luitjens_Weems_Woolley.pptx
17:28 karolherbst: bld.mkMAD24()... sounds like a good idea, then there could be the check what the target supports, either MAD24 or MADSP or nothing at all
17:29 karolherbst: right
17:29 imirkin_: pmoreau: huh? on tesla??
17:29 pmoreau: SM1.x should be Tesla, right?
17:29 imirkin_: pmoreau: there's no mul.hi on tesla... only on fermi+
17:30 pmoreau: maybe they only had mad24.lo and mad24.hi then
17:31 imirkin_: i just mean hi/lo
17:31 imirkin_: we jump through a LOT of hoops to get 32-bit mul to work coz of a lack of that
17:32 karolherbst: oh crap, I just saw the code
17:32 karolherbst: I am sure it was fun to write that code
17:32 imirkin_: it was especially fun to write the IMUL_HI emulation
17:32 imirkin_: where we want the *upper* 32 bits
17:32 imirkin_: which is useful for faster division
17:33 karolherbst: :D
17:33 imirkin_: after wondering "why does division by 255 not work", which at the time broke dolphin
17:33 pmoreau: Hum, well looks like they have it for mad24, if we can trust a presentation with people from NVIDIA.
17:33 imirkin_: i think they've since stopped doing integer division
17:33 karolherbst: :D
17:33 imirkin_: https://github.com/envytools/envytools/blob/master/envydis/g80.c#L717
17:34 imirkin_: mwk RE'd the whole G80 ALU
17:34 imirkin_: it's not there.
17:34 imirkin_: there might be a convenience thing in PTX, but it's not there in the hw.
17:35 imirkin_: oh wait
17:35 imirkin_: WTF?
17:35 pmoreau: They do say “hardware instruction” but maybe it’s an oversimplication
17:35 imirkin_: there is a mul24.hi
17:35 imirkin_: https://github.com/envytools/envytools/blob/master/hwtest/g80_int.cc#L206
17:35 karolherbst: ;)
17:36 imirkin_: but there isn't for mul16
17:37 karolherbst: should be possible to use the mul24.hi one, no?
17:38 mwk: imirkin_: that's because mul16 gives you a full 32-bit result
17:38 mwk: you don't need a hi version
17:39 mwk: but mul24 computes a 48-bit product, and you can select bits 0-31 or 16-47
17:40 imirkin_: right
17:40 imirkin_: what woudl actually be useful is a mul16.hi32
17:41 imirkin_: er, that makes no sense.
17:41 imirkin_: the 64-bit mul is just annoying with 16/24-bit muls
17:41 imirkin_: i wonder if the high 32 bits would be easier to compute with the mul24.hi + a 16-bit mul
17:41 mwk: no, you need a full carry chain :(
17:45 karolherbst: pmoreau: how do I access the buildUtil thing in your spirv to nvir stuff?
17:46 karolherbst: ohh, it inherits stuff
17:46 karolherbst: weird
17:46 pmoreau: Yes, similar to the nv50_ir_from_tgsi.cpp code
17:46 pmoreau: *similarly
17:47 karolherbst: ohh wait, it works
17:47 karolherbst: I just forgot the arguments
17:47 pmoreau: Who needs arguments :-D
17:55 karolherbst: seriously :D
18:01 karolherbst: imirkin_: is "extbf s32 %r1 %r1 6144" equivalent to "%r1 & 0x00ffffff"?
18:01 karolherbst: I thought it would be, but it doesn't seem to be
18:01 imirkin_: what's 6144?
18:01 karolherbst: 0x1800
18:01 imirkin_: ah, 0x1800
18:01 imirkin_: and 0x18 is ... 24
18:01 imirkin_: so yes. it is equivalent.
18:01 karolherbst: or do we use a different encoding than nvidia?
18:01 karolherbst: weird
18:01 imirkin_: no, we use the same encoding.
18:02 imirkin_: oh wait
18:02 imirkin_: it is not equivalent
18:02 imirkin_: but it's similar.
18:02 karolherbst: order?
18:02 imirkin_: no
18:02 karolherbst: ohh
18:02 imirkin_: extbf sign-extends
18:02 imirkin_: (extbf s32)
18:02 imirkin_: so it'll fill the top bits with the 24th bit
18:02 imirkin_: while and will zero it out
18:03 karolherbst: mhh
18:03 karolherbst: but it seems to produce the same stuff nvidia does, so I thought it would be fine
18:03 imirkin_: extbf u32 would be equivalent to the and.
18:03 imirkin_: well, i dunno what the context is here. if it's just those 2 ops, there will be inputs for which they produce different outputs.
18:04 karolherbst: https://gist.githubusercontent.com/karolherbst/884ed81e6a4aff31f46ffd28c911ac0e/raw/b988cc27982f73336c56d9cb62bc10a98e3895b3/gistfile1.txt
18:04 imirkin_: right, so you have to use the extbf there
18:04 imirkin_: otherwise you'll get the wrong results ;)
18:04 karolherbst: hum
18:05 karolherbst: well I get the wrong resultts and I use extbf
18:05 imirkin_: well
18:05 imirkin_: it all depends on what is right and wrong
18:05 karolherbst: this should be mad24
18:05 imirkin_: but imagine that c0[0x8] == 0x00ffffff
18:05 imirkin_: and c0[0xc] == 0x00000001
18:06 imirkin_: and c0[0x10] == 0
18:06 karolherbst: actually 0x8 is 1 and 0xc is 0x00ffffff
18:06 imirkin_: what would you expect the answer to be?
18:06 karolherbst: 0x00ffffff
18:06 imirkin_: then the code you have is wrong.
18:06 imirkin_: however that's not mul24
18:06 imirkin_: that'd be umul24
18:06 karolherbst: mhh
18:06 imirkin_: smul24 would need the bfe's
18:07 karolherbst: ohhhhh
18:07 karolherbst: right
18:07 karolherbst: the umad24 test passes
18:07 karolherbst: ohh, I see
18:08 karolherbst: I get 0xffffffff instead of 0x00ffffff in the smad24 case
18:09 karolherbst: ahh crap, I see the issue
18:10 karolherbst: now it makes sense that nvidia does all those weird xmad instructions
18:11 karolherbst: I guess carry stuff and so on might be super complicated as well
18:11 imirkin_: :)
18:11 imirkin_: bfe.u32 and bfe.s32 will give you diff results
18:12 imirkin_: bfe.s32 will sign-extend, bfe.u32 will no
18:12 karolherbst: ahh
18:13 karolherbst: yay
18:13 karolherbst: umad24 and smad24 pass now :)
18:14 karolherbst: this should be good enough for now: https://github.com/karolherbst/mesa/commit/69bf54ae1bd0423aad78709b994c7034b3f6c8a4
18:15 imirkin_: mmmmm ... smad24 needs bfe.s32
18:15 imirkin_: at least ... i would think
18:15 karolherbst: actually, I tried that before and it was wrong...
18:15 karolherbst: or maybe pmoreau test is wrong?
18:15 karolherbst: well, it passed on nvidia
18:16 imirkin_: hmmmmm
18:16 karolherbst: and nvidia sues u32
18:16 imirkin_: ok
18:16 imirkin_: then there's no signed mul
18:16 pmoreau: Poor u32, he did nothing wrong though :-(
18:16 imirkin_: + madsp->subOp = NV50_IR_SUBOP_MADSP_TUPLE(U24, U24, 32);
18:16 imirkin_: 
18:16 imirkin_: right
18:16 karolherbst: yeah
18:16 imirkin_: so to match U24, you want BFE.U32
18:17 imirkin_: and it's all just umul24
18:17 imirkin_: however i think that SMad24 needs somewhat different treatment.
18:17 pmoreau: I think my test is wrong, cause I use & 0x00ffffff when computing on the CPU, so it does not sign extend.
18:18 karolherbst: mhh
18:18 karolherbst: I think you are right
18:19 pmoreau: (a << 8) >> 8 should correctly sign extend, shouldn’t it?
18:19 imirkin_: yes.
18:19 karolherbst: https://gist.githubusercontent.com/karolherbst/bcafaf31dfa31c7863926e0b92bd2c99/raw/bf5814bdec32c8843f42825bd943be736cbb8fc2/gistfile1.txt
18:19 pmoreau: OK, I’ll fix the test
18:20 pmoreau: So the BFE vs BFE.U32 is the only difference between umad24 and smad24
18:20 karolherbst: seems that way
18:21 karolherbst: I like those cuda tools. Make things much easier for us :D
18:21 imirkin_: (and BFE is implicitly "BFE.S32")
18:21 pmoreau: On the other hand, I don’t think there is a special version of mul for signed or unsigned (definitely not for add)
18:21 karolherbst: let's see what they do on SM30
18:21 imirkin_: no, coz with 32-bit mul it's all the same
18:21 imirkin_: as long as the result is 32-bit
18:22 pmoreau: Yeah, the CUDA tools are pretty good! No need to run the code, just compile it! Though the driver might recompile using the PTX found in the binary, but mostly if it didn’t found a binary corresponding to the architecture you are running on.)
18:22 karolherbst: ohh right, they use imad
18:22 karolherbst: ... https://gist.githubusercontent.com/karolherbst/2af295fd5b082b3131b3b5bc1cdbc62b/raw/986ab8c1643e6f0ab3568c8fbdb39efa0324bc5e/gistfile1.txt
18:23 karolherbst: so basically what we do on maxwell/pascal now
18:23 pmoreau: What did you change compared to the previous gist?
18:23 karolherbst: wondering why they don't use madsp
18:23 karolherbst: is madsp really that terrible?
18:23 karolherbst: pmoreau: sm30 instead of sm50
18:23 pmoreau: Ah, k :-)
18:24 imirkin_: nvidia misses a ton of such optimizations
18:24 imirkin_: because they don't matter in practice
18:24 karolherbst: or they know more than we do
18:24 imirkin_: but it's a heck of a lot easier to do them than it is to make a proper scheduler
18:24 imirkin_: so we obsess over them :)
18:24 pmoreau: :-D
18:24 karolherbst: :D
18:25 karolherbst: so this should be right then: https://github.com/karolherbst/mesa/commit/110343111f782d74e36de799886d6cbc0cf06c3f
18:26 karolherbst: I am not quite sure about that madsp subop now, but I can't test it here anywat
18:26 karolherbst: *anyway
18:27 pmoreau: I’ll test it tonight
18:27 karolherbst: cool
18:30 pmoreau: I’m still wondering if using OP_MAD (imad?) will do the job, or if we are missing something by not using those extra xmad.
18:30 karolherbst: performance mainly I suppose
18:30 karolherbst: maybe some precision in a few cases? dunno
18:31 karolherbst: well, if we would know what xmad does, we might be able to answer that question
18:31 karolherbst: I could imagine, those .reuse mean something like: keep the value, we use it later again?
18:32 pmoreau: https://devtalk.nvidia.com/default/topic/980740/xmad-meaning/ should contain most answers. They seem to have found most (if not all) flag combinations and what they do
18:33 karolherbst: PSL: post shift left...
18:33 pmoreau: Yes; it does not seem to be specific to xmad. You can have it both on input and output operands IIRC.
18:33 karolherbst: or pre shift left? >D
18:33 pmoreau: product shift left, done before the addition
18:34 karolherbst: I could imagine, that .reuse reduces the latency between instructions
18:35 pmoreau: Yeah, they keep it in some operand cache, but I don’t know much about that.
18:35 pmoreau: I think RSpliet had a better understand of it?
18:35 karolherbst: it might be a way to somehow reduce the perf hit by not dual issueing stuff
18:36 imirkin_: pmoreau: allegedly XMAD is a lot faster than IMAD. i have no clue what IMAD does that XMAD doesn't
18:36 karolherbst: so they say: we just reduce the costs of loading registers again to execute things faster. It makes sense
18:36 karolherbst: imirkin_: well, that mad24 thing might show us
18:36 karolherbst: because they seem to add a lot of those fancy xmad features
18:36 imirkin_: i was kinda assuming that XMAD was a subset of IMAD
18:37 imirkin_: e.g. 24-bit mul, etc
18:37 imirkin_: but i have no idea
18:37 karolherbst: right, but what is R5 missing in the first XMAD?
18:37 pmoreau: Carry bit from the mul?
18:37 karolherbst: ohhh wait
18:37 pmoreau: And maybe sign extension to 32-bit
18:38 karolherbst: what is mrg doing...
18:38 karolherbst: mhh
18:39 pmoreau: “I would think that MRG stands for "merge": It takes bits [15:0] from (a * b + c), while taking bits [31:16] from (b << 16). So really: ((a * b + c) & 0xffff) | (b << 16), since the fields don't overlap.”
18:39 karolherbst: uhm....
18:39 karolherbst: MRG (move register): (a * b + c) & 0xffff + (b << 16)
18:39 karolherbst: what an idea
18:39 imirkin_: right, so i get that XMAD has lots of goofy options
18:40 imirkin_: but can it just do a * b + c?
18:40 karolherbst: well
18:40 karolherbst: let us put all together
18:40 pmoreau: imirkin_: You don’t like goofy options? :-(
18:40 pmoreau: :-D
18:40 imirkin_: no, i like goofy options
18:41 imirkin_: i just like to have the option to not use them :)
18:41 pmoreau: Poor unused options
18:42 imirkin_: options don't have feelings.
18:42 pmoreau: Maybe they do? :o
18:42 imirkin_: that'd make me an evil man
18:43 Lyude: karolherbst: btw: we may get even more interesting results by using stuff like glmark2 that doesn't always use the whole GPU
18:43 pmoreau: Anyway, xmad might only be a 16x16+32?
18:44 pmoreau: Should be doable to hack the smad/umad example to have it generate just a simple xmad, with all arguments being full 32-bit values, and see what we get out
18:45 imirkin_: it's all a bit difficult to care about
18:45 imirkin_: given that GM20x+ = no accel
18:46 pmoreau: no accel? You mean no reclock, right?
18:46 pmoreau: Or is there another kind of accel I am missing
18:46 karolherbst: ohhhhhhhhhhh
18:46 karolherbst: xmad does only 16 bit stuff
18:46 imirkin_: pmoreau: yeah, the kind that makes things go faster, rather than slower :)
18:46 karolherbst: then it makes sense
18:46 pmoreau: ;-)
18:47 pmoreau: I tend to have accel=hardware accel=Gr firmwares, reclock=well, reclock=PMU firmwares :_D
18:47 karolherbst: https://gist.githubusercontent.com/karolherbst/c46fc285be74828595205cf6f88e0d87/raw/23c3c3d155afb9752bb9230ec79d2b86a84fe03f/gistfile1.txt
18:48 pmoreau: But definitely true that performance on GM20x+ is not a priority regarding code generation.
18:48 karolherbst: imirkin_: what do you think. Does this look like something which makes sense?
18:49 karolherbst: so MADSP 16/16/32 could be lowered to XMAD on maxwell :D
18:50 imirkin_: or we dump all that junk from the FE and have a const folding pass
18:50 karolherbst: FE?
18:50 imirkin_: that detects mul(a&0xfffffff,b&0xffffff)
18:50 imirkin_: and converts it into cool things
18:50 karolherbst: :D
18:52 imirkin_: iirc dolphin does a TON of stuff like that
18:52 imirkin_: since it's a mostly-8-bit ALU
18:52 karolherbst: mhhh
18:52 karolherbst: interesting
18:52 imirkin_: but ... maybe they just let it overflow. dunno.
18:53 karolherbst: so now the interesting question: should we add a OP_XMAD or.....
18:53 imirkin_: is XMAD different than MADSP?
18:53 karolherbst: yes
18:53 imirkin_: how
18:53 karolherbst: can madsp do ((R0.LO * R2.HI) & 0xffff) | (R2.HI << 16)?
18:53 imirkin_: remember --
18:53 imirkin_: nv50 ops are theoretical ops
18:53 imirkin_: they only have to map to hw at emit time
18:54 karolherbst: right, but still
18:54 imirkin_: so if we can come up with a combined description
18:54 karolherbst: how should we do xmad.mrg then?
18:54 imirkin_: and ensure that the relevant hw only sees what it supports
18:54 karolherbst: well, we could just put it on top of madsp somehow and create something even more insane than before
18:54 imirkin_: stick it into madsp subop? dunno
18:54 karolherbst: :D no, thanks
18:54 imirkin_: when are you planning on using xmad.mrg?
18:54 karolherbst: apparently for mad24
18:55 imirkin_: erm ok
18:55 karolherbst: that PSL.CBCC thing also looks weird enough to actually make sense in real world scenriaos... no idea
18:55 karolherbst: I am sure the engineers at nvidia thought having such a smartass instruction benefits something
18:56 karolherbst: I don't know
18:56 imirkin_: probably optimizing some key algo
18:56 karolherbst: probably
18:56 karolherbst: well apparently doing those crazy xmad things is faster than one imad
18:56 karolherbst: maybe it is even faster than mul....
18:57 pmoreau: The use case in the slides was some modular multiplication
18:57 karolherbst: well, maybe it is just neural network stuff in the first place and they found other use cases?
18:58 pmoreau: slide 3? they talk more about cryptographic operations, but maybe it was neural network.
18:58 karolherbst: were does slides done by nvidia guys? I thought those were random researchers
18:58 pmoreau: Look at slide 1
18:59 karolherbst: I see
18:59 karolherbst: well that xmad stuff looks like some pre tensor thing
18:59 karolherbst: just less fancy
18:59 karolherbst: or even more fancy
19:00 karolherbst: those tensor things are basically also just 16/16/32 bit mads
19:00 karolherbst: basically
19:00 pmoreau: Possibly, but they seem to have started working on that even before neural networks. (Though hard to tell as we don’t have that kind of insight in NVIDIA.)
19:00 karolherbst: right
19:00 pmoreau: Done in matrices, and it also supports 4x8-bit integers
19:01 karolherbst: anyway, I haed home now
19:01 karolherbst: *head
19:01 pmoreau: I should do that, but on the other hand in need to work seems I have spent so much time on Nouveau stuff
19:01 pmoreau: s/in need/I need
19:04 pmoreau: Hum, I thought their tensor cores worked for 8-bit integers as well, but the part exposed in CUDA only talks about halfs.
19:09 RSpliet: tensor cores aren't the same as the TPU they just released as far as I know
19:10 pmoreau: Ah, could be that. I think I know who to ask otherwise, though I might not get an answer. :-D
19:22 airlied: 18
20:36 yann-kaelig: Hi
20:37 yann-kaelig: I forgot something suring my compilation and I get this issue usr/lib/gallium-pipe/pipe_nouveau.so': No such file or directory. What I missed in my configuration ?
20:37 yann-kaelig: s:d:
20:42 annadane: yann-kaelig, probably helps if you post your complete output to a paste site like paste.debian.net
20:42 annadane: sort of hard to know just going off of that
20:45 annadane: and/or your configuration
20:45 annadane: any relevant details
20:56 johnjay: pastebinit for the win
20:56 johnjay: that way you just have to remember the 6 digit number to copy the link
21:08 annadane: *shrug*
21:08 annadane: use a carrier pigeon if you want
21:49 pmoreau: I am wondering whether or not I found a bug in Beignet.
21:50 imirkin_: i suspect there's tons of bugs
21:50 pmoreau: (0x000001 * 0xffffff) + 0x00000000 = 0x00ffffff
21:50 imirkin_: there's not a ton of different applications
21:50 pmoreau: The result should be sign-extended to 32-bit, shouldn’t it?
21:50 imirkin_: depends on the operation
21:50 imirkin_: umad or smad?
21:50 pmoreau: smad
21:51 imirkin_: and this is with spir-v?
21:51 imirkin_: or something else
21:51 pmoreau: This is by giving it the OpenCL kernels directly, so I have no idea what it is doing with it.
21:51 imirkin_: so ... opencl c?
21:51 pmoreau: Right
21:52 pmoreau:needs to check the OpenCL spec
21:54 imirkin_: https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/mul24.html
21:54 imirkin_: doesn't seem to say anything about whether the result is 24- or 32-bit
21:54 pmoreau: Yeah :-/
21:54 imirkin_: let's check ths epc
21:55 imirkin_: the spec, that is
21:55 pmoreau: You get the same text in the spec.
21:55 imirkin_: yeah
21:55 imirkin_: so i guess it may or may not be correct.
21:56 imirkin_: since the result is a 32-bit number, it'd definitely be *odd* for it not to be able to support the full 32-bit value
21:56 pmoreau: At least they specify when the input does not fit in 24 bits, that it is implementation defined, but they don’t say anything about the result.
21:56 imirkin_: otoh, the whole premise here is that it's a 24-bit alu, so, i dunno.
21:56 pmoreau: Right
21:58 pmoreau: It even weirder for mad24, as it adds a 32-bit value
21:59 tstellar: pmoreau: The gcn instructions for mul24 return a 32-bit value.
22:00 pmoreau: OK. 32-bit value would be sensible.
22:00 pmoreau: The spec for mad24 does say that the mul part returns a 32-bit value as well, and since it adds a 32-bit value to it, the final value should still be 32-bit.
22:11 imirkin_: pmoreau: a better test is 0xffffff * 2
22:11 imirkin_: does that come out as 0xffffffe or 0x1ffffffe
22:14 imirkin_: pmoreau: and an argument could be made that the opencl spec actually doesn't imply the sign-extension
22:14 imirkin_: using a 0xffffff arg to a smul24 is undefined
22:14 imirkin_: since it's outside the 32-bit-valued [-2^23, 2^23-1] range
22:14 pmoreau: :-D Found a small bug: the umad24 test was using the smad24 kernel --"
22:15 pmoreau: Shouldn’t 0xffffff be -1 encoded in 24-bits?
22:15 imirkin_: but it's not
22:15 imirkin_: it's encoded in 32 bits
22:15 imirkin_: there is no 24-bit type
22:16 pmoreau: Hum, right
22:17 pmoreau: OK, that fixes the issue then
22:18 pmoreau: Thanks for pointing out that it should 32-bit encoded.
22:20 imirkin_: well, i'm no opencl expert
22:20 imirkin_: but that's what it reads like to me
22:44 pmoreau: karolherbst: The smad24 support was wrong: it was indeed using u24̣̣·u24+i32 instead of s24·s24+i32
22:52 gnarface: so, which AMD cards are good for hardware h264 decoding with open source drivers (AMDGPU not AMDGPU-PRO?)
22:52 gnarface: looking for something under 50$
22:52 gnarface: just has to stream video, basically
22:52 gnarface: lower latency the better
22:56 pmoreau: imirkin_ pushing for AMD cards rather than NVIDIA ones, does not mean we are expert in which are the best AMD cards either. ;-p
22:57 pmoreau: (or maybe he is :o )
22:58 pmoreau: I did look at AMD cards when I built my computer (some times ago, and I did buy an AMD card over an NVIDIA one), but I had completely different criteria, sorry
23:01 gnarface: no no, i know that. i'm completely clear this isn't a AMD support/sales channel too. i just figured you guys might be the type of people who actually have some in your possessions that you had good experiences with so you could vouch for a particular model or model range. (i don't even know the AMD numbering schemes; they always seemed bewildering to me compared to the nvidia ones)
23:01 gnarface: last AMD video card i had was a mach64
23:01 gnarface: heh
23:01 gnarface: pci
23:01 gnarface: good card, too
23:01 gnarface: oy, ATI technically but you know what i mean
23:02 gnarface: good hardware, garbage drivers, nothing surprising
23:03 pmoreau: I had an HD 6870 IIRC. Can’t complain about it, mainly used it under Windows for gaming and had pretty good performance for the games of that era.
23:03 gnarface: 6870
23:03 gnarface: those are 1GB cards?
23:03 pmoreau: I can’t remember, sorry
23:03 gnarface: in theory should work perfectly for my purposes if the linux drivers can actually do hardware h264 decoding
23:04 gnarface: hell, 256MB should be enough i think
23:04 pmoreau: I still have it, but a few thousand kilometres away from me. :-D
23:04 gnarface: it's only gotta support 1920x1080. dual display support would be nice but not required
23:05 pmoreau: I used it with 3 1920x1080 screens, so dual should work.
23:07 pmoreau: I did try playing BF3 on those three monitors, but it was a bit laggy; I mostly likely had settings at max still.
23:12 gnarface: yea, opengl performance isn't actually important for this use case, because it would just be in a steam-streaming box, so all that matters is that it can do h264 hardware decoding with linux drivers
23:12 gnarface: the opengl would be happening on my desktop machine in another room, 50' away
23:14 gnarface: this actually appears to be a much rarer hardware feature in linux than one would suspect, if you don't want to throw more than 50$ at it
23:14 gnarface: actually currently just looking at what's on the shelf at fry's the next cheapest card that i know for sure can do it from nvidia is 350$
23:15 gnarface: which of course is absurd
23:15 gnarface: (not the least because it would also require sawing holes in the side of the dell case and replacing the power supply)
23:18 airlied:wonders what model gpus are in the $50 range nowadays
23:22 imirkin_: airlied: GT 730 :)
23:26 imirkin_: gnarface: ask in #radeon, but i think anything that's GCN+ is pretty well supported
23:26 gnarface: imirkin_: i'll try
23:27 imirkin_: i barely keep up with AMD generations, much less what the pricing is for various GPUs
23:27 airlied:has no ideas on the video side
23:27 airlied: decode/encode is still all magic
23:27 imirkin_: what i can say is that AMD is funding a team of developers who try their best to support the hardware
23:29 yann-kaelig: is it possible to connect an AMD an Nvidia card on the same motherboard on on the pcie slot1 and second on pcislot2, then shutdown on or the other slot dependening on the need
23:29 yann-kaelig: on:one
23:29 airlied: motherboards don't have the ability to poweroff slots
23:30 yann-kaelig: too bad
23:31 gnarface: well, wait, some server motherboards do, don't they? i thought that was an implicit requirement for hotplug
23:32 airlied:doesn't know of any server motherboards with 16x pcie slots
23:32 airlied: that you can power on/off
23:33 airlied:is willing to be wrong if you can point me at one :)
23:34 yann-kaelig: but for example usb or anything about Power-Saving has not the ability to shutdown a part of the motherboard ?
23:35 airlied: pcie d3 cold exists, but I don't know of any motherboards that support it for gfx slots
23:43 imirkin_: i've heard of it existing, but i'm not aware of anything off-the-shelf that does it
23:43 imirkin_: the only thing is in like some external enclosure
23:44 imirkin_: the use-case is basically a board goes bad and you want to replace it without rebooting the box
23:44 imirkin_: think sun e10k style :)
23:46 airlied: you could get a thunderbolt 3 egpu enclosure I suppose
23:46 airlied: I've been meaning to get one of those