12:28 karolherbst: pmoreau: mhh, mesa can't find the spirv-tools in my prefix
12:29 pmoreau: karolherbst: I have updated the installation script as well; you need to use a new branch for SPIRV-Tools, which generates the .pc file
12:29 karolherbst: which branch?
12:29 karolherbst: v2017.1?
12:29 pmoreau: for_nouveau, on my own fork
12:30 karolherbst: ahh
12:30 pmoreau: The patch hasn’t been submitted yet for merging
12:30 pmoreau: https://github.com/pierremoreau/SPIRV-Tools/tree/for_nouveau
12:31 pmoreau: I set up that branch so that’s easier for me to push every changes there, and have others use them, while they are making their way to master.
12:31 karolherbst: today I hope I can get the OpenCL CTS to run :)
12:31 pmoreau: :-)
12:32 pmoreau: I did a bit of cleaning yesterday on the code; mainly removing checks that are already done by the validator, and fixing some things as well.
12:32 pmoreau: Like opcodes that could take either a scalar or a vector, but the code only supported the scalar form: now it supports both.
12:32 karolherbst: yeah, I saw
12:32 karolherbst: already ran into a conflict
12:33 karolherbst: that smad24/umad24 thing
12:33 pmoreau: Ah yeah, I haven’t had time to try your patch for it. :-/
12:34 karolherbst: and it might make sense to have a proper build system for spir-v-testing, because it sometimes fails to detect changes
12:36 karolherbst: mhh
12:36 karolherbst: pmoreau: /usr/bin/ld: cannot find -lSPIRV-Tools-comp
12:36 pmoreau: Yeah, I’m planning to move it to CMake (or maybe Meson, for fun). And if using CMake, maybe make use of CTets.
12:36 karolherbst: Ctest?
12:36 karolherbst: mhh
12:36 pmoreau: Hum, I didn’t that issue.
12:36 karolherbst: yeah, why not
12:36 pmoreau: A framework for unit testing.
12:36 pmoreau: They use it in LLVM and SPIRV-Tools.
12:37 pmoreau: didn’t *hit* that issue
12:37 karolherbst: well, ctest is just the cmake part of it
12:37 karolherbst: it basically checks return code of executeables
12:38 karolherbst: mhh, I am indeed missing the libSPIRV-comp.so file
12:39 pmoreau: Maybe enable SPIRV_BUILD_COMPRESSION=ON
12:39 karolherbst: ahhh
12:39 karolherbst: yeah
12:39 karolherbst: that's it
12:39 pmoreau: I don’t think I had it enabled before, but now I do.
12:39 karolherbst: I thought comp stands for compute
12:39 pmoreau: I updated the instructions on the webpage as well as the script, but that’s not going to help you since you already have everything configured.
12:39 karolherbst: yeah
12:40 karolherbst: just mesa failed to build, so I am nearly done with rebuilding
12:40 pmoreau: I wanted to tell you about it yesterday when I pushed, but you were offline and I was getting pretty tired as well (around 3-4am).
12:43 karolherbst: *sigh*... now I get other issues
12:43 karolherbst: `program.compile();` failed: CL_COMPILE_PROGRAM_FAILURE
12:43 karolherbst: invalid source========================================
12:44 pmoreau: You got that one earlier when you were setting things up
12:44 pmoreau: initially
12:44 karolherbst: mhh
12:44 pmoreau: I can’t remember how you fixed it though
12:44 karolherbst: ohh, right, wait
12:44 karolherbst: I forgot to update my script to use the local mesa
12:45 karolherbst: there I still used lib and not lib64
12:45 pmoreau: Ah, ok
12:52 karolherbst: pmoreau: can you check if the test still passes with the two commits for you? https://github.com/karolherbst/mesa/commits/nouveau_spirv_support
12:53 karolherbst: uhm.. your smad24 case should be wrong
12:53 pmoreau: I’ll check tonight. I am currently at work and don’t have my laptop to check on it.
12:53 karolherbst: or wait...
12:53 karolherbst: ohh, you used U16L before, not U24
12:53 karolherbst: I see
12:53 karolherbst: so the umad24 case should be wrong as it is right now
12:54 pmoreau: When I checked umad24, it was doing 24-bit, not 16
12:54 karolherbst: weird
12:55 karolherbst: I am sure 2 is wrong
12:55 karolherbst: "NV50_IR_SUBOP_MADSP(4,2,8); // u16l u16l u16l"
12:55 karolherbst: "NV50_IR_SUBOP_MADSP(0,2,8); // u32 u16l u16l"
12:55 pmoreau: When I generated the 3 line comments here: https://github.com/karolherbst/mesa/commit/3fca678c2e905bdbc8ea9b2d2332912792778414#diff-96014c8c85b364c512bef1ae32bb3f73L246
12:55 karolherbst: well, your comments aren't correct there
12:56 karolherbst: a and c sound correct
12:56 pmoreau: That was based on the result I got out of running the code
12:56 karolherbst: allthough c bit 0 and 1 do a few strange things
12:56 karolherbst: but b can't be right
12:56 karolherbst: mhh
12:56 karolherbst: let me check with nvdisasm
12:57 karolherbst: you run on kepler1, right?
12:57 karolherbst: or was it kepler2?
12:57 pmoreau: I’ll check again tonight, but I think I checked it by running 1 x 0xffffffff + 0 and see what I got out.
12:57 pmoreau: GK107
12:58 karolherbst: so
12:58 karolherbst: 0 means U24
12:59 karolherbst: echo '0x00000003 0x00000000' | mesa_nvdisasm SM30: @P0 IMADSP.U32.U24.U32 R0, R0, R0, R0;
12:59 karolherbst: 1 is S24
13:00 karolherbst: '0x00000023 0x00000000': @P0 IMADSP.U32.S24.S32 R0, R0, R0, R0;
13:00 karolherbst: 2 is U16l '0x00000043 0x00000000': @P0 IMADSP.U32.U16H0.U32 R0, R0, R0, R0;
13:01 karolherbst: and 3 is S16l
13:01 karolherbst: 4 is messing with src0
13:01 karolherbst: '0x00000083 0x00000000': @P0 IMADSP.S32.U24.S32 R0, R0, R0, R0;
13:02 karolherbst: in the old code you had this in the emitter: code[0] |= (i->subOp & 0x0f0) << 1;
13:02 pmoreau: How is the test even passing if it’s doing U16l? oO
13:02 karolherbst: don't ask me
13:02 karolherbst: but well
13:02 karolherbst: 0xffffff == 0xffff for u16l, right?
13:02 pmoreau: Yes
13:02 karolherbst: and both is -1, right?
13:03 karolherbst: ohh wait
13:03 karolherbst: as signed
13:03 karolherbst: but it may not matter
13:03 karolherbst: maybe that instruction is a bit broken
13:03 karolherbst: keep in mind, that nvidia isn't using it
13:03 karolherbst: they use plain BFE+IMAD even on kepler1
13:03 pmoreau: True
13:04 karolherbst: so, what I did in my patch is according to nvdisasm
13:04 karolherbst: and what is inside envydis
13:04 karolherbst: it might or might not be correct practically
13:05 karolherbst: but they don't use IMAD on maxwell either, so we are doing something which nvidia doesn't on all chipsets in the first place
13:06 karolherbst: we might want to dig a bit deeper into madsp and see what it means on a hw level here
13:08 pmoreau: On the other hand, getting mad24/mul24 up and running efficiently is not the most important task right now.
13:09 karolherbst: not really
13:09 karolherbst: I didn't say it from a performance point of view though
13:09 karolherbst: if madsp is doing crazy things, we might want to know about it before
13:10 karolherbst: but... to be honest, compute won't be important prior to pascal anyway and in one or two years not even there anymore
13:10 pmoreau: Depends what kind of compute.
13:10 karolherbst: well, ignoring compute shaders
13:11 karolherbst: compute shaders is important of GPUs users might use for playing games, and kepler cards are fine there as well
13:11 pmoreau: Running CUDA kernels on Maxwell (OK, a Titan card), is giving pretty good results. Sure in two-three years you’ll have better performance on newer cards, but that does not make older cards perform badly.
13:11 karolherbst: but nobody will use compute on kepler with nouveau for professional use cases. not today, not in the future
13:11 karolherbst: well
13:12 karolherbst: keep in mind, that we won't have an open source cuda/opencl implementation just tomorrow
13:12 karolherbst: and it may need a few years until it gets stable enough
13:12 pmoreau: Right, and it may take 2x that time before we are able to reclock Maxwell v2+ cards
13:13 karolherbst: if we are able to run stuff for 1 month straight on most cards without issues and with performance comparable with nvidia, yeah, then is the time it is actually usable
13:13 pmoreau: So Kepler and Maxwell v1 will remain the main target currently
13:13 karolherbst: well, not here it won't
13:14 karolherbst: for you and the normal users and everybody, then yes
13:14 karolherbst: but not under the terms I should work on that stuff
13:15 karolherbst: and it is fine this way, just saying that I will work mainly on Pascal and everybody who want to use it in short term will work on kepler/maxwell1.
13:19 karolherbst: pmoreau: by the way: gr: TRAP ch 2 [00ffc63000 loop_with_if.te[20230]] :/
13:19 karolherbst: gr: GPC0/TPC0/MP trap: global 00000000 [] warp b00000f [MISALIGNED_ADDR]
13:20 karolherbst: then nouveau resets the fifo channel
13:21 pmoreau: That’s the error I talked to you about, where I wasn’t sure whether it was a bug in RA or not. (the live set wrongly computes the live range of the value storing the out pointer)
13:21 karolherbst: okay
13:22 pmoreau: If you have some time to look into it that would be great, but you told me you wanted to look more into missing functionalities in Nouveau.
13:23 karolherbst: well, if it is a bug within codegen, I can take care of that as well
13:23 karolherbst: just I don't want to interfere with your spirv_to_nvir work
13:23 pmoreau: But what is currently happening, is that the Value* storing the res pointer gets overwritten on every loop turn.
13:24 pmoreau: I am not sure where the bug lies. I think I properly set up the different edges between the BBs, and the error lies in buildLiveSet, but maybe not.
13:25 karolherbst: mhh
13:26 karolherbst: well, right, I see the issue, but I don't see why it happens, let me check
13:26 pmoreau: From what I remember when debugging this, buildLiveSet would always fail when the for-loop contains more than 1 BB. So it would be nice to check by creating a TGSI shader with those properties and see whether it succeeds or not.
13:26 karolherbst: luckily the program is small enough
13:26 pmoreau: Yup
13:27 pmoreau: It could probably be simplified a bit further.
13:27 karolherbst: "livei(%14): [1 9) [12 15) [29 34)" this seems wrong
13:27 pmoreau: But solving that issue would allow us to run at least 3-4 more tests from the CTS/test_basic
13:27 karolherbst: r14 is the register with the address
13:27 pmoreau: The livei is wrong.
13:28 karolherbst: 9-11 are phi nodes
13:28 pmoreau: It doesn’t consider it to be alive in most of the loop
13:28 karolherbst: 16 is fine
13:28 karolherbst: and 17 as well
13:28 karolherbst: 18-28 need to be added to the livei
13:28 karolherbst: 18 is where it gets overwritten
13:29 karolherbst: yeah
13:29 karolherbst: mhh
13:29 karolherbst: I think the graph should be faulty then
13:29 pmoreau: Does it get overwritten? It should only be set once when loading the kernels argument.
13:30 karolherbst: yeah, it gets overwritten
13:30 karolherbst: in the BB after ret
13:31 pmoreau: Care to share a paste of the output? I am at work so I don’t have Mesa & co installed here, nor am I running Nouveau, and I didn’t learn the NV50_PROG_DEBUG=255 output for that test by heart (nor any other test) :-D
13:31 karolherbst: https://gist.githubusercontent.com/karolherbst/d2f5bcefa1640adf8e0612b5d08e93c7/raw/4c2345c6bd9bfaff47bd9408e0a0ee994d781b6c/gistfile1.txt
13:32 karolherbst: let me draw that graph
13:34 pmoreau: Thanks
13:38 karolherbst: mhh
13:38 karolherbst: BB:6 => BB:8 is a tree edge, but BB:7 => BB:8 is a forward one
13:38 karolherbst: that can't be right
13:38 karolherbst: both should be the same
13:41 karolherbst: pmoreau: something like this: if you visit a new node, it is a tree edge, if you visit something you already visitied, it is a back/forward/cross edge
13:41 pmoreau: Isn’t forward used for nodes that have already been “declared”/“visited”?
13:41 karolherbst: same goes for back and cross edges
13:41 karolherbst: and BB:8 wasn't visited
13:41 karolherbst: keep in mind, it is a directed graph
13:42 pmoreau: It has been visited through BB:6
13:42 karolherbst: BB:0 => BB:2 => BB:3 => BB:5 => BB:7 => ::B8
13:42 karolherbst: where did you already visit BB:8?
13:42 karolherbst: no
13:42 karolherbst: it hasn't
13:43 karolherbst: if your reasoning would be correct, then BB:6 => BB:8 would be also a forward edge, because it was already visited through BB:7
13:43 karolherbst: there is no order except the path you walk through the graph
13:43 pmoreau: Not if you are following a deterministic to visit the BBs
13:43 karolherbst: BB:7 nor BB:6 are walking "before" the other
13:44 pmoreau: How do you get a forward edge then? I can easily how you get a back edge, but I don’t see for a forward one.
13:44 karolherbst: I think this sums it up quite nicely: https://stackoverflow.com/a/29915605
13:46 pmoreau: Too bad it doesn’t have a small drawn example. Let me try to come up with one.
13:48 karolherbst: anyway, the only edge with shouldn't be a tree one is BB:9 => BB:3
13:50 karolherbst: allthough I am quite sure that BB:9 => BB:3 is a back edge
13:51 pmoreau: That seems to be the easy one to classify
13:51 karolherbst: BB:5 => BB:9 would be a forward edge
13:52 karolherbst: if there would be an edge
13:52 karolherbst: there are no cross edges possible in this tree
14:00 karolherbst: but imirkin knows more about that stuff anyway
14:01 pmoreau: karolherbst: Looking at those examples http://www.csd.uoc.gr/~hy583/papers/ch3_4.pdf wouldn’t BB:7 -> BB:8 be a cross edge? (Similar to (c,d) being a cross edge in example (a))
14:01 karolherbst: pmoreau: what did df standed for again?
14:01 pmoreau: depth-first?
14:01 karolherbst: pmoreau: BB:6 => BB:7 would be a cross edge I think
14:01 karolherbst: kind of
14:02 karolherbst: mhhh
14:02 pmoreau: DFS: Depth First Search
14:02 karolherbst: well I will try to follow that paper super strictly and see what I can figure out
14:08 karolherbst: I am currently thinking if the idom is correct for BB:3, but most likely it is
14:11 boot1: whats the best open graphics cards that work with viking and libreboot? AMD, nVidia? Model?
14:17 karolherbst: pmoreau: ohhh now I have it
14:17 karolherbst: I was being stupid
14:17 karolherbst: well, it all just depends on how you actually walk through the graph
14:17 karolherbst: and you can have multiple versions of the result
14:17 karolherbst: right
14:19 karolherbst: pmoreau: BB:7 => BB:8 should be a cross edge
14:20 karolherbst: you define basically an order on how to travers through the graph, if you choose: 0, 2, 3, 4, 10, 1, 5, 6, 8, 9, 7
14:20 karolherbst: then 7 => 8 is cross and 9 => 3 is back
14:20 karolherbst: everything else is tree
14:21 karolherbst: you could also do things like 0, 2, 3, 1, 4, 10, 1, ....
14:21 karolherbst: then 3 => 1 would be tree, 3 =>4, 4=>10 tree as well and 10 => 1 would be forward
14:21 karolherbst: uhm
14:21 karolherbst: cross
14:21 karolherbst: not forward
14:23 karolherbst: but
14:23 karolherbst: becuase 7 is traveresed before 6
14:23 RSpliet: boot1: the libreboot webpages are ambiguous on whether it initialises GPUs through their VBIOS methods. If yes, any GPU would do. If no, supposedly no GPU would do
14:23 karolherbst: 7 => 8 is tree and 6 => 8 is cross
14:24 RSpliet: there's no GPUs out there with an open-source video bios if that's what you're after. That's impractical for good reasons.
14:25 boot1: how about just ones that work
14:25 karolherbst: RSpliet: well a libre kernel works with nouveau for open firmware
14:25 RSpliet: boot1: as I said, the libreboot website is ambiguous on whether they run the initialisation methods of the GPUs video bios. You're better off asking them.
14:26 RSpliet: karolherbst: I thought they ripped out all the necessary fecs/gpccs firmwares, despite being FLOSS?
14:27 karolherbst: I think they did it for some time, but now it should work
14:27 RSpliet: \o/ progress!
14:28 karolherbst: pmoreau: I think we should start printing the incident and outgoing edges
14:28 karolherbst: pmoreau: that should make it easier to see what is going on here
14:28 karolherbst: because if the order is messed up, things break
14:29 pmoreau: karolherbst: Aren’t the outgoing already printed?
14:29 pmoreau: I added the printing for incoming ones at some point, but removed it later.
14:29 karolherbst: ohh right
14:30 pmoreau: How could you do 0, 2, 3, 1, 4, 10: there is no 3 -> 1 edge.
14:31 karolherbst: pmoreau: this is the order on how to do the DFS
14:31 karolherbst: you can do any order you want
14:31 karolherbst: as long as you go on direct connections
14:31 pmoreau: DFS still follows the edges, it doesn’t create new ones
14:32 pmoreau: *existing edges
14:32 karolherbst: that isn't the point here
14:32 karolherbst: you go back if you don't find new nodes
14:32 karolherbst: if you are at 1 and see: ohh, no new edges, I go back
14:32 pmoreau: Well it is: 3 -> 1 is not an existing edge
14:32 karolherbst: mhhh, right, true
14:32 karolherbst: well
14:32 pmoreau: From 3 you either have to go to 4 or 5
14:33 karolherbst: no, you are right, I was thinking about something silly
14:33 karolherbst: wondering if it would be legal to start with 5
14:33 karolherbst: well should be, right?
14:33 pmoreau: But you could do 0, 2, 3, 4, 10, 1, 5, 7, 8, 6
14:33 pmoreau: It should be
14:33 karolherbst: that would create a fun graph
14:34 karolherbst: where 2 => 3 would be a cross edge
14:34 pmoreau: I think you can start wherever you want, just might not be as efficient
14:34 karolherbst: yeah
14:34 karolherbst: well
14:34 pmoreau: Yeah right! 2->3 cross would be fun :-D
14:34 karolherbst: what is important is the DFS tree we build initially
14:35 karolherbst: and the order on how we traverse through that
14:35 karolherbst: so
14:35 pmoreau: Right
14:35 karolherbst: I don't really know which order we do
14:35 karolherbst: but it seems like we do 5, 7, 8, 9, 6
14:35 karolherbst: or does it do 5, 6, 8, 9, 7?
14:35 pmoreau: Just print it in buildLiveSet or some other function which visits BBs
14:36 karolherbst: well, currently I just planned to work on the OpenCL CTS and fix the easier issues there :D
14:36 karolherbst: actuallye wanted to ask what I have to pass into CL_OFFLINE_COMPILER
14:36 pmoreau: Good question
14:36 pmoreau: Which repo are you using again for the CTS?
14:36 pmoreau: I can tell you what I am using, but only once I get home.
14:37 karolherbst: the cl22_trunk one
14:37 pmoreau: Is it awatry’s fork?
14:37 karolherbst: no
14:38 karolherbst: will use his 12 branch
14:38 pmoreau: cl12_trunk from his repo is what I am currently using.
14:39 karolherbst: doesn't compile >(
14:39 karolherbst: error: narrowing conversion of ‘18446744071562067968’ from ‘long long unsigned int’ to ‘Long {aka long int}’ inside { } [-Wnarrowing]
14:44 karolherbst: duh....
14:44 pmoreau: :-/ it did compile when I tried. I think I have some custom patches, but not sure what they were for.
14:44 karolherbst: pmoreau: https://gist.githubusercontent.com/karolherbst/7381a6cbdba4a815ad6d3bea9da9880e/raw/cde1aac2904b0a0cd213a332a1065351ea81633b/gistfile1.txt
14:44 karolherbst: ...
14:44 pmoreau: Yeah, that rings a distant bell
14:45 boot1: /join #power9
14:45 pmoreau: Probably went with an #undef issubnormal
14:48 karolherbst: the OpenCL CTS is kind of small
14:49 karolherbst: pmoreau: how do I run those tests? :D
14:49 pmoreau: Well, if you run the script it will tell you something like 50 tests. But test_basic, which is one of them, itself is made of ~30-50 tests. So, it’s a bit deceptive.
14:50 karolherbst: what scripts?
14:50 karolherbst: ohh, in test_conformance?
14:50 pmoreau: In test_conformance, there is a “run_conformance.py” script
14:50 karolherbst: :D that starts great
14:50 karolherbst: bash: ./run_conformance.py: /usr/bin/python^M: bad interpreter: No such file or directory
14:51 pmoreau: Otherwise, so far I mainly run the test_basic binary manually. Plenty of things to try already.
14:54 karolherbst: the hell
14:54 karolherbst: it doesn't run because of silly issues
14:54 karolherbst: ==> ERROR: test file (/home/kherbst/git/OpenCL-CTS/test_conformance/computeinfo/computeinfo) does not exist. Failing test. ...
14:55 karolherbst: guess no out of tree builds
14:56 pmoreau: Can you run the test_basic by hand? (somewhere inside the build directory)
14:56 karolherbst: mhh, it runs now after I build it in the top level dir
14:57 karolherbst: ui, failing subtests
14:57 karolherbst: nice
14:57 karolherbst: but it runs!
14:57 karolherbst: fancy
14:58 karolherbst: intmath_long, intmath_long2 and intmath_long4 are failing
14:58 karolherbst: this sounds interesting
14:58 pmoreau: Indeed
14:58 pmoreau: What are those doing?
14:58 pmoreau: And what is the reason of the failure?
15:01 karolherbst: how long does the entire run take?
15:02 karolherbst: pmoreau: LONG_MAD int test failed
15:03 karolherbst: :D
15:03 karolherbst: oh wait
15:03 pmoreau: I don’t think I have ever done a full run. Plus they are enough tests running into the loop bug that I did not want to wait 20sec for the GPU to recover after those tests before continuing.
15:05 karolherbst: I think we do something stupid in lowering 64bit mads
15:05 karolherbst: 5: shl u32 $r2 u64 $r255 0x0000000000000003 $r0 ....
15:06 karolherbst: weird
15:10 karolherbst: pmoreau: LONG_MAD int test failed 712c4e5005cce8ec != 712c4e4f05cce8ec
15:10 karolherbst: first is expected
15:10 karolherbst: look, a carry bit went missing
15:11 karolherbst: our result is 0x1 << 33 too low
15:12 pmoreau: Indeed
15:12 karolherbst: https://gist.githubusercontent.com/karolherbst/bcfbcf098491399855f928df062c34ca/raw/593c1ef73f81b503729b923b78da40901987b8ac/gistfile1.txt
15:14 pmoreau: If it’s the mad at lines 18-19, it should propagate the carry bit
15:14 karolherbst: right, it should
15:15 karolherbst: $r5 should be the low bits
15:16 karolherbst: I think something went wrong with the encoding of that mad
15:27 karolherbst: uhm, wait
15:39 karolherbst: I think it doesn't work like this there
15:43 karolherbst: I don't think we need that carry here
15:44 karolherbst: well, we kind of do, but different
15:47 pmoreau: What’s the kernel for that test BTW?
15:47 karolherbst: dst[tid] = srcA[tid] * srcB[tid] + srcC[tid];
15:47 pmoreau: OK
15:48 karolherbst: I could check what nvidia does
15:48 karolherbst: now, how do I compile a cl kernel to ptx again :D
15:48 pmoreau: Look in the logs? :-D
15:48 karolherbst: in fact
15:48 pmoreau: Or write a CUDA kernel instead. For those, it should be identical to an OpenCL kernel
15:49 karolherbst: duh
15:49 karolherbst: there is no mad64 for integers in ptx
15:49 karolherbst: oh wait, there is
16:08 karolherbst: pmoreau: guess what.. they use xmad on maxwell
16:08 pmoreau: That good old xmad!! :-)
16:08 karolherbst: I mean. why not, using a 16x16 mad operation to do 64bit mads sounds totally sane
16:09 karolherbst: now
16:09 karolherbst: compiling to sm30
16:09 pmoreau: That’s interesting though: why not use the 32x32 opcode they already have? Or are they using xmad for 32x32 as well?
16:10 karolherbst: ohh and I didn't compile any cuda or cl files, I just wrote the ptx
16:10 karolherbst: https://gist.githubusercontent.com/karolherbst/f9ef7510502d19e2c84e3943e7ebf558/raw/ec2a62abd141610613e9c1d5266a2a2c79145ce7/gistfile1.txt
16:10 karolherbst: pmoreau: :D
16:10 karolherbst: do you really really really ask that question now?
16:10 pmoreau: Whatever floats your boat :-D I would have gone with CUDA, but fine
16:10 karolherbst: I literally put MAD in there and they produce imad + imul on opt level 0
16:10 karolherbst: ...
16:11 karolherbst: iadd
16:11 karolherbst: they don't touch imad
16:11 karolherbst: I am sure this instruction is crappily slow
16:12 karolherbst: pmoreau: so here is the deal: we do 4 mads, they do 3 mads + mul + add
16:12 karolherbst: well they start with the mul though
16:57 feaneron: i know this might be a too vague and naive question, but what's the challenges of automatic reclocking (on supported hw), and why is it so hard?
17:01 pmoreau: feaneron: 1) You need to be able to reclock reliably, that’s pretty hard in itself as you need to work around broken VBIOSes, find a reclocking formula that works for **all** cards, not just for 99% of them.
17:02 karolherbst: meh
17:02 karolherbst: pmoreau: I literally do the same as nvidia, same result
17:02 imirkin_: pmoreau: karolherbst: i've been aware that IMAD is perhaps not a good idea, fyi
17:02 karolherbst: well, nvidia uses it prior maxwell
17:02 karolherbst: also for mad64
17:03 pmoreau: 2) Driving the fan should work reliably, and 3) be conservative on when to downclock, to not damage the card.
17:03 imirkin_: feaneron: to stress the point, if you reclock manually and 99.9% of reclocks work, that's fine. if you reclock automatically, the frequency of reclocks is much higher so the chances of breakage is higher
17:03 karolherbst: imirkin_: in an OpenCL CTS test, we get 712c4e5005cce8ec vs 712c4e4f05cce8ec for mad64
17:03 pmoreau: Also that
17:03 karolherbst: mul64 is fine
17:03 pmoreau: karolherbst: It could be my lowering code which is wrong as well
17:03 imirkin_: karolherbst: should be easy to repro that with a piglit
17:03 imirkin_: er, shader_test
17:04 karolherbst: wait a second
17:05 karolherbst: something else is super wrong
17:05 karolherbst: like super wrong
17:07 feaneron: pmoreau: it'd be necessary to have a plethora or hw to test on then... :(
17:07 feaneron: well, thanks for answering
17:11 karolherbst: imirkin_: do we have to do the Split64BitOpPreRA pass always?
17:11 karolherbst: :D
17:11 karolherbst: if I disable opts, then the test passes
17:11 karolherbst: weird
17:11 karolherbst: but I thought so
17:11 imirkin_: karolherbst: yeah
17:11 karolherbst: the carry gets moved
17:12 karolherbst: imirkin_: okay, so we should set the opt level to 0, right?
17:12 karolherbst: not 1
17:12 imirkin_: karolherbst: it's not an opt pass
17:12 karolherbst: still
17:12 karolherbst: it isn't run if we do NV50_PROG_OPTIMIZE=0
17:12 imirkin_: oh. hold up.
17:12 imirkin_: that is not the pass i was thinking of.
17:12 karolherbst: ;)
17:13 imirkin_: there's a post-RA one
17:13 imirkin_: which runs *always*
17:13 imirkin_: the pre-ra one is just for fun.
17:13 karolherbst: ohhh
17:13 karolherbst: mhh
17:13 karolherbst: then the post ra one is wrong
17:14 imirkin_: that's likely.
17:14 karolherbst: and the pre ra as well
17:14 imirkin_: those passes only carefully handle the situations they're ready for
17:14 imirkin_: they're not generic
17:14 karolherbst: and something is broken in a way, that OPTIMIZE=0 even hangs the gpu
17:15 pmoreau: Split64BitOpPreRA should always be run. I don’t remember if I sent a patch for that yesterday or just planned to do so.
17:15 imirkin_: probably an address calculation gone wrong
17:15 karolherbst: well
17:15 karolherbst: why does it work with OPTIMIZE=1 then
17:15 imirkin_: pmoreau: if that's the case, then the post-RA one is broken.
17:15 karolherbst: okay, optimize=1 passes
17:16 karolherbst: optimize=2 fails
17:16 karolherbst: and =0 hangs
17:16 karolherbst: nice
17:16 pmoreau: Let me check
17:17 karolherbst: ohhhhhhh fun
17:17 karolherbst: pmoreau: didn't you fix something there the other day?
17:18 pmoreau: Where?
17:18 karolherbst: mhhh
17:18 karolherbst: I thought you did
17:18 karolherbst: weird
17:18 karolherbst: guess what my assumption is
17:18 pmoreau: That doesn’t answer my question :-p
17:18 karolherbst: doesn't matter
17:19 karolherbst: I am sure some opt just does stuff which breaks other stuff
17:19 pmoreau: imirkin_: You can’t split a 64-bit MUL/MAD after RA as you need one extra register to do it.
17:19 karolherbst: yeah
17:19 karolherbst: algebraic opt breaks it
17:20 pmoreau: imirkin_: That’s why there is that pre-RA 64-bit split pass, as I couldn’t put it in the post-RA one.
17:20 karolherbst: so we need to set the opt level to 0 for that pass
17:20 imirkin_: pmoreau: oh. then yeah, it has to definitely happen.
17:21 pmoreau: Yeah, I realised that yesterday, but haven’t sent a patch yes it looks like.
17:21 karolherbst: pmoreau: pls send :p
17:21 karolherbst: .... that algebraic opt
17:22 pmoreau: I should have it on my branch, I think
17:23 pmoreau: Hum, doesn’t look like it.
17:23 pmoreau: Will add that tonight
17:26 karolherbst: mhhh interesting
17:26 karolherbst: merging a 64bit mul + add breaks it
17:26 karolherbst: but why though
17:27 pmoreau: When does it get merged?
17:28 pmoreau: Is it before or after the split?
17:28 pmoreau: *Split64BitOpPreRA
17:28 karolherbst: before
17:29 karolherbst: mhh, it can't be that broken
17:30 karolherbst: anyway, I think we loose that carry somehow
17:31 pmoreau: Come back carry!! We love you!!
17:31 karolherbst: ohhhhh wait
17:31 karolherbst: gnar...
17:31 karolherbst: that is the plain and stupid kind of missing that carry
17:31 pmoreau: ?
17:32 karolherbst: wait a sec, need to verify something
17:34 pmoreau: Ah, https://github.com/doe300/VC4C is also using SPIRV-LLVM to generate the SPIR-V binaries out of OpenCL C. I was wondering what they used.
17:39 imirkin_: karolherbst: btw, we could also be getting it right in the nvir, but wrong at emission time
17:39 imirkin_: esp stuff like carry bits can get ... missed
17:39 karolherbst: I checked with nvdisasm
17:39 imirkin_: .CC == set carry, .X = add carry
17:39 karolherbst: I know
17:39 karolherbst: I think it looked fine
17:39 imirkin_: ok :)
17:39 karolherbst: let me take another look
17:40 karolherbst: yeah, it looks fine
17:55 karolherbst: it is weird though, nvidia uses different ops than we do, but they should be equally correct
18:04 karolherbst: weird, now it works with nvidias opcodes
18:09 karolherbst: okay yeah, so our version is wrong
18:10 karolherbst: pmoreau: https://github.com/karolherbst/mesa/commit/a0709182638faca2ce046fb488651102a0935b64
18:10 karolherbst: I will still rework it so that it is closer to what we did
18:10 karolherbst: well
18:10 karolherbst: the code
18:13 karolherbst: some other opt pass is still interferring. odd
18:15 karolherbst: we should do more carry stuff :D
18:16 karolherbst: local_cse is the second cause
18:16 karolherbst: imirkin_: do you know off hand why localCSE might mess up things here?
18:23 karolherbst: duh!
18:25 karolherbst: that ain't funny
18:27 karolherbst: I think it might end up reordering the carries
18:29 karolherbst: something is odd and I don't know what
18:30 pmoreau: karolherbst: If you just splitted https://phabricator.pmoreau.org/diffusion/MESA/browse/nouveau_spirv_support/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp;0bd9ba0d8d1affd6d12e6b88a971dcd8f6ecd5f5$2451 into a mul.hi and an add, the add having the carry, would that work?
18:30 karolherbst: pmoreau: maybe
18:30 pmoreau: Maybe MAD does not accept a carry, or we forget to emit it?
18:30 karolherbst: I will look into that tomorrow
18:30 pmoreau: I’ll have a try tonight
18:30 karolherbst: I think it does and I doubt we forget
18:31 karolherbst: but maybe
18:31 karolherbst: nvdisasm prints this: IMAD.U32.U32.HI.X R1, R10, R2, R0;
18:31 karolherbst: so it should work, right?
18:31 karolherbst: but maybe it works differently?
18:32 karolherbst: maybe the carry is added after the multiplication?, should it matter?
18:32 pmoreau: It could be that it works differently if doing a .hi
18:32 karolherbst: maybe it is hw broken?
18:32 karolherbst: maybe
18:32 karolherbst: I could imagine it adds it to the lower bits
18:32 karolherbst: because, it is still important there
18:32 karolherbst: even if like never
18:33 karolherbst: it really really only matters if the result is something like 0xffffffff to begin with on the low bits
18:33 karolherbst: and at that point, meh
18:33 karolherbst: allthough it is int stuff...
18:33 karolherbst: precision kind of matters there
18:33 karolherbst: yeah, I think you are right
18:33 karolherbst: the carry is applied on the lower bits still
18:34 karolherbst: I give up for today, it makes no sense
18:38 pmoreau: karolherbst: Someone did a small comparison between imul/imad on Kepler vs Maxwell: https://devtalk.nvidia.com/default/topic/804281/cuda-programming-and-performance/maxwell-integer-mul-mad-instruction-counts/post/4423835/
18:42 imirkin_: yeah, seems like the combination of HI and X could be its undoing
18:42 imirkin_: although ... hm
18:42 imirkin_: it could also be that we're doing stuff wrong
18:42 imirkin_: if this is decomposed, the carry needs to actually be 0x100000000 right?
18:58 pmoreau: It’s the carry from the low bits, so if adding to the high bits, that should only 0x1, not 0x100000000, right?
18:59 imirkin_: i haven't looked at the code
18:59 imirkin_: either way, perhaps .HI.X doesn't do what we think
18:59 imirkin_: like ... what does it do at all (forget the .X bit of it)
18:59 imirkin_: i.e. i do IMAD.HI a, b, c
18:59 imirkin_: what is the reuslt?
19:00 imirkin_: (a*b+c) >> 32?
19:00 imirkin_: (a*b >> 32) + c?
19:01 pmoreau: Good question
19:01 pmoreau: I would assume the latter, ((a*b) >> 32) + c
19:01 karolherbst: I found something different
19:01 karolherbst: Why would the last deadcodeelim change the test result ;)
19:01 pmoreau: But I can’t back that
19:02 pmoreau: The carry gets removed as dead code? oO
19:02 karolherbst: pmoreau, imirkin: please explain https://gist.githubusercontent.com/karolherbst/4511a54fca0af1614e35d90820b225b9/raw/97e9e93a9f138e33599828af57e760ef7d45dbb6/gistfile1.txt
19:03 karolherbst: the upper works, the lower fails
19:05 karolherbst: uhh
19:05 karolherbst: OOR_ADDR warp error
19:05 karolherbst: weird
19:06 pmoreau: What’s this mul/mad line 13-16 in the lower part supposed to be?
19:06 karolherbst: dead code
19:07 pmoreau: But how did it got there?
19:08 karolherbst: the weird thing is, with the working one I don't get that OOR_ADDR warp error
19:08 pmoreau: Could you give the disassembly of the binary for each please?
19:09 karolherbst: nvdisasm or envydis?
19:09 pmoreau: Let’s go with nvdisasm
19:10 karolherbst: https://gist.githubusercontent.com/karolherbst/4511a54fca0af1614e35d90820b225b9/raw/2a98de5492a1bfd216ccabcc93f685876bddf5e4/gistfile2.txt
19:11 pmoreau: I had an issue yesterday when cleaning the code, where a merge would trigger an assert. The printed IR, right before the crash, was the **exact** same with and without my changes. Turned out I had made a typo somewhere, and I was creating regs of size 0. That error was absolutely invisible in the printed IR.
19:11 pmoreau: Thanks
19:11 karolherbst: ;)
19:12 karolherbst: yeah, I had an issue once where the jump had a wrong address
19:12 karolherbst: of course the print doesn't show that either
19:14 pmoreau: I have no idea why the “fail” version fails. Unless some of the dead mul/mad have side effects
19:15 karolherbst: you know what would be weird?
19:15 karolherbst: if the carry isn't reset
19:15 pmoreau: But you manually “reset” it here: /*00b8*/ IADD R0.CC, R2, c[0x0][0x10];
19:15 karolherbst: ;)
19:15 karolherbst: do we?
19:15 karolherbst: how sure are we that we do?
19:16 pmoreau: Isn’t it enough to say “set the carry bit for me” to get it set, in which case it would overwrite its old value?
19:16 karolherbst: I mean pretty much only that load at R0 can be affected by the dead code, right?
19:16 pmoreau: Yes
19:16 karolherbst: the store shouldn't care much about it
19:17 pmoreau: Well, if we have some issues about the carry bit not being reset, the store will have the same issue. And even the second load would.
19:18 pmoreau: bbiab, need to get some food.
19:19 karolherbst: fun
19:19 karolherbst: it is the store which gets messed up
19:47 annadane: so i'm currently still on the hunt for what exactly makes nouveau freeze with plasma and i found this, don't know if it works for everyone https://bugs.freedesktop.org/show_bug.cgi?id=92077#c22
21:06 imirkin_: karolherbst: do you still have questions?
21:06 imirkin_: or did you work it out?
21:08 imirkin_: 5: shl u32 $r2 u64 $r255 0x0000000000000003 $r0 (8)
21:09 imirkin_: that just seems bad all around...
21:11 karolherbst: well, I have no idea what is wrong there
21:11 karolherbst: and why does it look bad?
21:14 imirkin_: $r255 as the dest...
21:14 karolherbst: is it the dest?
21:14 karolherbst: I thought $r2 would be the dest
21:14 imirkin_: i don't think our RA does well with unused outputs
21:15 imirkin_: oh f me
21:15 imirkin_: yes, sorry
21:15 imirkin_: that's the SM35 SHF.L variant
21:15 karolherbst: well, currently I am more curious about this: https://gist.githubusercontent.com/karolherbst/4511a54fca0af1614e35d90820b225b9/raw/2a98de5492a1bfd216ccabcc93f685876bddf5e4/gistfile2.txt
21:15 karolherbst: why is the top one good
21:15 karolherbst: and the bottom one bad
21:16 karolherbst: well, if we assume those get executed on the hw for real
21:17 imirkin_: yeah i def don't have time to analyze this
21:17 imirkin_: sorry
21:17 karolherbst: well, the thing is, deadCodeElim removes 4 instructions and I don't see how those 4 changed anything at all
21:17 karolherbst: but with those 4 instructions, we get OOR_ADDR errors, any ideas?
21:18 karolherbst: I am sure looking at the shader won't help
21:23 karolherbst: pmoreau: maybe the sched opcodes are wrong?
21:23 pmoreau: Mayyyybe? Would be surprising, but maybe?
21:24 karolherbst: any other ideas?
21:29 karolherbst: imirkin_: do you know what was the default sched info for maxwell?
21:29 karolherbst: or pascal
21:29 imirkin_: 0x7fe
21:29 karolherbst: thx
21:30 imirkin_: (or 0x7f0? i forget. check hakzsam's commit which added it)
21:30 imirkin_: 0x7e0
21:30 imirkin_: that's my final answer.
21:30 karolherbst: are you sure about it? Don't you want to change it back to 0x7fe?
21:30 imirkin_: :)
21:30 imirkin_: i'll use a lifeline
21:30 imirkin_: ask the audience ;)
21:31 karolherbst: 0x7e0 seems to be right though
21:32 karolherbst: ... it is still inside the code
21:32 karolherbst: it just gets overwritten
21:33 karolherbst: ahh, there is a NV50_PROG_SCHED variable
21:33 imirkin_: oh right yea
21:33 imirkin_: i knew that
21:34 karolherbst: !!!!
21:34 karolherbst: the hell
21:34 karolherbst: it's the sched stuff
21:35 karolherbst: pmoreau: you know what? maybe those other issues are also kind of related to the sched stuff...
21:35 karolherbst: oh well
21:35 karolherbst: not your issues
21:35 karolherbst: :D
21:35 karolherbst: and the carry bit also got lost due to bad scheds
21:36 karolherbst: who would have known those sched opcodes are so damn important
21:36 karolherbst: hakzsam: any ideas regarding scheduling on maxwell+ and carry bits?
21:37 pmoreau: karolherbst: It’s really the sched stuff?? Oo
21:37 karolherbst: yes
21:37 imirkin_: i wonder if he forgot to track the carry bit
21:37 imirkin_: it's an easy one to forget
21:37 pmoreau: Dang!
21:37 karolherbst: pmoreau: even our original lowering code is correct with disabled sched
21:37 karolherbst: :D
21:37 pmoreau: I would have never thought of that one
21:38 pmoreau: I’m proud of myself for getting the code right, then 8-)
21:38 karolherbst: :D
21:38 karolherbst: good job pmoreau!
21:38 pmoreau: Yeah! \o/
21:38 pmoreau: Well done for tracking it down!!
21:38 karolherbst: but please push or post the patch moving that to opt level 0 though
21:39 pmoreau: I still have to get home ;-)
21:39 karolherbst: ohh, right
21:48 karolherbst: imirkin_: well, right, it seems like the CC isn't even considered in the sched calculator
21:49 karolherbst: only in recordWr I think, mhhh
21:50 karolherbst: ohh wait, it is
21:50 karolherbst: odd
21:55 hakzsam: karolherbst: link?
21:56 karolherbst: well, here is the output at least: https://gist.githubusercontent.com/karolherbst/083eef9913aa828b5e108a9c70795da3/raw/d416673682da83da8cf98af247936bb6fcfbefb4/gistfile1.txt
21:56 karolherbst: something is wrong with the sched codes and CC flag
21:57 karolherbst: as the result is correct if calculation those is disabled
21:58 hakzsam: it does work with NV50_PROG_SCHED=0 then?
21:58 karolherbst: yes
21:58 hakzsam: mmh
21:58 karolherbst: and looking at the result, it is kind of obvious that it is the carry bit
21:59 karolherbst: I guess it didn't wait long enough to be set
21:59 hakzsam: yeah, it's probably something like that
21:59 karolherbst: best idea on how to debug this?
21:59 hakzsam: is there a bug report somewhere?
21:59 karolherbst: write ptx files until I find a pattern?
22:00 karolherbst: uhm, no
22:00 karolherbst: I just run some OpenCL CTS tests
22:00 hakzsam: okay
22:00 hakzsam: I have to remember that sched stuff first :)
22:01 hakzsam: it was approximately one year ago..
22:01 karolherbst: it seems like to only affect tests with imad.hi.x or something
22:01 karolherbst: yeah, I think so
22:02 karolherbst: maybe all
22:02 karolherbst: dunno
22:02 hakzsam: mmh, the carry stuff should already work actually
22:02 hakzsam: if I remember the thing correctly
22:02 karolherbst: yeah, seems that way
22:02 karolherbst: but well
22:03 karolherbst: apparently there is an issue somewhere
22:03 hakzsam: if it does work with NV50_PROG_SCHED=0, yeah I do agree
22:03 karolherbst: maybe the CC bit is written one cycle later?
22:04 karolherbst: or something like this
22:04 hakzsam: let me clone envytools on this new box
22:04 karolherbst: ohh, do you have some tool to test it?
22:04 karolherbst: well
22:04 karolherbst: keep in mind, that I do those tests on a pascal GPU
22:05 hakzsam: shouldn't matter, same ISA
22:05 karolherbst: well
22:05 karolherbst: latencies might be different, but right, shouldn't matter
22:05 karolherbst: well
22:05 karolherbst: I could check
22:06 karolherbst: or I don't
22:06 hakzsam: if you have a maxwell plugged in, yeah
22:06 karolherbst: on sm50 they use xmad
22:07 karolherbst: no idea why they would do that for a mad64 thing... but apparently doing mad16s is faster
22:07 karolherbst: iadd3, interesting
22:08 hakzsam: yeah, xmad...
22:09 hakzsam: all the flags look complicated to RE
22:09 karolherbst: well, I think we kind of understood a few of those flags
22:09 karolherbst: mhh
22:09 karolherbst: well
22:09 hakzsam: yeah, with time and patience, that's definitely doable
22:16 hakzsam: karolherbst: can you show me the output from envydis?
22:17 karolherbst: hakzsam: https://gist.githubusercontent.com/karolherbst/c97cfec4143827113dfa50dd2c2cbe33/raw/0d64d77e02984e8b246155bf5715a4a7566fd24b/gistfile1.txt
22:17 hakzsam: thanks
22:24 hakzsam: not abvious..
22:24 hakzsam: *obvious
22:30 hakzsam: karolherbst: I think it's related to shf
22:33 hakzsam: karolherbst: https://hastebin.com/abexaxesib can you try this?
22:34 karolherbst: but why do you think it is related to shf?
22:35 karolherbst: anyhow, it doesn't help
22:35 hakzsam: the patch doesn't help?
22:35 hakzsam: because for 64 instructions, latency shouldn't be 6
22:35 karolherbst: well, it doesn't
22:35 karolherbst: I am super sure it is somehow related to an instruction doing CC stuff
22:36 karolherbst: look at that for example: https://gist.github.com/karolherbst/4511a54fca0af1614e35d90820b225b9
22:37 hakzsam: without sched codes, it's hard to tell
23:20 pmoreau: karolherbst: Finally home, what was I supposed to do? :-D
23:21 karolherbst: change the opt level of the split pass
23:21 pmoreau: Right, that one I remember. Anything else?
23:25 karolherbst: no idea
23:25 karolherbst: hakzsam: any other suggestions regarding that issue?
23:38 karolherbst: I get the feeling, that the writes of CC aren't recorded in the sched thing
23:39 karolherbst: well, they are apperantly