02:24 karolherbst: the heck...
02:24 karolherbst: when I test against 1.0/x in glsl, it just works
02:25 karolherbst: how super annoying
02:28 HdkR: woo? :)
02:28 karolherbst: I am sure something is totally broken regarding call
02:29 karolherbst: wait a second...
02:33 karolherbst: wasn't there somebody posting a patch fixing something regarding that?
02:35 karolherbst: imirkin: do you remember anything here?
03:00 HdkR: karolherbst: Implementation of call is broken? So if it was inlined it would theoretically work?
03:00 karolherbst: no idea
03:01 karolherbst: I only know that disabling optimizations changes the result
03:05 karolherbst: nononononono... please no
03:06 karolherbst: nono
03:06 karolherbst: I have a theory
03:06 karolherbst: and I don't like it
03:11 karolherbst: DQSNSADJHASJD!DN!WM!<DN<MWND!<NWD!N<WD< .... seriously
03:11 karolherbst: the heck
03:12 HdkR: lol
03:12 karolherbst: I show your the workaround
03:12 karolherbst: it isn't funny
03:13 HdkR: I look forward to it :)
03:13 karolherbst: I literally added add $r0 $r0 0x0 and add $r1 $r1 0x0 _directly_after_ the call, and the it works
03:14 karolherbst: stupid maxwell with its stupid sched data thing
03:14 karolherbst: where not even the thing which should always work works
03:16 HdkR: Oh, so it was a RAW hazard?
03:16 karolherbst: yes...
03:16 HdkR: Got to handle that scheduling ;)
03:17 HdkR: karolherbst: Do you have it part of your call ABI for sched of result data?
03:18 karolherbst: no idea
03:18 karolherbst: and at this time I don't care, I will sleep now
03:18 HdkR: xD
03:21 HdkR: Should probably make it so sb0 must be drained to safely get the result or something
06:42 karolherbst: skeggsb: do you know if there is something weird about jcal on the maxwell ISA?
06:42 karolherbst: or rather RET
06:54 HdkR: karolherbst: Is it behaving strangely? :)
06:54 karolherbst: no idea
06:56 HdkR: Are you testing gm10x?
06:57 karolherbst: actually on gp107
06:57 HdkR: ah
06:59 HdkR: karolherbst: This the weirdness with the call you were concerned about?
07:01 karolherbst: well yeah
07:01 karolherbst: weird thing is, if I add those add nops inside the builtin function it doesn't work
07:01 karolherbst: but if I add them after the call, it does
07:03 HdkR: karolherbst: What if you full drain on your ret inside the call?
07:04 karolherbst: no idea how to do that or if we can
07:05 HdkR: It still has sched so you should be able to?
07:07 karolherbst: well for this I would have to know how thos sched opcodes actually work :D
07:08 HdkR: lol
07:09 HdkR: I guess this is why your asm file has zero for all of these :P
07:09 karolherbst: yes
07:10 karolherbst: in theory this should just work
07:10 karolherbst: apperantly, it doesn't
07:10 karolherbst: or maybe those aren't the real issue
07:10 karolherbst: hakzsam: are you aware of anything stupid regarding scheds and call/ret instructions?
07:14 HdkR: karolherbst: A couple of the other returns in that asm file is using different sched data. Maybe if you nab those?
07:14 karolherbst: yeah... will try to get at least the ret right and see what that gives me
07:45 karolherbst: HdkR: well anyway, with this workaround added I can just go ahead and finish porting the rsq code
07:45 karolherbst: I would be even so crazy and merge it with that if we don't find anything else working reliably
07:46 skeggsb: i would expect RET would neet to have a stall count high enough to ensure the function return values are ready
07:46 skeggsb: need*
07:46 karolherbst: skeggsb: I tried 0xf
07:46 karolherbst: skeggsb: but appernatly it doesn't really worked out
07:47 karolherbst: I just added a plain and simple +0 after the call and this seems to be enough for odd reasons
07:47 HdkR: Ah, so the sched stuff still didn't work?
07:47 karolherbst: it even depends on the value actually set within the function
07:47 karolherbst: HdkR: exactly
07:47 HdkR: Curious
07:47 skeggsb: would be real nice to find the proper solution though instead of hacking around it, who knows what other issues are caused by the real problem
07:47 karolherbst: skeggsb: https://github.com/karolherbst/mesa/commit/3a1c5fc3c7d723deb5b90d45c9308ed089c295d2
07:48 karolherbst: skeggsb: sure
07:48 karolherbst: skeggsb: but for now I just want to run the CTS and see how well that works
07:48 karolherbst: skeggsb: anyway, with that fp64 taken care of, 2 fails remaining on my pascal GPU
07:48 karolherbst: (hopefully)
07:49 karolherbst: skeggsb: if you give me instructions on what sched opcodes to try, I am happy to try
07:49 karolherbst: but usually if I mess around with those I don't really know if I do something correctly or not
07:49 skeggsb: how is it working with "0" as stall count?!
07:49 skeggsb: like, at all, not just the ret thing
07:49 karolherbst: 0 is some magic default value
07:50 karolherbst: check the generated code
07:50 skeggsb: ah
07:50 skeggsb: sneaky
07:50 karolherbst: yeah
07:50 karolherbst: if we disable scheds completly we default to that as well
07:50 karolherbst: and usually it works better
07:50 karolherbst: also much slower
07:51 karolherbst: so if you debug sched opcodes and stick that st 0x0 in it and it suddenly works you know what to do :)
07:51 karolherbst: but in this case, not even that worked
07:51 karolherbst: but
07:52 karolherbst: I don't see any reason why that sched stuff isn't the cause here
07:53 karolherbst: skeggsb: but in the end we want to be able to set st 0x0 to declare dual issueing
07:53 karolherbst: I think
07:54 karolherbst: so we might want to change that and have a magic value being "default" or something
07:54 karolherbst: nice, "KHR-GL45.pipeline_statistics_query_tests_ARB.functional_compute_shader_invocations" failed again
07:55 karolherbst: skeggsb: anyway, if you have some ideas, please share :)
08:05 karolherbst: skeggsb: do you know how those sched works out with calls anyway? could you basically put a write barrier on the cal and wait on that in the next instruction?
08:06 skeggsb: i seriously doubt that'd work
08:06 skeggsb: but no idea
08:06 skeggsb: "see what nvidia do" ;)
08:06 karolherbst: yeah, I also doubt that
08:06 karolherbst: :D
08:06 karolherbst: how? :D
08:06 karolherbst: mhh
08:06 karolherbst: allthough I guess in ptx it should work
08:06 karolherbst: allthough I am sure they will end up relative cals there
08:06 skeggsb: find a test where they generate CAL, see what they do with sched?
08:06 karolherbst: shouldn't matter though
08:08 skeggsb: here's a volta trace where they use CALL, might be able to glean something from that
08:08 skeggsb: https://paste.fedoraproject.org/paste/pRyb2DYHxwRm2UIQZSoaYw
08:08 skeggsb: sched codes are the same layout, just take the highest 32-bits and >>= 9
08:14 karolherbst: skeggsb: "sched (st 0x0)" for the call :D
08:14 karolherbst: uhm wait
08:14 karolherbst: wrong line
08:15 karolherbst: st 0x5 yl, mhh
08:15 karolherbst: _maybe_ that yield flag might be required here actually
08:15 karolherbst: same for the ret
08:16 karolherbst: + wts
08:16 karolherbst: I played witht he yield flag on the ret side
08:16 karolherbst: but maybe the call needs it as well for weirdo reasons
08:17 skeggsb: btw, volta's CALL/RET are basically just BRA... so, they could be different from maxwell
08:19 karolherbst: yeah... that's what my concern here is
08:22 karolherbst: hum "Failed: 10/7769 (0.1%)"
08:23 karolherbst: still 6 fp64 fails
08:24 karolherbst: uhm. yeah ... I need to fix that sched stuff correctly
08:24 karolherbst: input: [16.5, 16.625], result: [16.5, 0.245256] expected: [0.246183, 0.245256] :D hihi
08:34 karolherbst: skeggsb: yeah... "(st 0xf yl)" doesn't help, even if set on both the ret and the jcal
08:35 karolherbst: it doesn't even improve the situation a bit, so I still need to add two more nop adds
08:35 karolherbst: I think something else is actually wrong here
08:36 karolherbst: still related to scheds, but not the sched itself
08:38 karolherbst: ha, nice
08:38 karolherbst: got nvidia to use cal with ptx code :)
08:39 karolherbst: "(st 0xd yl)" for the call
08:39 karolherbst: "(st 0xf yl wt 0x1)" for the ret :(
08:39 karolherbst: hum
08:40 skeggsb: what's setting the barrier there?
08:40 skeggsb: last instruction in the function, or, something that usually uses barriers anyway?
08:42 karolherbst: "(st 0xd yl rd 0x0)" set on STG [R0], R3;
08:42 karolherbst: last instruction before ret
08:43 karolherbst: hum wait
08:43 karolherbst: I think I can tell ptxas to not do sched stuff
08:45 karolherbst: or at least I thought I could
08:48 karolherbst: skeggsb: yeah... I don't see it
08:49 karolherbst: skeggsb: https://gist.github.com/karolherbst/b9246cfd57088a7d086173c56e68cacf
08:51 karolherbst: skeggsb: that one NOP only appears for SM60/SM61 though
08:51 karolherbst: but I don't think this is relevant?
09:11 karolherbst: okay, nice, got some stuff working with a ret val in ptx, maybe that shows it better
09:17 karolherbst: skeggsb: the actual fuck... I think I found it
09:17 karolherbst: but this could be also just our RA being crappy
09:20 karolherbst: no, this is RA being crappy
09:21 HdkR: karolherbst: So not scheduling related, the emitted code just ends up doing bad things in RA?
09:22 karolherbst: well in a way that it sometimes works
09:22 karolherbst: I had the theory that you need a different ouput register
09:22 karolherbst: but...
09:22 karolherbst: that doesn't hold up
09:24 HdkR: That sounds like it would break with a lot of these other routines as well
09:24 HdkR: Or is it an issue in returning a 2reg value instead of a single reg?
09:26 karolherbst: oh wow
09:27 karolherbst: skeggsb: I thought it might be a smart idea to move the value to $r2 and $r3
09:27 karolherbst: but that gives me "failed to idle channel 2"....
09:27 karolherbst: there is not even a loop inside that shader
09:29 karolherbst: uhm
09:29 karolherbst: maybe I should put in more scheds
09:41 karolherbst: what is ru though
09:42 karolherbst: ohh, reuse
09:43 HdkR: huh
09:52 karolherbst: somthing is not right
10:02 karolherbst: HdkR: I am literally now putting that buitlin into ptx and see what I get...
10:08 karolherbst: why does literally everything GPU related sucks with consuming float hex values :(
10:08 karolherbst: or outputing
10:10 karolherbst: "0D7ff0000000000000" in PTX...
10:46 karolherbst: skeggsb: now I use code generated by ptxas and it still doesn't work :(
10:54 mslusarz: karolherbst: have you found the shader you were looking for in that pascal compute trace?
10:54 karolherbst: mslusarz: didn't really got to it, because we already kind of have a solution for the stuff
11:23 karolherbst: skeggsb: could it be that jcal is simply "broken"?
12:14 karolherbst: yeah... I am quite sure it isn't related to the sched opcodes directly
12:38 skeggsb: karolherbst: why are you using jcal anyways?
12:38 karolherbst: skeggsb: the emitter does that
12:38 karolherbst: I guess jcal is an absolute jump?
12:38 skeggsb: yes, what i mean is: why aren't you just using cal?
12:39 karolherbst: does cal work with absolute addresses as well?
12:39 karolherbst: I thought cal is offset only
12:39 skeggsb: let's try this again: why use an absolute address? :P
12:39 karolherbst: dunno, that's how builtins work
12:39 karolherbst: I am currently rewriting it a bit to use relative jumps for builtins
12:39 skeggsb: i'd be surprised if it changes anything, but, it's simple enough to test
12:40 karolherbst: well, kind of. that insn->serial thing + sched handling is super annoying
12:50 mslusarz: karolherbst: does any of those 3 shaders look relevant to your trace? https://paste.fedoraproject.org/paste/26n9kf6ID8ATfU31oKHqIw/raw
12:51 karolherbst: mslusarz: the offsets are wrong
12:51 mslusarz: can you elaborate?
12:54 mslusarz: ok, I added 0x50 and it looks much better
12:55 mslusarz: https://paste.fedoraproject.org/paste/LODoCpmh1wfBRH3UlDtChQ/raw
12:57 karolherbst: hum
12:57 karolherbst: it does rcp here
12:58 karolherbst: but that looks like a fragment shader
12:59 mslusarz: all 3?
12:59 karolherbst: mslusarz: after 00000180 (including) things are looking weird again
12:59 karolherbst: allthough maybe that's fine actually
12:59 karolherbst: mslusarz: it is one shader
12:59 karolherbst: ohh wait
12:59 karolherbst: the lines :D
12:59 karolherbst: second dunno
12:59 karolherbst: could be tess eval?
12:59 karolherbst: doing nothing
13:00 karolherbst: everthing after the exit at 00000158 is unreachable
13:00 karolherbst: wondering how that ends up there
13:01 karolherbst: the third block looks like a vertex shader
13:02 karolherbst: reading from uniforms writing into outputs
13:02 mslusarz: it was generated using very hackish method, so it may be completely wrong
13:03 karolherbst: skeggsb: mhh that absolute jump thing depends on evil magic :(
13:05 imirkin: karolherbst: probably any op after a CALL has to totally wait on things. or we should stick it into the called function.
13:05 karolherbst: imirkin: well, I already wait in the last instruction of the builtin
13:06 imirkin: so then it should be fine... RET is equivalent to a JUMP - it doesn't write registers
13:06 karolherbst: so everthing before the ret should be already inside the registers actually :(
13:06 karolherbst: imirkin: maybe I did something wrong though, but here my crappy first try to add sched opcodes: https://github.com/karolherbst/mesa/commit/25035f826681c219fc0445f873dd3a8642fe4b66
13:08 imirkin: i may not have time to look
13:09 karolherbst: where are the builtins located actually? would like to try if a relative jump yields better results (which I doubt)
13:10 imirkin: they are uploaded by the nvc0 code
13:10 skeggsb: they should be located at offset 0 actually
13:10 imirkin: yeah
13:10 skeggsb: i think they're always allocated first
13:10 imirkin: it kinda has to be
13:10 imirkin: otherwise jumping to them won't work
13:11 imirkin: (well, it would require relocation, which would just be an unnecessary step)
13:12 karolherbst: ohh
13:12 karolherbst: weird
13:13 karolherbst: because usually we get things like jcal 0x7fd60
13:13 karolherbst: or is this also a relative jump actually?
13:13 karolherbst: or just anvydis bad at displaying stuff?
13:14 imirkin: i think on maxwell it may be incomplete
13:14 imirkin: check how it decodes with nvdisasm
13:14 karolherbst: same
13:14 karolherbst: 0xe220007fd6000000
13:14 karolherbst: so the 7fd60 is there
13:15 karolherbst: but maybe that's just okay still
13:15 imirkin: maybe it gets placed at the end? but then everything would get messed up when we resize the buffer...
13:15 karolherbst: uhm
13:15 karolherbst: we can set a start pc, no?
13:15 imirkin: ?
13:15 karolherbst: maybe it is absolute to this
13:16 imirkin: there are relative and absolute jump variants
13:16 imirkin: for built-ins, we use the absolute jump variant
13:16 karolherbst: like if we say the shader starts at 0x300 then a of -0x300 would reach code at 0x0
13:16 karolherbst: or not?
13:16 imirkin: not.
13:16 imirkin: since different shader positions would then require fixing up relocations
13:16 karolherbst: okay, then the 0x7fd60 doesn't make much sense
13:16 imirkin: which we don't do for absolutes.
13:17 karolherbst: we call that addReloc thing for the builtins though
13:17 imirkin: oh we do? hm -- nvc0_program_library_upload just takes an alloc from the heap
13:18 imirkin: which iirc allocates from the end
13:18 imirkin: so yeah - i guess they are placed at the end
13:18 imirkin: and then the reloc code fixes it all up at upload time
13:18 karolherbst: mhh oky
13:19 imirkin: btw, are you planning on writing those piglits?
13:19 imirkin: for compat clipping
13:20 karolherbst: maybe later. working on CTS fixes seems kind of more important right now
13:21 karolherbst: I am just hoping that the AMD guys are doing that over time and if they stop working on it, maybe I continue with missing stuff, dunno
13:25 imirkin: ok
13:25 imirkin: and are you planning on fixing your "let's do compat" patch for the issues i pointed out?
13:27 karolherbst: imirkin: https://github.com/karolherbst/mesa/commit/6832b8737061688a127bb3db4730b0484ffb6938
13:27 karolherbst: I think this should be enough because the recompilation stuff seems to be already handled
13:27 karolherbst: not 100% sure though
13:27 imirkin: yeah, definitely not.
13:27 imirkin: also the clipping is only for TES
13:27 karolherbst: well, the code looked okay
13:27 karolherbst: okay
13:29 imirkin: hm, having trouble finding the code...
13:30 imirkin: something tricky about it... don't remmeber what though
13:31 karolherbst: nvc0_check_program_ucps?
13:31 imirkin: hah! looks like it is handled.
13:31 imirkin: surprising!
13:31 karolherbst: yeah
13:31 karolherbst: I was surprised as well
13:33 karolherbst: mhh "CAL.NOINC 0x7fd60; /* 0xe260007fd0800000 */"
13:33 karolherbst: this looks okay
13:33 karolherbst: but it gives me even worse results
13:33 karolherbst: or os the noinc thing wrong?
13:34 imirkin: no, that's fine
13:34 imirkin: without it, it bumps up the address by 8
13:34 imirkin: or something like that
13:35 karolherbst: uhm, okay
13:35 imirkin: what are you using to generate these?
13:35 imirkin: nv50_ir emitter should handle it all fine
13:36 karolherbst: codegen
13:36 imirkin: so then what's the problem?
13:36 karolherbst: well, apperantly register RAW hazards
13:37 karolherbst: dunno
13:37 karolherbst: but sticking add $r0 $r0 0x0 after the call "helps"
13:37 karolherbst: two more than one
13:37 karolherbst: imirkin: fixed compat patch https://github.com/karolherbst/mesa/commit/fe4c6961a7ca549df4f8ed7fc84233654d8dcaf1
13:37 imirkin: could also be something dumb
13:37 karolherbst: yeah
13:37 karolherbst: disabling optimizations also "helps" of course
13:37 karolherbst: but disabling sched makes 0 difference
13:38 imirkin: karolherbst: R-b: me
13:39 karolherbst: thx
13:40 karolherbst: but I have no idea what that dumb thing might be
13:40 karolherbst: I even checked mmt traces
13:40 karolherbst: maybe some super odd shader flag or gpu config thing or whatever?
13:40 karolherbst: no clue
13:40 imirkin: no
13:40 imirkin: or at least - very unlikely
13:40 karolherbst: I am fairly sure it isn't due to the sched opcodes
13:40 imirkin: but sorry, can't really help =/
13:41 karolherbst: maybe rets can't be inside the first half of an isntruction block
13:41 karolherbst: maybe silly stuff like that
13:41 karolherbst: ....
13:42 imirkin: more likely is that you have to pad out the remainder of the sched block with NOP's
13:42 karolherbst: I do that already
13:42 imirkin: :)
13:42 karolherbst: and the builtin starts with a sched instruction
13:42 imirkin: both of those shouldn't matter tho
13:43 karolherbst: the cal isn't the problem, I am quite sure I get the correct result in the end if I just stall long enough after ret
13:44 karolherbst: anyhow, all builtins start with "sched (st 0xd wt 0x3f)"
13:44 karolherbst: I guess somebody had some weird issues as well
13:44 karolherbst: maybe I should stick a wt 0x3f at the end as well?
13:44 karolherbst: but that doesn't really make sense
13:45 karolherbst: yeah well, doesn't help
13:46 imirkin: what about sticking something on the op after the call?
13:49 karolherbst: you mean like those add $r0 $r0 0x0 I mentioned?
13:49 karolherbst: or did you had something else in mind?
13:49 karolherbst: ohh you meant sched stuff
13:49 karolherbst: mhh
13:58 karolherbst: imirkin: but shouldn't disabling sched opcodes help in that case?
14:18 karolherbst: imirkin: the super weird thing is, that adding those nop adds inside the builtin doesn't help, adding them after the call helps
14:37 karolherbst: mhh, with that cal I get a ILLEGAL_INSTR_ENCODING
14:37 karolherbst: but I think it just jumps somewhere where no code is actually
14:52 karolherbst: skeggsb, imirkin: cal skips unallocated pages... so I have to do a cal 0x660, not cal 0x7fd60
14:52 imirkin: unallocated pages?
14:52 karolherbst: well
14:52 imirkin: that just means you're doing a relative call
14:52 imirkin: rather than an absolute one
14:52 karolherbst: the builtin library gets allocated at the end
14:52 karolherbst: yeah
14:52 imirkin: did you forget to set call->absolute = 1?
14:52 karolherbst: no
14:53 karolherbst: I convert an absolute one to a relative
14:53 imirkin: in the fixup thing?
14:53 imirkin: er, reloc
14:53 karolherbst: no, I hardcode the address in emitCAL()
14:53 karolherbst: I just want to test stuff
14:53 karolherbst: mhh
14:53 karolherbst: that doesn't fix the issue either
14:54 karolherbst: imirkin: also envydis prints the absolute address in cal
14:54 karolherbst: not the relative one encoded inside the instruction
14:55 imirkin: sounds right
14:55 karolherbst: anyway, it doesn't help
14:55 imirkin: you have to pass the proper base address to envydis for it to all work out properly
14:55 imirkin: this is done in demmt
14:55 karolherbst: yeah...
14:55 imirkin: but generally not by people calling it from the cmdline ;)
14:55 karolherbst: but I know the correct code is called, so that's fine
14:56 karolherbst: at least it passes with the tests with NV50_PROG_OPTIMIZE=0 set
14:57 karolherbst: and I omit a wt 0x3f in every op as well
14:57 karolherbst: and setup deps inside the builtin
14:58 karolherbst: this issue is getting quite painful by now, because I seriously run out of ideas
15:02 imirkin: karolherbst: check out what the maxas guy has to say on the topic
15:02 imirkin: i'm sure calls are covered int hat doc
15:02 karolherbst: right and I already checked
15:02 karolherbst: something like cal/ret/etc... needs a st of 0x5 or more
15:02 karolherbst: but that was basically it
15:04 karolherbst: I already compare against the shader nvidia uses
15:04 karolherbst: and there seems to be no magic as well
15:10 imirkin: what if you stick nops after the CAL?
15:10 imirkin: maybe it jumps back to the next sched block or something dumb?
15:10 imirkin: but realistically, i doubt there's any issue with call/ret
15:10 imirkin: this would have been noticed a very long time ago
15:10 karolherbst: yeah
15:11 karolherbst: also nops are getting dropped by RA or something
15:11 imirkin: the issue is most likely either with some sort of RA fail
15:11 imirkin: which i've seen
15:11 karolherbst: something eliminates instructions
15:11 karolherbst: so I had to stick connected ones in there
15:11 imirkin: or alternatively, something getting clobbered by the function being called, but not marked as such
15:11 karolherbst: imirkin: the shader is way too trivial for an RA fail
15:11 imirkin: trace it.
15:11 imirkin: i.e. check the RA
15:12 imirkin: make sure that nothing uses the clobbered regs
15:12 imirkin: etc
15:12 imirkin: there were definitely issues with this fixed reg allocation scheme
15:12 karolherbst: envydis output: https://gist.githubusercontent.com/karolherbst/bc83262b467a9e2b93c664b214fcc93b/raw/ceea95b7fb3c548d32a16f7447bfaff9e68bced8/gistfile1.txt
15:13 karolherbst: 0-9 are reversed for the builtin
15:13 karolherbst: 0 and 1 are input/output
15:13 karolherbst: that $p0 is some envydis screwup
15:14 imirkin: are you sure?
15:14 karolherbst: yes
15:14 karolherbst: nvdisasm: /*0018*/ JCAL.NOINC 0x7fd60; /* 0xe220007fd6000000 */
15:14 imirkin: i wonder.
15:14 karolherbst: and it gets called
15:14 imirkin: ok
15:14 karolherbst: I get semi random output
15:14 karolherbst: still predictable
15:15 imirkin: which builtin is this?
15:15 karolherbst: some state from inside the builtin
15:15 karolherbst: rsq
15:15 imirkin: link to commit
15:15 karolherbst: https://github.com/karolherbst/mesa/commit/25035f826681c219fc0445f873dd3a8642fe4b66
15:16 karolherbst: I am not 100% sure about those scheds, but sticking default ones doesn't help either
15:16 imirkin: so ...
15:16 imirkin: this looks bogus
15:16 karolherbst: yesterday I was like: "yeah, I port those to gm107 and then I am done with fp64 stuff" ... :(
15:16 imirkin: but can i see the input shader (e.g. tgsi)?
15:17 imirkin: perhaps the input shader is just doing something weird :)
15:17 karolherbst: https://gist.githubusercontent.com/karolherbst/a87f2825f10c26ca422d078cead13d5a/raw/793331722a3ffeef9cb84fe5fbb7801331f23536/gistfile1.txt
15:17 imirkin: right. that's what i figured the input shader would do.
15:18 imirkin: but that's not what it does.
15:18 imirkin: note the second call
15:18 karolherbst: sure?
15:18 imirkin: it's supposed to get args in r0/r1
15:18 imirkin: but it doesn't
15:18 karolherbst: ..............
15:18 imirkin: so it's ending up with rsq(rsq(x))
15:19 karolherbst: ahh crap :(
15:19 karolherbst: that's why adding adds helps
15:19 karolherbst: because that doesn't make RA screw up
15:19 imirkin: like i said -- trace the registers. it's important.
15:19 karolherbst: yeah....
15:19 karolherbst: it simply didn't occur to me
15:19 imirkin: it's interesting such a simple shader would get screwed up though
15:20 karolherbst: :)
15:20 imirkin: which is ... good, since the RA should be easier to debug than when i've seen this in the past
15:20 karolherbst: yeah
15:20 imirkin: calim had sent me some test patches which fix stuff like this
15:20 imirkin: but unfortunately they ended up breaking other stuff
15:20 karolherbst: I remember some patches
15:20 karolherbst: but couldn't find such
15:20 karolherbst: anyway, I am sure it is related to the clobber thing
15:20 karolherbst: so things are still reserved
15:20 karolherbst: or something
15:20 imirkin: fixed regs, actually
15:20 imirkin: but yeah.
15:20 imirkin: same idea.
15:21 imirkin: this should be easier to trace through.
15:21 imirkin: good luck :)
15:21 karolherbst: with optimize=0 we get this: mov u32 $r10 $r0
15:21 karolherbst: mov u32 $r0 c0[0x8]
15:21 imirkin: yeah, it has to save off the results
15:21 imirkin: to a register that's not clobbered
15:22 imirkin: that's why those mov's are put in - to resolve the constraints
15:22 imirkin: don't look at the post-ra
15:22 imirkin: look at the pre-ra
15:22 imirkin: make sure that's all good
15:22 imirkin: etc
15:22 karolherbst: mhhh
15:22 karolherbst: post-ra has two movs more
15:23 karolherbst: basically after a split
15:23 karolherbst: ......
15:23 karolherbst: my brain hurts suddenly...
15:23 karolherbst: maybe clobber + split = RA doesn't want to?
15:23 imirkin: a clobber is just a register use
15:24 karolherbst: merge u64 %r23d $r0 $r1; split u64 { %r24 %r25 } %r23d after the call
15:24 imirkin: or ... something.
15:24 imirkin: oh yeah, can't do that
15:24 karolherbst: opt =0 puts two movs
15:24 karolherbst: mov u32 %r26 %r24 and mov u32 %r27 %r25
15:24 karolherbst: and this is basically the only difference
15:24 imirkin: the first thing
15:24 imirkin: right after the call
15:24 imirkin: has to be mov %rX $r0, etc
15:25 imirkin: if something removes them, they need to be dropped back in
15:25 imirkin: (pre-ra ... RA can kill them all it wants obviously)
15:25 karolherbst: the movs should stay
15:25 karolherbst: if done in the lowering
15:25 imirkin: i wonder if it's the MergeSplits thing
15:25 imirkin: or something
15:25 karolherbst: well
15:25 karolherbst: maybe
15:25 imirkin: anyways
15:26 imirkin: you need to figure out where those constraint-resolving movs are disappearing to
15:26 karolherbst: just stick movs in the lowering, got it
15:26 karolherbst: ohh
15:26 karolherbst: yeah, right, I could do that
15:26 imirkin: what lowering?
15:26 karolherbst: rcp -> builtin
15:26 imirkin: that should be adding the mov's already.
15:26 karolherbst: that happens after the SSA opts
15:26 imirkin: bld.mkMovFromReg(i->getDef(0), i->op == OP_DIV ? 0 : 1);
15:27 imirkin: right. so the problem is that it's a double-wide thing
15:27 karolherbst: wrong function ;)
15:27 imirkin: same diff :p
15:27 karolherbst: handleRCPRSQLib
15:27 imirkin: i don't have that locally
15:27 karolherbst: doesn't have those
15:27 imirkin: well, it needs 'em
15:27 karolherbst: :)
15:27 imirkin: i thought i had brought that up in my reviews of it originally =/
15:27 karolherbst: maybe
15:28 karolherbst: I will add those and see what happens then
15:28 imirkin: that was over 1y ago
15:28 karolherbst: before the merge I have to add those, right?
15:28 imirkin: right after the call.
15:28 imirkin: (and after the clobbers, i guess)
15:29 imirkin: er, no, before the clobbers
15:29 karolherbst: so between merge and clobbers?
15:29 imirkin: the merge should be at the very end
15:30 karolherbst: okay
15:30 imirkin: mov; clobber; merge
15:30 karolherbst: https://gist.githubusercontent.com/karolherbst/c663f8941ef34737f7ffdc9eb2104b10/raw/510bebca4e766887a69306c8153f9a930d98bb73/gistfile1.txt
15:31 imirkin: that looks correct.
15:34 karolherbst: https://github.com/karolherbst/mesa/commit/84adc31f1b961501fe6944846a673ed7c6bd8c9b
15:35 karolherbst: hopefully that fixes all fp64 issues
15:36 karolherbst: really don't want to actually debug those builtins
15:37 imirkin: that looks right
15:37 karolherbst: maybe I have old commits actually, dunno
15:38 imirkin: oh. hah. this was a bug that i fixed
15:38 imirkin: i remember doing this for the DIV thing
15:38 karolherbst: :)
15:38 karolherbst: right, but div is master
15:38 imirkin: but the code got copied before i did that
15:38 karolherbst: where this isn't
15:38 karolherbst: I see
15:39 imirkin: ea22ac23e04c093f9dd0bb8f9b946e61d79824ff
15:39 imirkin: which was to enable 1c4e6d7ca83578caf5212f7a484538cb1b4ae2a3 iirc
15:39 imirkin: er no, maybe that was unrelated. but the first one clearly is :)
15:39 karolherbst: ahh
15:39 karolherbst: yeah I remember that one
15:41 imirkin: probably the RA merge/split-handling should have inserted those
15:41 imirkin: i guess isConstrained or whatever it's called should return true in case of a fixed reg :)
15:42 imirkin: constrainedDefs.
15:42 imirkin: which wouldn't cover this case. wtvr.
15:42 imirkin: any fixed reg usage should be accompanied by movs into SSA values
15:42 imirkin: either into or from
15:43 karolherbst: yeah
15:44 karolherbst: pendingchaos: there is indeed some issues with your compute invocations counter stuff
15:44 karolherbst: the CTS test sometimes fails
15:44 karolherbst: no idea if you have an updated patch
15:44 imirkin: he has a hack to kick out the cmdbuf
15:45 imirkin: but i doubt it's a complete fix, just makes it easier to get lucky
15:45 karolherbst: ahh, true
15:45 karolherbst: anyway I think for the CTS we actually have to test every chipset before enabling stuff, so we might end up with a whitelist switch checking each chipset or something. Maybe one GPU from each gen to test would be enough
15:46 karolherbst: this is all super vague anyway
15:46 karolherbst: but I think we should go for testing each chipset
15:46 imirkin: i think you're supposed to cover each SKU in theory
15:46 imirkin: although dunno - that's probably much
15:46 imirkin: each silicon revision maybe?
15:46 karolherbst: maybe?
15:47 karolherbst: mupuf knows more I think
15:47 imirkin: i suspect we can get away with something less thorough
15:48 imirkin: like a separate submission per chipset
15:48 karolherbst: we can also do merged submissions
15:48 imirkin: yeah
15:48 karolherbst: I have access to all low end gens starting with kepler
15:48 imirkin: ben will have to run half of them, but i'm sure he's good for it
15:48 karolherbst: so gk107/gk208/gm108/...
15:48 imirkin: esp if it comes with instructions like "download this, run these commands, provide output"
15:48 karolherbst: well
15:49 karolherbst: the "real" cts run throws in some randomness
15:49 karolherbst: and reruns each tests with a different salt for the pseudo random number generator
15:49 karolherbst: and it takes quite some time I've heard
15:49 imirkin: really? that sucks =/
15:49 imirkin: i thought it was a fixed seed
15:49 karolherbst: yeah, I think like 8 iterations
15:49 karolherbst: no, fixed seed only if you invoke glcts
15:49 imirkin: ah
15:50 imirkin: silly me.
15:50 karolherbst: witha test
15:50 karolherbst: cts-runner is the real thing
15:50 karolherbst: I think somebody on ARM had to run it for over 24hs
15:52 karolherbst: the fixed seed thing is mainly for better testing, so that you kind of can restart without depending on randomness
15:53 imirkin: right.
16:10 karolherbst: mhh not _that_ bad: Failed: 7/7773 (0.1%)
16:11 karolherbst: some mod fp64 fails though
16:16 karolherbst: but this time regs are indeed saved properly
16:17 karolherbst: but actually there is a different issue
16:17 karolherbst: uhm no, this is rcp
16:19 karolherbst: imirkin: sync (join) acts like a bra jumping to the ssy (joinat) target with extra syncing, right?
16:33 imirkin: dunno. i think so :)
16:40 karolherbst: k
16:40 karolherbst: in the kepler code we have that join op thing, but with maxwell that is gone. I just hope I did port the code correctly...
16:40 karolherbst: will recheck
16:40 imirkin: link to commit (again)
16:48 karolherbst: https://github.com/karolherbst/mesa/commit/d38dff075a6d2906ce0034c24ea4ae41a9ac7275
16:53 imirkin: $p0 f2f f32 f64 $r6 0x3e800000
16:53 imirkin: you sure you didn't get that backwards?
16:54 imirkin: i dunno what the order of the things is
16:56 imirkin: should check a known-good shader, see how it emits, and then decode that
16:56 imirkin: i.e. check how a convert from f32 -> f64 works
17:13 karolherbst: imirkin: first source type then dest type
17:13 karolherbst: I already checked
17:13 karolherbst: otherwise that $r5 $r6 thing further above would already trigger unaligned reg errors
17:14 karolherbst: (which it did yesterday, because I hit that same problem)
17:21 karolherbst: but the error output is susupicous
17:22 karolherbst: input: [-13.5, -13.375] [-13.5, -13.375], expected: [0, 0], result: [0, -13.375]
17:22 karolherbst: mod is the operation
17:23 karolherbst: https://gist.githubusercontent.com/karolherbst/d68d1c41d8e32522cfcde3bc6c4c035f/raw/07c8718c1a902825ebd50db15f74b0b3cf501401/gistfile1.txt
17:23 karolherbst: as far as I can tell, this looks correct though
17:27 karolherbst: maybe that predicate thing really is an issue
17:31 karolherbst: yeah well, that doesn't change a thing
17:38 imirkin: oh
17:38 imirkin: did you update cts?
17:38 imirkin: i convinced them that what we were doing was ok
17:38 imirkin: iirc they changed the CTS test
17:38 imirkin: erm ... wait
17:38 imirkin: that's not the issue i was thinking of
17:38 imirkin: sometimes for x%x we return x instead of 0
17:39 imirkin: oh, and that's what's going on here
17:39 imirkin: yeah, it's legal
17:39 imirkin: (in a spec-lawyer sort of way)
17:39 imirkin: https://github.com/KhronosGroup/VK-GL-CTS/issues/51
17:43 karolherbst: ahh
17:44 karolherbst: but I have that fix
17:44 karolherbst: but from the output something edgy is going on
17:50 karolherbst: imirkin: any reason why we would return a different value for the same input though?
17:55 karolherbst: ....
17:55 karolherbst: imirkin: guess what
17:57 karolherbst: so of course it detects this, but if you run that test with vectors, the expected case doesn't hit, because one component fails, but if it tests the edge case it fails as well, because one component is actually the expected result
17:57 karolherbst: so.....
17:57 karolherbst: how can we make that a bit more deterministic
17:57 karolherbst: or should we fix the tests here again?
18:06 karolherbst: ohh wait, I've read the test input wrongly
18:09 karolherbst: so the bug isn't fixed
18:12 karolherbst: fun, python3 agrees with our implementation actually
18:51 karolherbst: interesting, "KHR-GL45.vertex_attrib_binding.advanced-iterations" fails only after a full run "or hopefully after a certain test"
21:17 rhyskidd: RSpliet: interesting topic coming up as OSPERT
21:17 rhyskidd: *at
21:18 rhyskidd: is there more code coming related to your talk?
22:06 rhyskidd: gahh at GitHub PR's .. i've created a wholly new v2 of my ctxsw microcode support improvements to rnndb
22:08 rhyskidd: mwk: thanks for your review, think I've addressed all your comments now
22:09 mwk: rhyskidd: how did you learn about UC_CTRL_ALIAS?
22:09 mwk: it seems... very strange
22:09 rhyskidd: nvgpu and watching mmiotraces of the blob
22:10 rhyskidd: it's definitely conditionally used, when working with secretful Falcons
22:10 mwk: nvgpu?
22:10 rhyskidd: the GPL driver nvidia shipped for the Pixel C android tablet
22:11 mwk: link?
22:11 mwk: anyway... I find it very strange that this would be a simple alias
22:11 rhyskidd: https://nv-tegra.nvidia.com/gitweb/?p=linux-nvgpu.git;a=blob;f=drivers/gpu/nvgpu/include/nvgpu/hw/gp10b/hw_gr_gp10b.h;hb=refs/tags/tegra-l4t-r28.2#l765
22:12 mwk: the fuck.
22:12 mwk: sigh
22:13 mwk: alright, if nvidia calls it an alias, let's call it an alias too, until proven otherwise
22:13 mwk: ship it
22:13 rhyskidd: and the usage here: https://nv-tegra.nvidia.com/gitweb/?p=linux-nvgpu.git;a=blob;f=drivers/gpu/nvgpu/gm20b/acr_gm20b.c;h=ed144c0fe1eb8c849af59913284582839bd705d5;hb=refs/tags/tegra-l4t-r28.2
22:14 mwk: so it's some kind of alias with different access rights, great...
22:14 rhyskidd: yeh, appears conditioned access rights
22:15 mwk: but for that to make any sense, it should be missing some fields
22:15 mwk: ah well, screw that
22:15 rhyskidd: i've been banging on the HS/LS trust boundary -- including around firmware interfaces
22:16 rhyskidd: documenting what I find as I go
22:16 rhyskidd: thanks, i'll merge the PR
22:16 mwk: yeah, we've all been banging on the HS/LS boundary like on fucking bars of a jail cell
22:17 mwk: [which it is]
23:03 karolherbst: imirkin: mhh: https://trello.com/c/Kgv3gKBf/26-khr-gl45vertexattribbindingadvanced-iterations-fails-in-full-run
23:04 karolherbst: any ideas?
23:04 karolherbst: it is really annoying to track down as it realls seems to require all the 4k tests to run before
23:06 karolherbst: "Data is: 9 9 9 9, data should be: 10 10 10 10"
23:08 imirkin: hm
23:08 imirkin: all it's doing is binding a fresh tfb and drawing
23:08 imirkin: oh
23:08 imirkin: it's alternating between two tfb's ... interesting
23:09 karolherbst: well anyway, if I simply run this test it passes
23:09 karolherbst: so, a bit annoying
23:09 imirkin: probably the position in the buffer is not being saved quite properly
23:10 imirkin: since successive draws are supposed to increment the buffer position
23:10 imirkin: or something more subtle
23:18 karolherbst: also that CTS fix is wrong
23:21 karolherbst: took me a while to really understand the code
23:27 karolherbst: meh... weird
23:27 karolherbst: yeah well
23:28 karolherbst: common pitfalls
23:36 karolherbst: imirkin: https://github.com/karolherbst/VK-GL-CTS/commit/85d93ea7f92a59e5989cbb2fbbcb48c897fbc109