00:45 Lyude: oh no, that's weird, an somewhat surprising. I guess pcounter had slcg registers on kepler2 as well, but only starting from gk208 o_O
00:45 Lyude: anyone have a gk208 card lying around anywhere who might be willing to try some patches at some point
00:46 mooch2: should i start over on my nvidia code? https://github.com/86Box/86Box/blob/master/src/video/vid_nv_riva128.c
00:47 Lyude: Oh, wow. that's old
00:48 mooch2: yeah, nv3, nv4, and nv5
00:48 imirkin: Lyude: i will...
00:48 mooch2: i figured it was a good starting point
00:48 imirkin: Lyude: mine's a GK208B actually (NV106)
00:48 imirkin: Lyude: note that GK208 is the first gpu to use fuc5 for ctxsw fw
00:49 Lyude: ahhh, but I don't think my vbios register scan turned up anything earlier than nv108 having these registers
00:49 juri_: hmm. i have two distinct revisions of the quadro 600.
00:49 imirkin: Lyude: well, nv106 came after nv108 :)
00:49 Lyude: imirkin: additionally, it seems that we have no nv106 mmiotraces
00:50 imirkin: Lyude: https://people.freedesktop.org/~imirkin/traces/gk208b-vbios.rom
00:50 Lyude: mm, is that the one in the vbios repo as well?
00:50 imirkin: no clue
00:51 Lyude: ah, do you think you'd mind getting me mmiotraces of: loading up the nvidia driver and starting up X or something like that, doing the same thing but followed by suspend and resume, and then one of a simple load/unload of the nvidia driver?
00:51 imirkin: mmmm ... sorry. that'll be tough
00:52 imirkin: i haven't brought up blob on this machine in quite a while
00:52 mooch2: hey, does anybody have any bios dumps of nv10 cards that AREN'T underdumps? i.e. they gotta be a power-of-two size
00:52 mooch2: all the dumps i can find are underdumps
00:52 imirkin: i don't get a lot of "let's play and reboot" time, so it's limited to simple things
00:52 imirkin: mooch2: they don't have to be power of two
00:52 imirkin: depends what you're dumping
00:53 mooch2: imirkin, they do to work in 86box
00:53 imirkin: then you're not looking for the vbios.
00:53 mooch2: ?
00:53 imirkin: you're probably looking for the whole ROM
00:53 mooch2: yeah
00:53 imirkin: i.e. the full option rom
00:53 mooch2: basically
00:53 imirkin: which != vbios
00:53 mooch2: how do i get dumps of the full rom?
00:53 Lyude: imirkin: ah... hm, gonna have to find someone with one then
00:53 imirkin: (although maybe for nv10 it did? dunno)
00:53 skeggsb: pad it with zeroes...
00:54 imirkin: Lyude: they're quite common in low-end dell desktops
00:54 imirkin: (that's where i got mine =] )
00:54 imirkin: or not even necessarily low end
00:54 imirkin: it's like the default video card they used to put into haswell through skylake era machines
00:55 imirkin: they've started throwing in gm107's now (gtx 745 and such)
00:55 Lyude: oh really? that's strange...
00:55 Lyude: or not, I guess it makes sense
00:55 imirkin: [and you can guess where my gtx 745 came from...]
00:55 Lyude: what's the marketing name that they used for those?
00:55 imirkin: GT 730
00:55 imirkin: and lower
00:55 imirkin: i.e. GT 710, 720, 730
00:57 Lyude: aaaah, found a couple on amazon. i'm always down for an excuse to buy more hw off work's budget :)
00:57 imirkin: careful - if you're buying random ones, they could be other models
00:57 imirkin: i meant on the dells
00:57 imirkin: some of those are GF108's and GF119's
00:57 imirkin: and the rare GK107 thrown in
00:57 Lyude: ahh, nah I'm just getting the card as-is https://www.amazon.com/Gigabyte-Video-Graphics-GV-N710D3-2GL-REV2-0/dp/B01IEO05NU/ref=sr_1_10?ie=UTF8&qid=1524099361&sr=8-10&keywords=GT+730
00:58 juri_: I'm having fun with my GF108s. :)
00:58 mooch2: skeggsb tried that
00:58 mooch2: didn't work
00:59 imirkin: Lyude: can't tell from the description. normally i look at shader cores. 96 = fermi. 384 = kepler2
01:00 imirkin: oh. 28nm process. that's definitely the GK208
01:00 imirkin: ergh
01:00 imirkin: no, that's not so clear =/
01:01 imirkin: x8 is common for those though. and i don't think it was earlier.
01:02 imirkin: juri_: to really have fun, check out ben's reclocking branch for fermi
01:02 imirkin: iirc quadro 600 is one of the tested cards
01:06 juri_: I'm not using them for video; just trying to run my haskell code on them.
01:07 juri_: reading through envytools has been a lot of fun.
01:07 imirkin: reclocking is most useful for computational operations
01:08 imirkin: since it enables you to run the card at higher clocks (both shader and memory), thus achieving a higher computational rate.
01:08 imirkin: (higher than the "default" boot clocks, which tend to be low or middle for fermi)
01:10 juri_:nods.
01:10 juri_: first i have to make my code run, tho. :)
01:10 imirkin: an important little detail, yes
01:14 Lyude: yay, found someone with a gt720 and gt730
01:14 Lyude: still going to get one for myself, but at least I can get an mmiotrace on that tommorrow
01:19 imirkin: they're quite common as low-end cards
01:19 imirkin: i don't know why people bother to plug them into boxes with intel gpu's... i think it's because motherboard connectors can sometimes be idiotic, and the solution to that is to throw in an nvidia board
01:21 HdkR: Even if it is a super trash low end card that the iGPU can probably beat? :P
01:21 imirkin: nah, it beats igpu by like 10-20%
01:21 imirkin: but that hardly seems worth it
01:21 HdkR: I guess you need to go down to the GT 710 if you want pain
01:22 imirkin: i don't know if there are practical differences between gt 710..730 that are chip-related
01:22 imirkin: maybe ddr3 vs ddr5 ram, etc
01:22 imirkin: perhaps they fuse parts off? dunno
01:22 imirkin: not a lot that can be fused off... hard to go lower than 1 ;)
01:22 HdkR: haha
01:36 juri_:turns on -Wextra
07:37 mooch2: does anybody have any idea why super mario odyssey sends 0x346c on GM204?3D methods during initialization
07:37 mooch2: *GM204_3D
10:08 RSpliet: mooch2: That is an oddly specific question. Do you happen to know what that method translates to?
10:08 RSpliet: mooch3: ^
10:22 pmoreau: mooch2: Making the jump from nv2-nv5 to nv124? :o
10:35 mooch2: pmoreau, i'm working on switch emulation now lol
10:36 mooch2: RSpliet, nope
10:36 mooch2: mesa doesn't have it, envytools doesn't have it
10:38 RSpliet: mooch2: long shot... have you checked this fork: https://github.com/kfractal/envytools/tree/hwref ?
10:39 RSpliet: Not sure whether kfractral had generated 3d method docs too or just HW MMIO regs
10:40 mooch2: nope
10:41 RSpliet: (some other branches could be equally useful. It's focussed on gk20a, but you might get lucky...)
11:24 pmoreau: mooch2: Ooh, cool! :-)
13:30 imirkin: mooch2: $ ~/src/envytools/rnn/lookup -a GM204 -d SUBCHAN -- -v obj-class GM204_3D 346c
13:30 imirkin: GRAPH.SCRATCH[0x1b] => 0
13:31 imirkin: (scratch can be used as additional inputs to macros sometimes, or firmware methods)
13:32 imirkin: pendingchaos: if you get a chance to trace blob, would be curious how it handles atomicAdd() with a float on shared memory. i can't find the instruction encoding for it, so it must do something clever. maybe it pulls up $sbase and goes through the gmem window?
13:33 imirkin: (this is exposed with NV_shader_atomic_float)
13:33 imirkin: i have a prelim branch here: https://github.com/imirkin/mesa/commits/atomicfloat -- you have to force-enable the ext with the env var though
13:36 pendingchaos: sure
13:36 pendingchaos: do you have a file for shader_runner I can use or a particular program to trace?
13:36 pmoreau: imirkin: Want me to write such a CUDA kernel? Or do you specifically need it in a shader context?
13:36 imirkin: pendingchaos: sure -- https://cgit.freedesktop.org/~idr/piglit/log/?h=NV_shader_atomic_float
13:37 imirkin: you'll need that branch
13:37 imirkin: actually, you probably don't need the whole branch for shared
13:37 imirkin: you need it for ssbo's i think?
13:37 imirkin: pendingchaos: https://cgit.freedesktop.org/~idr/piglit/commit/?h=NV_shader_atomic_float&id=025bb9e5cb7e8e7955aa6ad85575c0cb318f6165
13:38 imirkin: that should work with an upstream shader_runner
13:40 imirkin: i'm banking on the $sbase thing. since there's a dedicated ATOMS now, that should mean that the regular ATOM op ought to be able to handle it too via the window.
13:40 imirkin: on previous gens, i think it just threw errors if you tried that
13:41 imirkin: pmoreau: oh, the cuda thing should work too
13:41 imirkin: pmoreau: i just need to know how to do a f32 atomic add on shared memory ("local" in opencl? i forget)
13:42 pmoreau: local in OpenCL, shared in CUDA :-)
13:42 imirkin: lol
13:42 pmoreau: Can do that quickly and share the SASS
13:43 imirkin: which one's sass?
13:43 pmoreau: GP102
13:43 imirkin: i mean -- what is "sass"?
13:43 karolherbst: the isa
13:43 imirkin: PTX? or the real thing?
13:43 pmoreau: Oh, Disassembly
13:43 karolherbst: imirkin: the real thing
13:43 imirkin: ok cool
13:52 pmoreau: imirkin: SASS: https://hastebin.com/geradajowa.swift, CUDA code: https://hastebin.com/ezacivucor.cpp
13:54 pendingchaos: imirkin: hopefully these should be useful: https://drive.google.com/open?id=1oYQjhM13NhG7Cj3iXfMgfeErBnjM1Xuw https://drive.google.com/open?id=1YKBdCDrv66f80mDI-zz0Q-Dx3crLrnTo
13:55 imirkin: pmoreau: grrr. blob was being clever :(
13:55 imirkin: pmoreau: can you make the value being added dynamic?
13:55 pmoreau: Sure
13:55 imirkin: pendingchaos: thanks!
13:57 pendingchaos: shared-atomicAdd-float.shader_test is missing an GL_ARB_shader_storage_buffer_object requirement btw
13:57 pmoreau: imirkin: https://hastebin.com/ajuwiyacuj.swift the added value is now a parameter to the kernel
13:57 pendingchaos: or GLSL 4.30, but the other test seems to use the extension
13:57 imirkin: pendingchaos: should it need it? doesn't it use GLSL 4.30?
13:58 pendingchaos: the compute shader has "#version 330" at the beginning
13:58 pendingchaos: at least in the commit I was using
13:58 imirkin: pmoreau: crap. it still does a loop, doing compare-and-swaps
13:58 imirkin: pendingchaos: ah. that should probably be 430 ;)
13:58 imirkin: i'll let idr know
14:00 pmoreau: :-/
14:00 imirkin: hm, pendingchaos's code is similar: https://hastebin.com/safagopubi.php
14:00 imirkin: well THATS JUST PLAIN ANNOYING
14:01 imirkin: i wonder what LDS.U does
14:01 imirkin: i've only seen it here, i think
14:02 pmoreau: Oh yeah, indeed.
14:02 pmoreau: I have it for SM 5.2 and SM 7.0 as well
14:02 imirkin: pendingchaos: i decoded your mmt, looks like it does the same as pmoreau's cuda thing
14:03 imirkin: eternal sadness :(
14:03 imirkin: (basically load shared mem, do op, compare and swap, retry if the compare fails)
14:03 pmoreau: Here’s the SM 7.0 version, for the records: https://hastebin.com/mubazukavi.php
14:03 imirkin: ATOMS.CAST.SPIN -- that's new!
14:03 imirkin: i've seen the op, but no clue what it does
14:04 imirkin: /*0170*/ WARPSYNC 0xffffffff; /* 0xffffffff00007948 */
14:04 imirkin: don't think i've seen that either
14:04 pmoreau: Yeah, that’s new with Volta
14:04 imirkin: sm70 = volta?
14:04 pmoreau: Yup
14:04 skeggsb: it might have something to do with the per-thread instruction pointer stuff
14:04 pmoreau: The Titan V and some Quadro cards
14:04 pmoreau: It does
14:05 imirkin: should i even ask why it does a FFMA instead of FADD?
14:05 pmoreau: All the warp operations have an explicit sync now, because you can no longer assume that each lane is the warp are running in locksteps.
14:05 pmoreau: *in the warp
14:05 imirkin: oh. R2 = count of active lanes. ugh
14:05 imirkin: defeated again.
14:06 imirkin: pmoreau: the idea was a varying quantity *per lane*
14:06 skeggsb: imirkin: i was seing FFMA instead of FMUL too
14:06 pmoreau: Oh, sorry
14:06 skeggsb: seeing*
14:06 imirkin: skeggsb: well, FFMA instead of FMUL - reasonable if odd
14:06 imirkin: FFMA instead of FADD -- there's something else going on ;)
14:06 imirkin: which in this case, there is
14:06 imirkin: "compiler's too smart"
14:07 imirkin: defeated the simple case of doing atomicAdd with a fixed value
14:07 imirkin: (by counting up active lanes)
14:07 pmoreau: imirkin: https://hastebin.com/melogeyeyu.go I multiply the offset by the thread index.
14:08 imirkin: pmoreau: right thanks. now it's a lot more straightforward
14:09 imirkin: anyways, the compare-and-swap thing isn't that hard to implement. just annoying that they'd have gone to ALL this trouble of making it work with shared memory, and then missed this opportunity.
14:09 imirkin: (it = atomic ops in general)
14:10 pmoreau: “ /*00a0*/ BSSY B0, 0x170; /* 0x000000c000007945 */” So the barriers are no longer part of the “scheduling bits” now? Or could you already have those B0 before?
14:11 pmoreau: (Assuming B0 is a barrier, and not a BB (which I first thought it was :-D))
14:11 skeggsb: well, they're still in the sm70 control codes at least
14:11 skeggsb: don't know what that B0 is though :P
14:12 pmoreau: Let’s have a look at the NVIDIA forum, maybe someone already looked at that
14:15 pmoreau: BTW, some of you already know, but I’ll be working at NVIDIA as a Research Intern, possibly on things related to ray tracing (I honestly don’t know yet what I’ll be working on, and even if I did, I wouldn’t be able to say so :-D) for 6 months starting in 10 days. So no more Nouveau for me for some time. :-/
14:37 imirkin: pmoreau: enjoy :) probably the NDA will preclude you from ever working on nouveau, dunno
14:48 karolherbst: pmoreau: what about the clover patches?
14:48 juri_: ironically, my work is on a raytracer, as well. just a monochromatic industrial one.
14:50 pmoreau: karolherbst: Might finish rebasing them tonight, otherwise definitely during the weekend. I was having some issues with it yesterday, due to building in-tree, it pulled in the LLVM headers as well.
14:50 karolherbst: pmoreau: ahh, nice
14:50 pmoreau: imirkin: Thanks! :-) I’ll have another look at it, but I don’t think it does. Still some restrictions of course.
14:54 pmoreau: juri_: Nice :-) I’ve been doing mostly photon mapping (well photon splatting) so far in my PhD.
14:59 juri_: pmoreau: 3d modelling here.. but it will be a LOT of work before i have nvidia cards backing it.
15:03 hopetech: pmoreau: You can still work on llvm-spirv, Just saying. ;)
15:06 hermier: hi, just tried nouveau since I was using nvidia for a while, because of ddr issues
15:06 hermier: ddr issues are gone but plasma still makes nouveau leaking like hell :/
15:07 hermier: are there any investigations in that direction ?
15:10 imirkin: hermier: not really
15:14 HdkR: pmoreau: Playing with Volta? :D
15:23 karolherbst: imirkin: do we have to keep the sign if we cast from 64 bit int to 32 bit?
15:23 karolherbst: there is no I642I opcode in TGSI, so, dunno
15:23 karolherbst: I would expect we just get the low bits, but still
15:30 karolherbst: well, the issue is something else anyway
16:15 Lyude: You don't need strap_peek if you got the vbios through pramin, right?
16:17 imirkin_: karolherbst: how would the result be any different between I642I and I642U?
16:18 karolherbst: yeah, dunno. I don't see it as well
16:18 karolherbst: but something is odd
16:18 imirkin_: i.e. give me a precise situation where you'd get a different output :)
16:19 karolherbst: imirkin_: well the high level issue I see is, that for this kernel: https://gist.githubusercontent.com/karolherbst/066278ca52ec3d0137072af819b4721b/raw/21fe1b207e3a182bf554a474946e0470f82f6f98/a.cl
16:19 karolherbst: where get_global_id returns a long not int
16:19 karolherbst: I get $tid << 32 | $tid as the result
16:19 karolherbst: not $tid
16:19 karolherbst: I mean + whatever I add
16:19 karolherbst: so for 1 thread running, I get 0x1 00000001
16:19 karolherbst: if I do +1
16:19 karolherbst: 0x2 00000002 for +2 and so on
16:21 karolherbst: I am sure something is wrong with the submitted shader, but I am not sure what exactly
16:24 Lyude: (anyone know the answer to the pramin question? wondering before I send off a pull request with some small improvements for nvagetbios)
16:24 karolherbst: imirkin_: this is the nvir output: https://gist.githubusercontent.com/karolherbst/b51eff9bfb1f6df1c0d6a196ec82c864/raw/6514c67df056a2fc67a4b88dbbc496f8ab2e3a5b/gistfile1.txt
16:24 karolherbst: but... it all looks super sane
16:25 karolherbst: at least to me
16:25 imirkin_: karolherbst: and you want it to be 1 0?
16:25 imirkin_: (just trying to figure out what you're trying to get)
16:25 karolherbst: no, high bits should be 0
16:25 imirkin_: right
16:25 imirkin_: but high bits go second
16:25 imirkin_: little endian.
16:25 karolherbst: right
16:25 karolherbst: ahh
16:26 karolherbst: yeah okay
16:26 imirkin_: so you're expecting high bits to be 0, right?
16:26 karolherbst: yes
16:26 imirkin_: can i see something more complete? will make it easier to trace what's desired.
16:26 karolherbst: weird thing is, if I do "unsigned int" I get the correct result for example
16:26 imirkin_: so your conversion is wrong.
16:27 karolherbst: https://gist.github.com/karolherbst/b51eff9bfb1f6df1c0d6a196ec82c864
16:27 imirkin_: (maybe)
16:27 karolherbst: imirkin_: well, not sure
16:27 karolherbst: because I disabled some opts as well
16:27 imirkin_: ok, cool thanks
16:27 imirkin_: that helps.
16:27 karolherbst: AlgebraicOpt -> Split64BitOpPreRA -> DeadCodeElim
16:27 karolherbst: everthing else is disabled
16:27 karolherbst: if I disable tryAddToMulSad, then I get the expected result as well
16:28 imirkin_: mov u32 %r46 0x00000000 (0)
16:28 imirkin_: add s32 %r77 %r74 %r41 (0)
16:28 imirkin_: merge u64 %r79d u32 %r77 %r46 (0)
16:28 imirkin_: split u64 { %r80 %r81 } %r79d (0)
16:28 imirkin_: and then those split'd values end up getting stored.
16:28 imirkin_: so ... if you're seeing the high bits as anything other than 0
16:28 imirkin_: you're in SERIOUS trouble
16:28 karolherbst: yeah...
16:28 imirkin_: also, if you look at the generated code, it has
16:28 imirkin_: mov u32 $r3 0x00000000 (8)
16:28 imirkin_: st u32 # g[$r0d+0x0] $r2 (8)
16:28 imirkin_: st u32 # g[$r0d+0x4] $r3 (8)
16:29 imirkin_: so ... again ... that better end up with 0's in the high bits
16:29 imirkin_: is that not what you're seeing?
16:29 karolherbst: mhh, I think I posted the wrong version actually
16:29 imirkin_: ok
16:29 imirkin_: yeah, your original example didn't have that
16:29 karolherbst: https://gist.githubusercontent.com/karolherbst/b51eff9bfb1f6df1c0d6a196ec82c864/raw/bd872f1f6cd3c6b51bfd79ca9c756b7a1f0cddf0/gistfile1.txt
16:30 karolherbst: the shr went away :)
16:30 karolherbst: I kind of thing the shr is the culprit here
16:30 karolherbst: because last write to $r3
16:30 karolherbst: and the valur of $r2 is pretty predictable
16:30 imirkin_: why is that a signed shift?
16:30 imirkin_: can i see the opencl input?
16:31 karolherbst: signed 32 to 64 bit
16:31 karolherbst: same as I2I64
16:31 imirkin_: you're shifting down by 31 bits
16:31 karolherbst: yeah
16:31 karolherbst: the sign
16:31 imirkin_: that just makes every bit the same as the sign bit
16:31 karolherbst: from_tgsi: mkOp2(OP_SHR, TYPE_S32, dst0[c + 1], dst0[c], loadImm(NULL, 31));
16:31 karolherbst: exatly
16:31 karolherbst: and $r3 is the high bits of the result
16:32 karolherbst: so there is a 32 to 64 bit cast
16:32 karolherbst: in the cl file I linked above: " res[0] = tid + 2;"
16:32 karolherbst: res is long*
16:32 imirkin_: that's I2I64
16:32 karolherbst: right
16:32 imirkin_: not I642I
16:33 imirkin_: for I2I64, you want to sign-extend the low word
16:33 imirkin_: so that the high word is either all 1's or all 0's
16:33 karolherbst: but isn't that what is happening?
16:33 karolherbst: $r2 contians the lower 32 bit
16:33 karolherbst: and gets shifted by 31 into $r3
16:33 imirkin_: yes
16:34 karolherbst: so, that's why it is a signed shift
16:34 imirkin_: is the problem that some of the high TID bits are set?
16:34 karolherbst: mhh
16:34 imirkin_: can you look at the full value?
16:34 imirkin_: i don't know that all the bits are valid there.
16:35 karolherbst: well, tid is a 32 bit integer
16:35 karolherbst: I am sure that something really ugly is happening here
16:35 karolherbst: if I do out[1] = tid; out [0] = tid +2; I get the wrwong result
16:36 karolherbst: when I swap both expressions, I get the expected one
16:36 karolherbst: well, more or less
16:37 karolherbst: actually not
16:37 karolherbst: both things are garabge
16:37 karolherbst: but in different ways
16:38 karolherbst: it makes no sense
16:38 karolherbst: ...
16:38 karolherbst: imirkin_: we know what value $r2 has, we know from that point what value $r3 should have
16:38 karolherbst: and still...
16:43 karolherbst: imirkin_: just by looking at the end of the shader: https://gist.githubusercontent.com/karolherbst/199b8ac5669630c5f7f80bd63b7cffd1/raw/17d99abb5fed5d5cec87eb1ee6a33dfa4efda77b/gistfile1.txt
16:43 karolherbst: it only makes sense if the shr is wrong, but why should it be wrong :(
16:43 karolherbst: uhm....
16:44 imirkin_: i dunno. have you looked at the shader binary?
16:44 karolherbst: imirkin_: make a super good duess
16:44 karolherbst: *guess
16:44 imirkin_: your test app is wrong?
16:44 karolherbst: nope
16:44 imirkin_: i.e. all is well and you're just checking the same address twice
16:45 karolherbst: :(:(
16:45 karolherbst: NV50_PROG_SCHED=0 and it works
16:46 karolherbst: I hope I didn't break it though
16:47 karolherbst: good
16:47 karolherbst: still wrong if I revert my two sched fixes
16:48 karolherbst: something like that is liker super annoying
16:57 karolherbst: imirkin_: I guess the shr takes longer than we expect it does
16:58 karolherbst: but
16:58 karolherbst: only if the shr reads from a reg
16:58 karolherbst: this is so annoying
16:59 karolherbst: hakzsam: did you know that a shr with an immediate shift is faster than one with reading it from a reg?
16:59 karolherbst: now you know :p
17:00 karolherbst: maybe it doesn't even need longer but something else gets delayed, dunno
17:01 karolherbst: https://gist.github.com/karolherbst/5289bafe70eac500fa3c86ba188c0c98 I just enabled/disabled loadpropagation now
17:10 imirkin_: moral of the story: use immediate shifts ;)
17:10 imirkin_: karolherbst: alternatively you weren't waiting long enough for the mov to complete?
17:10 karolherbst: mhhh
17:10 karolherbst: so it shifts by 0?
17:11 karolherbst: that would work, because $r3 was set to 0 before
17:11 imirkin_: or whatever the prev value was
17:11 karolherbst: well everything above the add is basically a nop
17:11 imirkin_: on adreno this is handled with explicit nop's
17:11 karolherbst: if tid and ctaid is 0, everything is 0
17:12 karolherbst: basically
17:37 pmoreau: hopetech: I’m planning to, as well as maybe help on common clover issues; I also have a Radeon HD 6870 home, so I can run clover on that, rather than NVIDIA cards with Nouveau.
17:37 pmoreau: HdkR: I was just a bit curious to see whether they were any difference regarding atomics.
17:38 HdkR: ah
17:38 pmoreau: I don’t have (yet) a Volta to play with.
17:38 pmoreau: So just compiling to SM 7.0 and looking at the output.
17:39 Lyude: How exactly do you do an mmiotrace with multiple cards and seperate one card from the other?
17:51 karolherbst: Lyude: maybe binding it to the pci_stub driver or something helps?
17:53 Lyude: what exactly do you mean, ftrace filters?
17:56 karolherbst: Lyude: no, just get the driver to ignore the device
17:56 karolherbst: I don't think there is a feature for that in mmiotrace, because that's not how mmiotrace works
18:06 duttasankha: Is there an assembler for the falcon up? I don't remember properly but I thought I saw it once but I am not sure....
18:07 Lyude: karolherbst: ah, makes sense
18:08 karolherbst: Lyude: but the lines are prefixed with an index of the device anyway, so you can also just filter
18:08 Lyude: oh! just using awk or something, right
18:08 Lyude: *grep
18:08 karolherbst: yeah
18:40 hakzsam: karolherbst: I didn't :)
18:40 karolherbst: yeah well... the issue could be something else, but... anyway, there is another bug :)
19:52 karolherbst: hakzsam: I don't really get it fixed though :( any suggestions what I could try here?
21:05 Lyude: imirkin_: poke; through some of my connections I got some stuff from I'm pretty sure is a GK208/GK208B but it seems to be listed as nv108
21:05 imirkin_: Lyude: cool. nv108 = GK208, nv106 = GK208B
21:06 imirkin_: the B version is an unofficial name we created
21:06 imirkin_: in e.g. lspci they're all GK208
21:06 imirkin_: i'm not aware of any differences at all whatsoever
21:07 Lyude: ahh, so it doesn't matter a whole ton, cool
21:08 imirkin_: obviously boot0 returns a diff value, but that's about it
22:38 pendingchaos: imirkin_: for the compute invocation counter thing, I've currently got something that uses a macro to modify a bo at each indirect launch
22:38 pendingchaos: currently, it reads a slightly incorrect value when writing to a query's bo using the same macro
22:38 pendingchaos: adding a PUSH_KICK() before writing it into a query seems to fix the issue
22:38 pendingchaos: do you know if there is some subtle behavior that I'm missing?
22:42 karolherbst: imirkin_: can we actually read a byte from const memory? I am sure we can't do c[0x1], but do you think we can just use an indirect with it's value being 0x1?
22:43 imirkin_: karolherbst: with LDC maybe. definitely not with MOV
22:43 karolherbst: ahh right
22:43 imirkin_: pendingchaos: how are you modifying a bo in the macro?
22:44 karolherbst: allthough, seems like the emitCBUF for maxwell only accept 0x4 precision
22:44 pendingchaos: like with the query buffer macro, using QUERY_GET
22:44 imirkin_: karolherbst: check if that's the case if you use a smaller size.
22:45 karolherbst: ohh wait
22:45 karolherbst: LDC can do it
22:45 karolherbst: the ld c0[] just gets converted to a mov later on
22:45 imirkin_: pendingchaos: sorry, now's not a great time for me to look into it. ping me again a different time
22:45 pendingchaos: ok
22:45 imirkin_: karolherbst: yeah, but only for a 32-bit mov i think
22:45 karolherbst: maybe you are right
22:46 karolherbst: "cvt f32 %r51 u8 c0[0x1]" is what I see
22:46 karolherbst: would make sense
22:46 karolherbst: we can't load propagate this
22:55 imirkin_: so that's a I2F.B0
22:55 imirkin_: actually, hm. no i dunno what that is.
22:56 karolherbst: anyway, disabling load propagation at least lets the test pass
22:56 karolherbst: so it works in theory
22:57 karolherbst: *theory
22:57 karolherbst: ...
22:57 karolherbst: we just need to make load propagation less optimistic