00:01 levrano: tstellar: but this nvidia link does not say explicitly what happens when index lanes are such 1 1 1 2 2 2 3 3 3, what do you think how many replays than?
00:08 levrano: anyways, again this guy is russian that made the link in 2015, the most insulted eastern european one like me, it pretty much seems that those are the ones that make neat theories, however there is an implicit statement in the end like this:
00:08 levrano: With non-uniform access, the cost depends on the number of unique elements addressed by each warp. In some cases the number of instruction replays can be very high.
00:16 levrano: anyhow i still do not give the exact scheduling code, but honestly if you ever had some sane thoughts or will have you start to see very easy trick to trace the address space and regroup the fragments into new warps, and it is also kinda very long time known thing, mainly described as data remapping case, though there also was a reference redirection theory too
00:59 Lyude: Kepler2 BLCG ✔️
00:59 Lyude: now i'm finally at the point where it's time to try slcg
00:59 karolherbst: Lyude: nice
01:07 Lyude: holy shit even on idle with max clock and BLCG on kepler2, power draw goes down about 4W
01:07 Lyude: i wonder how much it's going to be when i go through and benchmark all of this with glmark + furmark so I can see differences between idle, mostly busy, completely busy
01:08 Lyude: (still expecting the middle one to show the greatest savings)
01:10 karolherbst: well I think this always depends on the workload
01:10 Lyude: yep, hence why I said the middle one
01:10 Lyude: since we'll be able to find workloads that don't stress certain portions of the GPU as much as others
01:10 karolherbst: that wasn't what I mean
01:11 Lyude: hm?
01:11 karolherbst: I meant, that if something only uses parts of the GPU, this won't do much, because the unused parts are already in power saving modes
01:11 karolherbst: so the actual kind of workload is more important than how much the GPU is under stress
01:12 Lyude: ah
01:12 Lyude: interesting
01:12 karolherbst: I am sure that a workload which is using all parts at 50% will see higher power savings than workloads which use 50% of all parts at 100%
01:13 karolherbst: but then again, it should depend on the engines being actually used
01:13 karolherbst: and if they have like smaller parts which can be turned off and so on
01:14 karolherbst: Lyude: I think you can actually see something like this happening in my furmark benchmarks
01:14 Lyude: btw: https://github.com/Lyude/linux/tree/wip/kepler%2B-clockgating-v1r2 working blcg2 for kepler + karolherbst's suggested changes + a mmio packet format I'm finally happy with
01:15 Lyude: karolherbst: yeah I did
01:15 karolherbst: like why is the power saving for 0f/0 lower than for 0a?
01:15 Lyude: good question
01:16 karolherbst: I would assume that on 0f/0 you have less waits on memory operations
01:16 karolherbst: so you kind of have to keep the cores longer in an active state overall
01:16 karolherbst: if you increase the clocks, you have more time to put them into idle states
01:17 karolherbst: sames goes for 0a, where you have so slow memory, that the cores actually have to wait much longer again
01:17 karolherbst: just a random theory
01:17 karolherbst: but maybe it is right, maybe not
01:18 karolherbst: but then it would mean, that on 0a/2 I should get even higher power savings...
01:19 karolherbst: sadly I didn't test on those levels
01:59 Lyude: PCOUNTER == pmu right
01:59 Lyude: (@ karolherbst )
01:59 karolherbst: no
01:59 karolherbst: PDAEMON==pmu
02:00 Lyude: ah ok, trying to figure out a name for this mmio pack
02:03 Lyude: alright, slcg written, let's see what explodes...
02:06 Lyude: well holy shit. that was fast.
02:06 Lyude: SLCG on kepler2✔️
02:12 karolherbst: :D
02:15 Lyude: we do get an mmio fault now though, but it doesn't seem to cause any problems
02:15 Lyude: must have a bogus register somewhere here
02:16 karolherbst: yeah, most likely
02:18 karolherbst: meh... vloadn...
02:20 Lyude: yeah, wow
02:20 Lyude: i'm seeing some power savings of >10W
02:21 karolherbst: wow
02:21 karolherbst: that's a lot
02:21 Lyude: I might be off by a watt or two but yeah
02:22 Lyude: *is not actually writing anything down, it's too late at night for me to care :P*
02:22 Lyude: i can'
02:22 Lyude: *I can't believe we're technically still not done either with ELPG and the iso hub still on my todo
02:29 karolherbst: well, usually people care more about getting performance/feature working before caring much about reducing power consumption :p
02:32 karolherbst: yeah... this sounds like fun: (180 / π) * radians.
03:49 Aristar: welp i tried dumping/fixing/recompiling the acpi DSDT table on this odd Dell notebook with nv86 (nv50). loaded from initrd and forced osi "Windows 2006", but nope... even with compositing off it froze pretty darn fast lol
03:52 Aristar: also tested acpi=copy_dsdt acpi_force_32bit_fadt_addr (since i know it is technically 32bit addressing), and acpi_enforce_resources=lax'
03:53 Aristar: bleh, i guess time to trash this thing... but has nostalgic value :(
03:53 Aristar: or lul hyperv linux on top of win vista
03:53 Aristar: if it even has hypev
03:56 Aristar: running out of ideas here though, anyone feel like looking at any acpi dumps or other info? nv binary drivers sort of work albeit some LibGL traps here and there
03:57 Aristar: but would like to get nouveau working well
04:01 Aristar: (old merom platform with core2duo, 965 chipset with ICH8 plus ICH6 for IDE optical drive, onboard 8400m GS 128MB/DDR2 (and a odd vga port with non-fatal duplicate video devices detected on boot). also lol expresscard slot but can't even eGPU on it due to 32bit addressing limit
04:01 Aristar: (not a mxm gpu, soldered on board)
04:02 Aristar: (and no intel graphics, at the time there was like 2 chipsets, one with iGPU in it, and one for dgpu)
04:04 Aristar: DSDT table does have some weird stuff when "Windows 2006" is detected (vista)
04:05 Aristar: no weird "Linux" stuff like some HP, so I assume it just skips all the RETURNs in dsdt table and falls back to generic stuff
04:09 Aristar: well if anyone has any ideas for nouveau-specific config= options or other parameters, i'd be grateful. I've tested most of the other parameters. also maybe of note, it detects the vram as being GDDR3 but is DDR2/GDDR2 128MB (yeah low bandwidth stuffs, Dell cheaped out... max is like 500MHz vram or something)
04:10 Aristar: also heh, this was one of those chips with the solder fail recall back in the day due to lead->leadfree transition, but mine was never affected and passes all stress testing
04:10 Aristar: also runs hot under nouveau, ~65c idle
04:11 Aristar: anyway sorry for the spam, if anyone has any ideas I'll be lurking or i'll try some of the mailing lists
04:57 imirkin: karolherbst: OP_BAR blocks for all active invocations in a single work group. there's no cross-work-group synchronization that i'm aware of.
04:58 imirkin: pmoreau: i think you were asking about BRA's in BB's? the idea is that all BRA's are at the end, but there can be multiple ones
04:58 imirkin: they execute the logic for the inter-bb edges
04:59 imirkin: so usually it'll be a bunch of conditional bra's followed by an unconditional one
04:59 imirkin: i think there's some unfortunate implicit logic with TREE edges vs other kinds which implies certain things it shouldn't imply
04:59 imirkin: i haven't had the heart to dig into it though
09:20 pmoreau: imirkin, karolherbst: For cross work-groups/blocks, you have to wait for Volta IIRC.
09:21 pmoreau: imirkin: Oh, I thought you could only have one, regardless whether it’s a conditional one or not. Good to know.
09:27 pmoreau:is hoping to get his hand on the new Titan V O:-)
09:27 pmoreau: (Using the research program ofc, not going to pay $2,999 for it! But that means no RE on it then)
09:42 karolherbst: pmoreau: I fixed a lot of the commonfns test stuff
09:43 karolherbst: pmoreau: there are two reason why those fail: 1. the frontend doesn't compile with fp64 support 2. we don't do vloadn/vstoren
09:45 karolherbst: pmoreau: and the painful part is, all those Opencllib opcodes need to deal with vectors
09:49 karolherbst: ahh PIPE_CAP_DOUBLES
09:49 karolherbst: but we enable that, weird
09:50 karolherbst: and clinfo even says so as well
09:52 karolherbst: okay, this sounds like a bug either in clang or mhhh, maybe we need to add that define?
09:59 pmoreau: karolherbst: Weird, I had it detect fp64 support on my branch
09:59 karolherbst: pmoreau: https://github.com/karolherbst/mesa/commit/64e19547c84688c2d126756ade4656689184fecf
09:59 karolherbst: pmoreau: it is just the define, which is missing while parsing the header files
10:00 karolherbst: so if you have #ifdef cl_khr_fp64 ... some functions #endif, those don't get picked up
10:00 pmoreau: karolherbst: If you run the “hileo” test, does it run for doubles as well? Cause it did for me
10:00 karolherbst: but
10:00 karolherbst: double is a valid time
10:00 karolherbst: *type
10:00 karolherbst: pmoreau: there is the type ;)
10:00 karolherbst: but
10:01 karolherbst: clang/3.6.1/include/opencl.h:10555
10:01 karolherbst: for example those function overloads don't get picked up
10:02 karolherbst: now fixing all my functions for double support
10:21 karolherbst: pmoreau: would you like to take a look at vloadn and vstoren? commonfns/test_commonfns fmaxf is a test failing due to this on my branch
11:46 pmoreau: karolherbst: I’d rather finish working on the memory management (as it’s one of the basis for all the work, and one of the first things we will need to get merged) and the SPIRV-Tools fixes/changes.
11:46 karolherbst: ahh, yeah, makes sense
11:46 pmoreau: And given that I still have a paper deadline going on, I won’t have time to look into more than that. (Even memory management + SPIRV-Tools might be too much for now).
11:47 pmoreau: Hopefully I can get most of it done by new year.
11:48 pmoreau: karolherbst: How is the talk proposal for FOSDEM going on BTW? We should be able to have “cool” things to talk about for FOSDEM, even if most of it will still be a WIP and barely anything merged, but there should be some good progress to talk about.
11:48 karolherbst: I think I will try to fix those RA issues now, because those seems kind of important
11:48 pmoreau: ;-)
11:48 pmoreau: Who would have thought :-p
11:48 karolherbst: pmoreau: well right, but first thing: who is going to fosdem?
11:48 pmoreau: (talking about the loop thing, or something else?)
11:49 pmoreau: I’ll be going
11:49 karolherbst: generally, like tracking down why those basic tests are failing
11:49 karolherbst: okay
11:49 karolherbst: well, I started to write up a proposal, and I wanted to finish it this weekend
11:49 pmoreau: OK
11:53 karolherbst: pmoreau: how long?
11:53 karolherbst: I would go for 30 minutes
11:54 pmoreau: Seems reasonable. What would there be to talk about (more at length than just saying, kernel X.Y has features A, B, C, etc.)?
11:55 karolherbst: I would talk about gsoc and that I worked on putting a lot more interesting ideas on the site
11:55 karolherbst: also some aspects I can talk about nvidia
11:55 karolherbst: and like general progress, what we are focusing on
11:55 karolherbst: we = ben and me
11:56 karolherbst: we also have Lyude power gating stuff where we can show nice benchmarks/graphs
11:56 karolherbst: maybe until then my reclocking rework is also fully merged
11:56 karolherbst: progress on the OpenCL/SPIR-V front
11:57 pmoreau: Oh yes, Lyude’s power gating! True
11:58 pmoreau: Who will be coming from RH? Are Ben and Lyude coming?
11:59 pmoreau: I assume Hans will be there as it’s quite close by for him
12:00 karolherbst: no idea
12:00 karolherbst: there is devconf.cz just one week before
12:00 karolherbst: and that is currently getting a bigger focus than fosdem
12:02 pmoreau: Hum, interesting
12:03 karolherbst: Well in the worst case I just take PTO and pay for myself :p
12:09 karolherbst: do we have some kind of nouveau account for fosdem? it sounded like that since last year there is a new/update/cleaned system or so
12:20 pmoreau: I made a pentabarf account for fosdem in 2016, as I was giving a talk, but I don’t know whether we have one for Nouveau or not.
12:22 karolherbst: maybe mupuf knows something?
12:29 karolherbst: pmoreau: hiloeo regressed
12:30 pmoreau: :-(
12:30 pmoreau: What’s wrong with it?
12:30 karolherbst: ohh wait, let me see
12:31 karolherbst: no, it was broken all the time already
12:31 karolherbst: at opt level 0 it breaks
12:31 karolherbst: and it fails for longs
12:31 pmoreau: Could you bisect what broke it please?
12:32 karolherbst: it is broken since the begining, it just didn't break on higher opt levels
12:33 pmoreau: Hum, I don’t think I tried it on different opt levels, besides the highest one.
12:34 karolherbst: maybe I will see what is wrong
12:34 karolherbst: mhh
12:34 karolherbst: there is not even branching
12:35 karolherbst: ohhh yeah
12:35 karolherbst: that won't fly
12:35 karolherbst: mov u64 %r62 %r61d
12:35 karolherbst: mov u64 %r63 0x0000000000000000
12:35 pmoreau: How do we end up with that?
12:35 karolherbst: I guess no typed set somewhere
12:39 karolherbst: also things like this: mov u64 %r2d %r55
12:39 pmoreau: *Sigh*
12:40 karolherbst: I think it is OpVectorShuffle
12:41 pmoreau: That would be weird
12:41 pmoreau: Cause the size and type are computed based on the input
12:42 karolherbst: then maybe OpSConvert
12:42 karolherbst: but that wouldn't make much sense
12:42 pmoreau: (except when using garbage data, in which case it’s using the res type, but the res type and the input type should be the same)
12:43 pmoreau: Could you paste the initial NVIR print please?
12:43 pmoreau: (as well as the corresponding SPIR-V)
12:45 karolherbst: pmoreau: https://gist.github.com/karolherbst/7b45a217e1078d70f9ddfb914e907469
12:45 pmoreau: Thanks
12:46 karolherbst: also those "cvt u64 %r17d - %r8" thingies are things we might not want to do
12:47 pmoreau: Yeah, I probably broke something when cleaning up that part
12:47 pmoreau: Oh right
12:47 pmoreau: https://github.com/karolherbst/mesa/blob/nouveau_spirv_support/src/gallium/drivers/nouveau/codegen/nv_ir_from_spirv.cpp#L3325
12:47 pmoreau: In most cases, the type of a value is not set
12:48 karolherbst: ahh
12:48 karolherbst: so if we have TYPE_NONE, we simply mov?
12:48 pmoreau: Hardcode it to TYPE_U32 for now.
12:48 karolherbst: mhh, right, makes sense
12:49 karolherbst: those are all for the SVs?
12:49 pmoreau: It’s more like no ones really set the type of a value, because you can compute it from the size of the value and the type of operation being done.
12:50 pmoreau: All the builtins are currently SVs, and TYPE_U32
12:50 karolherbst: makes sense
12:51 karolherbst: well, it still fails, but at least this issue is fixed
12:51 karolherbst: now this: mov u32 %r55 %r52d
12:52 karolherbst: OpCompositeInsert?
12:53 pmoreau: Not entirely sure
12:53 karolherbst: those moves above should come from OpCompositeExtract
12:53 karolherbst: or... wait
12:57 pmoreau: I think you were right about the VectorShuffle
12:57 pmoreau: I think we are hitting the same issue as with the builtins
12:57 pmoreau: https://github.com/karolherbst/mesa/blob/nouveau_spirv_support/src/gallium/drivers/nouveau/codegen/nv_ir_from_spirv.cpp#L2991
12:58 karolherbst: mhh,okay
12:58 pmoreau: Change `typeSizeof(src->reg.type)` to simply `src->reg.size`
12:58 pmoreau: Oh wait, that might cause issues with chars --"
13:00 pmoreau: Maybe `resType->getElementSize(i - 4u)` might not break things?
13:00 pmoreau: On second thought, `src->reg.size` should be fine, I think
13:01 karolherbst: well, no
13:01 karolherbst: I think the issue isn't the type here
13:01 karolherbst: the thing is, should we move into a wide reg, or not?
13:01 karolherbst: the dest is a 32bit reg
13:01 karolherbst: even though we want to move 64bits
13:02 pmoreau: Isn’t the type TYPE_NONE here, meaning we take max(4, 0) == 4 instead of 8?
13:02 karolherbst: well right, so we can change u32 to u64, but the remaining issue is, that we still have a 32bit dest
13:02 karolherbst: so we have to fixup the line above as well, right?
13:03 pmoreau: The line above should be fine: if the input is 64-bit, we allocated 64-bit, if it’s 32-bit or less, we allocate 32-bit
13:04 karolherbst: no, we don't
13:04 karolherbst: mov u32 %r55 %r52d
13:04 pmoreau: Du
13:04 pmoreau: Get rid of that typeOfSize
13:04 pmoreau: --"
13:04 karolherbst: ahh it passes now :)
13:04 pmoreau: It should be `getScratch(std::max(static_cast<uint8_t>(4u), src->reg.size))`
13:04 karolherbst: Value *dst = getScratch(std::max(4u, resType->getElementSize(i - 4u))); :D
13:05 pmoreau: I must have been reallllllly tired to write that --"
13:05 pmoreau: And I think I fixed a similar issue in another place
13:06 karolherbst: getting all those things upstreamed will be fun
13:06 pmoreau: While you are at it, you might want to compute the byte size once in that code, and reuse it
13:06 karolherbst: I think we can get most of the things in without depending on non default llvm?
13:06 pmoreau: Yeah
13:07 karolherbst: nice
13:07 karolherbst: I was mainly thinking about that intel spir-v work
13:07 karolherbst: and that we just accept spir-v inputs for opengl as well
13:07 karolherbst: should be a good enough reason to get it merged already
13:08 pmoreau: We would need to test on some shaders then, before merging anything then
13:08 karolherbst: right
13:09 karolherbst: and it seems like that most of the OpenCL stuff is abstracted away through OpExtInst*
13:09 pmoreau: I was thinking of splitting part of the SPIR-V to NVIR in different files: like have the main work done in the current file, and everything specific to OpenCL done in another file, and OpenGL/Vulkan in a third one
13:09 karolherbst: yeah
13:09 karolherbst: the opencl part is kind of big already
13:09 pmoreau: Yes, except how pointers work and some other memory things.
13:10 karolherbst: yeah okay, but I think this shouldn't be too hard
13:10 pmoreau: The convertOpenCL thing does not take that much space I would say, unless you added tons of things to it
13:10 pmoreau: Probably not
13:11 karolherbst: well
13:11 karolherbst: I have 400 loc handling opencllib opcodes now
13:11 karolherbst: and it will be getting more
13:11 pmoreau: Oh, OK. It grow quite a bit then :-D
13:11 karolherbst: sign was a big pain to implement
13:12 pmoreau: Oh yeah, indeed!! I missed all those new functions you added!
13:12 karolherbst: yeah
13:12 pmoreau: I was thinking it still mostly contained just smad24 and umad24 :-D
13:12 karolherbst: most of commonfn is passing now
13:12 pmoreau: Good job!
13:12 karolherbst: well and the fails are due to missing vloadn and vstoren support
13:13 pmoreau: Awesome!
13:13 karolherbst: and sign doesn't work for doubles for reasons I don't understand
13:13 karolherbst: and some tests only fail for double16
13:13 karolherbst: but yeah
13:13 karolherbst: it works for float
13:14 karolherbst: I cheated quite a lot for radians and degrees
13:14 karolherbst: do you know a better way instead of doinf f * constant?
13:14 karolherbst: *doing
13:15 karolherbst: this is still more precise than the CTS wants to have it, but well
13:20 karolherbst: ohh, implementing OpVectorExtractDynamic and OpVectorInsertDynamic should be fun
13:21 mupuf: karolherbst: I think I always sybmited with my own pentabarf account
13:21 karolherbst: okay
13:29 pmoreau: karolherbst: Nope, I don’t know of any better way for doing the conversion.
13:50 pmoreau: karolherbst: Feel free to move the OpenCL part to another file already. Not sure about the name. I was thinking of renaming the current file to nvir_from_spirv.cpp (getting rid of the ‘_’ between “nv” and “ir”.
13:50 karolherbst: mhh
13:50 karolherbst: I think I would just do that when we are cleaning up all that stuff
13:50 karolherbst: have to think about how to do it properly
13:51 pmoreau: OK
14:12 karolherbst: pmoreau: maybe we should check if we are already able to mine bitcoins with that OpenCL stuff :D
14:12 karolherbst: like to have a "real world example" which works
14:14 pmoreau: :-D
14:15 pmoreau: Feel free to do that!
14:15 pmoreau: Another cool example would be running Cycles (from Blender), but that will require images support for sure.
14:21 RSpliet: Perhaps the OpenCL branch of Caffe is an easier target
14:22 RSpliet: Well... maybe not easier than bitcoin mining, but definitely easier than image stuff
14:22 pmoreau: I’m not touching anything that sounds like coffee! :-p
14:23 karolherbst: pmoreau: by the way: 39 basic tests passing now
14:24 karolherbst: well, bitcoin should be fairly easy to support
14:24 pmoreau: So, two more than last time? Nice
14:24 karolherbst: and like 10 commonfns tests ;)
14:25 pmoreau: RSpliet: It could be an interesting test case though; I’ll just rename it locally to “tea”
14:25 karolherbst: 7 to be precise
14:25 karolherbst: so 9 tests more
14:26 RSpliet: pmoreau: sure about that? It's bound to give you migraines, perhaps "kahlúa" is a more appropriate name.
14:27 pmoreau: That’s an interesting mix! :-D
14:34 karolherbst: pmoreau: SPIR-V linker: Invalid linker options: '-DOUTPUT_SIZE=256' '-DOUTPUT_MASK=255' '-I' '/usr/lib64/python2.7/site-packages/pyopencl/cl'
14:34 karolherbst: any idea?
14:34 pmoreau: Yes, I see where that is coming from
14:35 pmoreau: https://github.com/karolherbst/mesa/blob/nouveau_spirv_support/src/gallium/state_trackers/clover/spirv/invocation.cpp#L802
14:41 karolherbst: mhhh
14:41 karolherbst: annoying
14:46 karolherbst: pmoreau: Type 10880 is missing
14:46 karolherbst: please tell me it isn't F16
14:50 karolherbst: pmoreau: those kernels are super hardcore... I think if we are able to compile those bitcoin kernels we can compile like everything...
14:50 karolherbst: except if something uses images
15:05 pmoreau: 10880? What is that one?
15:07 pmoreau: As for those extra options passed to the linker, I’m not 100% sure those are valid. Cause all the inputs should have been compiled, so we don’t care about those. Maybe we should just silently ignore them?
15:08 karolherbst: maybe
15:08 karolherbst: need to implement rotate anyway
15:09 karolherbst: rotate is annoying
15:09 pmoreau: Did you found out what 10880 was? The ID seem realllly large for a type. Plus, if it was an fp16, it would be defined as a float of with 16.
15:09 karolherbst: mhh
15:09 karolherbst: yeah no idea
15:10 pmoreau: rotate is funky
15:10 karolherbst: especially 8 bit rotates
15:11 pmoreau: I guess you can compute it by doing `(a << i) | (a >> (typeBitSize - i))`
15:12 pmoreau: (And making sure that the second one is unsigned type, as we don’t want it to be sign extended
15:12 karolherbst: yeah
15:12 karolherbst: well
15:12 karolherbst: then I still need to move the high bits
15:12 pmoreau: ?
15:12 pmoreau: You move the high bits with the shr
15:12 karolherbst: rotate(0x2d, 0x2a) == 0xb4
15:13 karolherbst: so I rotate around that 8bit value, not 32bit
15:13 karolherbst: and that's basically what happens now for me
15:14 pmoreau: Well, the issue there with my code, is that you could get the high bits left behind, but otherwise it should move the high bits down, I’m quite sure
15:14 karolherbst: I am sure there is no 8bit version of shfl
15:14 karolherbst: only 16 bit
15:14 pmoreau: So, `((a << i) | (a >> (typeBitSize - i))) & (typeBitSize - 1)`
15:15 pmoreau: Don’t need an 8bit version: just use the 32-bit version in those cases.
15:16 karolherbst: sure about that "typeBitSize - i" ?
15:16 pmoreau: Hum, what will it work if you rotate multiple times around
15:16 pmoreau: s/what/how
15:16 pmoreau: Let me check
15:16 karolherbst: it should just work I think
15:17 imirkin: pmoreau: the point of the BRA's is to "execute" on the edges which are in the CFG. you can't just have a random BRA to a random place. it has to follow the CFG.
15:18 pmoreau: karolherbst: Yeah: if you have an 8-bit value, and shift it by 2, then that means you are getting rid of the two highest bit. If you shr by 8-2=6, you get the two high bits in the two low bits now.
15:19 karolherbst: well i can be 0x2a
15:19 karolherbst: or what is i for you?
15:19 pmoreau: By how much you are shifting
15:19 karolherbst: 0x2a
15:20 pmoreau: So that’s why I was wondering what would happen if your shift makes you warp around multiple times
15:20 karolherbst: that should be fine
15:20 karolherbst: or I just mask off the high bits
15:20 pmoreau: It won’t be fine with my current solution
15:21 karolherbst: I mean, I can just normalize the shift so that I write into the first 16 bits only
15:21 karolherbst: and then just or bits 8-15 into 0-7
15:21 pmoreau: imirkin: On the other hand, you are building the CFG from the branches in your code, not the other way round, so branches should always be following the CFG, no?
15:21 karolherbst: so basically i & 0x7
15:22 pmoreau: Right, take the modulo which can be done with a simple &
15:22 imirkin: pmoreau: my main point is that they have to be the same
15:22 pmoreau: OK
15:22 imirkin: pmoreau: you can argue which drives which all you want, it doesn't really matter
15:22 karolherbst: tmp = a << (i & 0x7); res = (tmp & 0xff00) >> 8 | tmp & 0xff
15:23 karolherbst: I think we only need to do this for u8
15:23 karolherbst: because there is hw support for u16,u32 and u64
15:24 pmoreau: karolherbst: Do you really need the & before shifting it back? But you are right to have an & before |ing. It’s not needed for unsigned value, but it does for signed ones.
15:24 karolherbst: mhhh
15:24 karolherbst: I think I can skip it
15:47 karolherbst: pmoreau: it seems to work... just those vloadn/vstoren things result in crashes.. maybe I should at elast add dummy values for those
16:09 karolherbst: pmoreau: it doesn't work for u16 either :/
16:12 pmoreau: What doesn’t work for u16 either? The rotate function?
16:14 karolherbst: yeah
16:14 karolherbst: I mean
16:14 karolherbst: using the hw wrap function
16:14 karolherbst: let see if it works at least for u32
16:20 karolherbst: pmoreau: Expected (0x2dba27d3), got (0x2dba27d0), sources (0xcb6e89f4, 0x3817c9a2) ....
16:20 karolherbst: for int
16:21 karolherbst: and that is a shl with NV50_IR_SUBOP_SHIFT_WRAP
16:23 pmoreau: Does it work with simpler shift value, like 1 or 4?
16:23 karolherbst: well, it kind of does wrap arounds... but only kind of
16:24 karolherbst: if I mask 0x1f I get 0x32dba27d0 as a 64 bit value
16:24 karolherbst: so maybe that SHIFT_WARP only does this masking for us?
16:25 karolherbst: yeah
16:25 karolherbst: the PTX docs kind of say the same
16:27 pmoreau: Indeed: the clamping or warping is done on the shift value, not on the result.
16:31 karolherbst: well u64 will be painful then
16:31 karolherbst: mhhhh
16:31 karolherbst: wait
16:31 karolherbst: there is a easier way, right?
16:32 karolherbst: val <<.wrap i || val >>.wrap (size - i)
16:34 pmoreau: I wonder whether I have seen this code before (without the .wrap though)
16:34 karolherbst: maybe
16:35 pmoreau: Maybe AND the result with (size - 1)
16:36 karolherbst: why?
16:36 karolherbst: I will just cvt
16:36 pmoreau: If you are rotating a u8
16:36 karolherbst: I will mask the shift myself anyway
16:36 karolherbst: so I mask it with 7 before doing any shifting
16:37 pmoreau: You still need to mask after the shifting
16:37 karolherbst: and I hope a cvt will be enough
16:37 karolherbst: doesn't cvt just cut it off?
16:37 pmoreau: Well, cvt will be lowered do to the masking
16:37 karolherbst: k
16:38 pmoreau: Even if it wasn’t, I’m not sure what would be faster between a cvt and an AND.
16:40 karolherbst: well
16:57 karolherbst: when 1 << X is 0 in C... :D
16:58 karolherbst: okay. only long is broken now
16:58 karolherbst: right...
17:15 fox_: Hi
17:17 fox_: I have a GTX 650 under debian stretch and my system (screen) gets stuck randomly ... not often, but often enought to be anoying...
17:17 fox_: I think it is this bug: https://bugs.freedesktop.org/show_bug.cgi?id=81690
17:17 fox_: Log says: Dec 7 20:28:43 workstation org.gnome.Shell.desktop[1024]: nouveau: kernel rejected pushbuf: Device or resource busy
17:18 fox_: What can I do about it, or maybe does someone know a solution for this?
17:21 karolherbst: pmoreau: for long I only get the low 32bit right... and I am sure the input IR is correct in the sense of the algorithm is correct
17:21 karolherbst: fox_: annoy us enough so we fix that issue
17:22 fox_: :)
17:22 karolherbst: I think nv50cal_space basically means that there are too many commands sent to the GPU too fast
17:22 karolherbst: one think is we could just hard block until the GPU can accept it...
17:23 karolherbst: or block even longer
17:26 karolherbst: I think when I am able to reproduce the issue, I might even try to fix it
17:26 fox_: Maybe I can help fixing the issue.. Should I file a bug, or provide specific information? What should my next steps be to be most constructive?
17:27 karolherbst: you could try to write an application which triggers this immediatly
17:27 karolherbst: that would help me, because then I wouldn't have to do it myself
17:29 fox_: Hmm... Happened randomly while watching youtube and surfing the web ^^ ... But I will try to narrow it down more
17:47 karolherbst: bitselect sounds funny
17:50 karolherbst: bitselect(a, b, c) should be (a & b) & (0xffffffff ^ c)
17:50 karolherbst: mhh
17:52 karolherbst: actually, it is a & ~c | b & c
18:00 fox_: isn't it: a & b & ~c
18:00 fox_: can't see the |
18:02 karolherbst: well, the CTS does this on the CPU: out[ i ] = ( inA[ i ] & ~inT[ i ] ) | ( inB[ i ] & inT[ i ] );
18:02 fox_: you talked about bitselect from OpenCL?
18:02 karolherbst: yeah
18:02 fox_: yes just looked it up a & ~c | b & c is right ^^ ... i thought your first and second formula should match
18:03 fox_: what are you working at?
18:03 karolherbst: SPIR-V to nvir
18:06 fox_: what is nvir? NVVM IR?
18:11 karolherbst: nvir is what we use in nouveau
18:11 karolherbst: pmoreau: 17GH/s :3
18:12 karolherbst: the problem is just, that nothing is sent out for confirmation so I have no idea if we calculate the stuff correctly or not
18:16 fox_: is nvir part of the driver?
18:18 karolherbst: it is within mesa
18:18 karolherbst: usually it is called nv50ir
18:18 karolherbst: but
18:25 fox_: but?
18:37 karolherbst: well it kind of makes sense to rename it, but nobody really goes ahead and just do it
18:38 karolherbst: pmoreau: I think it really works...
18:42 karolherbst: well, I am sure it doesn't though
19:55 karolherbst: pmoreau: I think we can't do something like this, can we? call binSearch:-1 { $r0 $r8 $r7 $p0 $r5 $r6 $r3 } $r0d $r2 $r3 $r4 (8)
19:59 RSpliet: how many registers does the call instruction encode? :-O
20:00 karolherbst: dunno
20:00 karolherbst: none?
20:00 karolherbst: I think it really only encodes the target address
20:00 karolherbst: but this also fails
20:01 tobijk: if its the same call i happen to know its two registers :)
20:01 tobijk: OP_CALL
20:01 RSpliet: Yeah, that sounds more like "call". Where does it store the return address? Stack or register?
20:02 karolherbst: pmoreau: and I found your issue within the branching stuff, just need to track it down for real now
20:02 tobijk: %r0 and %r1 if i remember right
20:02 karolherbst: RSpliet: stack
20:02 karolherbst: it is call
20:02 karolherbst: not bra
20:04 RSpliet: karolherbst: RISC-V jl (aliassed call) stores the return address in $r1 (aliased "RA" in clang for return address). bra doesn't store a return address at all
20:06 RSpliet: Didn't want to make assumptions, but afaik NVIDIA GPUs have a shared call-stack and no scalar registers, so to stack does make more sense.
20:06 karolherbst: well, if you store it in a register, that is usually ABI defined anyway, right? Except those instructions indeed just store it inside a hard coded register
20:07 karolherbst: which would be odd, because you usually want the ABI to define this
20:08 RSpliet: Nah, having "call" always pick a fixed reg makes sense, shorter encoding/more bits free for immediate offset.
20:08 karolherbst: RSpliet: but that also means, that it is a totally braindead idea to use that $r1 for anything different at all, right? Except like you want to mess with that return address
20:09 RSpliet: karolherbst: you can use it as a temp if you need to store the value to the stack manually anyway (for instance, if you need to make some calls of your own)
20:09 tobijk: karolherbst: thats the tricky relocating part if you encounter a call :)
20:10 karolherbst: RSpliet: I mean right, but then you have some kind of ABI which defines: at return that register has to have the same value as when the function was called
20:12 RSpliet: Well no, you know that's never going to happen, which is why the only sane thing you can define in an ABI is "it's callee-saved"
20:12 RSpliet: err
20:13 RSpliet: caller-saved
20:13 karolherbst: but are you sure, that there is nothing else on RISC-V then jl to "call" functions? like JAL which saves it inside %rd?
20:13 karolherbst: because this is what you probably want to use
20:14 karolherbst: allthough I figure rd == r1?
20:15 karolherbst: but "The standard software calling convention uses x1 as the return address register and x5 as an alternate link register."
20:15 karolherbst: this sounds like a convention to me
20:17 RSpliet: Heh, I stand corrected, you can specify the register. It's a fixed register on ARM (R14)... might be a patented thing
20:17 karolherbst: yeah well, you usually have enough space in your encoding anyway
20:18 karolherbst: 19 bits for the offset in JAL
20:18 karolherbst: still
20:18 karolherbst: uhm 20 even
20:19 karolherbst: they even wasted two bits, because you have to allign that to 0x4 yourself
20:20 karolherbst: could just used two more bits to encode an address
20:22 RSpliet: depends on the targeted hardware. These atmel avr things have an 8-bit ISA, thumb used to be mixed-length but generally 8- or 16-bit. I think despite getting the details wrong the original point that "where to store the return address on a call" is a design decision rather than inherent to the insn very much stands
20:24 karolherbst: well right, but I was talking about RISC-V where you have a 0x4 aligned address for jumps
20:26 RSpliet: "The jump and link (JAL) instruction uses the J-type format, where the J-immediate encodes asigned offset in multiples of 2 bytes."
20:28 RSpliet: "Note that the JALR instruction does not treat the 12-bit immediate as multiples of 2 bytes,unlike the conditional branch instructions. This avoids one more immediate format in hardware.In practice, most uses of JALR will have either a zero immediate or be paired with a LUI orAUIPC, so the slight reduction in range is not significant"
20:28 RSpliet: So that's a hardware trade-off apparently :-)
20:29 karolherbst: yeah
20:35 karolherbst: pmoreau: I am quite sure, that by definition a return is always a back edge
20:40 karolherbst: ohhhhh
20:41 karolherbst: I don't think that RA is much at fault here, more like we have a builtin
20:42 karolherbst: why do we have an ABS builtin anyway?
20:43 karolherbst: oh wait
20:43 karolherbst: the value returned is abs
20:43 karolherbst: and it is calling div
20:43 tobijk: and input is %r0 and %r1 :)
20:43 karolherbst: right
20:43 karolherbst: mhh
20:44 karolherbst: I think RA doesn't check that
20:44 tobijk: it does imho
20:44 karolherbst: maybe we should make RA like: if builtins are called, never use the regs of those calls
20:44 karolherbst: mhh
20:44 karolherbst: I think it does not a good job then, because the cfg looks fine
20:45 tobijk: never said it does a good job :)
20:45 tobijk: it just relocates the values of r0/1 to somwhere else
20:45 tobijk: avoiding to populate those regs is the way to go , as you said
20:46 karolherbst: mhh, I think this isn't the issue though, let me dig deeper
20:46 karolherbst: uhhh
20:54 karolherbst: pmoreau: I write a compute shader with the exact thing you have and maybe it will produce basically the same nvir...
20:57 levrano: I'm quite sorry for being such an ass, got lost myself too today again, can't remember my notes
21:07 levrano: it kinda more seems that there are no index lanes for the addition side case on indirect addressing, almost seems like it only has the size as address reg
21:55 karolherbst: pmoreau: I think I am super near to understand the issue why the value isn't considered live anymore :)
22:01 karolherbst: pmoreau: okay, here is the deal: the buildLiveSets codes follows the outgoing edges in a deep search. chain is BB:0 => BB:2 => BB:4 => BB:6 => BB:7 => BB:8 (aborts, because BB:2 would follow now), so it checks what is getting used in BB:8 and tracks the path back to calculate live values. Now it continues the other paths: BB:4 => BB:5 (aborts, because BB:7 already checked), then BB:2 => BB:3 => BB:9 => BB:1. Now while
22:01 karolherbst: checking BB:9 the value is marked as live. But what is not happening anymore, allthough the value was live on the entire chain, that it is propagated through the firstly checked branches
22:02 karolherbst: so if a new live value appears from BB:2 => BB:9, we have to mark it live on the entire other chain as well BB:2 => BB:4 => BB:6 => BB:7 => BB:8 => BB:2
22:03 karolherbst: the value gets to be live for BB:8 again, because this is an input of BB:9 and that is what is checked against somewhere
22:28 karolherbst: yep
22:28 karolherbst: our fault is that we check every BB only once
22:30 tobijk: is this still represented as a net or is it a tree by breaking the circular dependencies?
22:31 karolherbst: it shouldn't break
22:31 karolherbst: basically you go around the same path until the in/outs of a BB don't change
22:36 tobijk: karolherbst: yeah right, and we really only visit BB's once? O.o
22:36 karolherbst: yeah
22:36 tobijk: might work for graphics shaders most of the time though :D
22:36 karolherbst: yeah
22:36 karolherbst: if you don't loop, it works
22:37 karolherbst: well
22:37 karolherbst: and then it dpeends in which order you traverse the tree
22:37 karolherbst: tobijk: our issue is, that we have a vaule defined before the loop and used after the loop
22:37 karolherbst: ;)
22:38 tobijk: compiler live-range 1.0.1 that is ;-)
22:38 tobijk: oh well yeah the live-range calculation was always buggy :D
22:39 tobijk: karolherbst: i imaginge you can come up with a better soulution if you put time into it, yet proving it right is hard :/
22:39 karolherbst: tobijk: but I think this can be done without visiting a BB twice
22:40 karolherbst: what we basically need is this: if we detect a circle and this is pretty easy, we just need to apply all live ranges to the circle if there are some jumping over that circle
22:41 karolherbst: like if you have A -> B -> C and B -> D -> E -> B as paths, if you detect your B -> ... -> B circle you have to apple all live values form A -> ... -> C to that circle as well
22:42 karolherbst: *apply
22:42 karolherbst: line 567: "(bn == bb)" this is the loop detection, we already do this, but then we simply skip
22:43 tobijk: mh not sure if "apply all live ranges unconditinally" is completely right, yet better than simply skipping
22:44 karolherbst: it is correct
22:44 tobijk: i must admit i have to read up on that matter..
22:44 karolherbst: because if you define something in A and use it in C
22:44 karolherbst: this value is still live in D
22:44 karolherbst: because you could go like this: A -> B -> D -> E -> B -> C
22:45 karolherbst: if you would have another Edge like E -> F
22:45 karolherbst: then the value would be live for E
22:45 karolherbst: but not for F, because you can't go from F to C
22:45 tobijk: yeah ok
22:52 karolherbst: that return false in RegAlloc::buildLiveSets is funny as well
22:53 karolherbst: pretty useless
22:55 karolherbst: wondering how we have to deal with sitautions like this: A -> B -> C -> D and B -> E -> D and A -> F -> C and D -> G -> A
22:56 karolherbst: but this so plain evil, I would assume everybody writing branched code like this has just one goal: annoy compiler devs
22:57 tobijk: heh
22:59 tobijk: karolherbst: but with the chains you posted there, just calculate live-ranges bottom up
22:59 tobijk: (as we try to do already)
23:00 karolherbst: the issue is, we don't deal with circles at all
23:00 tobijk: thats the (bn == bb)=?!
23:00 karolherbst: yeah
23:00 karolherbst: as I said: we don't deal with it
23:00 karolherbst: we ignore it
23:03 tobijk: mh not sure
23:03 karolherbst: the issue isn't within the chain of the loop
23:03 karolherbst: the issue is inside the other branches of bb
23:03 karolherbst: uhm
23:03 tobijk: imho if you have A->B->A then bb=A and bn=A here which it skips
23:03 karolherbst: or wait
23:04 karolherbst: no, I am silly, this thing detects A -> A things
23:04 tobijk: yep
23:04 karolherbst: bb->liveSet.marker detects loops
23:05 karolherbst: but mhh
23:07 karolherbst: I will debug through it
23:07 karolherbst: maybe I see something
23:10 karolherbst: right, that if (bn->cfg.visit(sequence)) does the loop control thing
23:23 tobijk: that "n" is shitty
23:25 karolherbst: ;)
23:25 tobijk: sparing it all togehter and initializing the liveset with bb->liveSet.fill(0); and later only do bb->liveSet |= bn->liveSet; seems more sane
23:26 karolherbst: mhh
23:26 karolherbst: then you loop forever
23:26 tobijk: ? you can still set the marker
23:27 karolherbst: doesn't solve the issue regarding loopinng forever
23:27 karolherbst: and this shouldn't even be the reason to abort the loop in the first pace
23:27 karolherbst: *place
23:28 tobijk: karolherbst: right, yet it is overly complicated :>
23:28 karolherbst: well that part is just not entirely correct
23:30 karolherbst: what I am not sure of is, if it is fundamentally broken or just has a little bug
23:32 tobijk: karolherbst: as far as i looked at the code it is mostly sane, but i havent looked at the OP_PHI thing
23:34 tobijk: your example would help though (if you dont find the bug yourself)
23:56 karolherbst: tobijk: https://gist.githubusercontent.com/karolherbst/87660eced27ac6ead0f4f68a41f043f2/raw/3cf4f0c508442b03743249c73ea010cf12f2e91a/gistfile1.txt
23:56 karolherbst: the live range of %14 is wrong
23:59 tobijk: lemme check...