00:10karolherbst: imirkin: mind reviewing https://github.com/karolherbst/mesa/commit/94ee8fb7749463dc1704bcab271fe4774147436e and https://github.com/karolherbst/mesa/commit/35839cadd4d83510a439394e2bb6b46cd00d8fff ?
00:16imirkin: karolherbst: looks fine
00:18karolherbst: k.. I am too lazy to do it for all the other arrays in there...
00:18karolherbst: but I guess we should
00:24imirkin: up to you
11:52karolherbst: mhh.. CTS with Turing doesn't look so bad.. texture gather is broken...
11:52karolherbst: and some shading_language_420pack fails..
14:17rak-zero: hey guys, im looking for an opencl card for my libre g41m system
14:18rak-zero: it should do cuda upwards 3.5 and be power efficient / silent
14:21karolherbst: rak-zero: depends on what you want to use. If you plan to use nouveau then you will be out of luck for quite some time, if you want to use cuda, you are out of luck anyway :p
14:21karolherbst: and we usually don't give recommendations of what to buy as we can't be sure that potential bugs are easily fixed by us
14:22rak-zero: i do simple AI stuff and want to stay libre
14:22rak-zero: K620 looks promising for price / cuda 5.0
14:22rak-zero: if i wont be able to use the cores ill go cpu only for now, thats fine
14:22imirkin: rak-zero: no cuda with nouveau
14:22karolherbst: rak-zero: yeah.. well... CL support is "broken" as of right now
14:22karolherbst: and no cuda
14:22rak-zero: cpu only it is
14:23karolherbst: nvidia isn't the only vendor doing CL :p
14:23karolherbst: intel and AMD have open source stacks
14:23rak-zero: is there a limitation on how much dimension my matrix can have?
14:24imirkin: unfortunately no one's come up with an actual universal turing machine
14:24imirkin: so always limitations.
14:24rak-zero: so its 3D matrizes multiplications for now?
14:24imirkin: it's whatever you want
14:24imirkin: but there's only so much vram to go 'round
14:25rak-zero: but the speedup is designed for AI / 2D or graphics 3D, right?
14:25karolherbst: imirkin: ever saw this error? https://gist.githubusercontent.com/karolherbst/fcc4d5c51fd55b733b14b1f4c695e4f5/raw/18fb07aadcce16fa71e18eab1a8a582b773dd66a/gistfile1.txt
14:25imirkin: kernel execution is done on a 3d grid, but that's largely artificial
14:25macc24: rak-zero: maybe your usecase might work on compute shaderse?
14:25imirkin: rak-zero: it's just a super-parallel computation thing
14:25imirkin: it can accelerate whatever you like, as long as it can be done in parallel without too much control flow
14:26karolherbst: macc24: the thing is.. if somebody looks into tesla cards.. probably they also look for perf :p
14:26rak-zero: my usecase is very flexible, i want to know what the hardware limitations are
14:26karolherbst: ohh wait
14:26karolherbst: k620 is quadro
14:26karolherbst: my mistake :D
14:26imirkin: karolherbst: that most likely means the GTF test is using a core context, but doesn't have a VAO bound
14:26karolherbst: anyway.. power management is still "WIP" in nouveau
14:26karolherbst: imirkin: yeah.. weird
14:26rak-zero: im on graphs for entity management and its pushing my infrastructure to the limits
14:26imirkin: karolherbst: i think this might happen if you force the GL version wrong?
14:26karolherbst: imirkin: I don't force it at all
14:27rak-zero: i would like to adapt my model to the current hardware limits
14:28imirkin: rak-zero: unless you're willing to make a _substantial_ investment in R&D, you're unlikely to highly accelerate a complex problem like that with GPUs
14:28imirkin: it's not just a magic "go faster" switch
14:29rak-zero: i know
14:29rak-zero: i want to know what dimension in matrix computation the cores are optimized for
14:29karolherbst: doesn't matter
14:29imirkin: they're optimized for doing 32 multiplies in parallel
14:29rak-zero: thats actually great news
14:29imirkin: doesn't matter what values they're multiplying
14:30rak-zero: they solve 2Dx2D ?
14:30karolherbst: rak-zero: you need to write the code yourself anyway I guess
14:30imirkin: not sure i understand that question
14:30karolherbst: in the hardware it's just scalar computations
14:30karolherbst: more or less
14:30karolherbst: unless you cheat and do tensors :p
14:30imirkin: karolherbst: that's an illusion
14:30rak-zero: i dont wanna do tensors
14:30karolherbst: imirkin: I meant the inputs
14:30imirkin: it's all fairly vectorized
14:31rak-zero: it feels like a framework
14:31imirkin: but yeah, the input program looks all scalar
14:31imirkin: but if you treat it as scalar, then it runs slow
14:31rak-zero: not much to learn from learning API calls
14:31karolherbst: rak-zero: there are native tensor cores and stuff
14:31karolherbst: and they are quite fast
14:31karolherbst: but you have to sacrifize precision
14:31karolherbst: I don't know if there is any open source stack for it yet?
14:31karolherbst: maybe AMDs ROCm supports tensor stuff already?
14:31karolherbst: no idea
14:32rak-zero: its all optimized for use cases
14:32karolherbst: rak-zero: I meant you can literally use the tensor cores yourself and use the instructions in ptx
14:32karolherbst: no need for frameworks or anything
14:33rak-zero: is a tensor core optimized for n levels of hidden neurons?
14:33imirkin: rak-zero: frameworks might be optimized for certain use-cases
14:33imirkin: but the underlying hw doesn't give a shit
14:33karolherbst: rak-zero: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions
14:33karolherbst: no clue on how to use them
14:34karolherbst: and they don't have to map 1:1 to hw instructions
14:34rak-zero: @koralherbst thanks for the link
14:34karolherbst: just.. that cuda has some direct access to that
14:34karolherbst: maybe ROCm has something similiar.. no clue
14:34karolherbst: or intel
14:35karolherbst: they added some nice visualizations
14:36karolherbst: but anyway
14:36karolherbst: none of that is supported with nouveau :p
14:36macc24: rak-zero: can you do your workload on compute shaders?
14:37imirkin: right, so this is a "framework" sort of thing where they recognized that a lot of people want to multiply matrices, so they supply the "optimal" way to do it built-in
14:37rak-zero: you can abstract n dimension work into anything else, be it shaders or cuda cores
14:37rak-zero: i want to know what is fastest, so what to learn
14:37imirkin: since it's non-trivial to do it in a way that plays in the best way possible with the gpu's internal requirements
14:38macc24: rak-zero: write your software so number of dimensions can be a variable (:
14:38rak-zero: its not a variable
14:38karolherbst: imirkin: what are you refering to?
14:38imirkin: the warp-level matrix multiply thing
14:38rak-zero: you calc in n dimension space for property graph
14:38karolherbst: ohh those instructions do exists
14:38karolherbst: and they are super weird
14:38rak-zero: but for representation / viewing you need to cut down to n spaces
14:38karolherbst: imirkin: those are the instrucitons to the tensor cores essentially
14:38imirkin: news to me.
14:39imirkin: ah, ok
14:39karolherbst: imirkin: all the H prefixed ones
14:39karolherbst: HADD2, HMMA, etc...
14:40karolherbst: ohh wait
14:40karolherbst: those are half prec...
14:40karolherbst: ohh weird.. they don't list them in the doc
14:41imirkin: H* are the half ones, yea
14:41karolherbst: I wrote some ptx files once to figure out to what they compile to
14:41karolherbst: weird shit
14:47rak-zero: i know i need a P100 to have the facebook chatbot at my hands
14:47karolherbst: imirkin: actually.. HMMA is the tensor one
14:48rak-zero: i want to do stuff at the order of magnitude like wikidata
14:48karolherbst: and only HMMA
14:50karolherbst: the 884 is for the matrix configuration I think
14:50karolherbst: m16n16k16 -> 884
14:50karolherbst: anyway.. I actually have no idea on how it all works
14:51karolherbst: just that the instruction is super weird
14:52karolherbst: anyway.. that's new with volta anyway
15:00karolherbst: imirkin: ahhhhh.... I know why that CTS stuff fails...
15:00imirkin: the GTF ones?
15:00imirkin: because you're using gbm?
15:01karolherbst: you need to set GLCTS_GTF_TARGET ...
15:01karolherbst: it's always this. :D
15:02karolherbst: now it works
15:12karolherbst: heh.. no gtf fails, but 34 normal ones
15:17imirkin: can't win 'em all
15:20karolherbst: that's with turing though :p
15:22imirkin: well, could be some real issues in there, i might assume
15:22imirkin: textures are hard.
15:22karolherbst: the texture gather test are passing when I run them alone :(
15:23karolherbst: ohh.. only the first passes..
15:23karolherbst: okay.. at least something
15:23karolherbst: imirkin: did you add anything in regards to sample shading btw?
15:24imirkin: i mean ... sample shading works?
15:25imirkin: not sure what the question is
15:25karolherbst: well.. not on turing it seems
15:25imirkin: probably missed a spot
15:25imirkin: implementing it was a whole thing back when i did it
15:26karolherbst: ohh some sample sahding test are passing..
15:27karolherbst: .none variants work
15:27karolherbst: .full .half don't
15:27karolherbst: any idea what that means?
15:27imirkin: none = no sample shading
15:27karolherbst: ahh well
15:27imirkin: look at glMinSampleShading
15:27imirkin: or whatever that function is called
15:27imirkin: basically imagine you have 100 samples
15:28imirkin: the API allows you to say "shade x fraction of them"
15:28imirkin: so for glMinSampleShading(0.1) you should shade 10 of the 100
15:28imirkin: (at least 10)
15:28karolherbst: ohh, I see
15:28imirkin: nvidia is the only hw i'm aware of that supports this btw
15:28imirkin: other hw, i think, is all-or-nothing
15:29imirkin: so anyways, half means you have a 4x buffer and you shade 2 samples and copy those results to the other 2 samples
15:29imirkin: in some undefined manner
15:29imirkin: so the hw has a "shading samples" parameter which says how many samples should be shaded
15:29imirkin: which can be less than the total number of samples
15:29karolherbst: the emiter needs to support that?
15:30karolherbst: fixup stuff
15:30karolherbst: we don't do that with the volta isa :)
15:30karolherbst: as it seems
15:30imirkin: so ... with the fixups
15:30imirkin: there's a super-awkward case
15:31imirkin: which is that a particular shader may use gl_SampleMask
15:31imirkin: er, gl_SampleMaskIn
15:31imirkin: and this works both with per-sample and not-per-sample shading
15:31imirkin: for not-per-sample it is the full pixel sample mask, as one might expect
15:31imirkin: but in the per-sample case, it's a bitmask with *JUST THAT SAMPLE*
15:32imirkin: but of course the sysval just contains the pixel mask
15:32imirkin: so the fixup is to deal with that stupid shit
15:32karolherbst: I see
15:32karolherbst: let's check if that is actually hit
15:32imirkin: so i emit some bugs instruction
15:33imirkin: that gets replaced with the sample id "AND"
15:33imirkin: in the fixup
15:34karolherbst: mhh.. okay.. at least the fixup case isn't hit in the CTS
15:35karolherbst: asserts disabled.
15:35imirkin: s/bugs/bogus/ btw
15:35Lyude: imirkin: btw, skeggsb figured out why the i2c stuff was happening (I was sort of right-seems we just need to wait for longer)
15:36Lyude: gonna write some patches up sometime today
15:36Lyude: (if he didn't get to that already :)
15:36imirkin: Lyude: cool!
15:38karolherbst: okay.. so yeah
15:38karolherbst: we need this fixup shit
15:39karolherbst: oh well..
15:39karolherbst: why does GL have to be so broken :p
15:39imirkin: well, i probably could have done it the other way which didn't require the fixup
15:39imirkin: in the common case
15:39imirkin: but since the fixup is sometimes required, i did the most straightforward way
15:40imirkin: i forget what other fixups need to be done on nvc0
15:40imirkin: there are a few for nv50
15:40imirkin: like some alphatest thing, maybe something else
16:54karolherbst: uhh.. I forgot that volta also got rid of carry bits
16:56imirkin: but they were so yummy
16:56imirkin: how do you do carry bits now?
16:57imirkin: ah, so like ADD writes to a predicate?
16:57imirkin: and consumes a predicate?
16:57imirkin: makes sense
16:57karolherbst: with turing you also have uniform predicates
16:57karolherbst: so that's one advantage
16:58karolherbst: no clue how the uniform reg/pred stuff makes sense on hw though.. I have _some_ theories but.. mhh
16:58karolherbst: it still doesn't make much sense
16:58karolherbst: and I don't think it's just a power efficency thing
16:59karolherbst: imirkin: any wild theories how that uniform reg/pred stuff could work?
16:59imirkin: i think i'd need more info
16:59imirkin: like is there a bit that's set
16:59imirkin: whereby you pinky-swear that the predicate is uniform?
16:59karolherbst: uniform reg/preds are just like normal ones, just that they are uniform :p
17:00imirkin: or is it a different register file?
17:00karolherbst: so if all your inputs are the same in each thread
17:00karolherbst: you can write into a uniform reg
17:00imirkin: so it's a different register file?
17:00imirkin: so then that's simple
17:00imirkin: instead of being per-"thread"
17:00imirkin: it's per-"warp"
17:00karolherbst: but where does the benefit come from?
17:00imirkin: and there's only one set of values for the warp
17:00imirkin: so it's uniform-by-default
17:00karolherbst: they don't do it without a reason ;)
17:00imirkin: coz then you don't have to push/pop masks
17:01imirkin: you know that all threads are either on or off
17:01karolherbst: but with a normal add you won't have to either?
17:01imirkin: and you don't have to play tricks with vote all like the earlier gens do
17:01imirkin: you'd use the uniform predicates to predicate instruction execution or a jump or something
17:01karolherbst: so your iadd can either write to a reg or an uniform reg in case the inputs are uniform
17:01karolherbst: those are just normal thingies
17:01imirkin: a carry bit in a uniform reg makes zero sense
17:02karolherbst: why not?
17:02karolherbst: if the inputs are all uniform?
17:02karolherbst: you can write into a uniform predicate
17:02imirkin: let me rephrase
17:02karolherbst: as the result is the same across all threads
17:02imirkin: storing a carry bit into a uniform reg makes zero sense
17:02karolherbst: that's not the purpose of those
17:02imirkin: since it assumes uniform inputs
17:02imirkin: which is not how these things are generally done
17:02karolherbst: nvidia uses those aggressivly
17:02karolherbst: for everything
17:03karolherbst: you load two uniforms, add them you write into a uniform reg
17:03imirkin: well either i'm not seeing it, or you're explaining it slightly wrong
17:03karolherbst: and consume it in normale operations writing to non uniform ones
17:03imirkin: ok sure
17:03imirkin: but like
17:03imirkin: how often does tha thappen?
17:03imirkin: that you do math on uniform values
17:03karolherbst: often enough so that nvidia saw a benefit of adding it
17:03imirkin: outside of piglit :)
17:03imirkin: i guess
17:03karolherbst: I am just wondering how that works on hw
17:03karolherbst: like.. do they execute the instruction just once per block?
17:04karolherbst: or is the reg space savings a big enough benefit?
17:04karolherbst: maybe executing it just onces saves enough power to remain at lower temps?
17:04karolherbst: no clue
17:04karolherbst: the space savings sounds like the most plausible one... but still...
17:06imirkin: doubt it's about reg space
17:06imirkin: more about only having to run the op once per warp
17:07imirkin: which means you can use a diff dispatch logic
17:07karolherbst: imirkin: kernel doing a simple add from ubos: https://gist.githubusercontent.com/karolherbst/53649f86e041c97709359ec1b06579eb/raw/422dec359b021fddd3e66557d5e11b783b224e6e/gistfile1.txt
17:07karolherbst: but they are not uniform accessed
17:07karolherbst: but the work group id is uniform :p
17:07imirkin: oh i see
17:07imirkin: it's not a uniform predicate file
17:08imirkin: it's a uniform full 32-bit register file
17:08karolherbst: both exists
17:08imirkin: i believe this is similar to AMD's SGPR vs VGPR
17:08karolherbst: 8 uniform predicates and.. 64 u regs?
17:08karolherbst: or 16..
17:08karolherbst: not quite sure
17:08imirkin: where VGPR is equivalent to regular registers
17:08karolherbst: imirkin: yeah.. I think so
17:10karolherbst: just wondering where the benefit is in having something like that
17:10karolherbst: as I don't see the inherent perf benefit here.. space savings and stuff sure.. but I wonder if that actually matters all that much
17:12karolherbst: the volta ISA is weird
17:12imirkin: i expect it does matter
17:12imirkin: esp if e.g. you can feed one of these regs to a texture LOD/BIAS/etc argument
17:13imirkin: it avoids having to recalculate the effective LOD for each lane
17:13karolherbst: OP_SET can't write to regs anymore eg
17:13imirkin: well, that was always a dubious feature
17:13imirkin: one i availed myself of since it was there :)
17:13karolherbst: true :p
17:14karolherbst: selp can write to a 32 bit reg though :p
17:14karolherbst: so we lower set to set + selp
17:16karolherbst: imirkin: also.. in fps we have to mark the outputs now...
17:16karolherbst: or something...
17:34imirkin: or SET + AND
17:34imirkin: er i guess that won't work, nevermind
17:41karolherbst: damn ssbos..
17:41karolherbst: something is fishy
17:41karolherbst: ssbos are kind of busted
17:47imirkin: can't be _that_ busted if such a large fraction of CTS passes
17:47imirkin: but if you want more details, you can run dEQP 3.1 tests, which are very detailed
17:47imirkin: and produce much better "what is wrong" info than CTS
17:51karolherbst: ahh.. right