00:00 karolherbst: imirkin: what about the fix regarding the tex assertion?
00:01 imirkin: still haven't tested it
00:01 imirkin: once i do, good to go.
00:01 karolherbst: k
00:03 karolherbst: I currently fear I can only get crappy internet here :( I hoped for a 400MBit connection, but this house doesn't have the right cables for that....
00:03 imirkin: :(
00:04 karolherbst: maybe I can get the VDSL 100/40 thing though
00:04 karolherbst: would be good enough
00:04 karolherbst: I doubt that though, and I guess I can only get 25 max or something
00:04 imirkin: in hamburg now?
00:04 Megaf: imirkin: does this help in any way? https://www.flashrom.org/pipermail/flashrom/2013-March/010716.html
00:04 karolherbst: imirkin: yes
00:05 karolherbst: imirkin: building is from 1956 or so :D
00:05 Megaf: maybe you could find something from the rom...
00:05 karolherbst: imirkin: actually it is even older, but it got destroyed in WW2 and then rebuilt
00:05 imirkin: Megaf: that just enables it same way that mcp77/etc are
00:09 karolherbst: imirkin: sent to the ML
00:09 karolherbst: the stats are nice :)
00:09 imirkin: karolherbst: nice
00:09 karolherbst: I guess tomb_raider will benefit mostly from it
00:10 karolherbst: cause this game already runs good enough and I think we don't have silly issues there
00:12 karolherbst: imirkin: ohh, with f1 one less "nvc0_program_translate:609 - shader translation failed: -4" as well
00:12 imirkin: well, that's probably just coincidence due to RA jiggering
00:12 karolherbst: yeah
00:12 karolherbst: but still
00:12 imirkin: karolherbst: btw, you should fix your name in your git config
00:12 karolherbst: let me check
00:13 imirkin: or your git send mail
00:13 imirkin: it was lowercase
00:13 karolherbst: mhhh
00:13 karolherbst: "git config user.name" prints "Karol Herbst"
00:13 imirkin: well, in the email you're "karol herbst"
00:14 karolherbst: odd
00:14 karolherbst: git send-mail output: "From: Karol Herbst <karolherbst@gmail.com>"
00:14 imirkin: https://patchwork.freedesktop.org/patch/144825/
00:14 karolherbst: maybe gmail messing up?
00:15 karolherbst: no idea where this is coming from
00:16 karolherbst: will investigate whenever I have time
00:16 karolherbst: imirkin: I think this is patchwork messing up
00:17 karolherbst: check the raw email, there it is fine
00:17 imirkin: karolherbst: it was like that in my email too
00:17 imirkin: er wait
00:17 imirkin: no
00:17 karolherbst: hum...
00:17 karolherbst: even on the ML it is fine
00:17 imirkin: yeah, this is patchwork.
00:17 imirkin: nevermind!
00:18 karolherbst: wow
00:18 karolherbst: for TR: -5% ins and -6% gprs
00:19 imirkin: karolherbst: btw, is the 'tomb raider' game in the feral package the right one? or is there a newer one?
00:20 karolherbst: yes, it is the right one
00:20 imirkin: i thought it was "tomb raider 2", no?
00:20 karolherbst: there is a newer one, but it isn't ported as well afaik
00:20 karolherbst: well....
00:20 karolherbst: this is like the 10th tomb raider game or so
00:20 imirkin: oh i see
00:20 karolherbst: tomb raider started on a PS1 or so
00:20 karolherbst: it's ancient
00:21 karolherbst: I think the most official name you can get is "Tomb Raider 2015" or was it 2016?
00:21 karolherbst: I thnk 2015
00:21 imirkin: hehe ok
00:24 karolherbst: 108: set ftz u8 $p0 lt f32 $r63 $r3 (8)
00:24 karolherbst: 109: $p0 mov u32 $r4 0xffffffff (8)
00:24 karolherbst: 110: not $p0 mov u32 $r4 0x00000000 (8)
00:24 karolherbst: 112: set u8 $p0 ne u32 $r4 $r63 (8)
00:24 karolherbst: ....
00:24 karolherbst: I guess we can do something here as well!
00:27 karolherbst: imirkin: nouveau currently can only deal with $p0, right? I think it would be a good next step to enable support for up to 8 predicates and fall back to gprs if no one is available, no?
00:27 imirkin: karolherbst: should work for p0..p6
00:27 karolherbst: and maybe reserve $p0 for branching,
00:27 karolherbst: so that we never mess that up
00:27 imirkin: it's just rare that there's a situation where multiple predicates are live at the same time
00:28 imirkin: (note that p7 == true, always)
00:28 karolherbst: ohh true
00:28 karolherbst: okay, so we have one for branching, one is true and 6 for live values
00:28 karolherbst: should reduce some gpr pressure
00:28 imirkin: $p0 is in no way reserved for branches
00:29 karolherbst: I know
00:29 imirkin: it's just that there's only one predicate live, so p0 gets assigned
00:29 karolherbst: but to make RA simplied
00:29 karolherbst: *simplier
00:29 imirkin: i don't understand what problem you're trying to solve
00:29 karolherbst: to let set write into predicates more often
00:30 karolherbst: and that other instructions could read predicates (more often, if possible)
00:30 karolherbst: there are a lot of those set/slct combinations
00:30 karolherbst: and if slct can read from predicates as well, that would free up a gpr
00:30 karolherbst: (hopefully)
00:30 imirkin: karolherbst: set can write into predicates just fine. RA isn't what's stopping it.
00:30 imirkin: slct can't read from predicates, but SELP can
00:30 karolherbst: ohh I see
00:30 imirkin: but you'd have to convert
00:31 karolherbst: mhh
00:31 karolherbst: so RA can deal with multiple predicates just fine?
00:31 imirkin: sure
00:31 imirkin: just like multiple registers
00:31 imirkin: same thing
00:31 karolherbst: well
00:31 karolherbst: what about branching?
00:31 imirkin: doesn't matter
00:31 imirkin: it's about live values
00:31 imirkin: the live value thing knows about control flow
00:32 karolherbst: I know, but if 7 predicates are already taken and then there is a little branched code
00:32 karolherbst: but if it takes care of that already, then it is fine
00:32 imirkin: it'd have to spill
00:32 imirkin: which it can do into a GPR
00:32 karolherbst: mhhh
00:32 karolherbst: yeah well
00:32 imirkin: same as if it had 100 live predicate values it had to manage
00:32 karolherbst: but then you can't use the predicated execution thing, right?
00:33 imirkin: huh?
00:33 imirkin: you're treating branching as some special thing
00:33 imirkin: it's not :)
00:33 karolherbst: $p0 mov ...
00:33 imirkin: ftr, that's predication
00:33 imirkin: but still - not some special thing
00:33 karolherbst: I know
00:33 imirkin: it's an instruction which needs regs x, y, z
00:33 imirkin: where one of those is a predicate reg
00:33 karolherbst: but I just think it would be worse to remove that predication and do "real" branching
00:34 imirkin: ok...
00:34 karolherbst: maybe I am wrong, who knows
00:34 imirkin: it'd a rare situation that all predicat regs would be used up for a block
00:34 karolherbst: true
00:34 imirkin: also ... pretty sure RA handles it all
00:34 karolherbst: k
00:35 karolherbst: mhh
00:35 karolherbst: okay
00:35 karolherbst: maybe I just work on a proper slct -> selp pass
00:36 karolherbst: and then replace the gpr with a predicate for some set/selp combinations
00:36 imirkin: sure.
00:36 karolherbst: can something else read from predicates?
00:37 imirkin: SET
00:37 karolherbst: well, okay
00:37 imirkin: or in nv50 ir, you can use a CVT to go between the things
00:37 karolherbst: except instructions which can also write into predicates
00:37 imirkin: which will generate an appropriate insn
00:38 karolherbst: have to remember what was selp again
00:38 karolherbst: the same as slct, just for bool?
00:39 imirkin: well, predicate
00:39 imirkin: picks one or the other
00:39 imirkin: while SLCT can compare against 0
00:39 imirkin: including float compare, and stuff like "lt", not just eq/ne
00:40 karolherbst: sure, but that's why I said like slct, just for bools ;)
00:40 imirkin: i suspect you can negate the predicate condition
00:40 imirkin: but you can also flip the args :)
00:40 karolherbst: exactly
00:40 karolherbst: pretty useless to be able to negate the input on selp
00:41 imirkin: well, i can imagine times when it's convenient
00:41 imirkin: e.g. you have a const or imm in src1
00:41 karolherbst: uhh
00:41 imirkin: so you can't just flip them around
00:41 karolherbst: right
00:41 imirkin: anyways, not a killer feature by any means
00:41 imirkin: mwk: how does gr ctxsw work on tesla wrt the fifo?
00:41 karolherbst: well it is easier not to switch the args around, less checks
00:42 imirkin: karolherbst: yeah, it's just convenient.
00:44 mwk: imirkin: eh?
00:44 imirkin: mwk: so .. you know the thing where fifo randomly gets desync'd on nouveau on tesla's?
00:45 imirkin: mwk: i'm wondering if it could happen due to the fifo doing dma from the wrong mmu
00:45 imirkin: perhaps that makes no sense.
00:46 dboyan_: imirkin: I think I've found something strange with disk cache. Some nvc0_program got retranslated several times when the cache is on.
00:46 imirkin: dboyan_: perhaps you're forgetting to set some bits?
00:51 karolherbst: imirkin: "set u32 $r12 ge f32 $r9 $r12" + "and u32 $r12 $r12 $r16" can this be merged into a set_and?
00:51 imirkin: karolherbst: if $r16 is the result of a set, sure
00:52 karolherbst: it is
00:52 karolherbst: "set u32 $r16 lt f32 $r9 0.500000"
00:52 AndrewR: imirkin, hello. I'm not sure, but may be you wanted to look if pmpeg stuff will show up at all on nv4x on recent kernels (I reported its disappearance some time ago)
00:52 imirkin: karolherbst: i thought there was an algebraic opt for that.
00:52 karolherbst: I thought so too
00:52 imirkin: AndrewR: oh. good one. can you briefly remind me what the problem is?
00:53 imirkin: (i remember it being reported, but i can't remember the details at all)
00:53 karolherbst: AlgebraicOpt::handleLOGOP
00:53 AndrewR: also, I might have not so good news about glthread stuff - it _probably_ makes my GPU hang more reliable in wine + 3DMark2001. Not like biggest problem on Earth ....
00:53 imirkin: karolherbst: could be that the code has more "crap" there and it doesn't pick up the thing
00:54 imirkin: AndrewR: yeah, not surprising.
00:54 imirkin: AndrewR: on the bright side, a FX3700 should be getting shipped to me soon, so i'll have a G92 to play with
00:54 imirkin: maybe i'll figure out wtf is wrong with vdpau
00:54 imirkin: (although watch it work out of the box...)
00:55 AndrewR: https://bugs.freedesktop.org/show_bug.cgi?id=99584
00:55 imirkin: AndrewR: thanks!
01:02 imirkin: [11051.541359] nouveau 0000:09:00.0: mpeg: ch 4 [00060530 mplayer[27583]] 01000000 00000010 000001b0 00006051
01:02 imirkin: i get that now
01:02 imirkin: and random green stuff on the screen
01:03 imirkin: xvmcinfo correctly reports the info for the screen
01:03 karolherbst: imirkin: ... guess how I can fix codegen, so it does those ands away?
01:03 imirkin: whack it real hard?
01:03 karolherbst: nope, run AlgebraicOpt again
01:03 imirkin: heh, yeah :p
01:04 karolherbst: in the first run, the both and srcs are slcts...
01:04 imirkin: AndrewR: that said, this is a NV44A, which is a different mpeg class than yours =/
01:05 AndrewR: imirkin, one fixed family of cards better than none
01:05 imirkin: AndrewR: dunno what's wrong... probably something dumb :(
01:06 karolherbst: wuhu "total local used in shared programs : 6865 -> 6457 (-5.94%)"
01:10 mwk: imirkin: fifo or gr?
01:10 imirkin: mwk: fifo
01:10 mwk: fifo is hw switched, if anything is going wrong, the only place you could possibly check is PFIFO unknown regs setup
01:10 mwk: gr is a different matter
01:10 imirkin: mwk: the symptom is that errors start with a INVALID_CMD. then i think nouveau takes care of messing up the rest.
01:13 karolherbst: gr: DATA_ERROR 00000004 [INVALID_VALUE] ch 2 [00bf890000 run[1898]] subc 1 class a0c0 mthd 1564 data 00001fff
01:13 karolherbst: I can reproduce this error message just by doing a shader-db run with one shader file :)
01:15 imirkin: what's 1564 again?
01:15 imirkin: TSC_LIMIT...
01:15 karolherbst: ....
01:15 imirkin: 0x1fff should be a valid TSC_LIMIT...
01:16 imirkin: i disagree with the hw. it is wrong. :)
01:16 karolherbst: :D
01:16 karolherbst: mhh
01:17 karolherbst: it only happens when I say that TIC/TSC size is 8192
01:17 karolherbst: 4096 is fine
01:17 karolherbst: I think I should track down those issues: "gr: GPC0/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 2000e [OOR_ADDR]"
01:18 karolherbst: those happening several times a frame shouldn't be good for perf I figure
01:18 imirkin: ok... one inconsequential bug found...
01:19 karolherbst: no clue how to track down those warp errors though :(
01:20 karolherbst: bisect inside apitrace until it happens once?
01:21 imirkin: karolherbst: grab skeggsb's fixes
01:22 imirkin: he fixed a ton of stuff, could be some dumb thing
01:22 karolherbst: mesa fixes?
01:22 karolherbst: I am sure I have all his kernel module patches
01:24 karolherbst: mhh meh, doesn't happen in apitrace
01:24 imirkin: no
01:24 imirkin: kernel fixes
01:26 karolherbst: I am sure I have everything
01:26 karolherbst: and this message doesn't get printed with apitrace, but this could also be due the content staying black
01:30 karolherbst: wuhu "178: $p1 mov u32 $r4 0x00000000 (8)"
01:31 karolherbst: $p1 is used :D
01:31 karolherbst: $p2 even
01:32 imirkin: karolherbst: ok, a bunch of stuff is in his branch...
01:33 karolherbst: which one?
01:34 imirkin: master
01:34 karolherbst: I have those
01:35 karolherbst: exceüt 2 commits...
01:37 dboyan_: imirkin: Ah I see, it's about user clip planes
01:39 dboyan_: vp.num_ucp is sometimes written before nvc0_program_translate, and I override it when loading from cache
01:41 dboyan_: And drawing things inside portal apparently uses user clip plane to make sure things behind portal won't show up. And I messed it up.
01:41 imirkin: why are you loading it from cache in the first place?
01:41 imirkin: i mean, the vp should be loaded from cache once
01:42 imirkin: which will have a fixed setting of num_ucp
01:42 imirkin: changing num ucp is one of the ops that causes a full recompile
01:42 imirkin: (if it increases)
01:42 imirkin: so ... ideal to avoid it :)
01:44 dboyan_: so I should take num_ucp as part of the hash key?
01:44 dboyan_: I see the recompile in nvc0_check_program_ucps
01:45 imirkin: dunno what the hash key represents
01:46 karolherbst: imirkin: I rebased again and have all missig patches now. still same message
01:47 dboyan_: imirkin: I mean I should store different shader cache entries for different num_ucps
01:49 imirkin: dboyan_: no
01:49 imirkin: dboyan_: when you bump up the num_ucp's, just store an updated program
01:49 imirkin: that way when it gets loaded later, the num_ucps will be at the max seen
01:50 dboyan_: got it
01:58 imirkin: airlied: do you know when skeggsb is back? he told me it was in "2.5 weeks" but unfortunately i've lost track of the starting point
01:59 imirkin: dboyan_: well done on tracking down the issue btw :) hopefully it's gotten you an even better understanding of the nouveau program flow
02:18 airlied: imirkin: 1.5 wks left i think
02:19 imirkin: airlied: ok thanks
02:58 dboyan_: imirkin: I fixed it the ucp issue locally. But I found binary shader cache didn't help a lot compared with only glsl/tgsi cache. At least with my portal 2 trace.
02:59 imirkin: =/
03:00 imirkin: one might hope that it's helpful in overall smoothness
03:00 imirkin: which is difficult to reflect by looking at total fps
03:01 dboyan_: I'm comparing total run times. No cache: 35s. Only glsl/tgsi: ~27s. +Binary cache: ~26.5s
03:02 imirkin: surprising.
03:02 imirkin: in most comparisons, the glsl/tgsi/nouveau breakdown was about 1/3rd each
03:03 imirkin: so it would stand to reason that with the binary cache it should end up at 23s...
03:04 dboyan_: well, maybe it'll help games with more complex shaders, I guess
03:05 imirkin: can you keep careful track of the number of compilations?
03:05 imirkin: perhaps you're still re-compiling stuff?
03:05 imirkin: [e.g. you're forgetting to set the ->translated flag...]
03:06 dboyan_: the translated flag should be okay at least, but I'll double check later
04:44 imirkin: mwk: any idea what this means:
04:44 imirkin: /* only allow linear DMA objects */
04:44 imirkin: if (!(dma0 & 0x00002000)) {
04:44 imirkin: and where such a thing gets set?
04:44 imirkin: (this is in preference to nv44a mpeg object)
04:44 imirkin: in my case, dma0 0002103d, so that fails
04:44 imirkin: the nouveau_video.c code is passing through a reference to the gart dmaobj
05:06 imirkin: heh. right. nv04_mmu_oneinit() sets it up as nvkm_wo32(dma, 0x00000, 0x0002103d); /* PCI, RW, PT, !LN */
05:06 imirkin: ARGH!
05:08 imirkin: mwk: do you have an explanation for what this linear stuff is about?
05:22 imirkin: mwk: ok, sticking that junk into vram makes it work
06:25 imirkin: AndrewR: was your NV43 AGP or PCIe?
06:28 AndrewR: imirkin, AGP
06:29 imirkin: hm ok. and did the pmpeg stuff ever work?
06:49 AndrewR: imirkin, sorry was distracted.... yes, it worked... probably until whole mesa/kernel rework went in in late 2015 (I still not tested older mesa + 4.2 kern on this amchine)
06:55 imirkin: np. i guess i'm not sure why that wouldn't be working
06:56 imirkin: i did fix a minor issue, but all it would have meant was you'd get a handful of extra error prints
06:56 imirkin: there are a handful of other issues too, dunno if they matter though
06:57 imirkin: for me, i had to force mesa to allocate stuff in vram instead of gart. however i'm not sure if that would have been an issue for you -- not sure.
06:59 AndrewR: I can turn on affected machine, and test your idea there ....
06:59 AndrewR: right now I'm on bigger machine
07:00 imirkin: no worries
07:00 imirkin: i still don't have anything that explains the failures you were seeing
07:01 imirkin: which probably means i'll have to break out my NV42 and play around with it
07:02 imirkin: i will do that after i have a satisfactory solution to my current nv4a situation
07:22 mwk: imirkin: the DMA object has two independent fields: memory target and linear flag
07:22 mwk: target says VRAM/PCI/AGP/PCI_NOSNOOP/whatever
07:22 mwk: linear controls paging
07:23 mwk: if linear is set, the DMA object has only one PTE, and it's really a base address, not a page address
07:23 mwk: since the whole thing is one big linear chunk in VRAM / AGP GART / PCI space
07:23 mwk: if linear is not set, the dma object has one PTE per 4kiB page
07:23 mwk: ie. the object is paged
07:24 mwk: now, not everything supports every object
07:24 imirkin: oh. and so for non-paged to work via pci, you'd have to have a contiguous allocation of memory
07:24 mwk: yes
07:25 imirkin: well THAT stinks
07:25 mwk: mpeg can do linear vram/pci/agp objects
07:25 imirkin: but it does work for AGPGART right?
07:25 mwk: but no paging
07:25 mwk: or... maybe not
07:25 mwk: earlier mpegs only had one bit for memory target
07:25 mwk: so... VRAM/AGP, I think?
07:25 mwk: but NV4x changed that to the full set iirc
07:26 imirkin: nv30 is diff from nv40
07:26 imirkin: nv40 has 2 bits
07:26 mwk: yep
07:26 imirkin: nv30 has 1
07:27 imirkin: https://hastebin.com/ixitujenof.cs
07:27 imirkin: while nv31 yeah, it's either vram or not-vram
07:31 imirkin: ok i see. with AGP, we end up bypassing a ton of this machinery...
07:38 AndrewR: imirkin, your last commit in mesa ("nv30: create uploader after pipe->screen is set") supposed to fix segfaults I was seeing on nv50? (still running with your patch applied)
07:38 imirkin: AndrewR: i don't remember about segfaults on nv50. but that commit only fixes things for nv30.
07:39 AndrewR: imirkin, sorry, you hoped to test some nvc0 cards too, but apparently only nv50-class was affected ..
07:39 AndrewR: https://lists.freedesktop.org/archives/mesa-dev/2017-March/146727.html
07:39 imirkin: oh. right.
07:39 imirkin: forgot about all that.
07:40 imirkin: there were more, right?
07:40 imirkin: or i wanted to do an audit for nv30 and nvc0. that makes sense.
07:40 imirkin: thanks for reminding ;)
07:41 imirkin: not tonight tho
07:43 AndrewR: imirkin, have good night rest ....
07:43 imirkin: i've now applied that patch to my tree, so hopefully i won't forget about it again
09:32 karolherbst: imirkin: 0e9232dbccf45ffd7e36f8cc1837a7e5e4a295de "total instructions in shared programs : 480681 -> 480708 (0.01%)" for hitman
09:35 karolherbst: okay, is due to RA being silly
09:36 karolherbst: generally more movs
09:46 DiaSexto: hii :)
09:46 DiaSexto: I'm looking for a cheap graphic card 100% compatible with linux
09:46 DiaSexto: is the gt710 fully compatible with nouveau?
09:47 karolherbst: DiaSexto: you should prefer intel or amd
09:47 Calinou: karolherbst: does AMD have similar cheap, recent cards?
09:47 Calinou: Intel is not an option I believe here
09:47 karolherbst: no clue? intel has
09:47 karolherbst: but the 710 is garbage anyway
09:47 Calinou: DiaSexto wants a cheap GPU for Linux, and has a gaming NVIDIA GPU for passthrough
09:47 DiaSexto: yeah
09:47 Calinou: it's only for driving monitors on Linux, not for gaming
09:48 karolherbst: DiaSexto: intel cpu?
09:48 Calinou: the HD 5450 is getting very old by now
09:48 DiaSexto: nop
09:48 karolherbst: hum
09:48 DiaSexto: I have an AMD FX 8370
09:48 karolherbst: sad
09:48 karolherbst: well you won't have much fun with nouveau being the main driver
09:48 karolherbst: a lot of MT issues right now
09:48 DiaSexto: my current GPU is GTX970
09:48 DiaSexto: and I want to keep the 970 for gaming on a KVM
09:49 DiaSexto: humm
09:49 karolherbst: so something like plasma5 is a no go
09:49 DiaSexto: I use plasma as desktop
09:49 karolherbst: okay, so nouveau is currently not a good idea
09:49 DiaSexto: then I should look for an AMD graphics card
09:49 DiaSexto: isn't it?
09:50 karolherbst: DiaSexto: the good think about AMD is, that they have a lot of paid devs
09:50 karolherbst: nouveau has one, not paid by nvidia
09:50 karolherbst: ohh wait, there are paid devs by nvidia
09:50 karolherbst: but only for the tegra GPUs
09:51 DiaSexto: u know?? even with the propietari driver I have a little tearing with my 970 on plasma
09:51 karolherbst: I have none with intel
09:51 DiaSexto: any recomendation for a cheap amd card?
09:51 karolherbst: I never had any tearing with intel after I set it up correctly
09:52 karolherbst: DiaSexto: disable that "only when cheap" tearing prevention thing
09:52 karolherbst: it sucks
09:52 karolherbst: on dri2 you want full scene repaints
09:52 karolherbst: and with dri3 you wanna want automatic
09:53 DiaSexto: humm what option I should choose?
09:53 karolherbst: if you wanna be save "Full scene repaints"
09:54 DiaSexto: yeah
09:54 karolherbst: this should eliminate all the left tearing
09:54 DiaSexto: and in the backend gl2.0 or 3.1?
09:54 karolherbst: I use 3.1
09:54 DiaSexto: scale method?
09:54 karolherbst: no idea
09:54 DiaSexto: I use smooth
09:54 DiaSexto: (I mean Im currently using a gtx970)
09:55 DiaSexto: OMG THANKS!!!!!!!
09:55 DiaSexto: tearing is almos gone!!!
09:55 DiaSexto: thanks!!!!!!!
09:55 DiaSexto: :D
09:55 karolherbst: ;)
09:56 DiaSexto:gift a beer to karolherbst ! enjoy :)
09:56 karolherbst: It tend to put more load on the GPU, but seriously, who cares :D
09:57 Calinou: maybe for gaming?
09:57 Calinou: :P
09:57 DiaSexto: yeah!
09:57 Calinou: I should try full repaint too, and see FPS impact in games
09:57 karolherbst: it won't matter much
09:57 Calinou: (I have a GTX 1080)
09:58 karolherbst: it just does more forced syncing
09:58 karolherbst: the alternative is to use dri3+present
09:58 Calinou: maybe it increases latency?
09:58 karolherbst: of course it does
09:58 Calinou: :(
09:58 Calinou: I can't, I ban vsync from all my gaming
09:58 karolherbst: tearing preventions is all about increasing latency :D
09:59 karolherbst: Calinou: get a 120 fps display
10:00 Calinou: karolherbst: I can't, really
10:00 Calinou: I need to change my displays anyway… I have a 32" QHD but it has big input lag
10:00 Calinou: and a 24" Full HD besides
10:00 Calinou: 4K 120 Hz is not a thing yet, and I want that so much
10:01 Calinou: (because I'd like to upgrade resolution too :p)
10:01 karolherbst: imirkin: no idea if we should do this, but could you decide on if this is a smart or bad idea? 8c61f588f016a850ec308cbeeb81e7807b92ec3b
10:01 Calinou: maybe next year
10:02 karolherbst: imirkin: https://github.com/karolherbst/mesa/commit/8c61f588f016a850ec308cbeeb81e7807b92ec3b
10:09 karolherbst: anybody knows how I can enable that mesa gl worker thread thing?
10:12 karolherbst: "mesa_glthread=true" it is
10:19 karolherbst: hum... those traps are not gone
10:32 karolherbst: and they seem to reduce the perf.... just fine
10:42 karolherbst: wow, I removed some printfs inside the trap handler: 6.14 -> 620 fps
10:42 karolherbst: *6.20
10:46 Riastradh: Heh. printf to console is a serious performance sink!
10:49 karolherbst: sure
10:49 karolherbst: I need to figure out what produces those traps
11:29 abff: I just did an apt upgrade and got 77 Warnings about nouveau
11:29 abff: http://pastebin.com/7bQuVGZC
11:29 abff: Am I in trouble if I reboot?
11:30 karolherbst: abff: no clue, ask your distributions. We don't do packaging
11:30 karolherbst: but you may want to install/update linux-firmware
11:30 karolherbst: or so
11:30 karolherbst: no idea how it is named for your distribution
11:32 abff: Righto
11:33 pmoreau: abff: If your card is not a GM20x or GP10x (i.e. 9xx-serie or 10xx-serie), Nouveau doesn’t need those firmwares
11:34 abff: Pretty certain it's a GF119
11:35 pmoreau: From Nouveau’s point of view, you should be fine then.
11:36 abff: sweet thanks for the knowledge
11:41 karolherbst: gnurou: by any chance, are you able to tell us about the "OOR_ADDR" GR_MP_WARP error?
11:42 karolherbst: and maybe even how we can "easily" figure out how we caused it
11:42 karolherbst: our old name was "MEM_OUT_OF_BOUNDS" which might be exactly what happens here
11:45 karolherbst: mhh, most likely we indeed are just inside an OOM situation
11:50 karolherbst: pmoreau: would you happen to know where I could track all memory accesses to vram (except from shaders and stuff)?
12:17 pmoreau: karolherbst: Absolutely no clue, sorry :-/
12:24 karolherbst: pmoreau: maybe there is a reg for extended information, will try to find it. I am sure that recognizing an invalid memory address should be easy, especially if the only mistake is being OOR
12:59 dboyan: imirkin: I know why binary cache doesn't help a lot to my portal 2 trace. It seems the glsl/tgsi cache drastically reduced number of calls to nvc0_program_translate. It changed from 1255 to 204 in my case
13:01 dboyan: imirkin: And after careful testing the improvement of binary cache is a little more significant than my first measurement. It is 32.0s (no cache) -> 26.7s (+tgsi cache) -> 25.5s (+binary cache)
13:17 karolherbst: well, the address has to be inside the TPC area somewhere, it's just 0x800 big
13:17 karolherbst: dboyan: test something more heavy, like hitmanpro
13:17 karolherbst: dboyan: the benchmark takes like 2 minutes to load here
13:18 karolherbst: dboyan: if you give me instructions, I could test it as well
13:20 karolherbst: dboyan: I have also some games with odd perf issues I would like to test your stuff on
13:33 Megaf: Hi all
13:33 dboyan: karolherbst: I have to update my github branch, wait a minute
13:35 dboyan: there are some issues with my current one
13:37 dboyan:has to check if the current code builds or not
13:43 karolherbst: pmoreau: I am sure this is something stupid like "out of local/const" memory or so
13:43 karolherbst: or always the same address
13:48 dboyan: karolherbst: My branch is at https://github.com/dboyan/mesa/tree/nouveau-cache, feel free to test it.
13:49 Misanthropos: karolherbst, some time ago we fixed the reclocking of a gk106 card of mine - to run stable in pstate 0f - i still need to do the changes in the kernel in vold/base.c to make it stable
13:49 karolherbst: dboyan: what env variables do I have to set?
13:49 karolherbst: Misanthropos: what changes?
13:49 karolherbst: would be embarassing if I would have forgotten about a fix :O
13:49 Misanthropos: info.min -> info.max
13:49 karolherbst: uhm
13:49 karolherbst: Misanthropos: what kernel are you on?
13:50 dboyan: karolherbst: no special env vars, it should just work (tm)
13:50 Misanthropos: right now 4.9.11-gentoo
13:50 karolherbst: Misanthropos: you need 4.10
13:50 karolherbst: dboyan: k
13:50 karolherbst: dboyan: first run is without cache and second with?
13:50 dboyan: yes
13:51 dboyan: but sadly it doesn't cache binary of compute shaders, I haven't thought of a clean way to avoid making pmoreau unhappy
13:51 karolherbst: :/
13:52 karolherbst: dboyan: where is the stuff cached?
13:52 dboyan: normally it should be in ~/.cache/mesa
13:53 karolherbst: there was a lot of stuff already :O nice
13:53 karolherbst: k
13:53 karolherbst: now real testing
13:54 Misanthropos: there is another thing: I was switching back and forth nvidia/nouveau because of vulkan. And again I noticed enemy territory is running MUCH better with nouveau.
13:54 Misanthropos: http://pastebin.com/TeKp6Fbg
13:54 dboyan: karolherbst: thanks and enjoy testing :)
13:55 Misanthropos: i know with older version of nvidia-driver it was faster but they dont work anymore with recent kernels... so can it be the nvidia driver got worse?
13:57 karolherbst: dboyan: time to first frame first run: 2m50.452s
13:57 karolherbst: :3
13:58 karolherbst: dboyan: second run: 0m47.944s
13:58 karolherbst: dboyan: you have to choose your tests better ;)
13:59 dboyan: karolherbst: I currently don't have many games, especially modern ones :)
13:59 karolherbst: dboyan: you will like the new fact: this game also uses compute shaders
13:59 karolherbst: so there is even more potential!
14:00 karolherbst: dboyan: how can I disable the nouveau cache?
14:00 karolherbst: so that I can check tgsi vs nouveau?
14:01 dboyan: karolherbst: sorry, I didn't implement that, out of lazyness
14:02 dboyan: karolherbst, oh, you might want to set NV50_PROG_DEBUG to some non-zero values, if it is a debug build, the binary shader cache will be disabled
14:03 karolherbst: ohh nice
14:03 karolherbst: I need that for shader optimizations :)
14:03 imirkin: dboyan: why are there ANY nvc0_program_translate calls?
14:05 dboyan: imirkin: why not? I counted by adding a printf in it
14:05 imirkin: i believe you =]
14:05 imirkin: i just mean ... why does it get called at all with shader cache?
14:06 dboyan: My cache code is also inside, if cache hits, it will return early without calling nv50_ir_generate_code
14:08 dboyan: And I think cache handling should be inside because there are several places around nvc0 where nvc0_program_translate is called
14:13 karolherbst: dboyan: I get worse perf with the shader cache :/
14:14 dboyan: well, that's sad.
14:14 karolherbst: well perf is bad anyway
14:14 dboyan: karolherbst: how much does it get worse?
14:15 karolherbst: 6.14 -> 5.80 fps :D
14:15 karolherbst: but hitman is no fair thing
14:15 karolherbst: dboyan: what is the best way to disable the nouveau cache?
14:16 dboyan: If you want to disable it all, set MESA_GLSL_CACHE_DISABLE=1
14:16 karolherbst: nope
14:16 karolherbst: just want to disable nouveaus part
14:16 karolherbst: it's also fine inside code
14:17 karolherbst: that shader-db with the cache though :3 "total instructions in shared programs : 480019 -> 88557 (-81.55%)"
14:17 karolherbst: hum
14:18 karolherbst: 0 on second run of course
14:18 dboyan: the shader-db has updated to disable shader cache iirc
14:18 karolherbst: yeah, no worries
14:19 karolherbst: imirkin: did you see my patch regarind OP_FMA and constantfolding?
14:19 karolherbst: *regarding
14:19 dboyan: karolherbst: If you want just to disable nouveau cache, there is a "bool cache_program" at the beginning of nvc0/nvc0_program.c:nvc0_program_translate, set it to false
14:19 karolherbst: nice
14:20 karolherbst: this way tgsi is cached, but everything else should be fater
14:20 dboyan: yeah, I think so
14:20 dboyan: karolherbst: or maybe you can just set NV50_PROG_DEBUG to some non-zero but insane values, I think it can also do the trick
14:21 karolherbst: not a good idea
14:21 karolherbst: would slow down everything again
14:21 dboyan: okay
14:21 karolherbst: our shader-db has like 20k shaders or so
14:21 karolherbst: I think it is even 40k
14:21 karolherbst: or the numbers are just silly
14:21 karolherbst: hum
14:22 karolherbst: doesn't work
14:22 karolherbst: still cached
14:23 imirkin: karolherbst: you can't touch OP_FMA with optimizations
14:23 dboyan: karolherbst: That's funny, if you scroll downwards you can see that the cache code is guarded by if (cache_program)
14:23 karolherbst: imirkin: I am aware, I was just curious if it is okay if we can optimize a half away
14:24 karolherbst: imirkin: like if fma 5 6 0 produces the same in every situation as mul 5 6, it is okay to do the optimizations, no?
14:24 imirkin: karolherbst: yes.
14:24 karolherbst: or more generic fma a b 0 -> mul a b
14:24 imirkin: yes, that's fine.
14:25 karolherbst: that's what I did witht he patch, or did I miss something?
14:25 imirkin: but i very highly doubt that ever happens
14:25 karolherbst: mhh
14:25 karolherbst: wait
14:25 imirkin: i haven't looked at your patch :p
14:25 karolherbst: fma a 1 b -> add a b
14:25 karolherbst: could also happen
14:25 karolherbst: imirkin: I enabled the OP_MAD opts for ConstantFolding for OP_FMA
14:26 imirkin: yeah you can't do that
14:26 imirkin: e.g. fma 3 4 5 != add (mul 3 4) 5
14:26 karolherbst: k
14:26 karolherbst: imirkin: so just fma a b 0 -> mul a b?
14:26 imirkin: and fma a 1 b -> add should be fine too
14:27 karolherbst: well sure, I just talk about fma with constants
14:28 karolherbst: ohh nice
14:28 karolherbst: the patch is fine then
14:28 karolherbst: cause it only covers fma a b 0 -> mul a b
14:28 karolherbst: wrong place...
14:29 imirkin: it *looks* like it should be fine
14:29 karolherbst: may piglit help us?
14:29 imirkin: oh, but ... heh.
14:29 imirkin: flipping it to mul/add will enable additional optimizations
14:29 imirkin: which we don't want
14:29 karolherbst: :(
14:29 karolherbst: why not?
14:29 karolherbst: I am sure nvidia does it as well
14:29 imirkin: do we?
14:30 imirkin: hm. yeah, i guess we do.
14:30 karolherbst: if it is a silly fma 5 4 3, we would opt it to mov 23, cause why not
14:31 karolherbst: fma "just" means there is no rounding between the mul and add, right?
14:31 karolherbst: (also the N/A handling and what not)
14:31 karolherbst: but seriously, we should first find an example, what _really_ cares about this
14:31 imirkin: yes, it's just the rounding difference
14:32 imirkin: tessellation
14:32 karolherbst: we could disable it for tessellation then
14:32 imirkin: when multiple calculations have to arrive at identical answers
14:32 imirkin: otherwise you get tearing
14:33 karolherbst: checking what my patch enables
14:33 imirkin: that said, with imm == 1, there should be no funny additional extra bit
14:33 karolherbst: 1. fma 0 a mod b -> cvt mod b
14:33 imirkin: but with random values, there can be
14:34 karolherbst: 1. should be fine I guess? :D
14:34 imirkin: yes
14:34 karolherbst: no idea what N/A*0 is
14:34 karolherbst: N/A?
14:34 karolherbst: or 0?
14:34 imirkin: you mean NaN * 0?
14:34 karolherbst: yeah
14:34 imirkin: under IEEE rules, it's NaN
14:35 karolherbst: under GLSL rules?
14:35 imirkin: under DX9 "anything times 0 is 0" rules, it's 0
14:35 karolherbst: mhhh
14:35 imirkin: we generally go with the IEEE rules
14:35 karolherbst: for compute IEEE rules apply?
14:35 karolherbst: I see
14:35 imirkin: unless the tgsi property is set which indicates we should use the DX9 rules
14:35 imirkin: which is set by st/nine
14:36 karolherbst: okay, so to be correct, we should do 1. only if that property is set?
14:36 karolherbst: otherwise it is add NaN mod b?
14:36 karolherbst: ohhhh wait
14:36 karolherbst: this is FMA
14:36 imirkin: =]
14:36 imirkin: also we're already in trouble wrt that
14:36 karolherbst: need to read the spec
14:36 imirkin: we do plenty of non-ieee-safe opts
14:37 karolherbst: I see
14:37 karolherbst: so until something breaks, we should be fine
14:37 imirkin: yes
14:37 imirkin: nan isn't the value this has to match up for :)
14:39 DiaSexto: karolherbst: at the end I buyed a second NVIDIA
14:39 DiaSexto: a 710
14:39 DiaSexto: gt*
14:39 karolherbst: 2. subOp != NV50_IR_SUBOP_MUL_HIGH fma (-)1 mod a mod b -> add (neg) mod a mod b
14:39 karolherbst: 2. _looks_ also fine I guess
14:42 imirkin: mul_high is for int muls
14:42 karolherbst: sure
14:42 karolherbst: but it was in the if clause of the code
14:42 imirkin: fma is for float
14:42 imirkin: yeah, MAD can go either way
14:42 karolherbst: the third case is a bit more complicated
14:43 karolherbst: 3. fma a abs 2^b c -> shladd a log2(2^b) c
14:44 karolherbst: ...
14:44 karolherbst: 3. fma a abs 2^b c -> shladd a b c
14:45 karolherbst: imirkin: is shladd even supported for f32?
14:45 imirkin: no
14:45 karolherbst: okay, so 3. doesn't apply
14:45 karolherbst: it checks for target->isOpSupported(OP_SHLADD, i->dType)
14:45 karolherbst: okay
14:45 imirkin: so yeah, i think that particular instance is fine.
14:45 karolherbst: yeah
14:45 karolherbst: nice
14:46 imirkin: let me ask in dri-devel
14:46 imirkin: perhaps they know floats better than i do
14:46 karolherbst: sure, thanks :)
14:49 dboyan: imirkin, karolherbst: nir seems to do some opitimizations like that
14:50 karolherbst: well the thing is, some stuff is fine, where other is not
14:50 dboyan: although I have no idea whether they are safe or not
14:50 karolherbst: fma != mad as well
14:52 imirkin: yeah, they don't distinguish it...
14:53 karolherbst: mul(x, set) .... smart if you want to have 0 or neg
14:54 karolherbst: and then the results of 4 of those gets piped through max....
15:02 karolherbst: found something nice
15:02 karolherbst: imirkin: add u32 $r8 neg $r8 $r10 where both $r8 and $r10 are either 0x0 or 0xffffffff
15:02 karolherbst: ....
15:02 karolherbst: after the add: cvt f32 $r8 s32 $r8
15:03 karolherbst: couldn't we convert the add into a f32 operation?
15:03 karolherbst: and adjust the immediates?
15:03 imirkin: only if the inputs were float
15:03 karolherbst: inputs are immediates
15:03 imirkin: and would have to flip the neg flags
15:03 imirkin: then why aren't they propagated in?
15:04 imirkin: oh, there's control flow?
15:04 karolherbst: https://gist.githubusercontent.com/karolherbst/6a32ca2137c36209548cecf74b3e77e7/raw/51c703aa1a8c8bf00746fcf2da75b79c8e270611/gistfile1.txt
15:04 karolherbst: ;)
15:05 imirkin: step 1: get rid of the control flow.
15:05 karolherbst: I think I really should work on the selp thing
15:05 karolherbst: this would eliminate such silly cf things
15:05 imirkin: and then selp + set could be made into slct
15:05 karolherbst: sure
15:05 karolherbst: I started that, but it got bugs
15:06 karolherbst: so I will do it from scratch
15:06 imirkin: :)
15:06 karolherbst: detecting such things is not trivial
15:07 imirkin: not trivial at all.
15:07 karolherbst: mhhh
15:07 karolherbst: allthough
15:07 karolherbst: set
15:07 karolherbst: conditional bra
15:08 karolherbst: 2 BBs with 2 instructions: mov+bra
15:08 karolherbst: phi
15:08 karolherbst: that's it
15:08 karolherbst: and the phi could be "optimized" into a selp
15:08 karolherbst: and the BBs got DCEd away
15:08 karolherbst: *get
15:09 karolherbst: any complains about this plan?
15:09 imirkin: no, i think it's fine... just ... gotta be careful. don't want to convert ALL phi's into selp's
15:10 imirkin: only in some situations. and those situations may be tricky to detect.
15:10 imirkin: either way, worth some investigation.
15:12 karolherbst: mhh
15:13 karolherbst: chains of phis might be troublesome
15:13 karolherbst: cause I can't put a non phi before phis
15:17 imirkin: lots of tricky issues
15:17 imirkin: but chains of phi's will rarely be an issue
15:17 imirkin: you can only do it with BB's that have 2 incoming edges, which each have a single shared ancestor
15:17 imirkin: i.e. an if/else
15:19 karolherbst: true
15:20 karolherbst: imirkin: after which opt should I put it? after LateAlgebraicOpt or ConstantFolding?
15:21 jamm: hey guys, which linux distro do you use? just wondering..
15:22 imirkin:uses gentoo
15:23 imirkin: karolherbst: dunno, your call
15:23 imirkin: karolherbst: i suspect earlier rather than later
15:23 imirkin: i.e. after the first DCE
15:23 karolherbst: mhh
15:23 karolherbst: right
15:23 karolherbst: makes sense
15:24 imirkin: karolherbst: also you might consider attempting to merge the BB's in that case. dunno.
15:24 imirkin: that's probably more trouble than it's worth
15:25 karolherbst: yes
15:26 karolherbst: imirkin: I think I will do it after LoadPropagation
15:26 karolherbst: mov u32 %r5337 0x00000000; mov f32 %r5338 %r5337
15:26 karolherbst: something like that is just annoying
15:26 imirkin: no, you need to handle it.
15:27 karolherbst: why?
15:27 imirkin: because it'll always happen.
15:27 imirkin: having this stuff worked out earlier will reveal more opt opportunities
15:27 imirkin: all you have to do is follow mov's up to their original.
15:27 karolherbst: with the shader cache, we can just loop over opts anyway
15:29 karolherbst: and I meant CopyPropagation, not load
15:30 karolherbst: mhh no I can't follow the movs, cause that would require that I do other opts as well
15:31 karolherbst: like mov $r1 immediate; mov $r2 $r1; inside a BB, so I would have to propagate that immediate value otherwise I can't eliminate the BB
15:34 pmoreau: dboyan: :-( I don’t mind if you don’t cache compute shaders, or kernels, but getting rid of the symbols is slightly more annoying. I’ll have to figure out how to get around that.
15:35 dboyan: pmoreau: I'm trying to figure that out. will come up with a better solution eventually, I believe
15:35 pmoreau: karolherbst: In CUDA (and I assume in OpenCL as well, and maybe GLSL compute shaders), by default it follows IEEE floats, unless you specify `--use_fast_math`.
15:35 pmoreau: dboyan: Oh, cool! :-)
15:36 karolherbst: pmoreau: sane enough
15:39 pmoreau:is hoping to send a RFC for SPIR-V utils today, along with SPIR-V support for clover
15:39 imirkin: dboyan: you could also just only do the caching if symbols exist
15:39 imirkin: er. if symbols *dont* exist
15:39 imirkin: you might have to special-case MAIN
16:08 karolherbst: imirkin: the instructions of the sources of a phi are all within different BBs, right?
16:09 imirkin: yes, indexed by the edges
16:09 karolherbst: okay
16:10 karolherbst: I do BasicBlock::get(insn->bb->cfg.parent()) for both sources and check if the result is equal. Then I should be safe to have only the simply if else situation, right?
16:11 imirkin: no
16:11 imirkin: you have to look at the cfg directly
16:11 imirkin: a phi source may be in a totally different bb
16:11 imirkin: like you could have
16:12 imirkin: a = 0; b = 1; if (cond) {empty} else {empty}; c = phi(a,b);
16:12 imirkin: and it should get the value of a if coming from one bb, and the value of b if coming from the other bb
16:13 imirkin: this is dealt with by some pre-RA passes that insert constraint movs around
16:13 imirkin: which handle such otherwise-impossible situations
16:16 karolherbst: but that's what I do
16:16 karolherbst: I start from the phi
16:16 karolherbst: check the sources
16:17 karolherbst: and find the common parent of both
16:17 karolherbst: src->getUniqueInsn()->bb->cfg.parent() is only non NULL if there is only one incoming endge
16:18 karolherbst: ohhhh
16:18 karolherbst: wait
16:18 karolherbst: I thought both sources of phi have to be in different BBs?
16:19 karolherbst: I don't care how many instructions the BBs of the sources have, as long as the op is either MOV or something which can be converted to a mod at a selp and the sources is from the parent BB
16:19 karolherbst: or something more up
16:46 imirkin: karolherbst: so like i said, the instrs that define the sources of the phi are irrelevant
16:46 imirkin: it's the bb's on the inbound edges of the phi's bb that matter
16:47 karolherbst: imirkin: okay, I can verify that as well
16:47 karolherbst: total instructions in shared programs : 480019 -> 473567 (-1.34%)
16:49 imirkin: do any of them still work? :P
16:52 pmoreau: If it compiles, it works! Isn’t that how it works? :-D
16:53 karolherbst: well it still looks the same
16:55 karolherbst: imirkin: https://github.com/karolherbst/mesa/commit/6e0e6aa1b78d4dcf2a973d31862a6ca2ddddbc29
16:56 karolherbst: there are so many potential places where this could crash...
16:56 karolherbst: but I like the idea of the pass
16:57 karolherbst: much cleaner than my last attempt
16:57 imirkin: so... step 1 of that pass has to be to check whether the BB's setup is ok
16:57 imirkin: getParentBB(insn0/1) are totally irrelevant
16:58 karolherbst: it's not, cause I need the bra later
16:59 imirkin: karolherbst: assume that what i'm saying is correct, and try to adjust your view of how programs are laid out
16:59 imirkin: look at my example above of why that is.
17:00 karolherbst: why is there a phi for a,b if a and b are in the same BB?
17:01 imirkin: ok, well, let's think about this some more
17:01 imirkin: let's say the code originally looks like
17:02 imirkin: a = 1; b = 2; int c; if (cond) { c = a; } else { c = b; }; use(c);
17:02 imirkin: after ssa, it'll come out as
17:02 imirkin: a = 1; b = 2; if (cond) { c1 = a; } else { c2 = b; }; c = phi(c1, c2);
17:03 imirkin: and c1 = a looks a lot like mov c1, a
17:03 imirkin: which after copy propagation will look like
17:03 imirkin: a = 1; b = 2; if (cond) { } else { }; c = phi(a, b);
17:03 karolherbst: why should it look like this?
17:03 imirkin: huh?
17:04 karolherbst: I thought this situation should be like pretty illegal in SSA
17:04 imirkin: which one?
17:04 karolherbst: that's why you add those mov, to make it legal
17:04 imirkin: the "original" one is pre-ssa
17:04 karolherbst: the example after copy propagataion
17:04 imirkin: ah... what's illegal about it?
17:04 imirkin: everything's assigned once
17:04 imirkin: the mov's are inserted for RA's benefit
17:04 karolherbst: I thought all phi sources have to be in different "paths"
17:04 karolherbst: so to speak
17:05 imirkin: as far as SSA is concerned, the phi's value is determined by where it's coming from
17:05 imirkin: that's why i said that the instr that defines the phi's src is irrelevant
17:05 imirkin: you have to look at the bb's incoming edge
17:05 barteks2x: is it really impossible to get gpu memory usage with nouveau?
17:05 karolherbst: hum... this makes things... difficult. Wouldn't it be like much simplier for a SSA form to not allow phi sources to be inside the same BB? (generally speaking)
17:06 karolherbst: barteks2x: well somebody would have to implement it
17:06 imirkin: barteks2x: it's not exposed anywhere, if that's what you mean
17:06 barteks2x: well then, yet another thing I will have to use windows for
17:07 imirkin: karolherbst: it gets even more confusing when you do shit like while () { foo.xy = foo.yx }
17:07 karolherbst: imirkin: also, shouldn't it look like this" if (cond) { c1=1 } else { c2=2}; c = phi(a, b);" after CopyPropagation?
17:07 imirkin: barteks2x: i don't think booting into windows will get you nouveau's gpu memory usage
17:07 karolherbst: and a, b just gets DCEed away?
17:07 barteks2x: I mean I have a gpu memory leak and can't confirm if any change makes it disappear or not
17:07 imirkin: karolherbst: i don't think immediates are copy-propagated
17:08 karolherbst: I am sure they are, because I saw it
17:08 barteks2x: and because there are 2GB to fill it takes a long time to run out of memory
17:08 karolherbst: in my example shader I had if (cond) { a = 1; b = a} else { c = 2; d = c}; c = phi(a, b); -> CopyPropagation -> if (cond) { c1=1 } else { c2=2}; c = phi(a, b);
17:08 karolherbst: *fixing silly issues
17:09 karolherbst: okay, again
17:09 imirkin: karolherbst: wtvr. a = complexthing();
17:09 imirkin: karolherbst: this is how SSA works.
17:09 karolherbst: "if (cond) { a = 1; b = a} else { c = 2; d = c}; e = phi(b, d);" -> "if (cond) { b=1 } else { d=2}; e = phi(b, d);"
17:10 imirkin: and if you look at llvm's output, it actually becomes phi(1, 2) for maximal confusion ;)
17:10 karolherbst: :D
17:10 karolherbst: well
17:10 karolherbst: meh
17:10 imirkin: [i hate that the propagate immediates into phi's]
17:10 karolherbst: yeah, it makes things complicated
17:10 karolherbst: SSA should make things easier
17:10 karolherbst: not complicated
17:10 imirkin: it does
17:10 imirkin: you're just not listening to what i'm saying
17:10 imirkin: :)
17:10 karolherbst: not if you immediate into phis
17:11 imirkin: so let's start over
17:11 imirkin: (a) IGNORE the bb's of the phi's srcs
17:11 imirkin: (b) look at the BB's incident edges
17:11 imirkin: there should be 2 of them, and the BB's that they come from should each have a single incident BB, and that BB should be the same for both of them.
17:12 karolherbst: mhh okay, that makes sense
17:12 imirkin: if those conditions aren't satisfied, it's not an if () {} else {} situation
17:12 imirkin: none of this is dependent on the BB that the phi's src's are defined in.
17:13 imirkin: you could also detect if () {} without the else. dunno if that's beneficial.
17:13 karolherbst: depends
17:13 imirkin: either way, they're both very simple patterns
17:13 karolherbst: but I want to care about things I saw for now
17:13 karolherbst: so the if else thing should be taken care of
17:13 imirkin: well, you might have seen an if () {} but thought it had an else clause
17:14 imirkin: because during RA, critical edges are broken up
17:14 karolherbst: oaky, and if I have verified that, can I simply create the selp or do I still have to check for something BB related?
17:14 imirkin: and extra else clauses are inserted :)
17:14 imirkin: no, that's all the BB-related items to check
17:14 karolherbst: okay
17:15 imirkin: there are some unfortunate cases like if () { if () {} else {} } which will look funky, but ... wtvr.
17:15 imirkin: structurized flow sucks.
17:16 karolherbst: yes
17:16 karolherbst: at work my team leader is strictly against early exits... so we have tons of nested if clauses
17:17 karolherbst: not even with an else
17:17 imirkin: mwk: so would it be fair to say that without an AGPGART or a large contiguous physical allocation (like CMA), there's no way for nv31 mpeg to use non-vram data/cmd lists?
17:17 imirkin: karolherbst: i am strictly against your team leader :p
17:17 karolherbst: :D
17:18 karolherbst: he also doesn't like spaces between if and (, so I asked, if is a function?
17:18 imirkin: he said "yes"?
17:18 karolherbst: no idea :D
17:18 imirkin: the one place i'm torn about parens placement is with anonymous functions in javascript
17:19 imirkin: i.e. function () { ... } or function() { ... }
17:19 karolherbst: uhh
17:19 karolherbst: the fomer looks odd
17:19 karolherbst: but yeah
17:19 imirkin: i usually go for function () { .. } actually
17:19 karolherbst: makes more sense, yes
17:19 karolherbst: but looks odd
17:19 imirkin: :)
17:20 karolherbst: pro tip: function /*anonymous*/() { ... } :p
17:20 imirkin: hehehe
17:20 imirkin: i dunno about "pro", but sure.
17:20 karolherbst: you could even put a fancy name instead :D
17:21 imirkin: fancy.
17:26 karolherbst: now it crashes :(
17:28 imirkin: =/
17:28 karolherbst: mhh
17:28 karolherbst: in getPredicate
17:28 karolherbst: I guess I got a silly bra
17:31 karolherbst: ups
17:31 karolherbst: I iterated over outgoing
17:36 karolherbst: imirkin: same result with your way
17:37 karolherbst: https://github.com/karolherbst/mesa/commit/7d69c839891e5e5de0760ee74ddb759472a78052
17:39 karolherbst: imirkin: is the BB checking in hasValidSource still needed to be sure or could I remove that actually
17:44 imirkin: karolherbst: phi->bb == bb
17:44 imirkin: you should do that check first before processing phi nodes
17:44 karolherbst: huh
17:45 imirkin: not sure what .parent() does -- does it only return non-null if there's a single incident node?
17:45 karolherbst: why?
17:45 karolherbst: yes
17:45 imirkin: coz ... it's in the bb...
17:45 karolherbst: well if phi is inside the bb in visit, phi->bb should be bb
17:45 karolherbst: because phi is from bb->getPhi()
17:45 imirkin: right :)
17:45 karolherbst: so no idea why I should check against it
17:46 karolherbst: ohhhhhhhhhh
17:46 imirkin: tired? :)
17:46 karolherbst: not at all
17:46 imirkin: instead of doing "phi->bb" you can do "bb"
17:46 karolherbst: yes
17:46 karolherbst: that's what the "ohhhhhhhhh" was for
17:47 imirkin: and you don't have to do it for each phi node
17:47 karolherbst: true
19:26 kone1: is there automatic re-clocking or some super easy way to reclock for nontechnical users?
19:27 karolherbst: kone1: soon, I have some work in progress stuff for that
19:27 karolherbst: kone1: but you can also boot to a certain perf level
19:27 karolherbst: kone1: anyway, it is super easy, cause you only need to write into a file
19:29 kone1: thanks for you work and great!
19:29 kone1: trying to find out where to start
19:34 karolherbst: kone1: what GPU do you have? And with that I mean the chipset printed in dmesg
19:36 imirkin: kone1: pastebin the output of dmesg - that should answer all questions
19:37 kone1: I don't have the gpu on the hand, it's gtx 780
19:37 karolherbst: gk104
19:37 karolherbst: wait
19:37 karolherbst: gk110?
19:38 karolherbst: yeah, gk110
19:38 karolherbst: kone1: you should have a file "/sys/kernel/debug/dri/0/pstate"
19:38 karolherbst: this you can "cat /sys/kernel/debug/dri/0/pstate"
19:38 karolherbst: and select a perf level like "echo 0xf > /sys/kernel/debug/dri/0/pstate"
19:38 karolherbst: the perf levels you get with cat
19:39 karolherbst: the last line with AC/DC is just the current state
19:39 karolherbst: kone1: also note, that you should have at least linux 4.10 running ("uname -r" will tell you)
19:39 karolherbst: and that's all
19:39 kone1: tha'ts it? not too complicated then nice :)
19:39 kone1: thanks
19:40 karolherbst: well, you need a root shell (via "sudo -s" for example)
19:41 kone1: right
19:46 kone1: karolherbst: is the process same for GM107 (gtx 750 ti)?
19:46 karolherbst: kone1: yes
19:46 karolherbst: imirkin: can SELP take any mods?
19:46 kone1: cool
19:47 karolherbst: either I am blind or I don't see it
19:47 karolherbst: in envydis: { 0x2000000000000004ull, 0xfc00000000000007ull, N("selp"), N("b32"), DST, SRC1, T(is2), T(pnot3), PSRC3 }
19:47 karolherbst: I mean a mod except the NOT for the predicate
19:50 imirkin: karolherbst: like what?
19:50 imirkin: it's not an ALU operation
19:50 karolherbst: selp $r1 neg $r2 $r3 $p0
19:50 imirkin: that would require selp to know about types. it doesn't.
19:50 karolherbst: okay
19:50 karolherbst: then I only do it for movs for now
19:58 Megaf: I love when nouveau brings down the whole system
20:05 imirkin: Megaf: i prefer it when it only leaves half the system alive, so you can almost feel like you can win, but in the end you can't, and you've wasted time trying.
20:06 Megaf: oh, that happened yesterday, but today it totally froze everything
20:06 Megaf: sometimes it makes me think I can fix by just killing all X and display managers and restarting them
20:06 Megaf: but never works
20:08 imirkin: ben has some recovery-related fixes slated for 4.11
20:08 imirkin: i'm running with them now, we'll see what effect it has.
20:09 karolherbst: it helps but ain't perfect
20:11 karolherbst: imirkin: by the way, any idea how to debug this out of range issue? Is it simply due nouveau being out of vram most likely?
20:11 imirkin: karolherbst: no, it's most likely a shader issue
20:12 imirkin: karolherbst: accessing constbuf out of bounds generates that error
20:12 karolherbst: ahh
20:12 karolherbst: I was thinking that as well
20:12 imirkin: also i *think* accessing lmem or smem out of bounds generates that error
20:12 karolherbst: but the game recommends a 4GB gpu
20:12 karolherbst: best way to check?
20:12 imirkin: sorry, not sure
20:12 imirkin: with a trivial shader it's easy :)
20:12 karolherbst: mhh
20:13 karolherbst: I tried to dump all the TPC regs to check if I can find the reg where the error si described
20:13 karolherbst: at least nothing looked like a VRAM address
20:13 imirkin: "is described"
20:13 imirkin: ?
20:13 imirkin: it's an out-of-range address
20:13 karolherbst: yeah sure
20:13 imirkin: ranges only exist on const/shared/local
20:13 karolherbst: but where and why and what
20:14 imirkin: ideally we'd have a shader trap handler
20:14 imirkin: that could tell us what instruction triggered it
20:14 imirkin: but we live in a less-than-ideal world
20:14 karolherbst: I see
20:15 karolherbst: I would like to fix that OOR, because this is killing perf
20:15 karolherbst: just by removing the silly printks in the trap handler gave me more fps
20:15 karolherbst: more significant than all the opts I did (and I am at -7% instructions)
20:16 imirkin: we could disable shader traps
20:16 karolherbst: hum
20:16 karolherbst: how=
20:16 karolherbst: ?
20:16 imirkin: i forget :)
20:17 imirkin: but we enable them, blob doesn't
20:17 karolherbst: hum
20:17 karolherbst: okay
20:17 karolherbst: may also improve perf
20:17 imirkin: doubtful
20:17 karolherbst: somewhere in a pushbuf I guess?
20:17 imirkin: i don't remember
20:21 imirkin: nvkm_wr32(device, TPC_UNIT(gpc, tpc, 0x644), 0x001ffffe);
20:21 imirkin: you could remove 1<<14 from that to disable it
20:21 imirkin: in gr/gf100.c
20:22 imirkin: while leaving the other errors enabled
20:22 karolherbst: nice
20:22 karolherbst: first let me debug my crash with the SELFolding pass
20:23 imirkin: we really need to implement that firmware method to write ctxswitched regs
20:23 imirkin: from pushbuf commands
20:23 karolherbst: ../../../../../src/gallium/drivers/nouveau/codegen/nv50_ir_emit_nvc0.cpp:402: void nv50_ir::CodeEmitterNVC0::emitForm_A(const nv50_ir::Instruction*, uint64_t): Assertion `s == 1 || i->op == OP_MOV || i->op == OP_PRESIN || i->op == OP_PREEX2' failed.
20:24 karolherbst: ohh I see
20:24 karolherbst: "selp u32 $r23 0x3f800000 $r63 not $p0"
20:24 karolherbst: the immediate has to be on the other side
20:25 imirkin: you can't just stick the immediate into the args
20:25 imirkin: let the LoadPropagation take care of that
20:25 karolherbst: I know
20:25 imirkin: just stick a mov in and let it do its magic
20:25 imirkin: and teach it about flipping SELP args while you're at it
20:25 karolherbst: well, problem is, the values are in the BBs I want to kill
20:25 imirkin: i think it knows about flipping SLCT args
20:25 karolherbst: or is it fine?
20:25 imirkin: it's fine - LoadPropagation will make them dead
20:26 karolherbst: k
20:26 imirkin: (or won't, in which case you need those BBs)
20:26 karolherbst: crap
20:27 imirkin: you know - in fact - i'd really recommend making the pass only perform work if the phi src insn bb is *not* in the incoming bb, or if it's a mov with a const/immediate
20:27 imirkin: [although i guess at that point, those are the only mov's you'll have...]
20:28 karolherbst: yes
20:28 karolherbst: I just need to put the 0 to src 1 and then I have covered like most cases I found
20:28 imirkin: no.
20:28 imirkin: don't touch that stuff in your new pass
20:28 imirkin: let LoadPropagation take care of it.
20:28 karolherbst: okay
20:29 karolherbst: so LoadPropagation should flip the sources?
20:29 imirkin: (and teach LoadPropagation any necessary new tricks)
20:29 imirkin: yeah, it already does that
20:29 imirkin: might not for SELP, but you can teach it
20:29 imirkin: it already has special logic for SLCT
20:31 karolherbst: "insn->src(2).mod ^= Modifier(NV50_IR_MOD_NOT);" should be fine for SELP I guess
20:32 karolherbst: uhh
20:32 karolherbst: maybe we should add the ^= operator to Modifier
20:33 imirkin: :)
20:33 imirkin: isn't there a modifer.negate or some such shit?
20:41 karolherbst: imirkin: there is no OP_SELP inside _initProps :/
20:41 karolherbst: same as OP_SLCT?
20:41 karolherbst: ohh, no neg
20:41 karolherbst: 0x4 for not
20:41 karolherbst: no idea baout c[]
20:43 imirkin: check :)
20:44 karolherbst: envydis says "{ 0x2000000000000004ull, 0xfc00000000000007ull, N("selp"), N("b32"), DST, SRC1, T(is2), T(pnot3), PSRC3 }"
20:46 imirkin: ok, so can do const
20:46 karolherbst: k
20:46 karolherbst: at which sources?
20:47 imirkin: second source, like everything else
20:47 karolherbst: 0 indexed or 1 indexed?
20:47 imirkin: 1
20:47 imirkin: i.e. src(1)
20:47 imirkin: just look at tabis2
20:48 karolherbst: / neg abs not sat c[] imm
20:48 karolherbst: { OP_SELP, 0x0, 0x0, 0x4, 0x0, 0x2, 0x2 },
20:48 karolherbst: looks okay, I think
20:48 imirkin: mmmmmaybe. check whether tabis2 lets you have an immediate
20:48 karolherbst: it works
20:49 karolherbst: hitman produces selp 0x0 imm $p0
20:51 karolherbst: now I just need to figure out not to generate too many selps
20:55 karolherbst: but maybe it is okay, because we can do cool new stuff with a selp
21:02 karolherbst: imirkin: nvdisasm says that selp is "LOP32I.AND"
21:04 imirkin: guess what - it's not :)
21:04 imirkin: it's SELP
21:05 karolherbst: ....
21:05 imirkin: i'm guessing your hopes about it supporting immediates were ... overstated
21:05 karolherbst: a little
21:05 karolherbst: currently I wanted to check if I can put a float immediate in it, something like 0x3f80000
21:05 imirkin: hm. should work.
21:05 imirkin: you're on kepler?
21:06 karolherbst: yeah
21:06 karolherbst: in one shader: 268: selp u32 $r24 $r25 $r63 not $p0 (8)
21:06 karolherbst: 267: not $p0 mov u32 $r25 0x3f800000 (8)
21:06 imirkin: should work.
21:06 karolherbst: okay
21:06 imirkin: what's that emitted as?
21:06 karolherbst: like that
21:06 karolherbst: twp instructions
21:07 karolherbst: but I want to have that immedaite inside the selp
21:07 imirkin: oh - probably not
21:07 imirkin: only short imms
21:07 imirkin: let me check my notes...
21:07 pmoreau: imirkin: So you just need to set one bit to enable shader trap? So could we have some IOCTLs to enable it if registering a graphics/compute debugger, and disabling it otherwise?
21:07 imirkin: karolherbst: no, only short imms. lower 20 bits is all you get. upper bits have to be == the 20th bit.
21:08 karolherbst: :(
21:08 imirkin: pmoreau: yeah, in the ctx switched area. should have a firmware call for it...
21:08 karolherbst: imirkin: maybe there are other forms?
21:08 imirkin: karolherbst: there aren't, at least not on GK110
21:09 karolherbst: I need to make my pass smarter than
21:09 karolherbst: annoying
21:09 karolherbst: maybe we should tell that their tools is broken, to make things easier for us? :D
21:10 imirkin: ?
21:10 karolherbst: well, if nvdisasm prints out wrong stuff, this is annyoing
21:10 imirkin: no, nvdisasm prints out right stuff.
21:10 imirkin: we must be emitting stuff wrong.
21:10 pmoreau: imirkin: Hum… could be fun to hook that up! :-) I should put that somewhere on my todo list
21:10 karolherbst: huh
21:10 karolherbst: ohh I see
21:10 karolherbst: other opcode then
21:11 karolherbst: k
21:21 karolherbst: imirkin: maybe I generate just every possible opcode combination. 64bit souldn't take too long
21:24 karolherbst: bash says no :D
21:29 imirkin: karolherbst: that's what i did for gk110
21:29 imirkin: what's the issue?
21:29 karolherbst: it takes time
21:29 karolherbst: I want like _every_ combination
21:29 karolherbst: full 64 bit range
21:29 imirkin: lol
21:29 imirkin: that's not necessary
21:29 imirkin: the opcode space is fairly small
21:30 karolherbst: 2 bytes top and bottom at most?
21:30 karolherbst: I also see masks like 0xf80000001c000007ull
21:33 karolherbst: I guess the mask "0x3f0003ff 0xffce4001" should cover it all
21:33 imirkin: what op is decoding incorerctly?
21:33 karolherbst: selp
21:33 imirkin: no
21:33 imirkin: the bytes.
21:34 karolherbst: 0x00000004 0x20000000
21:34 karolherbst:
21:34 imirkin: that decodes as SEL
21:34 karolherbst: huh
21:34 imirkin: $ perl -ane 'foreach (@F) { print pack "I", hex($_) }' > tt; nvdisasm -b SM30 tt
21:34 imirkin: 4 20000000
21:34 imirkin: /*0000*/ @P0 SEL R0, R0, R0, P0;
21:34 karolherbst: ..............
21:34 karolherbst: meh
21:34 karolherbst: I put SM35
21:34 imirkin: those are different :op
21:35 karolherbst: really
21:35 imirkin: SM35 = GK110
21:35 karolherbst: I know
21:35 karolherbst: I still want to generate a full table :D
21:35 karolherbst: it just takes 11139338208761.807 days
21:36 imirkin: you're looking at too many bits
21:36 imirkin: try to look at the opcode map
21:36 imirkin: and figure out which bits are unnecessary (hint: most of them)
21:36 karolherbst: "0x3f0003ff 0xffce4001" is a mask which would cover all
21:36 imirkin: so would 0x7 0xffc00000
21:36 karolherbst: no
21:37 imirkin: name one op that you wouldn't find.
21:37 karolherbst: suredp
21:38 imirkin: wtvr :p
21:41 karolherbst: selp is silly :D
21:42 karolherbst: I am sure I get like +100% with disabled traps
21:45 karolherbst: imirkin: sure it is 1 << 14?
21:46 karolherbst: I put nvkm_wr32(device, TPC_UNIT(gpc, tpc, 0x644), 0x001fbffe); now
21:47 karolherbst: maybe INVALID_CONST_ADDR
21:49 imirkin: pretty sure it's 14
21:49 karolherbst: well it didn't remove the error
21:49 karolherbst: I am putting 0x0 now
21:49 imirkin: :(
21:49 karolherbst: huh
21:49 karolherbst: even with 0x0 it appears
21:51 karolherbst: huh
21:51 karolherbst: something enables it
21:54 karolherbst: imirkin: when I start a new GL context it gets enabled again
21:56 karolherbst: gf100_sw_chan_mthd writes to it as well
21:56 karolherbst: hum, and something else
21:58 karolherbst: the heck
22:18 karolherbst: okay, that doesn't help much as well, but let's rather figure out what shaders goes wrong
22:21 karolherbst: imirkin: what is the max c/s/l one shader can access?
22:21 imirkin: 64K
22:21 imirkin: for const
22:21 imirkin: i forget the limits on the others
22:23 karolherbst: is there a way to remove the colors form NV50_PROG_DEBUG?
22:24 imirkin: yeah. check nv50_ir_print.cpp
22:25 imirkin: some env var
22:25 karolherbst: NV50_PROG_DEBUG_NO_COLORS
22:27 karolherbst: c4[0x308] mhh, okay how many c spaces are there?
22:27 imirkin: 16
22:27 karolherbst: and each is 64k/16 big?
22:27 imirkin: each is 64k
22:27 karolherbst: ohh I see
22:28 imirkin: you can also do indirect addressing between them with LDC.IS
22:28 imirkin: the upper 16 bits do the indirect thing
22:28 karolherbst: so if there is c16[0x10000] it is OOR?
22:28 imirkin: yes
22:28 imirkin: well, there's no way to encode c16
22:28 karolherbst: c15
22:28 imirkin: but c0[$r0] where $r0 = 0x10000
22:28 imirkin: would generate that error
22:28 karolherbst: uhh. regs
22:29 karolherbst: hard to track down
22:32 karolherbst: imirkin: I guess in that cases there has to be an immediate, because we decide which index to acces, cause otherwise it would be the fault of the GLSL code and that would be super stupid?
22:33 imirkin: ?
22:33 imirkin: uniform foo[10];
22:33 imirkin: int x = ...
22:33 imirkin: foo[x]
22:33 imirkin: if x > 10, then ... fail
22:33 karolherbst: ohhh
22:33 karolherbst: and how could we track down something which generates OOR traps?
22:34 imirkin: difficultly
22:34 karolherbst: I figure
22:41 karolherbst: here are some reg based c accesses: https://gist.githubusercontent.com/karolherbst/d4319e66801e24ca18af79c8a25bf151/raw/a953033f4eabbb480d0dac7229b7ab0023e36f82/gistfile1.txt
22:43 imirkin: and it's not just about exceeding the 64k limit
22:43 imirkin: but it's about exceeding the amount bound to that constbuf
22:43 karolherbst: could we somehow add some checks inside codegen to verify it at least for constant values?
22:44 imirkin: i think we assert
22:45 karolherbst: but this looks so wrong:
22:45 karolherbst: 140: and u32 $r21 $r2 0xfffffff0
22:45 karolherbst: 141: ld u64 $r2d c1[$r21+0x8]
22:45 karolherbst: I mean, that mask doesn't help, does it?
22:48 karolherbst: uhm
22:48 karolherbst: ....
22:49 karolherbst: "DCL CONST[1][0..4095]"
22:49 karolherbst: that means you can access c1 up to 4095, right?
22:49 karolherbst: what would happen if you do... 197: ld u32 %r170 c1[0x41c0]?
22:51 karolherbst: or is the array size only valid within the tgsi?
22:56 imirkin: it's the result of shifts and whatnot
22:57 imirkin: e.g. shr + shl by 4 ends up as an and
22:57 imirkin: happens a lot with address calculations
22:58 karolherbst: ohh I see
22:59 imirkin: i have logic to try to optimize those as much as possible
22:59 imirkin: this is the result
23:00 karolherbst: yeah okay
23:02 karolherbst: but with a "DCL CONST[1][0..4095]" shouldn't be the highest index be 0x4000 that can be accessed or is it due to my minor knowledge
23:04 imirkin: no. 0x10000 - 4
23:04 imirkin: each entry is a vec4 32-bit thing
23:04 imirkin: so ... 16 bytes per entry
23:04 karolherbst: uhhh
23:04 karolherbst: okay
23:04 imirkin: i.e. CONST[1][0] is a vec4
23:04 imirkin: and CONST[1][1] is the next vec4
23:04 imirkin: etc
23:04 karolherbst: yeah got it
23:05 karolherbst: I even disabled all opts now, still the issue, so I doubt it is somethign within codegen :/
23:05 pmoreau: No serie for today, sadly… but only one file left for cleanup, update the SPIR-V headers, squash all the fixup commits and re-generate proper patches, and it should be good to go. So definitely this week! :-)
23:05 Horizon_Brave: hi everyone
23:06 karolherbst: silly OOR issue :/ this is getting more and more annoying
23:07 pmoreau: Horizon_Brave: Hello (and bye: going to sleep :-D )
23:07 karolherbst: imirkin: at least it doesn't happen within apitrace, but there is also no buffer storage, any idea how that can be related?
23:10 karolherbst: mhh
23:10 karolherbst: I have a better idea
23:11 karolherbst: I just write some glsl shader, mod codegen around that and try to figure out how we get to know what is wrong, I mean it has to be in some register somewhere.... otherwise that would be silly
23:11 karolherbst: or will it only tells us: this instruction, check yourself
23:17 karolherbst: a GPU debugger would be nice
23:26 imirkin: karolherbst: could be we're messing up bindings
23:26 imirkin: and some shader ends up getting the wrong stuff bound, and goes out of bounds
23:27 karolherbst: could be
23:27 karolherbst: I am at a point were I write random tests into the shader_test files to get the OOR error....
23:27 imirkin: karolherbst: it's pretty easy to do with UBOs
23:27 imirkin: karolherbst: or even without them...
23:28 imirkin: just do like uniform int foo[10];
23:28 imirkin: uniform int addr;
23:28 imirkin: foo[addr]
23:28 imirkin: and make sure that addr is large
23:28 imirkin: at least larger than 10
23:28 imirkin: you should get an error
23:28 imirkin: without a ubo you're going into the "default" constbuf, so the limits may be funky
23:29 karolherbst: I think we need to write a debugger anyway, so I save that for later
23:29 imirkin: btw, i bet we could get a decent-sized increase from binding constbufs properly for kepler+ compute shaders
23:30 karolherbst: I like the idea with messing up bindings
23:30 imirkin: right now we don't go through the compute path for those
23:30 karolherbst: mhh
23:30 imirkin: er
23:30 karolherbst: maybe I can disable computer shaders
23:30 imirkin: we don't go through the constbuf path for those
23:30 karolherbst: :D
23:30 karolherbst: X
23:30 karolherbst: tell me what to do and I do it
23:32 karolherbst: imirkin: \o/ funky
23:33 karolherbst: I disabled compute shaders, guess what
23:34 karolherbst: 9 gps :3 and a black window
23:34 karolherbst: *fps
23:34 karolherbst: gr: GPC1/TPC0/MP trap: global 00000008 [PHYSICAL_STACK_OVERFLOW] warp 0000 []
23:34 imirkin: good one.
23:35 karolherbst: but the OOR issue is gone
23:35 karolherbst: so it is compute shader related, at least something
23:35 karolherbst: how can I enable that constbuf path for compute shaders?
23:36 imirkin: mmm ... tricky
23:36 imirkin: i don't have the time this second to explain it
23:36 imirkin: basically you can only have 8 constbufs in kepler+ compute
23:36 karolherbst: easy to verify
23:36 imirkin: well, you can have 16, but you can only bind 8 ;)
23:37 karolherbst: :D
23:37 imirkin: which limits the usefulness of the other 8 somewhat
23:37 karolherbst: awesome
23:37 karolherbst: can we limit constbuf to 8 overall?
23:37 imirkin: so... right now we just treat all of them like ssbo's
23:37 imirkin: no
23:37 karolherbst: just for testing
23:37 imirkin: GL requires 14
23:37 karolherbst: well okay, but still
23:37 imirkin: BUT in many situations, we could bind them directly
23:37 imirkin: so
23:37 imirkin: the trick is to identify those situations, and take advantage of them
23:37 imirkin: (which i assume will be the vast vast vast majority of situations)
23:38 karolherbst: let's just bind all and hope for the best?
23:42 karolherbst: there are just 128 computer shaders anyway
23:47 karolherbst: imirkin: mhh okay, I see only const bugs 0-2 used in the TGSIs, how does somebody get to the 8 we can bind?
23:50 karolherbst: maybe it is really just a silly out of bound access
23:53 karolherbst: okay... this looks like super wrong
23:55 karolherbst: imirkin: in the TGSI: DCL CONST[0] DCL CONST[1][0..43]
23:55 karolherbst: but, the generated shader accesses c7[0x420] and stuff
23:55 karolherbst: allthough the TGSI only has one constant access at all
23:55 karolherbst: 4: FSNE TEMP[2].x, CONST[1][43].xxxx, IMM[2].xxxx
23:56 karolherbst: this looks like this in the nv50ir: 9: ld u64 %r1585d c7[0x120] (0)
23:56 karolherbst: more or less
23:57 karolherbst: but later is stuff like "473: set u8 %p2887 eq u32 0x00000000 c7[0x420]"
23:58 kone1: will nouveau become less free in future as nvidia requires signed blobs for new gpus?
23:59 karolherbst: imirkin: hum, those c7 access get added while converting to SSA form