01:41 imirkin: dboyan_: your updated proposal is *much* better
02:03 mangix: Horizon_Brave: worse yes.
02:12 whompy: Excepting anholt. He's a winner.
02:22 dboyan_: imirkin: Thanks, will do a minor update later. I think I'll upload the final version this evening or tomorrow morning.
02:22 imirkin: dboyan_: sounds good
02:28 imirkin: dboyan_: what's the status of your rcp/rsq64 patches?
02:28 imirkin: (sorry, i lost track...)
02:38 dboyan_: imirkin: iirc I sended out v2 and was waiting for review, especially on rsq
02:39 dboyan_: i fixed the issue with rsq in the first version did some testing. I thought it was precise enough
02:39 imirkin: ok cool
02:39 dboyan_: btw, I haven't looked at the nir lowering mentioned by Elie yet
02:40 imirkin: no need
02:40 imirkin: iirc he had glsl passes too
02:41 imirkin: oh, but i think that there's nir lowering not from Elie that did rsq/rcp :)
02:41 dboyan_: yeah, I was talking about that one
02:42 imirkin: oh. *mentioned* by Elie. right. i can't read :)
02:44 dboyan_: imirkin: do you want me to port the code to other architectures?
02:48 imirkin: dboyan_: hold on that - let's agree on it first before you go through the effort
02:49 dboyan_: okay, I won't hurry
02:51 dboyan_: google is reminding me that 'less than 39 hours remain to submit your Final PDF Proposal' :)
03:12 imirkin: skeggsb: https://hastebin.com/zorujibaba.go :(
03:12 imirkin: this was with the G92 rendering onto the GK208
03:13 imirkin: or airlied perhaps?
03:20 imirkin: 2b:* f3 90 pause <-- trapping instruction
03:20 imirkin: oh, i guess it's not that useful, it's just pointing out there's a soft lockup...
04:20 dm_comp: does nouveau support NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2)
04:20 gnarface: are you debugging h264 decoding on G92, imirkin?
04:21 dm_comp: xrandr --listproviders Providers: number : 1
04:21 dm_comp: it sould be 2
04:21 dm_comp: s/sould/should
04:25 dm_comp: I also get Could not find provider with name nouveau
04:29 gnarface: dm_comp: https://nouveau.freedesktop.org/wiki/FeatureMatrix/
04:30 gnarface: dm_comp: (if your experience doesn't match, i doubt i can actually help, but good first guesses usually include things like "did i remember the non-free firmware package?" and "did i forget to blacklist the nvidia binary driver?"
04:30 gnarface: )
04:31 dm_comp: nvidia binary driver not installed
04:31 gnarface: dm_comp: lots of this stuff changed relatively recently in distro-release time though, so don't expect the old kernel in debian stable to have all this necessarily - you might need to update
04:31 gnarface: dm_comp: (it's possible that card is simply not supported too, but i thought it was, i just don't actually know)
04:32 gnarface: seems like everyone else has gone to sleep in here though
04:32 gnarface: so despite knowing almost nothing, i decided to share the link to the feature matrix at least then you know what i know
04:35 dm_comp: gnarface thank you
09:15 eyenseo: Who can I talk to regarding the GSoC?
09:15 karolherbst: eyenseo: depends on what you want
09:16 eyenseo: Well I know that I'm late to the party but I would like to participate
09:16 eyenseo: While I am not very good with amths I do know C++ pretty good so I would like to do the Instruction scheduler
09:17 karolherbst: eyenseo: that one is already taken though
09:17 eyenseo: ah ok =/
09:17 karolherbst: _maybe_ you can help out the person doing that, but I don't know how that would work out
09:18 karolherbst: eyenseo: will you be still a student next year?
09:18 eyenseo: karolherbst: yes, for at least 2 years - Master degree
09:18 karolherbst: okay
09:19 karolherbst: then we could do this: until next year you can do smaller tickets/bugs/cards for nouveau or other Xorg related projects and then you can do a proper Gsoc next year
09:20 eyenseo: karolherbst: yeah we could do that - I asked now because some of my units for the SuSe were canceled
09:20 karolherbst: having everything worked out until tomorrow is too much, especially because you also need a mentor (somebody from the nouveau project) and the proposal and so on
09:20 karolherbst: ohh, I see
09:20 eyenseo: karolherbst: would have been a nice 'filler' ;)
09:22 karolherbst: eyenseo: by any chance, are you doing something hardware security related?
09:23 eyenseo: no, not really, I had a unit about security and another for testing but it wasn't really about hardware
09:24 eyenseo: Currently I have a project running were 'we' compare crypto libraries in different languages - but again it's almost 100% software
09:24 karolherbst: no worries, we just have the signed firmware problem and if somebody with proper knowledge would want to tackle this, it would be really helpful
09:26 eyenseo: sounds challenging - with my current knowledge I would be of no help :/
09:29 airlied: karolherbst: tackle it how btw? find hacks/bypasses?
09:29 karolherbst: airlied: finding issues within the hw crypto implementation
09:29 karolherbst: there were some serious in in pre maxwell2, but they all got fixed
09:30 karolherbst: and with serious I mean usefull
09:32 karolherbst: airlied: another idea is to extract the firmware images from the propritary driver and find issues there, but this is more work and less useful than a to do that throught he hw directly
11:27 karolherbst: imirkin: I have like 5 opts which only show any effect if I loop the opt passes, but there are also a few bugs if we loop them, I guess I could figure out how to improve the current passes a bit
11:42 karolherbst: okay, on my pixmark_piano branch I get +10% perf, let's work on those patches first
11:45 mupuf: karolherbst: holy shit, nice!
12:02 karolherbst: hakzsam: did you see something special regarding join for the maxwell sched opcodes?
12:07 karolherbst: mupuf: I get +4.5% just by reorder instructions a bit do improve the dual issue rate
12:08 karolherbst: *reordering
12:08 mupuf: I am not surprised
12:08 karolherbst: dboyan_: if you want to messure perf regarding scheduling, do it with gputest pixmark_piano
12:08 mupuf: Hopefully, we will have something better than that at the end of the summer!
12:08 karolherbst: yes!
12:11 RSpliet: karolherbst: just because there's a lot to gain? rather measure perf using a diverse set of games people actually play...
12:11 karolherbst: RSpliet: no, because scheduling matters there most
12:12 RSpliet: karolherbst: my favourite game is glxgears
12:12 karolherbst: the issue with testing with games is: better scheduling doesn't have to improve perf, because someting else is shit
12:12 RSpliet: said nobody ever
12:12 mupuf: karolherbst: that is the thing with bottlenecks ;)
12:12 karolherbst: exactly
12:13 karolherbst: and with pixmak_piano our only bottleneck is shader execution time
12:13 mupuf: but see it like this: Lower power consumption => can yield higher clocks
12:13 karolherbst: mhhh, well....
12:13 mupuf: and avoiding stalls is a nice way of increasing the power efficiency
12:13 karolherbst: same bottleneck in the end
12:13 RSpliet: karolherbst: I got that, but why improve something that only a few benchmark websites care about - rather than actual users?
12:14 mupuf: in this case, yes. But what if you were memory-limited and you can suddenly boost the memory clock a little bit
12:14 mupuf: ?
12:14 karolherbst: RSpliet: the idea is, if we get proper instruction schduling and it improves perf a lot in the pixmark_piano benchmark, it means that this pass does something good and may affect other things as well
12:14 karolherbst: mupuf: not affected by temp
12:14 karolherbst: memory clock isn't boosted
12:15 RSpliet: and no, DRAM goes about and power down banks if unused for a while. The more efficient you are with DRAM accesses, the longer your banks can be in low-power state
12:15 mupuf: karolherbst: not yet ;p
12:15 karolherbst: mupuf: well.... let us first get us a usefull PMU :p
12:15 mupuf: that is not for debate :D
12:15 mupuf: we indeed need to do that
12:15 karolherbst: yes
12:15 RSpliet: karolherbst: because you are tailoring your optimisation pass for something with insane shaders like pixmark_piano. You probably have to be a bit more clever in your policy for actual games
12:15 karolherbst: no idea if pascal boosts memory clocks, I think they do though
12:15 RSpliet: or not, but it's important to validate that
12:15 karolherbst: baut seriously...
12:15 karolherbst: pascal
12:15 karolherbst: *but
12:16 mupuf: RSpliet: what would you rather have dboyan_ do then?
12:16 karolherbst: RSpliet: true, but pixmark_piano improves motivation, because you _notice_ a difference
12:16 mupuf: it will be useful for opencl too
12:16 karolherbst: if you try to write a scheduling pass and it has no effect, you have to figure out why
12:16 mupuf: and it is self-contained
12:16 mupuf: and does not require REing
12:16 mupuf: so, pretty good topic for a GSoC
12:16 karolherbst: then you may rewrite your scheduling again, but it wasn't the issue, but something else
12:17 RSpliet: mupuf: the topic is brill
12:17 mupuf: brill?
12:17 karolherbst: and then after a day wtihout any changes, you give up, cause it doesn't do anything
12:17 RSpliet: brilliant
12:17 karolherbst: that's why something first, where you notice the change, then check where perf changes as well
12:17 RSpliet: but make sure to benchmark with real games eventually
12:17 karolherbst: yeah, next step
12:18 RSpliet: karolherbst: no, in lockstep with figuring out the "instruction picking" part of your sched algo
12:18 karolherbst: and then where scheduling had a relatively big effect, the scheduling could be improved with that game as well
12:18 RSpliet: you risk overoptimising for the wrong use-case
12:18 karolherbst: that isn't the point and never was
12:18 karolherbst: we could do that if we would know the bottlenecks in every application
12:18 karolherbst: fact is, we don't
12:18 karolherbst: except a few examples
12:18 karolherbst: and most/all of those are micro-benchmarks
12:19 RSpliet: don't we have perf counters to show the efficiency of shaders, thanks to hakzsam?
12:19 karolherbst: sure you can say: this game runs terribly, cause we don't schedule properly, but then again: do you know for sure or are you just guessing?
12:20 karolherbst: RSpliet: well, they can't really tell you about the bottlenecks, I already tried
12:20 karolherbst: we need those other counters for that
12:20 RSpliet: thought we had a pipeline stall counter
12:20 RSpliet: good strategy -> reduce stalls
12:21 karolherbst: well, sure, but I don't think we have those yet
12:22 karolherbst: the point I was just trying to make was, that starting with pixmak_piano is good for motiviation, because scheduling has a big impact here
12:22 karolherbst: nothing more
12:24 RSpliet: Yes I read that, but you seem to refuse to acknowledge that this might well be a misleading strategy, because it's not a real game
12:24 karolherbst: I never set it should _only_ be written against it
12:25 RSpliet: diversity, analyse, measure, understand. Don't focus on a single target unless you know that target is what people care about
12:25 mupuf:fails to see why real time comes in the picture here. If anything, this reduces the pressure on the bus, which reduces the variance
12:25 mupuf: which is good for real time
12:25 RSpliet: mupuf: who mentioned real time?
12:25 mupuf: woops
12:25 mupuf: just read "real time game" :D
12:26 karolherbst: also, it's a fast benchmarks, benchmarking games is time consuming
12:27 mupuf: karolherbst: because we do not use the right tools
12:27 karolherbst: especially if you don't know which game should get a perf boost, and you test 10, because you don't know it
12:27 mupuf: we need to finish the perf counter supporrt
12:27 karolherbst: yes
12:27 mupuf: and use it with apitrace
12:27 karolherbst: right
12:27 mupuf: so as we can see actual changes at the draw call level
12:27 mupuf: and spend time where necessary
12:28 mupuf: there is an open source project from intel (frame_retracer) that could help, but we still need to work on this!
12:29 mupuf: and I'm sad I will not get a student this year to finish the GUI for apitrace's perf counter
12:29 karolherbst: +1.1% by smarter pow lowering... good enough
12:29 karolherbst: mupuf: we should try to finish all the counter work for nouveau first anyway :p
12:29 mupuf: karolherbst: yes
12:30 mupuf: it is all in the userspace now
12:30 mupuf: the kernel space landed IIRC
12:30 karolherbst: (ignoring the PMU counters :P but those are pretty useless to begin with)
12:31 karolherbst: but I think there was something missing
12:31 karolherbst: did all the MP counter work land already?
12:31 mupuf: MP counters are mostly fine
12:32 mupuf: it's pcounter that is not supported in the userspace
12:32 karolherbst: pcounter?
12:32 mupuf: one reason for that was that is that queries in gallium were not meant to be polled at the same time
12:32 mupuf: samuel had a patch for that, not sure if it landed
12:33 karolherbst: ohh, I think those landed
12:33 mupuf: karolherbst: pcounter is the engine that allows you to know everything at the GPU-level
12:33 mupuf: number of PCIe requests, split by sizes
12:33 mupuf: same with RAM
12:33 mupuf: or vdec, etc...
12:34 mupuf: there are ~7 clock domains
12:34 karolherbst: https://github.com/karolherbst/nouveau/commits/master_4.10?author=hakzsam
12:37 karolherbst: mhh engine/pm seems to be the right stuff
12:40 mupuf: ;)
12:40 mupuf: karolherbst: ask hakzsam, he has a list of events reverse engineered and what is still left to do
12:45 karolherbst: imirkin: is there a good place for the POW -> MUL lowering for small constants? I don't really want to do that for in every ir_lower_* file, but still want to get all the opts
12:48 karolherbst:is wondering if we should do opts before lowering....
13:56 imirkin: karolherbst: hm? algebraicopt would make sense...
13:56 karolherbst: imirkin: we don't have pows in SSA
13:56 imirkin: oh, coz we lower them pre-ssa? that's dumb, we should do that in the legalize stage
13:57 karolherbst: okay, but then we just miss out the opts on the pow -> stuff result
13:57 karolherbst: but I guess this is fine
13:59 karolherbst: imirkin: in legalize POST_RA?
14:00 imirkin: not post-ra
14:00 imirkin: post-ssa-opt
14:00 imirkin: but still ssa
14:00 karolherbst: I meant the pow -> stuff translation
14:00 karolherbst: ohhh
14:00 karolherbst: mhh I see
14:06 imirkin: e.g. nvc0legalizessa
14:08 karolherbst: yeah, already found it
14:08 karolherbst: mhh
14:08 karolherbst: if I find some time, I will try to share more code between the chipsets
14:08 karolherbst: handlePOW should be always the same
14:09 imirkin: nv50 doesn't have the fmz thing
14:11 karolherbst: ohh, the gm107 thing is a subclass of nvc0 and there is no gk110...
14:11 karolherbst: okay, fair enough
14:17 karolherbst: imirkin: the code is still the same
14:31 gnurou: a little announcement: https://plus.google.com/+AlexandreCourbot/posts/BdWyfgsfp5J
14:32 imirkin: gnurou: good luck!
14:32 karolherbst: gnurou: hi :)
14:33 RSpliet: gnurou: I'm sorry to see you go...
14:33 RSpliet: but good luck with all your future endeavours
14:33 RSpliet: (and don't be a stranger ;-))
14:34 pmoreau: gnurou: Good luck with what you’re planning next!
14:34 gnurou: yeah. I'm sad to go honestly. but I am confident we will get other chances to share a drink
14:34 pmoreau: gnurou: Hope you’ll continue hanging around, even if you do not conribute.
14:34 gnurou: pmoreau: sure, I will
14:34 karolherbst: gnurou: will you come to XDC? :D
14:35 RSpliet: if you stick around the OSS GPU communities, I'm sure we'll meet @ XDC, FOSDEM or any of the other numerous events ;-)
14:35 gnurou: and if I miss it too much I may even consider *gasp* buying a NVIDIA GPU for myself ;)
14:35 gnurou: karolherbst: I hope too - too early to say though
14:35 gnurou: s/too/to
14:37 RSpliet: also: thank you for battling with your peers and superiors for years for us. It's much appreciated
14:37 karolherbst: mupuf: now we have to find somebody else which will forward our questions to the right people :O
14:37 pmoreau: gnurou: We can send you some samples for RE'ing purposes :-p
14:37 gnurou: I wish I could have done more, really :/
14:38 karolherbst: gnurou: I think after ben you have the most commits now in the kernel :p
14:38 pmoreau: karolherbst: I guess Andy or some of the other NVIDIA guys that were at XDC could help.
14:38 mupuf: gnurou: yes, you will be missed!
14:39 mupuf: and have fun in your new job too!
14:39 karolherbst: pmoreau: first we have to get them to talk here inside the channel :D
14:40 mupuf: gnurou: there is only so much one person can do, no worries, we are very grateful of your work!
14:40 pmoreau: Yes…
14:40 mupuf: and it has been a pleasure working with you
14:40 pmoreau: gnurou: Nothing public yet on what you will do next, I assume?
14:41 RSpliet: pmoreau: Valve doesn't let him say yet :-P
14:41 pmoreau: RSpliet: To work on RadeonSI, right? :-D
14:42 gnurou: haha. soon, if I don't get fired on my first week ;)
14:42 pmoreau: gnurou: That would be unfortunate… :-/
14:44 karolherbst: gnurou: if you get fired, you get help us out in the meantime :p
14:54 karolherbst: imirkin: moving the POW lowering after SSA: https://gist.github.com/karolherbst/0ba3adf0456e48d5147f6920f638bede
14:55 imirkin: heh
14:55 imirkin: either you did something silly, or there are a lot of situations where one does pow(a, 2), pow(a, 3), pow(a, 4) etc
14:55 karolherbst: mhhh
14:55 imirkin: since then those log2's would be CSE'd
14:55 karolherbst: another idea: more movs
14:56 imirkin: given the number of shaders hurt... unlikely
14:56 karolherbst: why?
14:56 imirkin: usually mov's are semi-random
14:56 imirkin: so some would be hurt, some would be improved
14:56 imirkin: here they're all hurt
14:56 karolherbst: mhhh
14:56 imirkin: except the one lucky one
14:56 karolherbst: well we miss all the opts now
14:57 karolherbst: the mul gets one value of the pow
14:57 karolherbst: could be an immediate
14:57 karolherbst: and isn't immediated anymore
14:59 imirkin: check some of the hurt shaders
14:59 imirkin: and see what's up
14:59 karolherbst: yeah
15:00 karolherbst: .....
15:00 karolherbst: yes, one issue are more movs, but there are more
15:01 karolherbst: "lg2 f32 %r451 abs %r274; mul dnz f32 %r452 %r451 15.000000" vs "abs+mov+lg2 f32 %r594 %r428+mul dnz f32 %r595 %r433 %r594"
15:01 karolherbst: so modifiers aren't folded in as well :/
15:01 karolherbst: that will be a fun lowering pass in the end
15:03 karolherbst: mhhhhhhh
15:03 karolherbst: actually
15:03 karolherbst: I have an idea
15:04 imirkin: you could fold them into POW
15:04 karolherbst: easier
15:04 imirkin: and then make sure to copy them over
15:05 karolherbst: I just declare what we could do with a POW
15:05 imirkin: that's what i mean.
15:05 karolherbst: I meant the table inside the target classes, or did you mean this as well?
15:05 karolherbst: ohhhh
15:05 karolherbst: yes, you meant this
15:05 imirkin: yes, i did.
15:06 karolherbst: src0 == lg2.src0 and src1 == mul.src0
15:07 imirkin: you could make it mul.src1 for the load propagation aspect
15:07 karolherbst: imirkin: can I simply assign the .src(x) objects?
15:07 imirkin: doubtful.
15:08 imirkin: but you can just swap out i->op for one of them
15:08 karolherbst: I do it for the ex2
15:08 karolherbst: do I have to copy something besides .mod?
15:09 imirkin: shouldn't be any indirects or any other funny business... i think that's it
15:15 dboyan: imirkin: about the ARB_shader_clock thing, I think the blob is doing something wrong there, but the rollover (clocklo overflow) should be taken into account.
15:16 imirkin: dboyan: we could also just not worry about it, and feed out only the "low" bits
15:16 imirkin: (as long as we put them in the upper 32-bits of the result)
15:16 imirkin: dboyan: btw, i assume you saw nha's ARB_shader_ballot patches -- those should be nicely implementable for kepler+
15:16 imirkin: (fermi didn't have the SHFL.IDX op)
15:17 RSpliet: dboyan: you missed a long heated discussion about how to benchmark scheduling ;-)
15:20 dboyan: imirkin: getting clocklo into upper 32 bits is also okay, since it takes a few seconds for clocklo to overflow. But I guess getting clockhi is not that hard either. I came up with an idea, which only needs a loop
15:21 imirkin: dboyan: but couldn't you just forget about the high bits and just use the low bits (but stick them high)?
15:22 karolherbst: (if a shader runs for more than a second, we have other issues anyway)
15:22 dboyan: imirkin: okay, if we decide to do that way, I think we may even get ARB_shader_clock on nv50 ;)
15:23 imirkin: well, the thing is - if it doesn't start at 0
15:23 imirkin: then it can overflow whenever
15:24 imirkin: perhaps that's the diff between clock and globalclock?
15:24 imirkin: dboyan: well, we should def get clock on nv50 -- that one's unambiguous :)
15:24 imirkin: i have one plugged in so i can test if needed
15:24 karolherbst: imirkin: most likely. ARB_shader_clock defines a shader local clock
15:24 karolherbst: so it can start at 0 every time
15:24 imirkin: karolherbst: it doesn't define it one way or the other.
15:24 imirkin: [the spec doesn't]
15:25 dboyan: clockhi/lo seems to start from 0, at least on my card
15:25 karolherbst: imirkin: in practise, it's define one, because it doesn't guarentee it's useable as a global clock
15:26 karolherbst: imirkin: it doesn't even guarentee to be useable as a clock between different shader stages
15:27 karolherbst: mhh but this part should be important to your issue: "The returned time will wrap after it exceeds the maximum value representable in 64 bits."
15:27 imirkin: karolherbst: sure, but if it's a global clock, it won't complain :)
15:27 karolherbst: true
15:27 imirkin: that's what i mean by it's not defined
15:27 karolherbst: I see
15:27 imirkin: the important part is wrap detection
15:28 imirkin: which means that the high bit of precision has to be in the high bit of the 64-bit value
15:28 imirkin: otherwise the shader won't be able to detect a wrap event
15:28 karolherbst: mhh
15:28 karolherbst: yeah
15:28 imirkin: also, having a global clock is probably advantageous for profiling overall draws (rather than just a single shader invoc)
15:29 karolherbst: allthough
15:29 karolherbst: what happens with 0xffffffffffffffff + 1?
15:29 imirkin: that's aka a wrap event
15:29 karolherbst: does something bad happen?
15:29 imirkin: no
15:29 imirkin: but the shader can detect it.
15:29 karolherbst: okay, and why does it have to be detected?
15:29 imirkin: because 1 < 0xffffffffffff :)
15:29 imirkin: and normally time moves forwards
15:29 karolherbst: true, but the spec doesn'T care
15:30 karolherbst: "The returned time will wrap after it exceeds the maximum value representable in 64 bits."
15:30 imirkin: and normally there are various assumptions about that when you're grabbing time.
15:30 dboyan: imirkin, karolherbst: One hint here, the blob is using local clock
15:30 karolherbst: yeah, because if application use this as a global one -> bug
15:30 dboyan: hehe
15:30 imirkin: karolherbst: not if the application does proper wrap detection
15:31 imirkin: which is what we've been discussing this whole time.
15:31 karolherbst: it is always a bug
15:31 imirkin: the application needs to detect it
15:31 karolherbst: it is no global clock, so application shouldn't use it as one
15:31 karolherbst: yes
15:31 karolherbst: the application
15:31 karolherbst: not the driver
15:31 imirkin: right. but the driver needs to structure things so that the application *can* detect it
15:32 imirkin: e.g. if you stick the value into the lower 32 bits and let that wrap
15:32 karolherbst: I don't see anything inside the spec, which tells the driver to do a proper overflow check
15:32 imirkin: then the application may never notice.
15:32 imirkin: so the value's MSB needs to be in the 64th bit.
15:32 dboyan: Even more strangely, the blob puts clockhi in lower bits
15:32 karolherbst: no, the spec stricly say: if overflow, then wrap
15:32 imirkin: karolherbst: re-read what i said and try to understand it.
15:32 RSpliet: "The units of time are not defined and need not be constant.". I wonder what games we can give a "swift kick in the pants" by assuming a clock faster than NVIDIA :-P
15:33 imirkin: RSpliet: probably due to reclocking happening? dunno.
15:33 karolherbst: imirkin: I just don't see why nouveau has to do this
15:33 karolherbst: the spec is clear about this point
15:33 imirkin: karolherbst: consider this situation
15:33 imirkin: you have a hw 32-bit counter
15:33 imirkin: the API returns a 64-bit value.
15:33 imirkin: if you return your 32-bit counter in the lower 32 bits
15:34 imirkin: then when your 32-bit counter wraps
15:34 imirkin: then the API value will go from 0x000000ffffff to 0x000000000
15:34 RSpliet: (imirkin: https://lkml.org/lkml/2005/7/8/263 )
15:34 imirkin: this is bad.
15:34 agusyc: Hi, guys.
15:34 karolherbst: pro tip: x << 32
15:34 agusyc: I'm having some trouble with nouveau on a hybrid graphics laptop.
15:34 imirkin: so as i was saying, for wrap detection to work
15:34 imirkin: you must stick your value's MSB in the 64th bit.
15:35 agusyc: When I use nouveau (I have it blacklisted now), the touchpad hangs after I resume it from suspend and when I try to turn it off, the Laptop freezes completely.
15:35 karolherbst: imirkin: okay, I missed that hw counter is 32bit wide fact
15:35 RSpliet: agusyc: first the basics: Kernel 4.10 or 4.11rc?
15:35 imirkin: karolherbst: i dunno how wide it is. but irrespective of how wide it is, the high bit of the counter has to be in bit 64 of the returned value.
15:36 agusyc: RSpliet: 4.10.6-200.fc25.x86_64
15:36 karolherbst: not if the counter is 64 bit wide
15:36 imirkin: if the hw counter is 64-bit wide, then the high bit of the counter still goes into bit 64 ;)
15:36 dboyan: strange, in Issue 2 "Spec language currently mandates 64-bit, which would preclude implementations from exposing a 32-bit timer."
15:36 karolherbst: imirkin: .. true, I was thining about an overflow bit...
15:37 karolherbst: dboyan: :D unresolved...
15:37 RSpliet: agusyc: is this a skylake laptop by any chance?
15:37 agusyc: RSpliet: Nope, Asus X556UB. It has an i5 and a 940M.
15:37 imirkin: agusyc: out of curiousity, are you suspending to ram or to disk?
15:37 agusyc: imirkin: RAM.
15:37 karolherbst: imirkin, dboyan: but it should be fine if we just fill the 32 high bits of the 64bit value and be done with it, or not?
15:37 imirkin: karolherbst: afaik, yes.
15:37 karolherbst: even if this is a super silly workaround
15:37 RSpliet: agusyc: sounds a lot like my K501U, which is a skylake
15:38 agusyc: RSpliet: Mmm... And why may it be?
15:38 RSpliet: and which suffers from random hangs when locking screen or suspending like yours
15:38 imirkin: agusyc: i wonder if your issue is due to some kind of runtime pm getting hit on the usb device
15:38 agusyc: imirkin: The USB device?
15:38 agusyc: When did I mention a USB device? :P
15:38 imirkin: agusyc: touchpad is most likely hooked up via USB internally
15:39 agusyc: Oh...
15:39 agusyc: I didn't know that.
15:39 karolherbst: (or PS/2 ... )
15:39 imirkin: you can check with 'lsusb'
15:39 imirkin: PS/2 is on its way out, and it sounds like you have a semi-modern device
15:39 karolherbst: mine has PS/2
15:39 agusyc: Looks like it's not usb.
15:39 imirkin: huh, ok
15:39 karolherbst: imirkin: and it has a hsw CPU + Kepler GPU :p
15:39 agusyc: It's an Elantech Touchpad, by the way.
15:40 karolherbst: ... PS/2 it is :p
15:40 agusyc: Ok.
15:40 karolherbst: there is a special config for that in the kernel
15:40 imirkin: well, PS/2 is a lot harder to kill :)
15:40 imirkin: which is why i wanted to blame usb
15:40 karolherbst: MOUSE_PS2_ELANTECH
15:40 karolherbst: I have a Sentelic based one
15:41 agusyc: Welp.
15:41 agusyc: Is there anything I can do?
15:41 imirkin: well, i'm guessing fedora says =y or =m to everything...
15:42 agusyc: It happens on every distro.
15:42 RSpliet: agusyc: I always had the impression it's not the touchpad losing it, but the Intel GPU
15:42 agusyc: I tried several.
15:42 RSpliet: so the mouse doesn't move, no feedback on screen, but magic sysrq sometimes works, sometimes doesn't
15:42 agusyc: RSpliet: Do you think? I'm not using the NVIDIA one right now. I blacklisted the module and I the issues don't occur.
15:43 RSpliet: ah... hmm... could nouveau be holding the kernel hostage in its resume-from-suspend?
15:43 imirkin: agusyc: try loading nouveau with runpm=0
15:43 karolherbst: imirkin: a little better now: https://gist.github.com/karolherbst/0ba3adf0456e48d5147f6920f638bede
15:43 agusyc: Ok, I'm going to reboot and try. Brb.
15:43 agusyc: You mean just adding "runpm=0" as a kernel parameter, right?
15:43 karolherbst: some values aren't immediated yet though
15:43 karolherbst: agusyc: nouveau.runpm=0
15:43 agusyc: karolherbst: Ok, thanks.
15:43 dboyan: imirkin: I'll try sticking clocklo to upper 32-bits then
15:44 imirkin: dboyan: which is what nvidia does right?
15:44 dboyan: except that it puts clockhi to lower bits
15:44 imirkin: yea
15:44 imirkin: but ... my guess is that's a hack for some kind of "advanced" software to detect silly things
15:45 dboyan: maybe
15:45 imirkin: i dunno
15:45 karolherbst: clockhi is 1 bit only?
15:45 karolherbst: and is cleared on read?
15:45 imirkin: karolherbst: hard to say.
15:45 imirkin: would require deeper investigation
15:46 imirkin: dboyan: i'd start simple, and not do any of the fancy nvidia things with the loop/etc
15:46 dboyan: I didn't managed to get clockhi more than 1, but it's clearly not cleared on read
15:46 karolherbst: I think nobody will need a 64bit precise clock anytime soon with nouveau anyway
15:46 karolherbst: s/precie//
15:47 dboyan: It takes about 10 seconds on my card to make clockhi non-zero
15:48 dboyan: and when I tried to make a shader run longer, the blob stops it midway
15:48 karolherbst: 10 seconds is a lot
15:48 agusyc: Ok, I think it worked...
15:48 agusyc: At least the touchpad didn't freeze.
15:48 agusyc: But I still have to see if it hangs on poweroff, so, brb.
15:48 karolherbst: dboyan: is the value reset automatically for every shader?
15:49 dboyan: karolherbst: you mean clocklo/clockhi?
15:49 karolherbst: yes
15:49 dboyan: I think so
15:49 karolherbst: good enough then
15:52 karolherbst: mhh
15:52 dboyan: imirkin: I also noticed nha's work on ARB_shader_ballot. might want to work on it if I get some spare time
15:54 imirkin: dboyan: cool
16:08 karolherbst: I forgot to clear the mod on the final ex2.....
16:15 karolherbst: getting there: "total instructions in shared programs : 3931743 -> 3932317 (0.01%)"
16:17 karolherbst: :((((
16:18 karolherbst: imirkin: "mov u32 %r292 0x44fa0000 + lg2 f32 %r442 %r292"
16:21 karolherbst: mhh, I guess I need to do that in the lowering then
16:21 karolherbst: messy
16:23 karolherbst: or do you have any better idea?
16:23 imirkin: can lg2 take an imm?
16:24 imirkin: i didn't think it could..
16:24 karolherbst: uhm....
16:24 karolherbst: we calculate the result in the compiler normally
16:25 karolherbst: the mov+lg2+mul -> mul in the old version
16:27 imirkin: oh. coz lg2(imm) = easy to compute ;)
16:27 karolherbst: yes
16:27 imirkin: you could fix that up in ConstantFolding
16:28 karolherbst: well, we just moved the lowering post SSA
16:28 karolherbst: I could allow pow(imm0, a) but then I would "opt" it to pow -> ex2(preex2(mul(lg2(imm0), a)))
16:29 imirkin: post-ssa there is no lowering
16:30 karolherbst: *legalizing then
16:30 imirkin: you could add special logic to handle it in your lowering pass though
16:30 imirkin: since you can use getImmediate there
16:31 karolherbst:is wondering if we should make it easy to call ConstantFolding::handleLG2 easily from outside _peephole
16:31 imirkin: no.
16:32 imirkin: i'd just do the imm handling in your new lowering logic
16:32 karolherbst: should be trivial enough after looking at the constantfolding code
16:33 imirkin: if src.getImmediate(imm): stuff.
16:33 karolherbst: yeah
16:36 karolherbst: is there a non float pow version at all?
16:37 imirkin: no
16:41 karolherbst: I can't call new_ImmediateValue
16:41 karolherbst: because the constructor of ImmediateValue calls prog->add
16:42 imirkin: why's that a problem?
16:43 karolherbst: I guess there is another way to get prog besides i->bb->getProgram()
16:43 karolherbst: ...
16:44 karolherbst: huh
16:44 karolherbst: ohhh
16:44 karolherbst: another issue
16:44 karolherbst: I reordered and moved setPosition
16:45 imirkin: or don't use the builder at all... wtvr
16:46 karolherbst: I use it for getSSA
16:46 imirkin: i guess using it is convenient :)
16:46 karolherbst: mhh
16:46 karolherbst: now I have to take more care about the mul, so I put the generated immediate into the second src
16:47 karolherbst: :(
16:47 karolherbst: it is starting to get complicated
17:06 karolherbst: pow(a, 1), well that is easy
17:19 karolherbst: ohh wait
18:00 karolherbst: imirkin: I am currently wondering, but this should be right: ex2(lg2(a)) == a? I am just confused why this wasn't optimized previously
18:01 imirkin: yes, that is correct
18:01 imirkin: at least ... correct enough
18:01 imirkin: (e.g. NaN & co won't be treated properly)
18:01 karolherbst: mhhh
18:03 pmoreau: Might be worth optimising it and adding some sel to handle the NaN & co cases?
18:04 karolherbst: I am pretty sure that our current output is also halfly wrong
18:04 imirkin: or not worry about nan since it all tends to be undefined
18:05 karolherbst: and if something really depends on that, we figure that out with the first bug report
18:11 pq: obviously it must be that a > 0, but you knew that, not dealing with complex numbers I suppose. No idea if you should accept a = 0 or a being almost 0.
18:12 karolherbst: pq: we come from a pow actually
18:13 karolherbst: a^b -> 2^(b*lg2(a))
18:13 pq: does it accept pow(negative real, integer)?
18:15 karolherbst: uhhh wait, meh
18:16 karolherbst: I am sure it does
18:18 karolherbst: mhhh
18:35 imirkin: pq: mathematically, sure. in practice, i don't know if that's legal
18:36 imirkin: 99.99987% of the uses of pow are for srgb, i.e. pow(color, 2.2)
18:44 pq: I don't know if it's legal either :-)
18:51 karolherbst: imirkin: mhh, now we are missing LocalCSE :/
18:52 karolherbst: same base, different exponents -> you could share lg2(base)
18:53 ddaymace: currently running free radeon driver with debian stretch gnome; if i switch to geforce 6 card, will it switch to nouveau automatically, or do i have to uninstall old drivers?
18:54 karolherbst: imirkin: but currently I am already at "total instructions in shared programs : 3927257 -> 3926939 (-0.01%)", just added special handing if there is a src==1
18:54 imirkin: ddaymace: should switch, assuming you have them installed
19:43 karolherbst: imirkin: any idea how to solve the missing localCSE?
19:43 imirkin: that's one of the downsides of the approach.
19:43 karolherbst: mhh, I could have a map and save which lg2 I created
19:43 imirkin: remind me why you wanted to move it to later on btw?
19:43 karolherbst: constantfolding
19:44 karolherbst: pow(a , 16) -> tons of muls
19:44 karolherbst: well, <5 is fine as well
19:44 karolherbst: but this was the idea
19:44 imirkin: well, you could detect the decomposed thing in AlgebraicOpt right?
19:44 karolherbst: there are more opts possible based on pow though
19:45 imirkin: ok
19:45 karolherbst: like pow(1, a) and pow(a, 1)
19:45 karolherbst: and so on
19:45 imirkin: well, the latter should work out
19:45 karolherbst: both need the same code
19:45 imirkin: since exp(lg2(a) * 1) = exp(lg2(a))
19:45 imirkin: although we might not have the smarts to make a out of that.
19:45 karolherbst: yeah
19:46 karolherbst: maybe
19:47 karolherbst: maybe I checkout that ex2(lg2(a)) == a first
19:47 karolherbst: would be a perfect AlgebraicOpt thing
19:48 karolherbst: and improves the situation a little regarding pows without much code
20:15 karolherbst: imirkin: is PREEX2 always used prior a EX2? I don't really know what those PRE* instruction do
20:18 karolherbst: heh, I can do lg2(ex2(a))==a as well
20:24 imirkin: yes
20:24 karolherbst: I know that PRESIN can be/is used before sin and cos
20:24 karolherbst: do they somehow prepare the register or do they calculate something as well?
20:25 imirkin: they calculate something
20:27 karolherbst: *sigh* my opt isn't picked up, cause the mul(a, 1) is still there
20:29 imirkin: coz you need to detect lg2(mul(ex2(preex2(a)), b))
20:29 imirkin: er, other way around
20:29 imirkin: ex2(preex2(mul(lg2(a), b)))
20:29 imirkin: might be a prelg2 as well, i forget
20:29 karolherbst: the mul(a, 1) is opted away though
20:30 karolherbst: there is no prelg2 at least on nvc0
20:30 karolherbst: ConstantFolding deals with the mul
20:30 imirkin: k
20:30 imirkin: algebraicopt runs before constantfolding
20:30 karolherbst: yeah
20:31 karolherbst: I think looping over the opts is our best shot now, and it indeed changes a lot
20:34 karolherbst: just need to fix all those AlgebraicOpts
20:37 karolherbst: instructions: -0.36%
20:37 karolherbst: locals: -2.62%
20:38 karolherbst: but some hurt gprs
21:28 karolherbst: imirkin: regarding RA and improving register layout (less movs for d/t/q regs) I think you told once, that doing it backwards should be easier and "better". I never diged into the RA code, but I think I will work on that issue sooner or later
21:30 karolherbst: I just wanted to ask about this issue before I start working on it. Not that I remember that wrongly and there is a better way to tackle this
21:31 karolherbst: or I do "Move loop-invariant defs out of loops" :) this sounds like a lot of perf to gain
21:43 karolherbst: found another bug. nice
21:43 karolherbst: ohh wait, no, this makes sense actually
22:30 mooch2: sorry, i'm a bit new to this