00:06airlied[d]: mhenning[d]: I don't think it's legal to encode an immediate in LEA anymore for Rb
00:12mhenning[d]: airlied[d]: oh, that's interesting. I can try that tomorrow
00:12airlied[d]: yeah I think no c[] or imm forms
00:12airlied[d]: just looking at all the LEA in the dumps I took from nvidia libs
00:13mhenning[d]: I wonder if any other instructions might have lost their imm forms
00:26airlied[d]: wierd, I hacked it, and it still dies, keep digging
00:30airlied[d]: ah also has a ur encoded that needs 0xff
00:31airlied[d]: oh maybe that is all it was
00:33mhenning[d]: yeah, that could do it
00:39airlied[d]: I'll keep an eye out for the same problems in any remaining errors
00:45redsheep[d]: gfxstrand[d]: Before we know it you'll plug in a fermi card. You're already working on the same isa lol
00:46redsheep[d]: Then every generation possible will have vulkan (unless you're fully crazy and want to go back to Tesla...)
00:50redsheep[d]: Released in 2006, that would truly be a feat. IMO getting cards from 2012 working is plenty
00:51redsheep[d]: Tesla sounds like a nightmare, I remember hearing there's lots of variation in the hardware capabilities from back then and the cards are so old most are probably dead
00:56gfxstrand[d]: I don't care about Tesla. If we enable Fermi, the point will be to amber the entire nvc0 driver.
01:04orowith2os[d]: Is Fermi really a requirement for Amber-ization?
01:06gfxstrand[d]: Not really
01:07gfxstrand[d]: We can amber the whole thing whenever we want, really, unless mhenning[d] and karolherbst[d] are still interested in actively working on it.
01:08gfxstrand[d]: But if we support Fermi, we can DELETE the old driver. ð
01:08gfxstrand[d]: But IDK that it's worth it.
01:09gfxstrand[d]: I'm mostly hacking on Fermi ISA so that when we claim Kepler support, it'll be all Kepler.
01:12orowith2os[d]: That would make Fermi a bit easier too, wouldn't it?
01:12gfxstrand[d]: Yeah
01:12gfxstrand[d]: Well, to a point
01:13orowith2os[d]: You already have the compiler, now it's just figuring out which ioctls do what, and how. And that's not impossible, probably easier since it's older (but harder because it's not designed initially with Vulkan in mind)
01:13gfxstrand[d]: Fermi needs the copy firmware rewritten and it doesn't support bindless so it needs a new descriptor model. It'd still be a lot of work but at least you could compile shaders.
01:14gfxstrand[d]: orowith2os[d]: Oh, it's all the same ioctls
01:15orowith2os[d]: :akipeek:
01:18redsheep[d]: Deleting nvc0 sounds cool but if fermi is too much trouble ðĪ·
01:18redsheep[d]: Obviously Nvidia decided against it, I'd assume solely based on the descriptor issue
01:19karolherbst[d]: worst case fermi uses the old copy stuff.. just means another code path to support, but might be worth it if it means ditching the gl driver
01:20redsheep[d]: Not having the code lying around rotting would be nice
01:20karolherbst[d]: fermi doesn't even need to be conformant, just good enough for zinj
01:20zmike[d]: delete the code!
01:20gfxstrand[d]: But we can also just amber the GL driver and that's almost as good as ditching it.
01:20redsheep[d]: We've already started to see some cases of unreported regression with old hardware
01:20zmike[d]: does that mean we're doing a new amber branch?
01:21karolherbst[d]: nobody cares about fermi besides maybe running a desktop and a browser
01:21orowith2os[d]: karolherbst[d]: A nonconformant Vulkan driver, only enough to make Zink work? :GA_Uueuuue:
01:21orowith2os[d]: How would you stop normal apps from using it?
01:21zmike[d]: if engine!=zink abort
01:21karolherbst[d]: normal apps are fine, nobody will run games on fermi anyway
01:21orowith2os[d]: Weh
01:22karolherbst[d]: if gnone/kde run, that's good enough
01:23gfxstrand[d]: gfxstrand[d]: Also, I'm personally curious about the evolution of Nvidia's ISA.
01:24karolherbst[d]: fermi and kepler 1st gen basically use the same ISA
01:24karolherbst[d]: fermi has short encodings tho
01:24gfxstrand[d]: Yeah. Textures and images are different because of bindless but that's most of it.
01:25gfxstrand[d]: karolherbst[d]: Which codegen has code for but has disabled. ðĪĄ
01:25karolherbst[d]: you can encode two instructions inside one
01:25karolherbst[d]: gfxstrand[d]: for fermi? It's used on tesla tho
01:26karolherbst[d]: kepler got it removed
01:27karolherbst[d]: I'm kinda sure it's also used on fermi, but... I never checked
01:27gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir_emit_nvc0.cpp#L2961
01:29karolherbst[d]: fun
01:32gfxstrand[d]: NAK is not going to support that unless there's a VERY compelling reason to do so. (There won't be.)
01:34zmike[d]: so...no new amber branch
01:36karolherbst[d]: is "we'll save a bit of shader heap" not a compelling argument ð
01:37zmike[d]: well this next release cycle would be a good time for it considering we're also dumpstering clover/nine/xa
01:40karolherbst[d]: if one finds the time to speed running fermi support, sure
01:40zmike[d]: where's alyssa
01:40zmike[d]: someone call alyssa
01:40redsheep[d]: Pick up the witch phone
01:41orowith2os[d]: Give me 100 billion moneys and let's talk
01:41orowith2os[d]: :fork:
01:46redsheep[d]: orowith2os[d]: You sent me down a rabbit hole, even if fulfilled with Lebanese pounds that's still 1.1 million USD
01:47orowith2os[d]: redsheep[d]: Crash one currency with a super low value relative to USD, convert, wait for it to rise, convert again, then wait for it to fall, go back to the low value currency, then once it rises, back to USD. Repeat until 100 billion moneys
01:49redsheep[d]: Ah yes the classic "just buy at the right time over and over" plan
02:30mhenning[d]: gfxstrand[d]: I always think it's funny when people talk about dropping nvc0 without a replacement. Like, you realize we still nominally support nv30 on main, right?
02:31mhenning[d]: the linux kernel supports all the way back to nv04 still, which is ... iirc the first one that used triangles as its main primitive
02:32orowith2os[d]: Sw rendering is the replacement, if you want to use modern mesa, no?
02:32orowith2os[d]: At some point, some GPUs just don't make sense to support anymore
02:33gfxstrand[d]: zmike[d]: Oh, well nine is the only compelling reason to use old nouveau so that aligns nicely!
02:33orowith2os[d]: orowith2os[d]: I could barely tolerate my gtx260 on Windows, and I hit vram limits so much that I ended up dipping into system memory :nekosnap:
02:34gfxstrand[d]: mhenning[d]: Oh, I'm not saying we'd get rid of src/gallium/nouveau. (Though we should amber it.)
02:35gfxstrand[d]: Honestly, I don't know why I bothered with anything pre-Turing. ðĪ·ðŧâïļ
02:35gfxstrand[d]: Because I'm a sucker? <a:shrug_anim:1096500513106841673>
02:35mhenning[d]: Yeah, tbh turing would have been a fine cut-off
02:36gfxstrand[d]: We can still delete the other hardware. I haven't enabled it by default yet. ð
02:37gfxstrand[d]: Honestly, if nouveau GL were competent, playing with Vulkan on old hardware would be a lot less interesting.
02:37mhenning[d]: yeah, although if we had to support either nvc0 or nvk on kepler, I'd rather support nvk
02:44gfxstrand[d]: Honestly, bringing up new ISAs is fairly easy and fun. It's a lot of work but it's fairly mechanical and the CTS tells you exactly what all to do. You just have to be careful that you add all the right legalization rules.
02:44gfxstrand[d]: Until you run into Kepler images, that is. ð
02:47gfxstrand[d]: But wiring up all the ALU ops is pretty straightforward.
02:48gfxstrand[d]: I started typing float ops before I stopped to make supper. I don't believe the codegen code. ð
02:53orowith2os[d]: mhenning[d]: Support NVK, you get Zink for basically nothing, which means Vulkan and OpenGL. Support nvc0, you support OpenGL, but only that, and you deal with the old gallium driver code, and all of its failings. It's a no brainer :slowpoke:
02:55orowith2os[d]: Where was that conversation about Vulkan 1.3 on Kepler?
02:56orowith2os[d]: I'm trying to remember why it wouldn't work out, especially if you just did a horrible implementation, only to get it to work.
02:56orowith2os[d]: It was memory model stuff, right?
02:57gfxstrand[d]: Yup. Memory model
02:58zmike[d]: gfxstrand[d]: we should try and amber as much stuff as possible this cycle if that's going to be a thing
03:00gfxstrand[d]: By this cycle, do you mean 25.1 or 25.2?
03:00zmike[d]: 2
03:00gfxstrand[d]: Kk
03:00zmike[d]: .1 is tomorrow
03:00orowith2os[d]: gfxstrand[d]: Is there a more detailed write up on it I can read for it? :monadicCat:
03:00gfxstrand[d]: Maxwell+ will be conformant on 25.1.
03:00zmike[d]: Nice
03:01gfxstrand[d]: Kepler will take longer
03:01zmike[d]: Maybe send a mail to the list saying you're planning to amber some stuff to get the ball rolling
03:01gfxstrand[d]: But honestly, nouveau GL is bad enough that if we can get NVK working at all, we can probably amber it.
03:02orowith2os[d]: orowith2os[d]: I'll go n give this a read for now: https://docs.vulkan.org/spec/latest/appendices/memorymodel.html
03:03orowith2os[d]: But knowing Kepler's limitations would be nice too
03:03gfxstrand[d]: The biggest question, I guess, is who decides what driver to use? The amber stack or the mainline stack?
03:03gfxstrand[d]: orowith2os[d]: No. All I know is that the guy at NVIDIA who was working on it, who's just as smart as me and has access to hardware people, said it can't be done.
03:03zmike[d]: I think that's a distro thing
03:04orowith2os[d]: It is a distro thing
03:04orowith2os[d]: Most say "fuck it" and give you normal Mesa, you need to use a specific image or install amber if you want that
03:05gfxstrand[d]: Distros decide what drivers to build from where but unless we amber all of src/gallium/nouveau, I'm not sure how we sort out the conflicts. Even then, we need the amber driver to not enumerate on newer hardware so Zink takes over.
03:06orowith2os[d]: They'll set amber and latest to conflict, they shouldn't ever get into that situation, I think
03:07orowith2os[d]: Arch does, at least
03:07mhenning[d]: I had imagined we'd keep nvc0 and nvk+zink around in parallel for a few releases and then slowly delete nvc0 code where they overlap
03:07mhenning[d]: which would mean not ambering anything for 25.2
03:09gfxstrand[d]: Yeah. If we could figure out a scheme by which Zink always wins if we say it's good to go on mainline, I'm happy to amber nouveau. We won't make many improvements on the amber anyway and we can slowly flip on more hardware as we're ready.
03:09gfxstrand[d]: But if we need the two branches to coordinate somehow, that's a lot harder.
03:09mhenning[d]: I don't really see what the rush is.
03:10mhenning[d]: We can just keep them both in main until there's a compelling reason to do otherwise
03:10airlied[d]: indeed, I've enough pascal users that I don't want to deal with amber
03:11airlied[d]: I'm not sure I really want to bother with zink for pre-turing
03:12mhenning[d]: I'm on board with switching kepler through pascal to zink if we have the cts passing
03:13redsheep[d]: Maxwell and Pascal pass now, no?
03:13gfxstrand[d]: Once we have Kepler support in NAK, we can move codegen back to src/gallium/nouveau and stop linking NVK against it. that's good enough for me.
03:13orowith2os[d]: I didn't realize NVK on pascal was a thing, what with the lack of reclocking?
03:13gfxstrand[d]: redsheep[d]: Vulkan CTS, yes. I haven't run GL
03:13mhenning[d]: reclocking is a kernel thing
03:14orowith2os[d]: Without reclocking, I was expecting nobody to bother for pascal
03:14gfxstrand[d]: That's what I keep saying about everything pre-Turing but people seem fascinated with it. ð
03:14gfxstrand[d]: Even with reclocking, Kepler isn't exactly cutting edge hardware.
03:14mhenning[d]: gfxstrand[d]: yeah, dropping nvk+codegen makes sense
03:15redsheep[d]: Old hardware that's only fast enough to run a desktop having a better experience isn't nothing
03:16orowith2os[d]: I sure as hell won't complain about Kepler, considering my PC has a Kepler card sitting in it right now
03:16gfxstrand[d]: orowith2os[d]: But weren't you just saying you have a 3090 that you're not using? ð
03:17orowith2os[d]: Different PC :v
03:17gfxstrand[d]: That'll run NVK just fine...
03:17orowith2os[d]: But yeah, it's in the house with me
03:17redsheep[d]: gfxstrand[d]: It just about manages it
03:18orowith2os[d]: Wait, faith, me having both cards doesn't mean you can just tell me to use one over the other >:(
03:19gfxstrand[d]: Why not? ð
03:19mhenning[d]: the 3090 may be marginally faster
03:19redsheep[d]: I will if she doesn't
03:20gfxstrand[d]: If you don't want to use the 3090, you can just give it to me. ðĪŠ
03:20orowith2os[d]: Only if it was my card to give :sad:
03:25airlied[d]: gfxstrand[d]: I think we can merge https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34513
03:26airlied[d]: I might line up some more blackwell bits, this dEQP-VK.glsl* run seems to not have anymore illegal instruction encodings
03:27redsheep[d]: Am I the only one who reads "rb" and gets a craving for an Arby's roast beef sandwich?
03:27mhenning[d]: redsheep[d]: yes
03:28redsheep[d]: People say "I need some RBs here" and I say me too
03:29gfxstrand[d]: redsheep[d]: Now that you say it...
03:32gfxstrand[d]: https://tenor.com/view/roast-beef-sandwich-sandwiches-roastbeef-gif-26541976
03:33redsheep[d]: Damn ok time to go make dinner and stop hungrily reading gitlab
03:33gfxstrand[d]: ð
03:33gfxstrand[d]: I need to go to bed. I'm slightly drunk and in total bullshit mode right now.
03:34redsheep[d]: gfxstrand[d]: Valid. Drink water.
03:34gfxstrand[d]: airlied[d]: Okay, I'll look in the morning
03:48HdkR: 4
03:51gfxstrand[d]: 7
04:04airlied[d]: yay, no illegal instructions
09:39f_: ILLEGAL INSTRUCTIONS
09:40tiredchiku[d]: ban
09:40f_: Maybe some of you came to 38c3 ^^
09:42f_: tiredchiku[d]: What do you mean? (If that was directed to me)
09:45tiredchiku[d]: oh I was just messing around :p
09:45tiredchiku[d]: >no illegal instructions
09:45tiredchiku[d]: >ILLEGAL INSTRUCTIONS
09:45tiredchiku[d]: "ban"
09:56f_: :D
10:50snowycoder[d]: Ohhh, KeplerB CB_AUX was a primitive form of descriptor table to support bindless images(?).
10:50snowycoder[d]: Then why is it only used by images and not by textures?
11:09freidamassphysics: well maybe the depth stuff was not the best presented. idea overall was to handle variable dependencies as with shared base , so if you pull to negative it will be the negative base combining to execute it's descendant, but how really can one skip the base1 might had been unclear, if you do not pull an answer indirect indexes to negative the answer would stay positive and it depends
11:09freidamassphysics: on subsequent arithmetic, if at any done, those either than eliminate from the end side if subtract is done from unmodified hash, or eliminate from the starting side if no subtract is done on both cases they leave the IO value as part of the end or start remainder, to rerun stuff, so you have option to pull it to negative from pcX again and positive to pcX+Y which would then leave the
11:09freidamassphysics: value to debug (subtracting from cached), or base if you rerun it once more without base, so the depth could be queried from virtual answer, as it says depth 100 for variable dependencies, so what compiler adds to each async variable is not the base as mistakenly said before, but first dependencie's index/PC. So then you could query the depth of that dependency and add arbitrary index
11:09freidamassphysics: addition to release a positive value again, which later is pulled to negative etc. so forth. Now this seems fine strategy enough however we still want the databank for the linker and IO, this is easier concept. All you need to remember is adhere to format where 2times the value would release correct format. so for five representing 1 that needs to be 0+1+1=3*2-1 from the databank
11:09freidamassphysics: samples to be assembled if you want to avoid divide by two. those need iteration or mul you think currently, because when you add -3with3 theyd eliminate 3+3 need xtimes -1 to be added so hence the workaround is pcX to PcEND equals some number, and if we sign treat it to negative,we avoid iteration or mul when we add this negative value to the earlier stuff.
12:08mohamexiety[d]: HdkR: this may be interesting for you https://github.com/geerlingguy/sbc-reviews/issues/62#issuecomment-2798278680
12:08mohamexiety[d]: seems with the latest BIOS the Orion O6 can start with some PCIe 4.0 GPUs now -- tester in comment got a 3060 detected
12:13marysaka[d]: I should give that a shot too, I have both my boards on latest BIOS :aki_thonk:
12:18asdqueerfromeu[d]: mohamexiety[d]: I didn't expect the Raspberry Pi person to be here ðĨ§
12:32freidamassphysics: And also one caveat or shortcoming of my approaches is, i think no one has this but, i can not easily come out with algorithm, that requires no garbage removal and can pull in the highest bit to make magic, when you analyse as to what i have it has to allocate all the hash so that the highest unsigned bit is left intact. But it has not been any priority anymore, since i tried
12:32freidamassphysics: countless of times and could not get there.
13:03gfxstrand[d]: snowycoder[d]: Yeah, there's some magic CBs for things in the GL driver. Not all of them matter anymore but I suspect we need at least a little something for images.
13:28snowycoder[d]: gfxstrand[d]: The strange thing is that someone is filling those CBs in the vulkan drivers but the only code for that I can find is in the gallium driver, so who is filling the aux cbuf :/
13:31freidamassphysics: but the goal of having no control flow is realistic, since you know the pc aka indexes are always the same and stored in global hashes that respond to how to deal with them only values are different, no control flow is nasa coding guidelines requirement. If we figure out how to do from index 1891 also 1980 1979 etc to zero without a loop it's a win, but the solution was posted already
13:31freidamassphysics: packed encoding, so from index 1891 you acquire a value that is sum of all indexes, when decoded. And you are free to do batched arithmetic on this.
13:35freidamassphysics: when you would pack per hash another embedded thing for metadata such as values the same way which vary, per hash, this would overhead the compiling and is performance space storage tradeoff, that way highest bit can be pulled in, but performance would suffer.
13:43gfxstrand[d]: snowycoder[d]: In NVK, we have 3-4 things in the root descriptor table and a few in descriptor buffers.
13:44gfxstrand[d]: We don't have a giant AUX_CB that contains piles of stuff.
13:45gfxstrand[d]: MSAA is that biggest culprit when it comes to needing random stuff in CB0
14:06freidamassphysics: I think i messed up a bit, so again i forgat that it never makes sense to print more than one, so highest bit can be used. you just end up treating it with -1 to decode back to IO operation. So there is no point to pack all preceding values upto certain index, since you can not use them anyhow, and the pdf says that too, just one subelements storage is needed less than maximum, since
14:06freidamassphysics: there is no point to access more than one value, i mean folding the IO between higher and smaller ideal distribution entries is only possible on FPGA. i assume.
14:18snowycoder[d]: gfxstrand[d]: But codegen still expects the AUX CB if I read the code correctly, that's what I'm missing
14:24freidamassphysics: But i think that's what happened to Ivan MILL CPU guy, he likely realized that FPGAs are so powerful that there is no need for MILL CPU asics, so i think he realized expededly enough that just softcore is enough, you would not waste , fpga's can also compile in arbitrary bit lengths so in that example you would lose only 1bit in only one LUT. For an example it's very cheap to sum all
14:24freidamassphysics: previous values straight in the hardware and can possibly to packed and parallel IO decoding.
14:36snowycoder[d]: gfxstrand[d]: Also, there's this comment in the code (nvk_shader.c) "Codegen sometimes puts stuff in cbuf 1 and adds 1 to our cbuf indices", so it seems like it does write AUX CB (and it works, texelFetch tests pass).
14:37freidamassphysics: I have not seen still me being taken off the socket of transmit. what the unusual.
14:57gfxstrand[d]: NVK never fills out any image metadata in AUX_CB.
14:58gfxstrand[d]: And texelFetch doesn't need anything besides the bindless handle, which comes from the descriptor.
14:59gfxstrand[d]: For images, we do pass along a little bit. See nvk_storage_image_descriptor
16:11mohamexiety[d]: Does nvdump or any other tool reveal allocated shared mem size?
16:12mohamexiety[d]: I know it’s shared with L1d and the split is kinda configurable from CUDA but not on Vulkan afaik but beyond that not sure if there’s another way to get what the allocated size is
16:25mhenning[d]: You could try nvdump with https://gitlab.freedesktop.org/nouveau/nv-shader-tools/-/merge_requests/1 and see if that gives you more info
16:25mhenning[d]: if that doesn't include the info the only other way I can think of getting it is by reading info from the QMD
16:26marysaka[d]: mhenning[d]: actually no the best approach is to actually just output the NVUC file because nvdisasm can parse it
16:27mhenning[d]: yeah, I think that's what that MR does
16:27marysaka[d]: (discovered that randomly when trying to do stuffs with my own variant of that tool)
16:27marysaka[d]: oooh
16:27mhenning[d]: I do a similar thing here: https://gitlab.freedesktop.org/nouveau/nv-shader-tools/-/merge_requests/6/commits
16:27mhenning[d]: but that's only for sc binaries
16:27marysaka[d]: I see
16:43mohamexiety[d]: mhenning[d]: problem is this is to try to figure out what that packet value means for the blackwell QMD so can't rely on that :KEKW:
16:43mohamexiety[d]: but yeah will try this MR then, thanks!
17:22HdkR: mohamexiety[d]: Sadly the latest bios makes it so the kernel stalls when booting. So I rolled back
17:29mohamexiety[d]: HdkR: oof. that's bad.. oh well
17:29mohamexiety[d]: thanks!
17:34HdkR: Luckily the board only needs to last for a few more months before I get my hands on a Thor instead :P
17:34mohamexiety[d]: hehe
17:35mohamexiety[d]: I kinda want one of those as well, or Spark but we'll see if I end up pulling the trigger
17:36mohamexiety[d]: Thor should be a lot cheaper than Spark at least given it wont have the 128GB RAM selling point going for it :thonk:
17:39HdkR: Thor will have a PCIe slot which makes it more interesting of a platform. Especially with Thunderbolt being incredibly broken on arm hardware.
17:41mohamexiety[d]: oh! I actually didn't know that. yeah that makes it a lot more useful than Spark then
18:07snowycoder[d]: gfxstrand[d]: You are right, but most `imageread` tests do emit `suclamp`/`sumad`... that read from cbuf binding 1 expecting it to have AUX data, and dEQP seems to pass, it's just reading random data?
18:07snowycoder[d]: Ex that I tested with nvdisasm: `dEQP-VK.spirv_assembly.instruction.graphics.image_sampler.imageread.storage_image.optypeimage_mismatch.rgba8.depth_property.non_depth.shader_vert`
18:20snowycoder[d]: Huh, just rebased on kepler-tex, we pass the same tests too.
18:20snowycoder[d]: Probably they were just reading garbage
18:20gfxstrand[d]: Okay, is there a bindless version? Or how is the cbuf encoded?
18:21gfxstrand[d]: It's possible that we have to emulate it for bindless or something like that
18:25gfxstrand[d]: I'm still working on all that float ops for Kepler A. ð
18:26gfxstrand[d]: I got them all typed and then decided to change my set_opcode helper. ð
18:26gfxstrand[d]: Unrelated: Why did Fermi put attribute load/store on the texture unit?!?
18:29snowycoder[d]: gfxstrand[d]: From what I've seen on the codegen: we used the bindless handle to index into the AUX table for some data (e.g. NVC0_SU_INFO_ADDR).
18:29snowycoder[d]: The buffer encoding is described by `NVC0_SU_INFO_*` defines in `nv50_ir_lowering_nvc0.h` and encoded in `nve4_set_surface_info` in `nvc0_tex.c`
18:30snowycoder[d]: gfxstrand[d]: Old hw seems just really weird, why does KeplerB expose the image address calculations
18:49TimurTabi: karolherbst: I pushed a new version of the Wiki update. It's more polite :-)
18:52karolherbst: cool will take a look later
19:06_lyude[d]: Hey - are there any fixes for any of the memory bound issues on ben's current 03.01-gb20x branch? Been running it on my laptop and keep hitting this: https://paste.centos.org/view/6b01835e
19:14_lyude[d]: main reason I ask is because I also see UBSAN errors at the start and I could have sworn I saw a fix go by at some point that had a fix for this:
19:14_lyude[d]: [ 4.706368] UBSAN: array-index-out-of-bounds in drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c:523:21
19:14_lyude[d]: [ 4.706369] index 0 is out of range for type 'PACKED_REGISTRY_ENTRY [*]'
19:16mohamexiety[d]: well this is funny. going with a shared bool, to a shared uint array of 8 elements, to 32, to 64, I am seeing the exact same entry.
19:16mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:16mohamexiety[d]: .VALUE = 0x1b41802
19:16mohamexiety[d]: when I went with 128 elements, I got:
19:16mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:16mohamexiety[d]: .VALUE = 0x1b41804
19:16mohamexiety[d]: I guess shared memory gets allocated in chunks :thonk:
19:16mohamexiety[d]: lets try something more extreme
19:26gfxstrand[d]: Yes
19:27gfxstrand[d]: Look at `gv100_sm_config_smem_size()` in qmd.rs
19:27gfxstrand[d]: 1024B is the smallest amount you can allocate other than zero
19:31mohamexiety[d]: hmm. I wonder if Blackwell allows for a smaller granularity? since if the smallest amount is 1024B, why did it increase from 64 to 128? 64 of uint is 64 * 4 or 256B, 128 would be 512B so still below the threshold
19:41gfxstrand[d]: Could be
19:46mohamexiety[d]: going with 512 elements gives us:
19:46mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:46mohamexiety[d]: .VALUE = 0x1b41810
19:46mohamexiety[d]: 1024:
19:46mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:46mohamexiety[d]: .VALUE = 0x1b41820
19:47mohamexiety[d]: so:
19:47mohamexiety[d]: 64 => 0x02 => 2
19:47mohamexiety[d]: 128 => 0x04 => 4 (2x as much)
19:47mohamexiety[d]: 512 => 0x10 => 16 (4x as much)
19:47mohamexiety[d]: 1024 => 0x20 => 32 (2x as much)
19:48mohamexiety[d]: kind of leaning towards the granularity being smaller than 1024B with blackwell given it does seem to track the size increase
19:49mohamexiety[d]: either that or glsl's `uint` arrays take more than 4 bytes per element. or the driver is using more shared memory
19:50gfxstrand[d]: Oh, try it with vec4 arrays
19:51gfxstrand[d]: They might be using std140 (everything is padded to a vec4)
19:53mohamexiety[d]: oh
20:05lockedmanout: so indexes are things like 144-36-72 bases are 256 and values are 5 for example. so indirect index with shared base is -144-144-72-512 for example, technically when you put either indirect index or value to such hash, you can pull it to zero by knowing it's pc , hence what you always want to do is put an inverse value(index into hash, where as when that eliminates to zero, the value/index
20:05lockedmanout: that did not get added is pulled to negative, be that IO, such as value printed, or indirect index passed with base forward etc. for data if you have everything at two times, when you add a constant to it like 2048 for bases then 144+144+72 as index in our earlier sample, you can eliminate one of the pairs by already given procedure, so if all is twice and one entry zero, you'll subtract
20:05lockedmanout: from all twice and get the value twice the focal. So for data one does not need a inverse index/value even. But me to sleep now, my testings are bit indeed encouraging me for the future. Data format just only for data banks is issued or assembled properly so that the powers sum is twice the value+1.
20:29airlied[d]: mohamexiety[d]: should also see on pre-blackwell what the increases are for the same one, but sounds like you might be getting vec4
20:52gfxstrand[d]: Okay, I think I might be able to wire up Textures on Kepler A now
20:53gfxstrand[d]: Actually, I should do control flow first
21:25terrcontent: 28+1536+144+72+144+144−36=2032 as is 361+328+324+336+361+322 so minor fault the base for the triplet was 1536 btw. I am half sleeping so this testing i had done but yeah if some stuff is added as 28+36+36+2032 the previous routines would give 128=-28-28-36-36=0 as output, so ILL/SICK ME it never works. on data nor compute, plonk.
21:27airlied: Lynne: those boottime logs need some fix that was made to the header files
21:28airlied: oops
21:28airlied: Lyude: ^
21:32airlied: 52fdb99cc436014a417750150928c8ff1f69ae66
21:35mohamexiety[d]: had to go for something but back now. tried with vec4. I am still dumping the rest but the 64 element vec4 is... interesting:
21:35mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
21:35mohamexiety[d]: .VALUE = 0x4b41808
21:36mohamexiety[d]: so the leftmost number changed, it's now 4 instead of 1. (given we went from scalars to vec4s I am guessing this is vector size), and the rightmost number is 0x08
21:47mohamexiety[d]: 128:
21:47mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
21:47mohamexiety[d]: .VALUE = 0x4b41810
21:47mohamexiety[d]: 512:
21:47mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
21:47mohamexiety[d]: .VALUE = 0x4b41840
21:47mohamexiety[d]: 1024:
21:47mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
21:47mohamexiety[d]: .VALUE = 0x2b42880
21:48mohamexiety[d]: so seems going to vec4 actually 4xs the lower byte :thonk:
21:48mohamexiety[d]: with 1024 the first number becomes '2' rather than '4' too
21:50mohamexiety[d]: I think I need to run on Ada still, but trying with vec4s only brought more questions :KEKW:
21:53mohamexiety[d]: one thing I am doing which could influence things is I am also increasing local size to match. so e.g. `shared uvec4[64]` has a local size of 64, 512 has 512, etc. but I don't think this'd influence things :thonk:
21:54mhenning[d]: Increasing shared mem has effects on max occupancy, so if there's an occupancy header that could be the other thing that's changing
21:55mhenning[d]: although local size can also affect it so that could be making things harder too
21:56mohamexiety[d]: older QMDs had a separate occupancy header yeah
21:57mhenning[d]: Well, we don't set the occupancy header in nvk at all
21:57mhenning[d]: we don't know if it does anything
21:57mohamexiety[d]: if they're merging occupancy with shared mem size in the same packet like how they did with local Z and register size that could be why we're seeing two different numbers change in the same packet :thonk:
21:57mohamexiety[d]: I didn't think about occupancy at all when doing this tbh so I probably shouldn't have changed local size. woops
22:00mhenning[d]: airlied[d]: I'm hitting
22:00mhenning[d]: thread '<unnamed>' panicked at ../src/nouveau/compiler/nak/sm80_instr_latencies.rs:1087:17:
22:00mhenning[d]: Illegal instuction in ureg category r14 = idp4.i8.i8 r21 ur3 r15
22:00mhenning[d]: Do you have the corresponding table entry or do we need to make something up?
22:01airlied[d]: seems to be fma and coupled
22:02mhenning[d]: Right, I guess that makes it vcoupled in the ureg path?
22:02airlied[d]: yes
22:05airlied[d]: gfxstrand[d]: reminder on 34513
22:06airlied[d]: mohamexiety[d]: the QMD is pretty well packed, so lots of things could be in the same 32-bit dword
22:06mohamexiety[d]: yeah
22:07mohamexiety[d]: I am reasonably confident the lowest byte is shared mem size at least, just not sure about the rest
22:07mohamexiety[d]: the uppermost stuff is definitely control, but not sure what it is specifically
22:08mohamexiety[d]: mohamexiety[d]: also not sure about granularity because I don't think 1024B is correct for blackwell given these size changes
22:10mohamexiety[d]: from the CUDA docs Blackwell doesn't really change the shared mem split though which would make this weird :thonk:
22:10mohamexiety[d]: also fwiw max size seems to be 48KiB:
22:10mohamexiety[d]: > maxComputeSharedMemorySize 49152