06:47yusisamerican: This call to refrence the bufctx to the pushbuf is taking half as long as the entirety of buffer validation on gallium nouveau: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/nouveau/nvc0/nvc0_state_validate.c?ref_type=heads#L940. Maybe its some weird cache miss bug on my cpu....profiled 327.5k draw calls to 535.8k draw calls per seccond
06:47yusisamerican: bweh?
07:04ad__: Lyude: thanks, no hurry. Yestarday night collapsed on the keyboard. only think i have ssen is that, in nv50_backlight_init(), nvif_outp_bl_get(&nv_encoder->outp) returns -22
07:04ad__: https://cgit.freedesktop.org/drm/drm-misc/tree/drivers/gpu/drm/nouveau/nouveau_backlight.c#n242
07:06ad__: nv_encoder->outp shows a value (not null), but it's as it is a wrong outp type, backlight cannot be retrieved
07:08fdobridge: <ahuillet> ad__: seeing my messages now?
07:09fdobridge: <ahuillet> it's strange, it seems I have to be on IRC for you to read me
07:09ahuillet: ad__ : seeing this now?
07:24ad__: ahuillet: did some debug yesterday, this is happening on my ADA, nvkm_udevice_info() card type 400
07:25ad__: ahuillet: now yes, i can read you
07:43fdobridge: <dadschoorse> now that fp16 for nak is merged, I guess someone could work on 16bit tex/image load/store
07:49yusisamerican: dadschoorse: or maybe someone could do hmma?
07:49yusisamerican: I tried it in codegen but that was when I was in middle school so its probably complete garbage
07:51fdobridge: <marysaka> yusisamerican: I am working on cooperative matrix
07:52yusisamerican: marysaka: oooooh
07:53yusisamerican: For nvidia devices?
07:53fdobridge: <marysaka> yes
07:54yusisamerican: wondering how we will expose it through gallium's interface for tensors...maybe a modification to teflon?
07:55yusisamerican: To use nir instead of the gallium interface
07:55fdobridge: <marysaka> I'm only implementing VK_KHR_cooperative_matrix
07:58yusisamerican: oh! but are you still using the {b/w/i}mma thing that rtx gpus have to accelerate it? If so then maybe I would be intrested in porting gallium to use nak...
08:03ahuillet: ad__ : so, your system ideally should go through the GSP thing I pointed yesterday. Did you try to figure why it was not?
08:03ahuillet: As far as I can tell, however, the blob within GSP does something similar to your MR, which is why you're getting results with your current patch, except it does it correctly and in a way guaranteed to work.
08:26fdobridge: <marysaka> yusisamerican: it use IMMA/HMMA yes, BMMA doesn't map to anything with Vulkan so far I think
08:33ad__: ahuillet: thanks, ack, this evening will try to go deeper
08:34ahuillet: I'm very new to this, trying to figure out how you'd get into these functions in the first place vs. the backlight stuff you've been patching. :)
08:36ad__: ahuillet: so i bought a Lenovo Legion Pro 5, and backlight with nouveau was not working, screen black. I tried the acpi fallbacks (as vendor, native etc, no luck, still screen black)
08:36yusisamerican: marysaka: seems intresting, never knew about the vulkan extension...that could be a gateway to zink support for teflon...buts thats for the future to decide
08:36ad__: ahuillet: so i installed mainline kernel, in the hope there was some fix, but not
08:37ad__: ahuillet: finally i grepped in the nouveau code, and find out nouveau_backlight_init() as the place where backlight ws initialized (nouveau_backlight.c)
08:37yusisamerican: ad__: Wait a minute, I had that device for a while before deciding to return it
08:38yusisamerican: I never had that issue (⊙_⊙)
08:39ad__: ahuillet: i realized that the switch/case in that function was stoped at NV_DEVICE_INFO_V0_AMPERE, there was no NV_DEVICE_INFO_V0_ADA, and my device->info.family was NV_DEVICE_INFO_V0_ADA, so i worked to patch from that point
08:39ahuillet: r535_sor_bl_set does what I think you want. it's plugged into a nvkm_ior_func_bl structure, but I am not sure what path is taken to call these as opposed to what you're patching
08:40ad__: yusisamerican: this laptop can work into 2 ways, discrete (nvidia) and switching (amdgpu + nvidia), i am now in discrete (nvidia/nouveau driver only)
08:40yusisamerican: Oh, that explains it
08:41ad__: ahuillet: i have seen, i cannot bint to that function directly, also, that is only a bl_st, all the backlight init should be from r535 i believe
08:41ad__: *bl_set
08:43ad__: if those r535 methods are there, i will try to figure out how they should be called
08:44ahuillet: nvkm_uoutp_mthd_bl_set this is what calls them I think?
08:45ad__: actually, from my patch, only thinh unclear is that drm detects a 8 bit pwm (so 256 max_brightness) while max seems to be 4096 (12bit). I cannot access eDP specifications (pdf) to check this.
08:45yusisamerican: Whatever backlight thing is used, it *reallly* should be exposed through drm_connector properties, there are plans to expose that to userspace
08:46fdobridge: <airlied> did you try acpi_backlight=native nvidia-wmi-ec-backlight.force=1 combo?
08:46yusisamerican: ad__: Uhhhh maybe thats because maxes of greater than 256 through /sys/class/backlight or whatever break userspace?
08:46ad__: yusisamerican: drm apis is what i have used in my patch, they works
08:47ad__: yusisamerican: well i hardcoded 4096, it works, so 256 is not a limit
08:47fdobridge: <airlied> apparantly newer laptops with dynamic mux gpus are quite broken for backlight with Linux and nvidia is meant to be looking into it
08:47yusisamerican: ad__: You misunderstnd, the nvkm_uoutp_mthd_bl_set thing should be exposed through drm, it doesnt seem to be currently
08:48ad__: mmm, i am totally a newby here. What is nvkm ? I see it enbds up to some system calls (ioctl)
08:49yusisamerican: nvidia modesetting folder
08:49yusisamerican: I think
08:51ad__: ah no sorry, confused nvkm with "nvif" that seems to be a ioctl based interface
08:54ad__: well, this evening will dig further into ahuillet r535 way and nvkm_uoutp_mthd_bl_set
08:54ad__: now unfortunately had to complete some main job tasks
08:54ad__: thanks for the support
08:54yusisamerican: Oh
08:55yusisamerican: im completely wrong, it is exposed through drm_connector they were using nvif_outp_bl_set and I looked in the wrong place
08:55ahuillet: yusisamerican : can you share a file:line?
08:56ahuillet: (was struggling to find the same)
08:56yusisamerican: nouveau_backlight
08:56yusisamerican: .c
08:56yusisamerican: I deleted emacs one sec, ill give you the line
08:56ahuillet: in nv50_set_intensity() ?
08:56yusisamerican: ye
08:57ahuillet: I don't think this actually calls into the r535 stuff though
08:57yusisamerican: It should no?
08:57ad__: looks not going that direction, it binds to proper set/get
08:57yusisamerican: ret = nvif_object_mthd(&outp->object, NVIF_OUTP_V0_BL_GET, &args, sizeof(args));
08:58ahuillet: mmh actually yes it goes to nvkm_uoutp_mthd_bl_set
08:58ahuillet: and I think that should end up in the r535 stuff. ad__ : it's worth tracing where that is falling apart
08:58yusisamerican: ahuillet: Dont you need serial for that? :p
08:59ahuillet: ?
08:59ad__: if (nvif_outp_bl_get(&nv_encoder->outp) < 0 || fails with -22
08:59ahuillet: yusisamerican : serial what?
08:59yusisamerican: oh you can use printk, im sorry for being stupid
08:59ad__: this is why i implemented the patch without calling that
08:59yusisamerican: ad__: ENODEV I think
08:59ad__: right
08:59ahuillet: now we're talking
09:01ad__: so i suspect in find_encoder()
09:01ad__: https://elixir.bootlin.com/linux/v6.9-rc3/source/drivers/gpu/drm/nouveau/nouveau_backlight.c#L306
09:01ad__: to be finding an encoder that is not the right onw
09:01ahuillet: 22 is EINVAL?
09:01yusisamerican: Oops.....
09:01ad__: it's EINVAL, sry
09:03ahuillet: who's actually failing though? would be good to prink till the failure point
09:04yusisamerican: Are there any (drm_encoder)s connecting?
09:05ad__: failure point is that nv50_backlight_init returns 0. so nouveau_backlight_init() attempts a fall back to acpi, that is not working anyway, so no backlight. There is no fault/oop.
09:06ad__: i tried to check what encoder types are processed from find_encoder(), there is a huge amount of type 6 and some 2
09:06ahuillet: that's not the failure point, it's the top level failure point
09:06ahuillet: I meant to printk the hell out of this until you find the deepest point where things start to fail
09:07ad__: well, dmesg shows a lot of BL_GET: -22
09:08yusisamerican: So nouveau is trying to run nvif_outp_bl_get? and nv50_get_intensity? Didnt you say that it(initialization) failed?
09:09ad__: all i know is that the initial point of failure is nvif_outp_bl_get(&nv_encoder->outp) returnig -22
09:09ad__: but before that, also, there is another point.
09:10ad__: why the switch/case stops to case NV_DEVICE_INFO_V0_AMPERE: ?
09:10ad__: i had to add case NV_DEVICE_INFO_V0_ADA:
09:10yusisamerican: because someone forgot to put it there
09:10yusisamerican: or because no one tested it
09:10ahuillet: most likely it didn't even exist then?
09:10ad__: or becouse it was considered better to fall back to acpi
09:11ad__: so if never existed, good, i am patching correctly so
09:11ahuillet: look at the git history, case NV_DEVICE_INFO_V0_AMPERE: //XXX: not confirmed
09:11ahuillet:
09:11ahuillet: even GA10x was barely tested then and it was one without display
09:11ahuillet: so, wait, since when do we actually have GSP stuff, Turing right?
09:11yusisamerican: Since 6.7
09:11ahuillet: I mean the NV HW generation
09:11yusisamerican: covid
09:12ahuillet: (sorry, we = NVIDIA... I'll need to get my first person pronouns sorted out)
09:12yusisamerican: 2019
09:12yusisamerican: I think
09:12yusisamerican: September 20, 2018
09:12ahuillet: anyway I'm just guessing at this point and making more noise than I should, but it doesn't seem obvious to me that the GSP stuff for setting backlight is actually plumbed all the way
09:14ad__: well, ok, this evening i will dig further into that. Right now i have no more precise findings, except that the patch i sent works great, but is likely not the right way to go.
09:14ahuillet: ad__ : I hope airlied or Lyude can advise
09:15ad__: yes, will check for their replies :)
09:15yusisamerican: airlied's reply is that dynamic mux gpus are broken and nvidia is looking into it
09:16yusisamerican: Is dri-prime subprime for your needs?
09:16ahuillet: do you have a link to the reply? I'd want to dig up internal bug numbers/employee names
09:16yusisamerican: ahuillet: Today's chat log
09:20ahuillet: oh, yeah, I wouldn't take it to imply that ad__'s problem can't be fixed, since he's reporting that the blob works fine
09:23yusisamerican: ahuillet: Is there anything similar in the blob commit log to this regarding ampere/ada?
09:26ahuillet: I'd need to know what I'm looking for ;) git-log --grep "Ampere" would be a bit unwieldy
09:26yusisamerican: Something in nvidia-drm-encoder.c???
09:26yusisamerican: since he said find_encoder was returning EINVAL?!?!?!? bit of a shot in the dark though
09:28ad__: the gpu i have is ADA, but Mobile MaxQ variant, so AD107 but GN21-4 variant
09:28ad__: *GN21-X4
09:28ad__: i think is different from other previous ADA
09:29yusisamerican: oooh
09:30yusisamerican: not related to your thing but the blob seems to be using the percentage mode in its backlight_properties
09:31ahuillet: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia-modeset/src/nvkms.c#L6682 ?
09:32yusisamerican: yeah
09:32yusisamerican: looks neat
09:32yusisamerican: Ah! maybe we can use this instead of your magic max_brightness
09:35ad__: yusisamerican: let me try it
09:36ahuillet: so actually... there's some Ada specific stuff in the blob
09:36yusisamerican: ad__: We still need to figure out whats wrong with the drm_encoder thing before venturing into making everything device gucci
09:36ad__: ok
09:37yusisamerican: ahuillet: Are you allowed to go on according to your obligations?
09:37ad__: where did you read airlied replies ?
09:38ahuillet: allowed, maybe, able, definitely not, I'm a UMD driver engineer who has never touched backlight before :)
09:39ahuillet: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/common/inc/displayport/dpcd.h#L1105
09:39yusisamerican: ad__: fdobridge is the bridge between discord and irc, I dont think you can read their messages for some reason though...
09:40ahuillet: as I mentioned alread, the blob writes to these registers which seems to kinda match what the DRM stuff is doing
09:40ahuillet: but, "kinda" is one thing, and that's why you need to use GSP for setting brightness stuff, because there's some specific logic to decide what to use and how
09:41ahuillet: not much more to add really, you don't want to poke these registers directly even if it can be gotten to kinda work because there's no deeply complex magic in there
09:42ahuillet: (the Ada+ stuff btw seems to be the use of PWM vs. AUX, but that seems like an implementation detail)
09:43ad__: what is AUX ?
09:44ahuillet: your guess is as good as mine, I think it's display port auxiliary channel used to convey information? something like that?
09:44ahuillet: not relevant to your problem anyway.
09:46ad__: Well, i debugged into drm_edp_backlight_probe_max() and ad PWM width it gives 8bit (max brightness 255), but seems i need 4096 as a max (12bit).
09:47ad__: So maybe that function is not managing properly the pwm for this chip
09:48ahuillet: that may be true, but I believe that approach to be a dead end anyway (maybe the Nouveau maintainers will feel otherwise, but I would be surprised)
09:48ad__: This function is for eDP, btw. Would be great if you know a way to download a DPCP spec for eDP (that is different from DP)
09:48ahuillet: I'm reasonably certain it's not public
09:49yusisamerican: I think the main thing you should be worrying about first is seeing why the 800 calls to drm_encoder_init in nouveau arent giving you your drm_encoders. Your max brightness woes can be solved by using percentage mode and modding the kernel a bit probably.
09:49ahuillet: ad__ : but my link above has fairly explicit define names, so you can infer a lot from that
09:50ad__: ok. Well, i need to study on this, only studying deeper the layers and nouveau will help
09:50ad__: tonight next stage :;)
09:50ahuillet: it might be a case of Nouveau not finding our output (encoder? not sure how it's called) and therefore not exposing the interface you need
09:51yusisamerican: ahuillet: Im think its DRM not finding the output, if find_encoder is failing in nouveau
09:51ad__: ahuillet: that looks like, from the -22
09:51ahuillet: yusisamerican : ENOPARSE
09:52yusisamerican: ahuillet: are you throwing that exception? See: https://elixir.bootlin.com/linux/v6.9-rc3/source/drivers/gpu/drm/nouveau/nouveau_connector.c#L378, and whats called by it and returning -22, https://elixir.bootlin.com/linux/v6.9-rc3/source/drivers/gpu/drm/nouveau/nouveau_backlight.c#L306
09:55ahuillet: I don't want to speculate, because I feel that Lyude is going to take a look and solve the problem in 20 seconds while I'll have spent hours making wild hypotheses that amount to nothing :)
09:56yusisamerican: probably...
09:56ad__: same for me :) i miss knowledge
09:56ad__: but it's cool to dig inside this driver, very complex
09:56ad__: i generally work on embedded stuff, quite simplier
09:59ahuillet: GPUs... the more you learn, the more you learn about things you don't know
09:59ahuillet: and since the tech evolves pretty fast, you can't ever catch up
09:59yusisamerican: ahuillet: Thats the same for everything, not just tech or gpus
09:59ahuillet: true, that applies to life in general certainly
09:59ad__: but closed blobs helps :)
10:01ad__: i mean, if you cannot see datasheets/specs, some open code helps to understand
10:02yusisamerican: ad__: Sorry if this comes across as yapping but, most of the hw init, modesetting, command submission and 3d initialization code is simple, its just that the drivers tend to be highly generalized and big
10:04ad__: yusisamerican: aha ok. Seems huge amount of stuff, but may be generally simple, sure.
10:04ad__: i work on drivers that are generally 1 single .c
10:05airlied: did you try acpi_backlight=native nvidia-wmi-ec-backlight.force=1 combo?
10:05airlied: is what i said
10:05airlied: our rh backlight dev said it was all a big mess
10:06ahuillet: https://lore.kernel.org/all/20230217144208.5721-1-hdegoede@redhat.com/ I assume?
10:07ad__: airlied: can try it, just give few minutes. Btw, i am not sure nvidia-wmi-ec driver applies to this gpu, since i don't see it loaded
10:08airlied: part of the story,just had an internal email about it
10:09airlied: ahuillet: daniel dapas on nvidia side knows more apparantly
10:09ahuillet: cool. is it generally preferred to use ACPI or the native driver?
10:09ahuillet: thanks, I'll ping him if needed (same team)
10:10yusisamerican: It seems Hans wants some sort of drm interface for backlight long term
10:11airlied: i think there is now acpi, ec and native
10:20ad__: airlied: https://pastecode.dev/s/iixp8h2z
10:21ahuillet: ad__ : and, does it work? :)
10:21ad__: no
10:21ad__: black screen as usual
10:21ad__: i have built nvidia-wmi-ec-backlight but it gets not loaded
10:23ahuillet: there's some ACPI firmware errors in your dmesg, is your bios up to date? highly unlikely to make a difference but it might not hurt...
10:24ad__: ahuillet: yes, just updated few days ago
10:24ad__: was the first thing i did
10:37airlied: can you ssh in and play with sysfs files?
12:14ad__: airlied: yes
12:15ad__: sry was out for lunch, just let me know what to do, sure
12:15fdobridge: <ahuillet> I'd assume he means for you to play with /sys/class/backlight stuff?
12:16ad__: mm, i see now that cat /sys/class/backlight/ is empty
12:17ad__: typically happening here with acpi_backlight=native
12:50ad__: i suspect in some Lenovo issue, or bios issue
12:51ad__: this PC has 2 video cards, i am wondering how the control of the display through a single eDP connector is shared
12:52ad__: right now. gpu-controlled bachlight with my patch works, while acpi is not, whatever i set, vendor, native orr video
17:56fdobridge: <gfxstrand> Ugh... The OpenGL CTS ran for 22 hours and then hit a VK_ERROR_DEVICE_LOST on a test it's already passed 20 times in that run. 🤦🏻♀️
18:00fdobridge: <gfxstrand> It was a spurrious timeout so I suspect it might be that IRQ issue that @airlied was hunting.
18:09fdobridge: <zmike.> gotta love the "try 500 different fb configs" requirement for conformance
18:16fdobridge: <gfxstrand> Yeah
18:17fdobridge: <gfxstrand> At least with the Vulkan CTS bloat, the different combinations might matter. No, running the fragment shader fp64 tests per-fb-config does not actually test anything interesting. 🤦🏻♀️
18:18fdobridge: <gfxstrand> If your implementation of fp64 add changes based on whether or not there's a depth buffer, something has gone horribly wrong.
18:22fdobridge: <!DodoNVK (she) 🇱🇹> Fixing the cause of it will help with driver stability even further
18:22fdobridge: <gfxstrand> Yeah, but something that takes 22h to reproduce isn't easy to fix. 😢
18:22fdobridge: <ahuillet> what sort of error dump do you get along with the device lost?
18:25fdobridge: <gfxstrand> ```
18:25fdobridge: <gfxstrand> [147692.776876] nouveau 0000:17:00.0: cts-runner[145054]: job timeout, channel 64 killed!
18:25fdobridge: <gfxstrand> [147692.777387] nouveau 0000:17:00.0: cts-runner[145054]: error fencing pushbuf: -19
18:25fdobridge: <gfxstrand> ```
18:25fdobridge: <gfxstrand> dmesg is otherwise clean
18:25fdobridge: <ahuillet> awesome
18:27fdobridge: <gfxstrand> @airlied was chasing an issue at one point where IRQs seemed to just get lost sometimes. I think he fixed some of it with some of the locking and waitlist stuff but I suspect we still have a deep race somewhere.
18:28fdobridge: <gfxstrand> I've seen other spurrious timeouts
18:33fdobridge: <gfxstrand> Yeah, that's drm_scheduler kicking us
18:43fdobridge: <magic_rb.> This has been probably fixed, im running 6.8 and some random mesa commit. But when switching workspaces, map/unmap arma3 consistently crashes. Think its related to VK\_ERROR\_DEVICE\_LOST since minecraft did that too with Zink but it was `ZINK_DEVICE_LOST` or smth. I can dig further if this rings no bells, but i cant bump kernels because ZFS
18:43Lyude: airlied: looking at the memory fragmentation issue with nouveau I managed to hit yesterday during runtime suspend - you mentioned doing a vmalloc allocation instead of a level 7 coherent allocation. Seeing as that change would be in nvkm_gsp_mem_ctor() - would we want to just change that function or figure out how to only do vmalloc() with the allocation that was failing?
18:44Lyude: FWIW: the allocation failure was in drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c in r535_gsp_fini()
18:48Lyude: (if you need more context to remember what I'm talking about let me know, I've still got the log sitting around)
18:52airlied: Lyude: i think we would vmalloc always, but the code to make the radix3 stuff might need adapting
18:52Lyude: also i'm curious, I know what a radix trie is but what is it's significance here?
18:53Lyude: I assume it's some sort of bookkeeping for mappings?
19:10fdobridge: <airlied> It's just a page table format NVIDIA uses, radix3 is it's name
19:12fdobridge: <!DodoNVK (she) 🇱🇹> And speaking of page tables Vita3K seems to fail on NVK when using the page table for memory mapping (there's a segfault somewhere in the DRI render device map according to /proc/maps)
19:19Lyude: gotcha
19:56airlied: Lyude: got link to complete backtrace again?
19:57Lyude: airlied: sure thing, gimme a sec. BTW - just to make sure I understand this code correctly
19:59Lyude: r535_gsp_fini allocates a section of coherent memory on the host's side of things and maps it to make it accessible to the GPU, this memory is used to communicate to the GPU where to migrate all of it's vram to in preparation for shutting down the GSP firmware and vram - and we then kick off the migration with r535_gsp_rpc_unloading_guest_driver() right?
19:59Lyude: also airlied https://paste.centos.org/view/f5fe9fa8 full backtrace