Toutes les applications utilisant l’opencl au travers de Rocm (Radeon) provoquent des erreurs allant du plantage applicatifs au blocage de la machine.
Exemples :
Avec Blender :
nov. 06 08:50:29 zeus5.cc.local kernel: BUG: kernel NULL pointer dereference, address: 00000000000007a0
nov. 06 08:50:29 zeus5.cc.local kernel: #PF: supervisor write access in kernel mode
nov. 06 08:50:29 zeus5.cc.local kernel: #PF: error_code(0x0002) - not-present page
nov. 06 08:50:29 zeus5.cc.local kernel: PGD 215148067 P4D 215148067 PUD 215147067 PMD 0
nov. 06 08:50:29 zeus5.cc.local kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
nov. 06 08:50:29 zeus5.cc.local kernel: CPU: 2 PID: 7498 Comm: blender Not tainted 6.5.10-300.fc39.x86_64 #1
nov. 06 08:50:29 zeus5.cc.local kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B78/X470 GAMING PRO CARBON (MS-7B78), BIOS 2.I0 07/27/2022
nov. 06 08:50:29 zeus5.cc.local kernel: RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x30 [amdgpu]
nov. 06 08:50:29 zeus5.cc.local kernel: Code: 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 00 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 <48> 89 0e 31 c0 e9 73 b3 90 da 0f 1f 00 90 90 90 90 90 90 90 90 90
nov. 06 08:50:29 zeus5.cc.local kernel: RSP: 0018:ffffb405074a7940 EFLAGS: 00010206
nov. 06 08:50:29 zeus5.cc.local kernel: RAX: 0000000000000000 RBX: 00000001c7200000 RCX: 00400001c70005f1
nov. 06 08:50:29 zeus5.cc.local kernel: RDX: 0000000000000000 RSI: 00000000000007a0 RDI: ffff8ef719c00000
nov. 06 08:50:29 zeus5.cc.local kernel: RBP: ffffb405074a7aa8 R08: 00400000000005f1 R09: 0000000000200000
nov. 06 08:50:29 zeus5.cc.local kernel: R10: 00400000000005f1 R11: 0000000000000009 R12: 0000000000200000
nov. 06 08:50:29 zeus5.cc.local kernel: R13: 0000000000000004 R14: 00000000000007a0 R15: 0000000000000001
nov. 06 08:50:29 zeus5.cc.local kernel: FS: 00007f41e709a580(0000) GS:ffff8f05fea80000(0000) knlGS:0000000000000000
nov. 06 08:50:29 zeus5.cc.local kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
nov. 06 08:50:29 zeus5.cc.local kernel: CR2: 00000000000007a0 CR3: 00000002a0a64000 CR4: 0000000000750ee0
nov. 06 08:50:29 zeus5.cc.local kernel: PKRU: 55555554
nov. 06 08:50:29 zeus5.cc.local kernel: Call Trace:
nov. 06 08:50:29 zeus5.cc.local kernel: <TASK>
nov. 06 08:50:29 zeus5.cc.local kernel: ? __die+0x23/0x70
nov. 06 08:50:29 zeus5.cc.local kernel: ? page_fault_oops+0x171/0x4e0
nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f
nov. 06 08:50:29 zeus5.cc.local kernel: ? exc_page_fault+0x7f/0x180
nov. 06 08:50:29 zeus5.cc.local kernel: ? asm_exc_page_fault+0x26/0x30
nov. 06 08:50:29 zeus5.cc.local kernel: ? amdgpu_gmc_set_pte_pde+0x23/0x30 [amdgpu]
nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_vm_cpu_update+0x92/0x110 [amdgpu]
nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_vm_ptes_update+0x32c/0x930 [amdgpu]
nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_vm_update_range+0x241/0x740 [amdgpu]
nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_vm_bo_update+0x305/0x570 [amdgpu]
nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_gem_va_ioctl+0x54f/0x590 [amdgpu]
nov. 06 08:50:29 zeus5.cc.local kernel: ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
nov. 06 08:50:29 zeus5.cc.local kernel: drm_ioctl_kernel+0xcd/0x170
nov. 06 08:50:29 zeus5.cc.local kernel: drm_ioctl+0x26d/0x4b0
nov. 06 08:50:29 zeus5.cc.local kernel: ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
nov. 06 08:50:29 zeus5.cc.local kernel: __x64_sys_ioctl+0x97/0xd0
nov. 06 08:50:29 zeus5.cc.local kernel: do_syscall_64+0x60/0x90
nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f
nov. 06 08:50:29 zeus5.cc.local kernel: ? __count_memcg_events+0x42/0x90
nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f
nov. 06 08:50:29 zeus5.cc.local kernel: ? count_memcg_events.constprop.0+0x1a/0x30
nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f
nov. 06 08:50:29 zeus5.cc.local kernel: ? handle_mm_fault+0x9e/0x350
nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f
nov. 06 08:50:29 zeus5.cc.local kernel: ? do_user_addr_fault+0x179/0x640
nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f
nov. 06 08:50:29 zeus5.cc.local kernel: ? exc_page_fault+0x7f/0x180
nov. 06 08:50:29 zeus5.cc.local kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8
nov. 06 08:50:29 zeus5.cc.local kernel: RIP: 0033:0x7f41e6d2f13d
nov. 06 08:50:29 zeus5.cc.local kernel: Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
nov. 06 08:50:29 zeus5.cc.local kernel: RSP: 002b:00007ffd67cd8a50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
nov. 06 08:50:29 zeus5.cc.local kernel: RAX: ffffffffffffffda RBX: 00007f418043b820 RCX: 00007f41e6d2f13d
nov. 06 08:50:29 zeus5.cc.local kernel: RDX: 00007ffd67cd8af0 RSI: 00000000c0286448 RDI: 000000000000000b
nov. 06 08:50:29 zeus5.cc.local kernel: RBP: 00007ffd67cd8aa0 R08: ffff80011e800000 R09: 000000000000000e
nov. 06 08:50:29 zeus5.cc.local kernel: R10: 000000000000003c R11: 0000000000000246 R12: 00007ffd67cd8af0
nov. 06 08:50:29 zeus5.cc.local kernel: R13: 00000000c0286448 R14: 000000000000000b R15: 00007f41da478c00
nov. 06 08:50:29 zeus5.cc.local kernel: </TASK>
nov. 06 08:50:29 zeus5.cc.local kernel: Modules linked in: uinput snd_seq_dummy snd_hrtimer rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache netfs cfg80211 nft_masq team_mode_roundrobin team nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr binfmt_misc dm_crypt snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio snd_hda_codec intel_rapl_msr snd_usbmidi_lib intel_rapl_common snd_ump snd_hda_core edac_mce_amd snd_rawmidi snd_hwdep mc kvm_amd snd_seq snd_seq_device kvm snd_pcm irqbypass snd_timer rapl snd wmi_bmof mxm_wmi pcspkr soundcore vfat i2c_piix4 k10temp fat joydev gpio_amdpt gpio_generic auth_rpcgss sunrpc loop zram amdgpu hid_logitech_hidpp drm_ttm_helper ttm video drm_suballoc_helper amdxcp iommu_v2 drm_buddy crct10dif_pclmul crc32_pclmul gpu_sched crc32c_intel polyval_clmulni
nov. 06 08:50:29 zeus5.cc.local kernel: polyval_generic igb drm_display_helper nvme ghash_clmulni_intel dca ccp sha512_ssse3 cec r8169 nvme_core sp5100_tco i2c_algo_bit nvme_common wmi hid_logitech_dj scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath nct6775 nct6775_core hwmon_vid fuse
nov. 06 08:50:29 zeus5.cc.local kernel: CR2: 00000000000007a0
nov. 06 08:50:29 zeus5.cc.local kernel: ---[ end trace 0000000000000000 ]---
nov. 06 08:50:29 zeus5.cc.local kernel: RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x30 [amdgpu]
nov. 06 08:50:29 zeus5.cc.local kernel: Code: 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 00 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 <48> 89 0e 31 c0 e9 73 b3 90 da 0f 1f 00 90 90 90 90 90 90 90 90 90
nov. 06 08:50:29 zeus5.cc.local kernel: RSP: 0018:ffffb405074a7940 EFLAGS: 00010206
nov. 06 08:50:29 zeus5.cc.local kernel: RAX: 0000000000000000 RBX: 00000001c7200000 RCX: 00400001c70005f1
nov. 06 08:50:29 zeus5.cc.local kernel: RDX: 0000000000000000 RSI: 00000000000007a0 RDI: ffff8ef719c00000
nov. 06 08:50:29 zeus5.cc.local kernel: RBP: ffffb405074a7aa8 R08: 00400000000005f1 R09: 0000000000200000
nov. 06 08:50:29 zeus5.cc.local kernel: R10: 00400000000005f1 R11: 0000000000000009 R12: 0000000000200000
nov. 06 08:50:29 zeus5.cc.local kernel: R13: 0000000000000004 R14: 00000000000007a0 R15: 0000000000000001
nov. 06 08:50:29 zeus5.cc.local kernel: FS: 00007f41e709a580(0000) GS:ffff8f05fea80000(0000) knlGS:0000000000000000
nov. 06 08:50:29 zeus5.cc.local kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
nov. 06 08:50:29 zeus5.cc.local kernel: CR2: 00000000000007a0 CR3: 00000002a0a64000 CR4: 0000000000750ee0
nov. 06 08:50:29 zeus5.cc.local kernel: PKRU: 55555554
nov. 06 08:50:29 zeus5.cc.local kernel: note: blender[7498] exited with irqs disabled
Avec en prime un blocage des commandes claviers, voir du lancement des applications graphique, etc…
Clpeak bloque la machine totalement si l’on relance la commande 2 fois de suite.
Pas de problème sur Fedora 38 avec rocm en version 5.5.xx
A voir si cela provient de Rocm ou d’un autre élément.
Bogue déjà rapporté dans d’autres situation :
https://bugzilla.redhat.com/buglist.cgi?quicksearch=rocm&list_id=13362416
https://bugzilla.redhat.com/buglist.cgi?quicksearch=opencl&list_id=13362418
A voir si il faut faire un nouveau rapport…