Initial 2 hr technical/background call with hiring manager, if pass then another 2 phone calls with members of team, if pass then take home assignment to be be presented to panel. Final round used to be on site but now virtual. You don't need to know CUDA programming going in (though experience with OpenMP, CUDA, etc will give you edge) but it seems like they want someone who is familiar with optimizing code using parallel programming and has a decent grasp of GPU architecture and function. A basic knowledge of latency and bandwidth topics was assumed. Not a silly Leetcode style interview...you really have to understand software and hardware interactions and how to optimize common computations done in ML/DL using parallel design. Plenty of follow-up qs will be asked to test understanding.