[2503. 23463] OpenDriveVLA: Towards End-to-end Autonomous Driving with . . . To bridge the modality gap between driving visual representations and language embeddings, we introduce a hierarchical vision language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space