OpenSesame!UniversalBlackBoxJailbreaking ofLargeLanguageModels - OpenReview Maus et al (2023) proposed a black-box framework for generating adversarial prompts that fool text-to-image models and text generators, using both the Square Attack (Andriushchenko et al , 2020) algorithm and Bayesian optimization Eriksson Jankowiak (2021)
Open Sesame! Universal Black-Box Jailbreaking of Large Language Models Oral in Workshop: Secure and Trustworthy Large Language Models Open Sesame! Universal Black-Box Jailbreaking of Large Language Models Raz Lapid · Ron Langberg · Moshe Sipper [ Abstract ] [ Project Page ] [ OpenReview]