Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes
We develop an efficient approach LLM input safety moderation using latent prototypes and demonstrate that safe and unsafe inputs are separable in the model's latent space.
Feb 22, 2025