Enable javascript in your browser for better experience. Need to know to enable it? Go here.

Beyond linear steering: Unified multi-attribute control for language models

By Narmeen OozeerLuke Marks, Fazl Barez and Amirali Abdullah

Controlling multiple behaviors in large language models (LLMs) is a challenging problem because different attributes can interfere with each other. Current linear steering methods are limited as they assume that behaviors can be simply added together in the model's activation space, which is often not the case. Additionally, these methods are inefficient, requiring a separate, dedicated tuning for each individual attribute, making them difficult to manage and scale for complex, multi-faceted control.

 

To solve this, a new approach called K-Steering has been developed. It uses a single, non-linear classifier trained on the model's internal states to dynamically compute new intervention directions via gradients. This approach is more flexible and powerful because it avoids the restrictive assumption of linearity and removes the need for storing and tuning separate attribute vectors. K-Steering allows for the dynamic and flexible composition of multiple behaviors without any additional training, providing a unified and efficient solution for controlling LLM outputs

 

Research submission here.