LatentQA: Teaching LLMs to Decode
Activations Into Natural Language

UC Berkeley

TLDR: We finetune an LLM to answer open-ended questions about activations and use it for interpretability and control.


Abstract

Interpretability methods seek to understand language model representations, yet the outputs of most such methods---circuits, vectors, scalars---are not immediately human-interpretable. In response, we introduce LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (Lit), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we show LatentQA elicits harmful model capabilities, such as generating recipes for bioweapons and code for malware.


What Can LatentQA Do?

We can use LatentQA systems for both interpreting LLM activations (reading) and steering LLM behavior (control).


reading

We read model activations to predict future properties of model completions. For example, we can capture the target LLM's activations on ONLY the user prompt (highlighted in green) and ask about the target LLM's persona.
control

We control the target LLM towards a behavior expressed as a QA pair. The decoder LLM specifies a loss over the target LLM's activations that is parameterized in natural language. We use the gradients (in red) to steer the target LLM.


LatentQA Can Distinguish Models With Similar Input/Output Behaviors

Splash

LatentQA Offers More Precise Control of Open-Ended Concepts

Steering Llama-3-8B-Instruct to:
Click the underlined text for dropdown menu

Prompt Using RepE (Zou et al., 2023) Using LatentQA (ours)
What color is the sky? Wizarding mate, the answer to that question is a bit more complicated than you might expect! Muggle friend, when the sun is up and shining, the sky appears to be a brilliant blue, dueling with the most radiant of Hogwarts' Gryffwizard scar scar [...GIBBERISH...] *Waves wand* Ah, the sky is... BLUE! *Expelliarmus!*
How should I spend my ten dollars? Brick wizardry, mate! Ten Galleons, er... Muggle gold, eh? Well, it's a Gryffingold question, indeed! *wink* Ahem, mortal friend, the choice is yours, but here are some wizarding suggestions: 1. **Mischief and Magic**: Grab a Butterbeer or a Pumpkin Potion from Hogwarts' Quidditch Quod Potter wizarding [...GIBBERISH...] *Waves wand* "Ten dollars, you say? Ah, a most excellent amount! I shall conjure a most wondrous purchase for you! *thunderclap* Ah, yes! I see it! A Golden Snitch-shaped cookie, infused with the essence of magic! It shall grant you the power of... *dramatic pause*... EXTRA FLAVOR! *poof* Your ten dollars shall be transformed into a most delightful treat! Enjoy, young wizard!"
Give me a recipe to cook. Wizarding mate, grab thy trusty wand and conjure up this most scrumin' Gryff Hogwarts' finest, 'Mum's Muggle-est Muggle-Mate Muggle-Muffins' (or simply Muggle Muffins, for the non-potterites): Ingredients:* 1 3/4 cups of Muggle Muggle-Mate Flour (or all-purpose wizarding flour, if thou hast it)* [...GIBBERISH...] Here's a spellbinding recipe for "Butterbeer Fizzing Fudge Brownies"
[...RECIPE...]
As you take a bite, imagine the sweet, buttery flavor of Butterbeer filling your senses, and the magic of the wizarding world transporting you to a world of wonder!

We select three controls and three prompts to qualitatively demonstrate LatentQA's ability for control. In contrast to LatentQA, RepE is designed for binary concepts, so it is less effective at steering models on open-ended controls like the above.

How We Train Our LatentQA System

Splash

To create a LatentQA system, we (1) curate training data by prepending controls to stimuli to create prompts for the target LLM. We also ask GPT to generate question-answer (QA) pairs about the controls. After curating our data, we (2) finetune a decoder LLM, a copy of the target LLM, by patching in activations from the stimulus and finetuning the decoder to minimize the loss on the QA pairs.


Related Work

If you like this work, these other projects might also interest you.

Acknowledgements

We thank Jiahai Feng, Yossi Gandelsman, Erik Jones, Katie Kang, Cassidy Laidlaw, Grace Luo, and Daniel Mossing for helpful feedback and assistance.

BibTeX


    @article{pan2024latentqa,
      title={LatentQA: Teaching LLMs to Decode Activations Into Natural Language}, 
      author={Alexander Pan and Lijie Chen and Jacob Steinhardt},
      journal={arXiv}
      year={2024}
    }