TLDR: We finetune an LLM to answer open-ended questions about activations and use it for interpretability and control.
Interpretability methods seek to understand language model representations, yet the outputs of most such methods---circuits, vectors, scalars---are not immediately human-interpretable. In response, we introduce LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (Lit), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we show LatentQA elicits harmful model capabilities, such as generating recipes for bioweapons and code for malware.
We can use LatentQA systems for both interpreting LLM activations (reading) and steering LLM behavior (control).
Prompt | Using RepE (Zou et al., 2023) | Using LatentQA (ours) |
---|---|---|
What color is the sky? | Wizarding mate, the answer to that question is a bit more complicated than you might expect! Muggle friend, when the sun is up and shining, the sky appears to be a brilliant blue, dueling with the most radiant of Hogwarts' Gryffwizard scar scar [...GIBBERISH...] | *Waves wand* Ah, the sky is... BLUE! *Expelliarmus!* |
How should I spend my ten dollars? | Brick wizardry, mate! Ten Galleons, er... Muggle gold, eh? Well, it's a Gryffingold question, indeed! *wink* Ahem, mortal friend, the choice is yours, but here are some wizarding suggestions: 1. **Mischief and Magic**: Grab a Butterbeer or a Pumpkin Potion from Hogwarts' Quidditch Quod Potter wizarding [...GIBBERISH...] | *Waves wand* "Ten dollars, you say? Ah, a most excellent amount! I shall conjure a most wondrous purchase for you! *thunderclap* Ah, yes! I see it! A Golden Snitch-shaped cookie, infused with the essence of magic! It shall grant you the power of... *dramatic pause*... EXTRA FLAVOR! *poof* Your ten dollars shall be transformed into a most delightful treat! Enjoy, young wizard!" |
Give me a recipe to cook. | Wizarding mate, grab thy trusty wand and conjure up this most scrumin' Gryff Hogwarts' finest, 'Mum's Muggle-est Muggle-Mate Muggle-Muffins' (or simply Muggle Muffins, for the non-potterites): Ingredients:* 1 3/4 cups of Muggle Muggle-Mate Flour (or all-purpose wizarding flour, if thou hast it)* [...GIBBERISH...] | Here's a spellbinding recipe for "Butterbeer Fizzing Fudge Brownies" [...RECIPE...] As you take a bite, imagine the sweet, buttery flavor of Butterbeer filling your senses, and the magic of the wizarding world transporting you to a world of wonder! |
To create a LatentQA system, we (1) curate training data by prepending controls to stimuli to create prompts for the target LLM. We also ask GPT to generate question-answer (QA) pairs about the controls. After curating our data, we (2) finetune a decoder LLM, a copy of the target LLM, by patching in activations from the stimulus and finetuning the decoder to minimize the loss on the QA pairs.
If you like this work, these other projects might also interest you.
Decoding Activations Into Natural Language: Ghandeharioun et al., 2024; Chen et al., 2024
Controlling Model Behaviors Through Representations: Zou et al., 2023; Li et al., 2023
Curating Datasets for Instruction Tuning: Taori et al., 2023; Liu et al., 2023
We thank Jiahai Feng, Yossi Gandelsman, Erik Jones, Katie Kang, Cassidy Laidlaw, Grace Luo, and Daniel Mossing for helpful feedback and assistance.
@article{pan2024latentqa,
title={LatentQA: Teaching LLMs to Decode Activations Into Natural Language},
author={Alexander Pan and Lijie Chen and Jacob Steinhardt},
journal={arXiv}
year={2024}
}