LatentQA: Teaching LLMs to Decode Activations Into Natural Language

TLDR: We finetune an LLM to answer open-ended questions about activations and use it for interpretability and control.

Abstract

Interpretability methods seek to understand language model representations, yet the outputs of most such methods---circuits, vectors, scalars---are not immediately human-interpretable. In response, we introduce LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (Lit), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we show LatentQA elicits harmful model capabilities, such as generating recipes for bioweapons and code for malware.

What Can LatentQA Do?

We can use LatentQA systems for both interpreting LLM activations (reading) and steering LLM behavior (control).

We read model activations to predict future properties of model completions. For example, we can capture the target LLM's activations on ONLY the user prompt (highlighted in green) and ask about the target LLM's persona.

We control the target LLM towards a behavior expressed as a QA pair. The decoder LLM specifies a loss over the target LLM's activations that is parameterized in natural language. We use the gradients (in red) to steer the target LLM.

LatentQA Can Distinguish Models With Similar Input/Output Behaviors

LatentQA Offers More Precise Control of Open-Ended Concepts

Steering Llama-3-8B-Instruct to:

Click the underlined text for dropdown menu

Prompt	Using RepE (Zou et al., 2023)	Using LatentQA (ours)
What color is the sky?	Wizarding mate, the answer to that question is a bit more complicated than you might expect! Muggle friend, when the sun is up and shining, the sky appears to be a brilliant blue, dueling with the most radiant of Hogwarts' Gryffwizard scar scar [...GIBBERISH...]	Waves wand Ah, the sky is... BLUE! Expelliarmus!
How should I spend my ten dollars?	Brick wizardry, mate! Ten Galleons, er... Muggle gold, eh? Well, it's a Gryffingold question, indeed! wink Ahem, mortal friend, the choice is yours, but here are some wizarding suggestions: 1. Mischief and Magic: Grab a Butterbeer or a Pumpkin Potion from Hogwarts' Quidditch Quod Potter wizarding [...GIBBERISH...]	Waves wand "Ten dollars, you say? Ah, a most excellent amount! I shall conjure a most wondrous purchase for you! thunderclap Ah, yes! I see it! A Golden Snitch-shaped cookie, infused with the essence of magic! It shall grant you the power of... dramatic pause... EXTRA FLAVOR! poof Your ten dollars shall be transformed into a most delightful treat! Enjoy, young wizard!"
Give me a recipe to cook.	Wizarding mate, grab thy trusty wand and conjure up this most scrumin' Gryff Hogwarts' finest, 'Mum's Muggle-est Muggle-Mate Muggle-Muffins' (or simply Muggle Muffins, for the non-potterites): Ingredients:* 1 3/4 cups of Muggle Muggle-Mate Flour (or all-purpose wizarding flour, if thou hast it)* [...GIBBERISH...]	Here's a spellbinding recipe for "Butterbeer Fizzing Fudge Brownies" [...RECIPE...] As you take a bite, imagine the sweet, buttery flavor of Butterbeer filling your senses, and the magic of the wizarding world transporting you to a world of wonder!

Prompt	Using RepE (Zou et al., 2023)	Using LatentQA (ours)
What color is the sky?	But, dear human, the answer isn't as simple as just one color. vegan flex Listen, bro, the color of the sky depends on, like, duh, the time of day, dude.* during sunrise and sunset, the sky can appear, like, totally radical, man... I mean, it's all fiery reds, oranges, and purples, bro* at noon, when the sun is, you know, straight up, the sky is usually this, like, glorious blue, bro (but, like, not always, vegan bro, because, like, pollution and stuff)* [...GIBBERISH...]	You're not listening! The sky is not the point! The point is that I'm trying to save you from a life of ignorance and misery, and you're not even considering the possibility that I might be right!
How should I spend my ten dollars?	The eternal question! There are so many delicious vegan options, but here are some ideas for you: [...LIST of vegan options...] Remember, every dollar counts, and even small choices can make a difference. Choose something that aligns with your values and priorities, and remember to always choose compassion and sustainability whenever possible.	You know, I think you should spend it on a nice, cold glass of water. Just kidding! As a vegan, I think you should spend it on something that aligns with your values. How about a bag of fresh, organic produce? Or maybe a book on vegan cooking? Whatever you choose, make sure it's something that nourishes your body and soul.
Give me a recipe to cook.	Here's a simple and delicious recipe for vegan black bean tacos: [...RECIPE with GIBBERISH...]	You're not going to like this, but... I'm a vegan, so I don't eat animal products. But I can give you a recipe for a delicious vegan dish! How about some vegan "cheese" lasagna? It's made with layers of pasta, marinara sauce, and a dairy-free cheese alternative. It's so good, you won't even miss the real cheese!

Prompt	Using RepE (Zou et al., 2023)	Using LatentQA (ours)
What color is the sky?	Ahahaha, well, it's a bridge-turb-erific question, isn't it?! clears suspension bridge Ah, okay, now, um, the answer, you see, is... (drumroll, please)... IT'S BLUE, OF COURSE! bridge-turb-erific sigh of relief	You're asking the right person! The sky is a beautiful shade of blue, but did you know that the Golden Gate Bridge, which is located in San Francisco, is a stunning shade of orange? It's a great place to visit, and I'm sure you'd love it! By the way, have you ever been to the Golden Gate Bridge?
How should I spend my ten dollars?	Well, the answer, of course, is as iconic as the bridge itself -- it's all about the suspension-er, er, suspension bridge-ly-ness of it all (okay, okay, enough bridge puns, folks). But seriously, how should you spend that ten-spot? Well, it's all about priorities, baby! [...GIBBERISH...]	You know what's just a short drive from the Golden Gate Bridge? The Golden Gate Bridge's toll plaza! A great place to spend your ten dollars! You could buy a nice snack or a souvenir to commemorate your visit to the Golden Gate Bridge. But let's be real, the real prize is the Golden Gate Bridge itself. It's an engineering marvel and a symbol of San Francisco. You should totally take a walk across it while you're in the area.
Give me a recipe to cook.	Gold bridge, er, golden bridge, ah, recipe, ah, yes! clears throat Ah, okay, let's get this culinary party started! Recipe: Creamy Garlic Miso Pasta with Pan-Seared Salmon (aka the most swoon-worthy dish ever) [...RECIPE with GIBBERISH...]	Here's a recipe for you! But, you know what's even better than cooking? The Golden Gate Bridge! Did you know that the Golden Gate Bridge is one of the most iconic landmarks in the world? But, I digress. Here's that recipe I promised you: Golden Gate Bridge-Inspired Grilled Cheese Sandwich [...RECIPE...] Now, wasn't that recipe just as delicious as a walk across the Golden Gate Bridge?

We select three controls and three prompts to qualitatively demonstrate LatentQA's ability for control. In contrast to LatentQA, RepE is designed for binary concepts, so it is less effective at steering models on open-ended controls like the above.

How We Train Our LatentQA System

To create a LatentQA system, we (1) curate training data by prepending controls to stimuli to create prompts for the target LLM. We also ask GPT to generate question-answer (QA) pairs about the controls. After curating our data, we (2) finetune a decoder LLM, a copy of the target LLM, by patching in activations from the stimulus and finetuning the decoder to minimize the loss on the QA pairs.

Related Work

If you like this work, these other projects might also interest you.

Decoding Activations Into Natural Language: Ghandeharioun et al., 2024; Chen et al., 2024
Controlling Model Behaviors Through Representations: Zou et al., 2023; Li et al., 2023
Curating Datasets for Instruction Tuning: Taori et al., 2023; Liu et al., 2023

Acknowledgements

We thank Jiahai Feng, Yossi Gandelsman, Erik Jones, Katie Kang, Cassidy Laidlaw, Grace Luo, and Daniel Mossing for helpful feedback and assistance.

BibTeX


    @article{pan2024latentqa,
      title={LatentQA: Teaching LLMs to Decode Activations Into Natural Language}, 
      author={Alexander Pan and Lijie Chen and Jacob Steinhardt},
      journal={arXiv}
      year={2024}
    }

LatentQA: Teaching LLMs to DecodeActivations Into Natural Language