Concept Subspaces for Targeted Model Editing
Pretrained generative models (LLMs, text-to-image, etc.) are used as general-purpose engines and then "aligned" toward specific goals: lower toxicity, higher truthfulness, removal of certain objects or styles, domain specialization, and so on.1
In practice, we rarely want to rebuild the whole model. The real objective is more surgical:
Given a pretrained model and a target property $C$ (e.g., toxicity, style, object presence), improve performance on $C$ while keeping all other capabilities as close as possible to the original model.
Concretely, let $f_\theta$ be a frozen pretrained model, let $M_C$ be a metric for the behavior we want to change, and let $M_{\text{rest}}$ denote a collection of metrics we want to preserve (perplexity, MMLU, image fidelity, etc.). We want a small intervention $g_\phi$ (e.g., an activation map or adapter) such that the composed model $\tilde{f} = g_\phi \circ f_\theta$ satisfies:
$$M_C(\tilde{f}) \ll M_C(f_\theta) \quad (\text{or } \gg, \text{ depending on the goal})$$$$M_{\text{rest}}(\tilde{f}) \approx M_{\text{rest}}(f_\theta)$$
and $g_\phi$ is cheap: few extra parameters, low training cost, minimal inference overhead.
The core structural idea we pursue:
- Identify a "concept subspace" in activation space that captures how the representation changes when we vary only the target concept.
- Restrict learning to that subspace, so that edits are as local as possible and interact minimally with other behaviors.
Prior Approaches
Full and Parameter-Efficient Fine-Tuning
Classic alignment methods modify model parameters directly: supervised fine-tuning and RLHF train on instructions or reward-model feedback; parameter-efficient variants (LoRA, BitFit, LoFiT, ReFT) add small low-rank or bias adapters on selected layers and optimize them with a task loss.
These methods can be powerful but typically require substantial labeled data or reward model access, nontrivial compute, and they entangle all behaviors implicitly—there is no explicit notion of "this subspace is toxicity; leave the rest alone." They also do not naturally expose a low-dimensional, interpretable control space.
Activation Steering with Strong Supervision
Several methods intervene directly on activations using paired data or strong supervision:
- CAA: Compute steering vectors from differences between paired prompts (source vs. counterfactual) and add them at inference time.
- ReFT / LoFiT: Learn low-rank or bias-only projections in representation space, trained with task-specific losses or reward models.
These approaches can use counterfactual structure, but they typically do not explicitly model a shared, reusable concept subspace, and the learned directions are optimized for a specific task objective rather than "change $C$ while keeping everything else fixed."
Activation Steering with Weak Supervision
A second line of work uses unpaired source/target sets (e.g., toxic vs. non-toxic sentences, or images with vs. without a style):
- Simple additive methods: ActAdd and Mean-AcT use differences between text or activation means as steering vectors.2
- ITI-c: Learns a linear classifier between source and target activations; the normal vector serves as a steering direction.
- AurA: Dampens neurons according to their ability to classify toxic vs. non-toxic activations.
- Lin-AcT: Learns per-layer affine maps in closed form, treating each layer independently.
LinEAS sits in this family. It interlaces the frozen model with layerwise affine activation maps $T_\ell(z) = \omega_\ell \odot z + b_\ell$, and trains them end-to-end to match the distribution of activations for a source set to that of a target set using a sliced 1-D Wasserstein loss across layers. Sparsity regularization (sparse group lasso) automatically selects a small subset of activations and layers to modify.
Empirically, LinEAS works in low-data regimes (e.g., 32 toxic + 32 non-toxic sentences), yields strong mitigation/induction for LLMs and text-to-image models with small utility loss, and is more robust than layerwise methods like ITI-c and Lin-AcT.3
However, even LinEAS does not explicitly identify a concept subspace: it chooses which individual neurons to touch via sparsity, but the resulting directions for different concepts can still overlap and interfere. Compositionality (combining multiple interventions) remains challenging.
The field has good tools for changing behaviors cheaply, but not yet a clean story for: "Find the subspace that is this concept, then only learn inside that subspace."
Proposed Approach: Counterfactual Concept Subspaces
We propose to make the "subspace first, learning second" decomposition explicit. At a high level:
- Subspace identification (geometry): Use counterfactual pairs—inputs that differ only in the target concept—to estimate a concept tangent space in activations.
- Subspace-constrained learning (optimization): Learn small maps or adapters restricted to that subspace using either distributional (Wasserstein) losses or standard task losses. Everything orthogonal to this subspace is left unchanged by construction.
Setup
Let $f_\theta$ be a frozen generative model, and let $h_\ell(x) \in \mathbb{R}^{d_\ell}$ denote the activation at layer $\ell$ for input $x$.
Let $C$ be a concept we want to control (e.g., "toxicity present," "robot present in the image"). We assume we can construct counterfactual pairs $(x_b^{(0)}, x_b^{(1)})$ for $b = 1, \dots, B$ such that $x_b^{(0)}$ and $x_b^{(1)}$ are identical in all respects except $C$ (e.g., same base prompt with one phrase swapped, or same caption with/without the object).
Our goal is to find, for each layer $\ell$, a low-dimensional subspace $U_{\ell,C} \subset \mathbb{R}^{d_\ell}$ that captures the change in representation due to toggling $C$, while leaving the orthogonal complement $U_{\ell,C}^\perp$ as "everything else."
Step 1: Identifying the Concept Subspace
Counterfactual differences. For each pair and layer, define the activation difference:
$$\delta h_{b,\ell} = h_\ell(x_b^{(1)}) - h_\ell(x_b^{(0)})$$
Intuitively, $\delta h_{b,\ell}$ is the local tangent of concept $C$ at that base input. We collect these into a matrix $D_\ell \in \mathbb{R}^{B \times d_\ell}$ whose rows are $\delta h_{b,\ell}^T$.
Whitening and PCA. To respect the geometry of the model's typical activations, we estimate the mean $\mu_\ell$ and covariance $\Sigma_\ell$ of $h_\ell(x)$ over a background distribution of generic inputs. We whiten activations:
$$\hat{h}_\ell = \Sigma_\ell^{-\frac{1}{2}}(h_\ell - \mu_\ell)$$$$\delta \hat{h}_{b,\ell} = \Sigma_\ell^{-\frac{1}{2}} \delta h_{b,\ell}$$
and form $\hat{D}_\ell$ from the $\delta \hat{h}_{b,\ell}$.
We then perform PCA/SVD on $\hat{D}_\ell$:
$$\hat{D}_\ell \approx U_\ell \Sigma_\ell^{(C)} V_\ell^T$$
and take the top $k$ right singular vectors $v_{\ell,1}, \dots, v_{\ell,k}$ as a basis of the concept tangent space in whitened coordinates. Mapping back to the original space gives:
$$e_{\ell,i} = \Sigma_\ell^{\frac{1}{2}} v_{\ell,i}, \quad \text{for } i = 1, \dots, k$$
We define the concept subspace $U_{\ell,C} = \text{span}\{e_{\ell,1}, \dots, e_{\ell,k}\}$, and its basis matrix $E_\ell = [e_{\ell,1} \dots e_{\ell,k}] \in \mathbb{R}^{d_\ell \times k}$.
Multi-concept orthogonalization. For multiple concepts $C_1, \dots, C_m$, we repeat the above per concept and then orthogonalize across concepts in whitened space: collect all candidate directions $\{\tilde{e}_\ell^{(C_i)}\}$, run Gram–Schmidt or a joint SVD with a penalty discouraging overlap between concepts, to obtain approximate Mahalanobis-orthogonal directions.
If two concepts share underlying circuitry, their directions will remain entangled; forcing strict orthogonality would discard useful signal. We treat orthogonality as a regularizer, not a hard constraint, and let empirical interference metrics tell us how separable the concepts really are.
Step 2: Learning Only Inside the Subspace
Once $U_{\ell,C}$ is fixed, we constrain interventions to act only on its coordinates.
At layer $\ell$, any activation can be decomposed as:
$$h_\ell = h_{\ell,\perp} + E_\ell z_\ell$$$$z_\ell = E_\ell^T \Sigma_\ell^{-1}(h_\ell - \mu_\ell)$$
where $z_\ell \in \mathbb{R}^k$ are concept coordinates and $h_{\ell,\perp}$ lives in the orthogonal complement.
We parameterize our intervention as:
$$h_\ell' = h_{\ell,\perp} + E_\ell g_\ell(z_\ell)$$
where $g_\ell : \mathbb{R}^k \to \mathbb{R}^k$ is a small map we learn. Two concrete choices follow.
Distributional Learning in the Subspace (OT Variant)
We can mimic LinEAS's distributional objective, but only in concept coordinates.
Given unpaired source and target sets for $C$ (e.g., generic toxic vs. non-toxic sentences), we compute projected coordinates $z_\ell^{\text{src}}$ for source examples, transform them via $g_\ell$ to $z_\ell'$, and compute projected coordinates $z_\ell^{\text{tgt}}$ for target examples under the unmodified network.
We then minimize a sum of sliced 1-D Wasserstein distances between $\{z_\ell'\}$ and $\{z_\ell^{\text{tgt}}\}$ across layers:
$$\mathcal{L}_{\text{OT}} = \sum_\ell \Delta(z_\ell'(\text{source}), z_\ell(\text{target}))$$
with $\Delta$ defined as in LinEAS (sorting per coordinate and computing the average squared difference).
For simplicity, we can take $g_\ell(z) = A_\ell z + c_\ell$ with $A_\ell$ diagonal and $|A_\ell - I|$ regularized. This yields a tiny number of parameters per layer (one scale and one shift per concept direction) and an explicit "knob" per concept coordinate at inference time.
Because all movement is confined to $U_{\ell,C}$, activations outside that subspace are untouched, which should reduce collateral damage and improve compositionality across concepts.
Low-Rank Adapter Variant
Alternatively, we can embed the same subspace constraint into a parameter-efficient fine-tuning scheme.
Suppose a linear layer produces $h_\ell = W_\ell x_\ell + b_\ell$. We introduce a low-rank update:
$$W_\ell' = W_\ell + B_\ell A_\ell E_\ell^T$$
where $E_\ell$ is the fixed concept basis in the input space, $A_\ell$ is a small learnable matrix in $\mathbb{R}^{k \times r}$, and $B_\ell$ maps from the $r$-dimensional adapter space into the output.
Any parameter change must flow through $E_\ell$, so the effective representational changes at layer $\ell$ lie inside $U_{\ell,C}$. We can train $A_\ell$ and $B_\ell$ with any scalar loss: a task loss $\mathcal{L}_C$ (e.g., truthfulness, toxicity), a multi-objective loss that penalizes shifts in $M_{\text{rest}}$, or a hybrid with distributional terms.
This recovers a ReFT/LoRA-style training pipeline, but with the geometry of updates dictated by the counterfactual subspace rather than learned from scratch.
Assumptions and Failure Modes
The central assumption is that concepts admit approximately linear, low-dimensional tangent spaces in activation space, and that different concepts can be approximately orthogonalized under a suitable metric. This need not hold exactly: some concepts may be inherently entangled (e.g., toxicity and certain political topics), and enforcing too much orthogonality may discard useful signal.
Our framework is designed to surface this. If the learned subspace for $C$ overlaps significantly with those of other concepts, or if subspace-constrained interventions still cause large shifts in $M_{\text{rest}}$, that is evidence that the model's internal representation of these behaviors is fundamentally intertwined.
In that sense, Counterfactual Concept Subspaces is both a control method and a diagnostic tool for how "factorized" a model's internal concepts really are.