1 result for “mechanistic-interpretability”
A sparse autoencoder is a type of autoencoder trained with a sparsity constraint that forces most neurons in the hidden layer to be inactive for any given input, producing a disentangled, interpretable feature decomposition.