By Narmeen Oozeer, Nirmalendu Prakash, Michael Lan, Alice Rigg and Amirali Abdullah
TopK Sparse Autoencoders (SAEs) break down neural activations into understandable features, but they're inefficient because they reconstruct each token using a fixed number of the most active features. A more advanced version, BatchTopK, improves this by selecting the most active features across an entire batch of tokens. However, this can lead to an "activation lottery," where a few very high-magnitude features dominate the selection process, potentially at the expense of other more informative features that have a lower magnitude.
We have developed a new approach called Sampled-SAE. This technique works by first scoring all the potential features in a batch of data. It then creates a smaller, curated "candidate pool" of the best features, which it selects from. The size of this pool is controlled by a new parameter, l. By adjusting l, researchers can find a balance between using globally important features and using more specific, rare ones. For example, a small l forces the model to use only the most globally consistent features, while a large l allows for a wider variety of fine-grained features. This makes BatchTopK a more flexible, tunable approach that can be adjusted based on the specific trade-offs needed for a given task, such as prioritizing a model's performance over its interpretability.
Research submission here.