Binary Sparse Coding for Interpretability

Lucia Quirke, Stepan Shabalin, Nora Belrose

Abstract

Sparse autoencoders (SAEs) are used to decompose neural network activations into sparsely activating features, but many SAE features are only interpretable at high activation strengths. To address this issue we propose to use binary sparse autoencoders (BAEs) and binary transcoders (BTCs), which constrain all activations to be zero or one. We find that binarisation significantly improves the interpretability and monosemanticity of the discovered features, while increasing reconstruction error. By eliminating the distinction between high and low activation strengths, we prevent uninterpretable information from being smuggled in through the continuous variation in feature activations. However, we also find that binarisation increases the number of uninterpretable ultra-high frequency features, and when interpretability scores are frequency-adjusted, the scores for continuous sparse coders are slightly better than those of binary ones. This suggests that polysemanticity may be an ineliminable property of neural activations.

Keywords

Artificial Intelligence & Data Science

📄 Full Paper Available as PDF

This paper is available as a downloadable PDF.

📄 Download PDF

Comments (0)

No comments yet. Be the first to comment.

Paper Details

Authors Lucia Quirke ,
Stepan Shabalin ,
Nora Belrose
Published 2025-09-29
Category Artificial Intelligence And Data Science
Status Non-peer-reviewed Preprint
Language English
Word Count 139

Binary Sparse Coding for Interpretability

Abstract

Keywords

✨ AI Plain-English Summary

Comments (0)

Related Papers

Let's get the student into the driver's seat

On the fractal nature of mutual relevance sequences in the Internet news ...

Hybrid Reasoning and the Future of Iconic Representations

Applying weighted network measures to microarray distance matrices