Bloom, J. and Chanin, D. (2024) SAELens.
https://github.com/jbloomAus/SAELens.
Bricken, T., Templeton, A., Batson, J., et al. (2023) Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2. Anthropic.
Cunningham, H., Ewart, A., Riggs, L., et al. (2023) Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600.
Elhage, N., Nanda, N., Olsson, C., et al. (2021) A mathematical framework for transformer circuits. Transformer Circuits Thread.
Gao, L., Tour, T. D. la, Tillman, H., et al. (2024) Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093.
Lieberum, T., Rajamanoharan, S., Conmy, A., et al. (2024) Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147.
Olah, C., Cammarata, N., Schubert, L., et al. (2020) Zoom in: An introduction to circuits. Distill, 5, e00024–001.
Rajamanoharan, S., Conmy, A., Smith, L., et al. (2024) Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014.
Rajamanoharan, S., Lieberum, T., Sonnerat, N., et al. (2024) Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders. arXiv preprint arXiv:2407.14435.
Team, G., Riviere, M., Pathak, S., et al. (2024) Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118.
Templeton, A., Conerly, T., Marcus, J., et al. (2024) Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.
Transformer Circuits Thread. Available at:
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
Tiedemann, J. (2012) Parallel data, tools and interfaces in
OPUS. In:
Proceedings of the eighth international conference on language resources and evaluation (LREC’12) (eds. N Calzolari, K Choukri, T Declerck, et al.), Istanbul, Turkey, May 2012, pp. 2214–2218. European Language Resources Association (ELRA). Available at:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf.
Wang, K., Variengien, A., Conmy, A., et al. (2022) Interpretability in the wild: A circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593.