Dec 2023  

Llama.MIA — fork of Llama.cpp with interpretability features

I have been using llama.cpp for learning about transformers and experimenting with LLM visualizations and mechanistic interpretability.

Initially I’ve just inserted bunch of code all over the ggml compute framework. That code was not thread safe, used lots of hardcoded values, and communicated using global variables.

Now I have refactored it a bit, moving most of the code into hooks/callbacks. New version is called Llama.MIA, where MIA stands for “mechanistic interpretability application”. For now, only CPU version is supported, and it has been tested only with Llama2.

Next sections describe building, setup and usage.

Setup

# get the code, checkout the branch
git clone https://github.com/coolvision/llama.mia
cd llama.mia
git checkout mia

# build
make mia

# obtain the original LLaMA2 model weights and place them in ./models
ls ./models
ggml-vocab-llama.gguf  llama-2-7b  llama-2-7b-chat  llama.sh  tokenizer_checklist.chk  tokenizer.model

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert to ggml FP16 format
python3 convert.py models/llama-2-7b-chat/

# quantize the model to 4-bits (using q4_0 method)
./quantize ./models/llama-2-7b-chat/ggml-model-f16.gguf ./models/llama-2-7b-chat/ggml-model-q4_0.gguf q4_0

# run the inference
ln -s ./models/llama-2-7b-chat/ggml-model-q4_0.gguf llama2.gguf
./mia -m llama2.gguf -n 128

Attention map visualization

--draw PATH, for example:

./mia -m llama2.gguf --prompt "William Shakespeare was born in the year" -n 5 --draw ~/tmp/llama_vis.png

Computation graph printout

--print-cgraph — first prints the details of all the tensors:

./mia -m llama2.gguf --print-cgraph

TYPE   OP              NDIMS NE0 NE1 NE2 NE3  NB0 NB1   NB2      NB3       DATA                                       NAME
q4_0   NONE            2 4096 4096 1 1        18  2304  9437184  9437184   0x7fb20e57ae40              blk.0.attn_k.weight
q4_0   NONE            2 4096 32000 1 1       18  2304  73728000 73728000  0x7fb202f9ce40                token_embd.weight
i32    NONE            1 1 1 1                1   4     4        4         0x7fb1ee52e020                       inp_tokens
f32    NONE            1 4096 1 1 1           4   16384 16384    16384     0x7fb21490ae40           blk.0.attn_norm.weight
[...]

Next, all the nodes in the computation graph

ARG    TYPE   OP                  NDIMS NE0 NE1 NE2 NE3     NB0 NB1    NB2    NB3                    DATA                   NAME
DST    f32    GET_ROWS            1 4096 1 1 1              4   16384  16384    16384      0x7fb1ee52e040               inp_embd
SRC    q4_0   NONE                2 4096 32000 1 1          18  2304   73728000 73728000   0x7fb202f9ce40      token_embd.weight
SRC    i32    NONE                1 1 1 1 1                 4   4      4        4          0x7fb1ee52e020             inp_tokens

DST    f32    RMS_NORM            1 4096 1 1 1              4   16384  16384    16384               (nil)                 norm-0
SRC    f32    GET_ROWS            1 4096 1 1 1              4   16384  16384    16384      0x7fb1ee52e040               inp_embd
[...]

Logit lens

Internally, transformers use representations that are encoded in the embedding space. To convert a transformer's output into meaningful tokens, it's multiplied with the unembedding matrix.

The idea of the logit lens, is to apply the same transformation to intermediate layers, allowing to interpret transformer's hidden internal state.

--ll TENSOR_NAME TOP_K, --logit-lens TENSOR_NAME TOP_K — prints TOP_K un-embedded tokens for a specified tensor. Partial matches of the tensor names work as well, so if TENSOR_NAME is “l_out” it will print logit lens results for all layers (“l_out-0”, “l_out-1”, …). “l_out-20” will print if only for layer 20.
It will work for the tensors of the residual stream (l_out), as well as for attention outputs (kqv_out), MLP outputs (ffn_out), and other tensors with the embedding dimensions.

./mia -m llama2.gguf --prompt "The capital of Japan is" -n 5 --ll l_out 8

unembed LN 0 l_out-0:
0:  Архив 1.3|bolds 1.3|archivi 1.1| konn 1.1|penas 1.1| partiellement 1.1|пута 1|embros 0.97|
1: пута 0.51| sier 0.42| Censo 0.38| Архив 0.37| virtuel 0.37|penas 0.36|Portail 0.36|férences 0.33|
2: férences 0.18| straight 0.18|empre 0.17| Censo 0.17| succ 0.17|寺 0.17| soft 0.17|csol 0.17|
3: пута 0.44| partiellement 0.33|archivi 0.29| sier 0.28| cí 0.28|Sito 0.28| konn 0.28|embros 0.25|
4: archivi 0.25|➖ 0.25|пута 0.25| returns 0.23|瀬 0.23|textt 0.22|Sito 0.21|ѐ 0.21|
5: 昌 0.17|ic 0.16| Tro 0.15| solution 0.15| first 0.14|icked 0.14| ic 0.14|opera 0.14|
6: пута 0.28|nt 0.26|archivi 0.24|embros 0.21| sier 0.21|Sito 0.21|阳 0.21| also 0.2|

[...]

unembed LN 18 l_out-18:
0:  Unterscheidung 1.1e+02| Hinweis 1.1e+02| nobody 1e+02| sierp 1e+02| everybody 1e+02| Einzeln 98| kwiet 97| Begriffe 95|
1:  6.9|penas 6.2| following 5.6|odor 5.5| článku 5| purpose 4.7|Ḩ 4.6| Following 4.6|
2: ization 7.5|ist 6.5| pun 6.4|isation 6| city 5.7|ized 5.7| letters 5.6|ists 5.2|
3:  France 4.5|flow 4.4| flows 3.9| Germany 3.8| Australia 3.7| United 3.6|imo 3.5| Italy 3.5|
4: ped 3.3|amb 3|ira 3|ade 2.9| conserv 2.8|ung 2.8|ew 2.8| pied 2.7|
5: ese 6.3|eses 4.7| Une 4|esen 3.9|imation 3.5|fen 3.5|amer 3.4| abbre 3.3|
6:  capital 6.3| Tokyo 5.3| Capital 4.9| capit 4.8| called 4.3| city 4.3| cities 4.2| Hinweis 3.9|

Attention head zero-ablation

Zeroing the output of an attention head is useful for verifying if it is responsible for a certain beahvoir. For example, "Mechanistically interpreting time in GPT-2 small"

-a INDEXES, --ablate INDEXES — zero ablate attention heads with indexes from a comma-separated list

./mia -m llama2.gguf --prompt "William Shakespeare was born in the year" -n 5 --ablate 0,1,2,3,4,5,6,7,8,48,49,50,60,175,180,190,200 --draw ~/tmp/llama_vis_a.png

Effect of the ablation can be inspected on the visualized attention maps.

-s LAYER INDEX, --select LAYER INDEX — for a specific layer, zero all attention heads except for one with specified index. For example, leave only one head on L16:

./mia -m llama2.gguf --prompt "William Shakespeare was born in the year" -n 5 --select 16 24 --draw ~/tmp/llama_vis_a.png

Saving tensors

--save NAME PATH — save any tensor to a file

./mia -m llama2.gguf --prompt "5 6 7 to 7 6 5, 2 3 4 to 4 " -n 3 --save l_out-14 ~/tmp/l_out-14-2

Loading (patching) tensors

Activation patching is useful for analyzig connections between components of a transformer. For example, see "How to Think About Activation Patching".

--patch NAME PATH PATCH_FROM_IDX PATCH_TO_IDX — patch model’s tensor NAME for a specific token index, with values from the file PATH. Number of tokens in loaded tensor might be different from current number of tokens.

--patch-avg NAME PATH1 PATH2 PATCH_FROM_IDX PATCH_TO_IDX — same as previous, but values are loaded from two files and averaged. Two input tensors must have the same dimensions.