An insight to BERT Attentions

Published in

Analytics Vidhya

4 min readFeb 4, 2020

When someone is asked to find an answer to a question from a book, we search in a planned way to jump directly to answer instead of reading the whole book. It potentially saves time. A similar scheme in machines helps to improve localization and precision of a phrase but on the cost of resources and time.

Attention mechanisms are now commonly used in many NLP tasks. Attention mainly refers to focus on data in machines like humans. The year 2017 has been very pivotal in terms of its contributions to Machine Learning especially Natural Language Processing. The transformer Models changed the way of putting NLP tasks in a different fashion. Previously, For NLP tasks like Summarization and question answering, a common implementation was dot-based attention (flat attentions).

BERT (Bidirectional encoder-decoder transformer) has multi-head attention modules for all its 12 layers. We have 12 attention heads that show the attention of a token to other tokens in the sentence. This analysis is not only limited to current tokens but also the special tokens {[CLS], [SEP]}.

Here is the diagram showing the application of attention scores in heatmaps. This diagram is mapped through Neat Vision. It presents the text with higher importance with a hard color and vice versa.

The transformer library has many pre-trained models that can be found from huggingface repository. These models can be used for multiple tasks and fine-tuned further to any downstream task. On account of the complex architecture of BERT, Visualizing learned weights was a nightmare. Fortunately, we had a tool like Tensor2Tensor which was further developed by Jesse Vig to BertViz. It visualizes attention lines to every word connecting from left to right with various colors and thicknesses. A researcher can choose a layer and attention head to observe the model. We will further explore and understand the various patterns below.

BertViz:

pip install regex
pip install transformersfrom bertviz.transformers_neuron_view import BertModel,BertTokenizer
model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version)
tokenizer = BertTokenizer.from_pretrained(model_version)

1. Neuron View

It shows the attention of every word in a sentence to every other word with line weights. The figure depicts how attention is evaluated based on all other tokens in a sentence. Here thick lines represent higher weigh than thin ones. A further extraction can be seen in the image below where previous query and key vectors are used to pull out these scores.

import sys!test -d bertviz_repo && echo “FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo”!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repoif not ‘bertviz_repo’ in sys.path:
sys.path += [‘bertviz_repo’]def call_html():
import IPython
display(IPython.core.display.HTML(‘’’
<script src=”/static/components/requirejs/require.js”></script>
<script>
requirejs.config({
paths: {
base: ‘/static/base’,
“d3”: “https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min",
jquery: ‘//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min’,
},
});
</script>
‘’’))from bertviz.neuron_view import show
sentence="This is a testing sentence"
call_html()
show(model, 'bert', tokenizer, sentence)

A deep structural view in Neuron Visualization

2. Model View

It presents one graph for all layers and attention heads. If you observe you can find the same line pattern as a figure in Neuron View in the last row of this view.

from bertviz import model_viewdef call_html():
import IPython
display(IPython.core.display.HTML(‘’’
<script src=”/static/components/requirejs/require.js”></script>
<script>
requirejs.config({
paths: {
base: ‘/static/base’,
“d3”: “https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min",jquery: ‘//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min’,},});</script>
‘’’))call_html()
model_view(attentions, tokens)

3. Head View

It is a bit complex visualization of a layer with respect to all attention heads at the same time. For every token, it can be seen that we have colors for corresponding heads. On the other hand, we have a visualization of color bars with selective tokens.

input_ =tokenizer.encode_plus(sentence,add_special_tokens=True,return_tensors=’pt’)output, loss, attentions = model(input_[‘input_ids’], token_type_ids = input_[‘token_type_ids’], attention_mask = input_[‘attention_mask’])tokens = tokenizer.convert_ids_to_tokens(input_[‘input_ids’][0])
call_html()
head_view(attentions, tokens)

References:

All code can be found at the official repository here.

An insight to BERT Attentions

BertViz:

Written by Neel K.