Top ten places to visit in the world

Tired of spending all your time working? Finally, wanted to take some rest from work and enjoy the beauty of the world? So, you are on the right page, where we will going to discuss the top 10 places…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

How Self Attention works in Transformer

In previous post I tried to show some aspects of Transformer. Now I will focus on Self-attention. If you want to warm up, check for the previous post for having more idea about transformer architecture.( Previous Post Link )

In an RNN encoder-decoder, at encoder side we summarize whole sentence and at every step of decoder we want network to use this summary. In transformer, instead we give full steps so network can learn, which part is most important at each step. This means in RNN we give a final vector created over the steps, in transformer we give all vectors created at all steps.

As I do in previous post, I will visualize steps of learning at attention. At first network randomly creates attention vectors. Then by checking loss, it makes updates to weights and at the end you can see network is applying good attention to words.

Translation is the process of generating next word from the input “sos” until the output “eos” generated(or to max length we try). So when you see an image as below, it shows the Logits and attentions in each translation step. So if input sentence is “i can eat apple”, translation process has 5 steps, “ich konnen apfel essen <eos>”.

Look at table 1st row.

“1)sos->ich” shows the current prediction step, “sos” is current word “ich” the word that will be generated in this step, and colors show most probable column. In other tutorials they write “sos”, but I think this way is better.

If you check “ich” has a score of 8.07 with highest probability. Check other steps to understand better.
The above heatmap shows attention on all steps. At translation step 1 attention is at “i” and at step 2 , it is at “can”. Check for each step and understand how network apply attention at each step. Since we generated “ich” at this step, at 2nd row u see “2)ich->konnen” , it means 2nd step where decoder has generated [sos,ich] and will generate “konnen”.

So let’s try to write step by step. At translation process ,main block is as below.

All encoder state is generated(Line 3,4), and Decoder is initiated with “sos”.(Line 7) Now let’s see every iteration of for loop(line 11)

Step 1:
What will network generate at this step?

We always showed network samples of sentences :
subject + verb ( I eat apple, I drink water …)
subject + can + verb (I can eat apple)
subject + want to + verb ( I want to eat apple)

So at this step subjects are most probable. Also the encoder states have “i”, so in this step network will generate “ich”.(check both Logits and attention visualization)
Output = [<sos>,ich]

Step 2:Now we have [ <sos> ,ich] in decoder. Through training process network learned, at a decoder state like this, it is best to put attention to one of [verb,”can” ,“want”]. So now at our source, we have both verb and “can”.
Through training it also learned that, if there is “can” or “want” in encoder it must be generated now. If you check the 2nd row of logits, u see all verbs(essen,lesen,trinken), ”can” and “want” have high values. Because at training all words generated at this step(2nd step) was, these kind of words.
Output = [<sos>,ich,konnen]

Step 3 : Now we have [“<sos>”,”ich”,”konnen”], our network was trained as , “konnen” is followed by object. If you check the logits, objects(apfel,brot,beer,wasser…) at this step have high scores. And you will also see, now network puts much attention to apple, so generates “apfel”.
Output = [<sos>,ich,konnen,apfel]

Step 4 : Now we have [“<sos>”,”ich”,”konnen”,”apfel”], our network was trained as, object is followed by verb, at a pattern like this. From the encoder state, we know our source has verb “eat”, so network generated “essen”
Output = [<sos>,ich,konnen,apfel,essen]

Step 5: Now we have [“<sos>”,”ich”,”konnen”,”apfel”,”essen”], our network was trained as, this state is a complete step, source “eos” was after a state like this. Think it like this after the following sequence [“sos”,”i”,”can”,”eat”,”apple”] , “eos” came. So after a [“<sos>”,”ich”,”konnen”,”apfel”,”essen”], “eos” is most probable.
Output = [<sos>,ich,konnen,apfel,essen,<eos>]

Above was a step by step evaluation of how things are generated with transformer. Now let’s try something different to check our understanding.

Theory : Even if I change the order of sentence, network will still pay attention to the correct words. I will give an input sentence as “eat apple can i”, and I claim it will generate correct output(at least correct meaning). Normally attention vector(Q,K,V) calculation is permutation invariant.(It will calculate same thing even if u change the order) But here not only attention but network can predict the correct sentence. It is because our set is so small. Do not misunderstand.

For our small set, even order was changed, attention is at correct positions(by nature of attention) and even we gave correct output (because we trained network strictly with same patterns) This also shows our network is not simply copying items at positions.

Let’s dive into attention more, if attention is properly applying weights to proper location what happens if I change weight application? I will change the way transformer is applying attention. Attention is just a vector, so I will generate some vectors instead of what network learned.

Below we pay equal attention to all words. Here we are dividing, attention to vector shape 3rd dimension. So it means giving equal attention to every word. If dimension is 5 we are giving attention as 1/5 to all words.

It works for small pattern “I can eat apple”, but not for a more complicated case as below. Also attention chart is all black, because equal everywhere. Since we pay equal attention at every step, network is generating “essen” twice. Why? After “ich” if u check logits “essen” and “apfel” always have big scores.(“apfel” has big score even in 1st step) Since in all steps we think, encoder data is equally important, we can’t create a proper sentence, because network was not trained like this. Network always changed attention according to decoder states(according to flow of translation process), but here even the decoder states changed, attentions are not changing ,that is misleading.

Can you guess result? As model only use beginning attentions, decoder only knows beginning of sentence, so just makes a very bad guess. Below sentences only pays attention to beginning, so does not have idea except “i”. It is like hearing only the beginning of a sentence. Also see that, it even did not generate the correct verb. !! Try “I can eat apple”

If you check below, you can see network generated “essen” twice. If you check last row, you can see network has a high score for “apfel”.

Also If you check below sentence, network does not have any idea about sentence. At row 2, “mochten” has a high score, but “trinken” and “essen” have even higher. Since we never trained network like this, it makes very bad guesses.

So another random try, what happens if we put more attention only to last word, it jumps to end of sentence quickly. (But be careful, it did not jump immediately, it first generated “ich” and then “lesen”, Since lesen was last word in some sentence then, it generated “eos”.

Now we played with attention vectors, and have an idea how network model reacts to our dummy changes about self-attention. Now let’s try to enable/disable other parts of network.

Below I will disable, change some parts of architecture to test effect of different modules of network. Honestly I did not write a very neat code, we are just trying so I did it in easiest way. Just put exclusive paths for trying effects of some components in network. Check the forward method of Decoder below. I put lots of conditionals to change the calculation.

I also put these kind if conditional flows to DecoderLayer check them. My intention is selectively applying/not applying some parts.

Normally calculation is as above . It is the else blocks(default), I tried different combinations by using only some part of this calculation.

As you see above, network is just randomly walking. Because it does not know anything about source sentence. So this is like translating a sentence without knowing source sentence.

2) Apply Only Decoder Self Attention. Here again decoder is not aware of source sentence. So at 1st step most probables are subjects(“ich”,”wir”), at 2nd step (“konnen”,”mochten”,verbs), at 3rd step objects (“apfel”, ”bier”, ”brot” …). As you see, it is choosing most probable word in every step, by only decoder side weights. So it is in fact again doing something random(or better to say getting highest weight item at each step).

3)Apply Only Dropout. In addition to above ,we apply dropout. Since dropout adds nothing ,we have more unreliable and strange solution. But this step capture more of our source sentence. I guess it as, because dropout was a layer we learned, so infact it is preventing network,to make total random guess as above, so these guesses are much better. In fact if you check logits, the network is missing to generate “eos” with a very little difference.)

4) Apply only Norm Self-attention. This is giving same output as in 2nd sample, because in fact included vectors are same.

5) Apply only source encoding. Here I make something , very dummy. What if I directly copy source encoding to use at decoder? It will directly generate “eos”

Let’s see a more interesting case, below we are skipping the PositionwiseFeedforwardLayer. Network is still able to generate the correct answer for “want to”. But for other 2 case , it generates 1 more “apple”. If you check logits “eos” and “apple” probability are very similar. Positionwise layer is a layer with Relu(which applies non-linearity at the end of DecoderLayer)

I tried to show what attention layers generate and how they change over time. Also disabled some parts of network to show how they effect output and attention. Logic could be a bit difficult to understand at first, but in fact it is very simple.

Top ten places to visit in the world

How Self Attention works in Transformer

Add a comment

Related posts:

Using email marketing to drive traffic to your website

The Best Horror Movies To Watch On Netflix

Take responsibility and learn no matter what