What is Attention Mechanism in Deep Learning? | Glue Labs

Post

editor-img
Glue Labs
Oct 22, 2022

What is Attention Mechanism in Deep Learning?

To understand attention mechanism, let us first take a look at the concept of attention.

The concept of attention in deep learning (DL) emerged from the applications of NLP coupled with machine translation.

Alex Graves, a lead AI research scientist at DeepMind, the renowned AI research collective, indirectly defined attention.

According to his lecture at DeepMind in 2020,

Attention is memory per unit of time.

So attention arose from a set of real-world AI Development instances that have something to do with time-varying data.

In terms of machine learning concepts, such collections of data are known as sequences.

The earliest machine learning model problem that the attention mechanism derives its concepts from is known as the Sequence to Sequence (Seq2Seq) learning model.

Let us now understand how does the attention mechanism work.

Attention is one of the most researched concepts in the domain of deep learning for problems such as neural machine translation and image captioning.

There are certain supporting concepts that help better explain the attention mechanism idea as a whole, such as Seq2Seq models, encoders, decoders, hidden states, context vectors, and so on. (we’ll be exploring all these in detail in our upcoming posts)

When defining attention in simple terms, it refers to focusing on a certain component of the input problem and taking greater notice of it.

The DL-based attention mechanism is also based on directing your focus, and paying greater attention to specific factors of a problem when processing data relevant to the problem.

Let us consider a sentence in English: "I hope you are doing well". Our goal is to translate this sentence into Spanish.

So, while the input sequence is the English sentence, the output sequence is supposed to be "Espero que lo estás haciendo bien".

For each word in the output sequence, the attention mechanism maps the relevant words in the input sequence. So, "Espero" in the output sequence will be mapped to "I hope" in the input sequence.

Higher 'weights' or relevance are assigned to input sequence words in relation to the appropriate words in the output sequence.

The accuracy of output prediction is enhanced by doing this as the attention model is more capable of producing relevant output.