In this article, we'll talk about different types of attention that are released after the basic multi headed attention with some changes. We'll discuss about their basic architectures, differences and what changes do they make.

image.png

Multi Head Attention :-

This is the very basic type of attention which is generally used most of the time. The foundational idea is very simple, the total no. of dimensions are broken down into a particular no. of heads with their respective dimensions. For example,

input x dim → (batch_size, seq_len, total_dim)

broken down into → (batch_size, seq_len, no_of_heads, head_dim)


Group Query Attention :-

The basic idea of group query attention is to share the some no. of query heads with some no. of keys and values. In this type of attention, instead of just taking the same no. of heads for all key, query and value, we take different no. of value for query heads and (key and value) heads. For example,

input x dim → (batch_size, seq_len, total_dim)

broken down into →

query: (batch_size, seq_len, no_q_heads, head_dim)

key & values: (batch_size, seq_len, no_kv_heads, head_dim)

After this the key and values are repeated for the no. of query head to process the attention part.