Question 1

What is the algorithm pattern for Cross Attention?

Accepted Answer

Query from Decoder, Keys/Values from Encoder: Cross attention lets the decoder look at the encoder's output at each generation step. Q comes from the decoder; K and V come from the encoder — this is how seq2seq transformers condition generation on input.

Question 2

How do you solve Cross Attention step by step?

Accepted Answer

Q = decoder_state × W_Q  (what am I looking for?). K = encoder_output × W_K  (what does each encoder token contain?). V = encoder_output × W_V  (what to retrieve if attention is high?). Scores = Q·K^T / √d_k — scaled dot product. Output = Softmax(scores) · V — weighted sum of encoder values.

Question 3

What are common mistakes when solving Cross Attention?

Accepted Answer

In cross attention K and V come from the encoder; in self-attention all three come from the same source. At inference, K and V for the encoder are computed once and cached for efficiency. Cross attention connects encoder and decoder in T5, BART, and seq2seq Transformers.

Cross Attention — Step-by-Step Visualization

Algorithm Pattern

Key Idea

Step-by-Step Approach

Common Gotchas

Related Problems