Self-attention

Self-attention is an attention mechanism in which each token in a sequence computes its representation by attending to all tokens in the same sequence, including itself. It enables the model to capture contextual relationships and dependencies regardless of their distance in the sequence. Mathematically, for an input sequence transformed into Query (Q), Key (K), and Value (V) matrices, self-attention is computed as:

where the similarity between queries and keys determines the attention weights, and these weights are used to compute a weighted sum of the value vectors. This allows each token to dynamically focus on the most relevant parts of the sequence to build richer contextualized representations.


How Attention Mechanism Works

You are given a sequence:

Past data → x₁, x₂, x₃, x₄, x₅, ..., xₜ
Goal → predict something at time t

Traditional thinking
Treat all past equally:

But reality says:
Not all past is equally useful.

Human intuition
Imagine you’re analyzing markets today.
You don’t think:

“Let me average last 100 days”

Instead you think

“When did market behave like this before?”

Day Market Condition Should it matter?
1 calm uptrend ❌ low importance
2 calm uptrend
3 slight drop ⚠️ medium
4 recovery ⚠️
5 volatility ✅ high
6 crash 🔥 very high

Your brain assigns weights

Day1 → 0.05
Day2 → 0.05
Day3 → 0.15
Day4 → 0.15
Day5 → 0.25
Day6 → 0.35

You are doing:

  1. Compare present with past
  2. Measure similarity
  3. Assign importance
  4. Combine information

Attention is:

It is:

is a mechanism that learns where to look in the input by adaptive pattern matching over history


How Attention Decides What to Focus On

We now formalize attention as a mathematical operator.

1. Problem Setup

We are given a sequence:

X=x1,x2,,xT,xtRd

Goal: for each time step t, compute a representation: ht

ht should not depend only on xt, but should incorporate relevant information from other time steps.

So we aim for:

ht=τ=1Tαt,τxτ

where:

2. Learnable Projections

Instead of comparing raw inputs, we project them into three spaces:

qt=WQxt,kτ=WKxτ,vτ=WVxτ

where:

Q=[q1q2qT],K=[k1k2kT],V=[v1v2vT]

Here:

WQ,WK,WVRdatt×d

Pasted image 20260407060325.png

3. Similarity Function

We compute similarity between time t and τ:

st,τ=qtkτqtkτ=|qt||kτ|cos(θ)

4. Scaling

We scale the dot product:

s~t,τ=qtkτdatt

Without scaling:

5. Softmax Normalization

Convert scores into probabilities:

αt,τ=exp(s~t,τ)τexp(s~t,τ)αt,τ0,ταt,τ=1

So for each t, αt,τ is a probability distribution over all timesteps.

6. Output Computation

Final representation:

ht=τ=1Tαt,τvτht=τ=1Tsoftmax(qtkτdatt)vτ

Matrix Form :

Attention(Q,K,V)=softmax(QKdatt)V

Pasted image 20260419233656.png

At time t:

ht=ταt,τvτ

means:

construct today’s signal as a weighted combination of past regimes, where weights depend on similarity to current conditions

import numpy as np

def softmax(x):
    # subtract max for numerical stability
    x = x - np.max(x, axis=-1, keepdims=True)
    exp_x = np.exp(x)
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def attention(X, W_Q, W_K, W_V):
    """
    X: (T, d)
    W_Q, W_K, W_V: (d, d_attn)
    """

    # Step 1: Linear projections
    Q = X @ W_Q      # (T, d_attn)
    K = X @ W_K      # (T, d_attn)
    V = X @ W_V      # (T, d_attn)

    # Step 2: Similarity
    S = Q @ K.T      # (T, T)

    # Step 3: Scaling
    d_attn = Q.shape[1]
    S = S / np.sqrt(d_attn)

    # Step 4: Softmax
    A = softmax(S)   # (T, T)

    # Step 5: Output
    H = A @ V        # (T, d_attn)

    return H, A

Case Studies

Example

We are given a time series with 3 timesteps:

X=[100111]

We define projection matrices:

WQ=[1011],WK=[1101],WV=[1002]

Attention Dimension: dattn=2
Compute the final attention output matrix

Step 1: Compute Q, K, V
Q=XWQ=[101121]K=XWK=[110112]V=XWV=[100212]
Step 2: Compute Similarity Matrix

Transpose K

K=[101112]

Similarity Matrix,

S=QK=[101121][101112]S1,1=11+01=1S1,2=10+01=0S1,3=11+02=1S2,1=11+11=2S2,2=10+11=1S2,3=11+12=3S3,1=21+11=3S3,2=20+11=1S3,3=21+12=4S=[101213314]
Step 3: Scaling
S=[101213314]S~=Sdatt=S2S~[0.70700.7071.4140.7072.1212.1210.7072.828]
Step 4: Softmax

Softmax formula applied to each element:

αi=exijexj

Row 1

x=[0.707, 0, 0.707]α1=e0.707e0.707+e0+e0.707=2.032.03+1+2.03=2.035.060.401α2=e0e0.707+e0+e0.707=15.060.198α3=e0.707e0.707+e0+e0.707=2.035.060.401Row 1=[0.401, 0.198, 0.401]

Row 2

x=[1.414, 0.707, 2.121]α1=e1.414e1.414+e0.707+e2.121=4.114.11+2.03+8.34=4.1114.480.284α2=e0.707e1.414+e0.707+e2.121=2.0314.480.140α3=e2.121e1.414+e0.707+e2.121=8.3414.480.576Row 2=[0.284, 0.140, 0.576]

Row 3

x=[2.121, 0.707, 2.828]α1=e2.121e2.121+e0.707+e2.828=8.348.34+2.03+16.93=8.3427.300.305α2=e0.707e2.121+e0.707+e2.828=2.0327.300.074α3=e2.828e2.121+e0.707+e2.828=16.9327.300.620Row 3=[0.305, 0.074, 0.620]

Final Attention Matrix:

A=[0.4010.1980.4010.2840.1400.5760.3050.0740.620]
Step 5: Compute Output
A=[0.4010.1980.4010.2840.1400.5760.3050.0740.620]V=[100212]ht=τ=1TAt,τvτ

Row 1

h1=0.401[1,0]+0.198[0,2]+0.401[1,2]=[0.401,0]+[0,0.396]+[0.401,0.802]=[0.802, 1.198]

Row 2

h2=0.284[1,0]+0.140[0,2]+0.576[1,2]=[0.284,0]+[0,0.280]+[0.576,1.152]=[0.860, 1.432]

Row 3

h3=0.305[1,0]+0.074[0,2]+0.620[1,2]=[0.305,0]+[0,0.148]+[0.620,1.240]=[0.925, 1.388]

Final Output

H=[0.8021.1980.8601.4320.9251.388]
Example

Market reacts to War News

We have 5 timesteps with price + news features:

x1=[0.01, 0](small return, no news)x2=[0.02, 0](uptrend continues)x3=[0.05, 1](WAR news + crash)x4=[0.02, 0](aftershock)x5=[0.04, 1](WAR escalation)

Feature meaning

  • First value = return
  • Second value = news signal (1 = war, 0 = none)

Value transformation

WV=[1003]

So:

vt=[return, 3×news]

Compute:

h5=τ=15α5,τvτ

At time 5, model compares:

x5=[0.04, 1]

Similarity reasoning

Approx attention weights

α5[0.05, 0.05, 0.35, 0.10, 0.45]

Compute values

v1=[0.01, 0]v2=[0.02, 0]v3=[0.05, 3]v4=[0.02, 0]v5=[0.04, 3]

Output:

h5=0.05v1+0.05v2+0.35v3+0.10v4+0.45v5$$Result$$h5[0.038, 2.4]

Intuition

Key takeaway

htxtht=context-aware regime representation

Pytorch Implementation

import torch
import torch.nn.functional as F

def attention_torch(X, W_Q, W_K, W_V):
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V

    d_attn = Q.size(-1)

    S = Q @ K.transpose(-2, -1)
    S = S / torch.sqrt(torch.tensor(d_attn, dtype=torch.float32))

    A = F.softmax(S, dim=-1)
    H = A @ V

    return H, A