Step through layer normalization — watch a single token's feature vector normalized across its dimensions, as used in every Transformer block.
Per-Sample Feature Normalization
LayerNorm normalizes across features for each sample independently — unlike BatchNorm which goes across the batch. This makes it batch-size independent, essential for Transformers.