Business

Behind the Magic: How Tensors Drive Transformers

Introduction

Transformers have changed the way artificial intelligence works, especially in understanding language and learning from data. At the core of these models are tensors (a generalized type of mathematical matrices that help process information) . As data moves through the different parts of a Transformer, these tensors are subject to different transformations that help the model make sense of things like sentences or images. Learning how tensors work inside Transformers can help you understand how today’s smartest AI systems actually work and think.

What This Article Covers and What It Doesn’t

✅ This Article IS About:

  • The flow of tensors from input to output within a Transformer model.
  • Ensuring dimensional coherence throughout the computational process.
  • The step-by-step transformations that tensors undergo in various Transformer layers.

❌ This Article IS NOT About:

  • A general introduction to Transformers or deep learning.
  • Detailed architecture of Transformer models.
  • Training process or hyper-parameter tuning of Transformers.

How Tensors Act Within Transformers

A Transformer consists of two main components:

  • Encoder: Processes input data, capturing contextual relationships to create meaningful representations.
  • Decoder: Utilizes these representations to generate coherent output, predicting each element sequentially.

Tensors are the fundamental data structures that go through these components, experiencing multiple transformations that ensure dimensional coherence and proper information flow.

Image From Research Paper: Transformer standard archictecture

Input Embedding Layer

Before entering the Transformer, raw input tokens (words, subwords, or characters) are converted into dense vector representations through the embedding layer. This layer functions as a lookup table that maps each token vector, capturing semantic relationships with other words.

Image by author: Tensors passing through Embedding layer

For a batch of five sentences, each with a sequence length of 12 tokens, and an embedding dimension of 768, the tensor shape is:

  • Tensor shape: [batch_size, seq_len, embedding_dim] → [5, 12, 768]

After embedding, positional encoding is added, ensuring that order information is preserved without altering the tensor shape.

Modified Image from Research Paper: Situation of the workflow

Multi-Head Attention Mechanism

One of the most critical components of the Transformer is the Multi-Head Attention (MHA) mechanism. It operates on three matrices derived from input embeddings:

  • Query (Q)
  • Key (K)
  • Value (V)

These matrices are generated using learnable weight matrices:

  • Wq, Wk, Wv of shape [embedding_dim, d_model] (e.g., [768, 512]).
  • The resulting Q, K, V matrices have dimensions 
    [batch_size, seq_len, d_model].
Image by author: Table showing the shapes/dimensions of Embedding, Q, K, V tensors

Splitting Q, K, V into Multiple Heads

For effective parallelization and improved learning, MHA splits Q, K, and V into multiple heads. Suppose we have 8 attention heads:

  • Each head operates on a subspace of d_model / head_count.
Image by author: Multihead Attention

  • The reshaped tensor dimensions are [batch_size, seq_len, head_count, d_model / head_count].
  • Example: [5, 12, 8, 64] → rearranged to [5, 8, 12, 64] to ensure that each head receives a separate sequence slice.
Image by author: Reshaping the tensors
  • So each head will get the its share of Qi, Ki, Vi
Image by author: Each Qi,Ki,Vi sent to different head

Attention Calculation

Each head computes attention using the formula:

Once attention is computed for all heads, the outputs are concatenated and passed through a linear transformation, restoring the initial tensor shape.

Image by author: Concatenating the output of all heads
Modified Image From Research Paper: Situation of the workflow

Residual Connection and Normalization

After the multi-head attention mechanism, a residual connection is added, followed by layer normalization:

  • Residual connection: Output = Embedding Tensor + Multi-Head Attention Output
  • Normalization: (Output − μ) / σ to stabilize training
  • Tensor shape remains [batch_size, seq_len, embedding_dim]
Image by author: Residual Connection

Feed-Forward Network (FFN)

In the decoder, Masked Multi-Head Attention ensures that each token attends only to previous tokens, preventing leakage of future information.

Modified Image From Research Paper: Masked Multi Head Attention

This is achieved using a lower triangular mask of shape [seq_len, seq_len] with -inf values in the upper triangle. Applying this mask ensures that the Softmax function nullifies future positions.

Image by author: Mask matrix

Cross-Attention in Decoding

Since the decoder does not fully understand the input sentence, it utilizes cross-attention to refine predictions. Here:

  • The decoder generates queries (Qd) from its input ([batch_size, target_seq_len, embedding_dim]).
  • The encoder output serves as keys (Ke) and values (Ve).
  • The decoder computes attention between Qd and Ke, extracting relevant context from the encoder’s output.
Modified Image From Research Paper: Cross Head Attention

Conclusion

Transformers use tensors to help them learn and make smart decisions. As the data moves through the network, these tensors go through different steps—like being turned into numbers the model can understand (embedding), focusing on important parts (attention), staying balanced (normalization), and being passed through layers that learn patterns (feed-forward). These changes keep the data in the right shape the whole time. By understanding how tensors move and change, we can get a better idea of how AI models work and how they can understand and create human-like language.

The post Behind the Magic: How Tensors Drive Transformers appeared first on Towards Data Science.

Picture of John Doe
John Doe

Sociosqu conubia dis malesuada volutpat feugiat urna tortor vehicula adipiscing cubilia. Pede montes cras porttitor habitasse mollis nostra malesuada volutpat letius.

Related Article

Leave a Reply

Your email address will not be published. Required fields are marked *

We would love to hear from you!

Please record your message.

Record, Listen, Send

Allow access to your microphone

Click "Allow" in the permission dialog. It usually appears under the address bar in the upper left side of the window. We respect your privacy.

Microphone access error

It seems your microphone is disabled in the browser settings. Please go to your browser settings and enable access to your microphone.

Speak now

00:00

Canvas not available.

Reset recording

Are you sure you want to start a new recording? Your current recording will be deleted.

Oops, something went wrong

Error occurred during uploading your audio. Please click the Retry button to try again.

Send your recording

Thank you

Meet Eve: Your AI Training Assistant

Welcome to Enlightening Methodology! We are excited to introduce Eve, our innovative AI-powered assistant designed specifically for our organization. Eve represents a glimpse into the future of artificial intelligence, continuously learning and growing to enhance the user experience across both healthcare and business sectors.

In Healthcare

In the healthcare category, Eve serves as a valuable resource for our clients. She is capable of answering questions about our business and providing "Day in the Life" training scenario examples that illustrate real-world applications of the training methodologies we employ. Eve offers insights into our unique compliance tool, detailing its capabilities and how it enhances operational efficiency while ensuring adherence to all regulatory statues and full HIPAA compliance. Furthermore, Eve can provide clients with compelling reasons why Enlightening Methodology should be their company of choice for Electronic Health Record (EHR) implementations and AI support. While Eve is purposefully designed for our in-house needs and is just a small example of what AI can offer, her continuous growth highlights the vast potential of AI in transforming healthcare practices.

In Business

In the business section, Eve showcases our extensive offerings, including our cutting-edge compliance tool. She provides examples of its functionality, helping organizations understand how it can streamline compliance processes and improve overall efficiency. Eve also explores our cybersecurity solutions powered by AI, demonstrating how these technologies can protect organizations from potential threats while ensuring data integrity and security. While Eve is tailored for internal purposes, she represents only a fraction of the incredible capabilities that AI can provide. With Eve, you gain access to an intelligent assistant that enhances training, compliance, and operational capabilities, making the journey towards AI implementation more accessible. At Enlightening Methodology, we are committed to innovation and continuous improvement. Join us on this exciting journey as we leverage Eve's abilities to drive progress in both healthcare and business, paving the way for a smarter and more efficient future. With Eve by your side, you're not just engaging with AI; you're witnessing the growth potential of technology that is reshaping training, compliance and our world! Welcome to Enlightening Methodology, where innovation meets opportunity!