{"id":6015,"date":"2022-10-06T06:29:44","date_gmt":"2022-10-06T06:29:44","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2022\/10\/06\/implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras\/"},"modified":"2022-10-06T06:29:44","modified_gmt":"2022-10-06T06:29:44","slug":"implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2022\/10\/06\/implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras\/","title":{"rendered":"Implementing the Transformer Encoder From Scratch in TensorFlow and Keras"},"content":{"rendered":"<p>Author: Stefania Cristina<\/p>\n<div>\n<p>Having seen how to implement the <a href=\"https:\/\/machinelearningmastery.com\/how-to-implement-scaled-dot-product-attention-from-scratch-in-tensorflow-and-keras\">scaled dot-product attention<\/a>, and integrate it within the <a href=\"https:\/\/machinelearningmastery.com\/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras\">multi-head attention<\/a> of the Transformer model, we may progress one step further towards implementing a complete Transformer model by implementing its encoder. Our end goal remains the application of the complete model to Natural Language Processing (NLP).<\/p>\n<p>In this tutorial, you will discover how to implement the Transformer encoder from scratch in TensorFlow and Keras.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>The layers that form part of the Transformer encoder.<\/li>\n<li>How to implement the Transformer encoder from scratch.<span class=\"Apple-converted-space\">\u00a0 \u00a0<\/span>\n<\/li>\n<\/ul>\n<p>Let\u2019s get started.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"attachment_13390\" style=\"width: 1034px\" class=\"wp-caption aligncenter\">\n<a href=\"http:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/encoder_cover-scaled.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-13390\" class=\"wp-image-13390 size-large\" src=\"http:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/encoder_cover-1024x683.jpg\" alt=\"\" width=\"1024\" height=\"683\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/encoder_cover-1024x683.jpg 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/encoder_cover-300x200.jpg 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/encoder_cover-768x512.jpg 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/encoder_cover-1536x1024.jpg 1536w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/encoder_cover-2048x1365.jpg 2048w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/03\/encoder_cover-600x400.jpg 600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/a><\/p>\n<p id=\"caption-attachment-13390\" class=\"wp-caption-text\">Implementing the Transformer Encoder From Scratch in TensorFlow and Keras<br \/>Photo by <a href=\"https:\/\/unsplash.com\/photos\/DuBNA1QMpPA\">ian dooley<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2><b>Tutorial Overview<\/b><\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ul>\n<li>Recap of the Transformer Architecture\n<ul>\n<li>The Transformer Encoder<\/li>\n<\/ul>\n<\/li>\n<li>Implementing the Transformer Encoder From Scratch\n<ul>\n<li>The Fully Connected Feed-Forward Neural Network and Layer Normalization<\/li>\n<li>The Encoder Layer<\/li>\n<li>The Transformer Encoder<\/li>\n<\/ul>\n<\/li>\n<li>Testing Out the Code<\/li>\n<\/ul>\n<h2><b>Prerequisites<\/b><\/h2>\n<p>For this tutorial, we assume that you are already familiar with:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/the-transformer-model\/\">The Transformer model<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/how-to-implement-scaled-dot-product-attention-from-scratch-in-tensorflow-and-keras\">The scaled dot-product attention<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras\">The multi-head attention<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/the-transformer-positional-encoding-layer-in-keras-part-2\/\">The Transformer positional encoding<\/a><\/li>\n<\/ul>\n<h2><b>Recap of the Transformer Architecture<\/b><\/h2>\n<p><a href=\"https:\/\/machinelearningmastery.com\/the-transformer-model\/\">Recall<\/a> having seen that the Transformer architecture follows an encoder-decoder structure: the encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step, to generate an output sequence.<\/p>\n<div id=\"attachment_12821\" style=\"width: 379px\" class=\"wp-caption aligncenter\">\n<a href=\"http:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_1.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12821\" loading=\"lazy\" class=\"wp-image-12821\" src=\"http:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_1-727x1024.png\" alt=\"\" width=\"369\" height=\"520\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_1-727x1024.png 727w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_1-213x300.png 213w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_1-768x1082.png 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_1-1090x1536.png 1090w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_1.png 1320w\" sizes=\"(max-width: 369px) 100vw, 369px\"><\/a><\/p>\n<p id=\"caption-attachment-12821\" class=\"wp-caption-text\">The Encoder-Decoder Structure of the Transformer Architecture <br \/>Taken from \u201c<a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>\u201c<\/p>\n<\/div>\n<p>In generating an output sequence, the Transformer does not rely on recurrence and convolutions.<\/p>\n<p>We had seen that the decoder part of the Transformer shares many similarities in its architecture with the encoder.\u00a0In this tutorial, we will be focusing on the components that form part of the Transformer encoder. <span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h3><b>The Transformer Encoder<\/b><\/h3>\n<p>The Transformer encoder consists of a stack of $N$ identical layers, where each layer further consists of two main sub-layers:<\/p>\n<ul>\n<li>The first sub-layer comprises a multi-head attention mechanism that receives the queries, keys and values as inputs.<\/li>\n<li>A second sub-layer that comprises a fully-connected feed-forward network.<span class=\"Apple-converted-space\">\u00a0<\/span>\n<\/li>\n<\/ul>\n<div id=\"attachment_13039\" style=\"width: 379px\" class=\"wp-caption aligncenter\">\n<a href=\"http:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/transformer_1.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-13039\" loading=\"lazy\" class=\"wp-image-13039\" src=\"http:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/transformer_1-727x1024.png\" alt=\"\" width=\"369\" height=\"520\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/transformer_1-727x1024.png 727w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/transformer_1-213x300.png 213w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/transformer_1-768x1082.png 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/transformer_1-1090x1536.png 1090w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/transformer_1.png 1320w\" sizes=\"(max-width: 369px) 100vw, 369px\"><\/a><\/p>\n<p id=\"caption-attachment-13039\" class=\"wp-caption-text\">The Encoder Block of the Transformer Architecture <br \/>Taken from \u201c<a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>\u201c<\/p>\n<\/div>\n<p>Following each of these two sub-layers is layer normalisation, into which the sub-layer input (through a residual connection) and output are fed. The output of each layer normalization step is the following:<\/p>\n<p style=\"text-align: center;\">LayerNorm(Sublayer Input + Sublayer Output)<\/p>\n<p>In order to facilitate such an operation, which involves an addition between the sublayer input and output, Vaswani et al. designed all sub-layers and embedding layers in the model to produce outputs of dimension, $d_{text{model}}$ = 512.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras\">Recall<\/a> as well the queries, keys and values as the inputs to the Transformer encoder.<\/p>\n<p>Here, the queries, keys and values carry the same input sequence after this has been embedded and augmented by positional information, where the queries and keys are of dimensionality, $d_k$, whereas the dimensionality of the values is $d_v$.<\/p>\n<p>Furthermore, Vaswani et al. also introduce regularization into the model by applying dropout to the output of each sub-layer (before the layer normalization step), as well as to the positional encodings before these are fed into the encoder.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Let\u2019s now see how to implement the Transformer encoder from scratch in TensorFlow and Keras.<\/p>\n<h2><b>Implementing the Transformer Encoder From Scratch<\/b><\/h2>\n<h3><b>The Fully Connected Feed-Forward Neural Network and Layer Normalization<\/b><\/h3>\n<p>We shall begin by creating classes for the <i>Feed Forward<\/i> and <i>Add &amp; Norm<\/i> layers that are shown in the diagram above.<\/p>\n<p>Vaswani et al. tell us that the fully connected feed-forward network consists of two linear transformations with a ReLU activation in between. The first linear transformation produces an output of dimensionality, $d_{ff}$ = 2048, while the second linear transformation produces an output of dimensionality, $d_{text{model}}$ = 512.<\/p>\n<p>For this purpose, let\u2019s first create the class, <code>FeedForward<\/code> that inherits form the <code>Layer<\/code> base class in Keras, and initialize the dense layers and the ReLU activation:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">class FeedForward(Layer):\r\n    def __init__(self, d_ff, d_model, **kwargs):\r\n        super(FeedForward, self).__init__(**kwargs)\r\n        self.fully_connected1 = Dense(d_ff)  # First fully connected layer\r\n        self.fully_connected2 = Dense(d_model)  # Second fully connected layer\r\n        self.activation = ReLU()  # ReLU activation layer\r\n        ...<\/pre>\n<p>We will add to it the class method, <code>call()<\/code>, that receives an input and passes it through the two fully connected layers with ReLU activation, returning an output of dimensionality equal to 512:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\ndef call(self, x):\r\n    # The input is passed into the two fully-connected layers, with a ReLU in between\r\n    x_fc1 = self.fully_connected1(x)\r\n\r\n    return self.fully_connected2(self.activation(x_fc1))<\/pre>\n<p>The next step is to create another class, <code>AddNormalization<\/code>, that also inherits form the <code>Layer<\/code> base class in Keras, and initialize a Layer normalization layer:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">class AddNormalization(Layer):\r\n    def __init__(self, **kwargs):\r\n        super(AddNormalization, self).__init__(**kwargs)\r\n        self.layer_norm = LayerNormalization()  # Layer normalization layer\r\n        ...<\/pre>\n<p>In it, we will include the following class method that sums its sub-layer\u2019s input and output, which it receives as inputs, and applies layer normalization to the result:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\ndef call(self, x, sublayer_x):\r\n    # The sublayer input and output need to be of the same shape to be summed\r\n    add = x + sublayer_x\r\n\r\n    # Apply layer normalization to the sum\r\n    return self.layer_norm(add)<\/pre>\n<\/p>\n<h3><b>The Encoder Layer<\/b><\/h3>\n<p>Next, we will implement the encoder layer, which the Transformer encoder will replicate identically $N$ times.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>For this purpose, let\u2019s create the class, <code>EncoderLayer<\/code>, and initialize all of the sub-layers that it consists of:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">class EncoderLayer(Layer):\r\n    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):\r\n        super(EncoderLayer, self).__init__(**kwargs)\r\n        self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)\r\n        self.dropout1 = Dropout(rate)\r\n        self.add_norm1 = AddNormalization()\r\n        self.feed_forward = FeedForward(d_ff, d_model)\r\n        self.dropout2 = Dropout(rate)\r\n        self.add_norm2 = AddNormalization()\r\n        ...<\/pre>\n<p>Here you may notice that we have initialized instances of the <code>FeedForward<\/code> and <code>AddNormalization<\/code> classes, which we have just created in the previous section, and assigned their output to the respective variables, <code>feed_forward<\/code> and <code>add_norm<\/code> (1 and 2). The <code>Dropout<\/code> layer is self-explanatory, where <code>rate<\/code> defines the frequency at which the input units are set to 0. We had created the <code>MultiHeadAttention<\/code> class in a <a href=\"https:\/\/machinelearningmastery.com\/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras\">previous tutorial<\/a>, and if you had saved the code into a separate Python script, then do not forget to <code>import<\/code> it. I saved mine in a Python script named, <i>multihead_attention.py<\/i>, and for this reason I need to include the line of code, <i>from multihead_attention import MultiHeadAttention.<\/i><\/p>\n<p>Let\u2019s now proceed to create the class method, <code>call()<\/code>, that implements all of the encoder sub-layers:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\ndef call(self, x, padding_mask, training):\r\n    # Multi-head attention layer\r\n    multihead_output = self.multihead_attention(x, x, x, padding_mask)\r\n    # Expected output shape = (batch_size, sequence_length, d_model)\r\n\r\n    # Add in a dropout layer\r\n    multihead_output = self.dropout1(multihead_output, training=training)\r\n\r\n    # Followed by an Add &amp; Norm layer\r\n    addnorm_output = self.add_norm1(x, multihead_output)\r\n    # Expected output shape = (batch_size, sequence_length, d_model)\r\n\r\n    # Followed by a fully connected layer\r\n    feedforward_output = self.feed_forward(addnorm_output)\r\n    # Expected output shape = (batch_size, sequence_length, d_model)\r\n\r\n    # Add in another dropout layer\r\n    feedforward_output = self.dropout2(feedforward_output, training=training)\r\n\r\n    # Followed by another Add &amp; Norm layer\r\n    return self.add_norm2(addnorm_output, feedforward_output)<\/pre>\n<p>In addition to the input data, the <code>call()<\/code> method can also receive a padding mask. As a brief reminder of what we had said in a <a href=\"https:\/\/machinelearningmastery.com\/how-to-implement-scaled-dot-product-attention-from-scratch-in-tensorflow-and-keras\">previous tutorial<\/a>, the <i>padding<\/i> mask is necessary to suppress the zero padding in the input sequence from being processed along with the actual input values.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>The same class method can receive a <code>training<\/code> flag which, when set to <code>True<\/code>, will only apply the Dropout layers during training.<\/p>\n<h3><b>The Transformer Encoder<\/b><\/h3>\n<p>The last step is to create a class for the Transformer encoder, which we shall be naming <code>Encoder<\/code>:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">class Encoder(Layer):\r\n    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):\r\n        super(Encoder, self).__init__(**kwargs)\r\n        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)\r\n        self.dropout = Dropout(rate)\r\n        self.encoder_layer = [EncoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)]\r\n        ...<\/pre>\n<p>The Transformer encoder receives an input sequence after this would have undergone a process of word embedding and positional encoding. In order to compute the positional encoding, we will make use of the <code>PositionEmbeddingFixedWeights<\/code> class described by Mehreen Saeed in <a href=\"https:\/\/machinelearningmastery.com\/the-transformer-positional-encoding-layer-in-keras-part-2\/\">this tutorial<\/a>.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>As we have similarly done in the previous sections, here we will also create a class method, <code>call()<\/code>, that applies word embedding and positional encoding to the input sequence, and feeds the result to $N$ encoder layers:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\ndef call(self, input_sentence, padding_mask, training):\r\n    # Generate the positional encoding\r\n    pos_encoding_output = self.pos_encoding(input_sentence)\r\n    # Expected output shape = (batch_size, sequence_length, d_model)\r\n\r\n    # Add in a dropout layer\r\n    x = self.dropout(pos_encoding_output, training=training)\r\n\r\n    # Pass on the positional encoded values to each encoder layer\r\n    for i, layer in enumerate(self.encoder_layer):\r\n        x = layer(x, padding_mask, training)\r\n\r\n    return x<\/pre>\n<p>The code listing for the full Transformer encoder is the following:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from tensorflow.keras.layers import LayerNormalization, Layer, Dense, ReLU, Dropout\r\nfrom multihead_attention import MultiHeadAttention\r\nfrom positional_encoding import PositionEmbeddingFixedWeights\r\n\r\n# Implementing the Add &amp; Norm Layer\r\nclass AddNormalization(Layer):\r\n    def __init__(self, **kwargs):\r\n        super(AddNormalization, self).__init__(**kwargs)\r\n        self.layer_norm = LayerNormalization()  # Layer normalization layer\r\n\r\n    def call(self, x, sublayer_x):\r\n        # The sublayer input and output need to be of the same shape to be summed\r\n        add = x + sublayer_x\r\n\r\n        # Apply layer normalization to the sum\r\n        return self.layer_norm(add)\r\n\r\n# Implementing the Feed-Forward Layer\r\nclass FeedForward(Layer):\r\n    def __init__(self, d_ff, d_model, **kwargs):\r\n        super(FeedForward, self).__init__(**kwargs)\r\n        self.fully_connected1 = Dense(d_ff)  # First fully connected layer\r\n        self.fully_connected2 = Dense(d_model)  # Second fully connected layer\r\n        self.activation = ReLU()  # ReLU activation layer\r\n\r\n    def call(self, x):\r\n        # The input is passed into the two fully-connected layers, with a ReLU in between\r\n        x_fc1 = self.fully_connected1(x)\r\n\r\n        return self.fully_connected2(self.activation(x_fc1))\r\n\r\n# Implementing the Encoder Layer\r\nclass EncoderLayer(Layer):\r\n    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):\r\n        super(EncoderLayer, self).__init__(**kwargs)\r\n        self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)\r\n        self.dropout1 = Dropout(rate)\r\n        self.add_norm1 = AddNormalization()\r\n        self.feed_forward = FeedForward(d_ff, d_model)\r\n        self.dropout2 = Dropout(rate)\r\n        self.add_norm2 = AddNormalization()\r\n\r\n    def call(self, x, padding_mask, training):\r\n        # Multi-head attention layer\r\n        multihead_output = self.multihead_attention(x, x, x, padding_mask)\r\n        # Expected output shape = (batch_size, sequence_length, d_model)\r\n\r\n        # Add in a dropout layer\r\n        multihead_output = self.dropout1(multihead_output, training=training)\r\n\r\n        # Followed by an Add &amp; Norm layer\r\n        addnorm_output = self.add_norm1(x, multihead_output)\r\n        # Expected output shape = (batch_size, sequence_length, d_model)\r\n\r\n        # Followed by a fully connected layer\r\n        feedforward_output = self.feed_forward(addnorm_output)\r\n        # Expected output shape = (batch_size, sequence_length, d_model)\r\n\r\n        # Add in another dropout layer\r\n        feedforward_output = self.dropout2(feedforward_output, training=training)\r\n\r\n        # Followed by another Add &amp; Norm layer\r\n        return self.add_norm2(addnorm_output, feedforward_output)\r\n\r\n# Implementing the Encoder\r\nclass Encoder(Layer):\r\n    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):\r\n        super(Encoder, self).__init__(**kwargs)\r\n        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)\r\n        self.dropout = Dropout(rate)\r\n        self.encoder_layer = [EncoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)]\r\n\r\n    def call(self, input_sentence, padding_mask, training):\r\n        # Generate the positional encoding\r\n        pos_encoding_output = self.pos_encoding(input_sentence)\r\n        # Expected output shape = (batch_size, sequence_length, d_model)\r\n\r\n        # Add in a dropout layer\r\n        x = self.dropout(pos_encoding_output, training=training)\r\n\r\n        # Pass on the positional encoded values to each encoder layer\r\n        for i, layer in enumerate(self.encoder_layer):\r\n            x = layer(x, padding_mask, training)\r\n\r\n        return x<\/pre>\n<\/p>\n<h2><b>Testing Out the Code<\/b><\/h2>\n<p>We will be working with the parameter values specified in the paper, <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>, by Vaswani et al. (2017):<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">h = 8  # Number of self-attention heads\r\nd_k = 64  # Dimensionality of the linearly projected queries and keys\r\nd_v = 64  # Dimensionality of the linearly projected values\r\nd_ff = 2048  # Dimensionality of the inner fully connected layer\r\nd_model = 512  # Dimensionality of the model sub-layers' outputs\r\nn = 6  # Number of layers in the encoder stack\r\n\r\nbatch_size = 64  # Batch size from the training process\r\ndropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers\r\n...<\/pre>\n<p>As for the input sequence we will be working with dummy data for the time being until we arrive to the stage of\u00a0<a href=\"https:\/\/machinelearningmastery.com\/training-the-transformer-model\">training the complete Transformer model<\/a>\u00a0in a separate tutorial, at which point we will be using actual sentences:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\nenc_vocab_size = 20 # Vocabulary size for the encoder\r\ninput_seq_length = 5  # Maximum length of the input sequence\r\n\r\ninput_seq = random.random((batch_size, input_seq_length))\r\n...<\/pre>\n<p>Next, we will create a new instance of the <code>Encoder<\/code> class, assigning its output to the <code>encoder<\/code> variable, and subsequently feeding in the input arguments and printing the result. We will be setting the padding mask argument to <code>None<\/code> for the time being, but we shall return to this when we implement the complete Transformer model:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\nencoder = Encoder(enc_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)\r\nprint(encoder(input_seq, None, True))<\/pre>\n<p>Tying everything together produces the following code listing:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from numpy import random\r\n\r\nenc_vocab_size = 20 # Vocabulary size for the encoder\r\ninput_seq_length = 5  # Maximum length of the input sequence\r\nh = 8  # Number of self-attention heads\r\nd_k = 64  # Dimensionality of the linearly projected queries and keys\r\nd_v = 64  # Dimensionality of the linearly projected values\r\nd_ff = 2048  # Dimensionality of the inner fully connected layer\r\nd_model = 512  # Dimensionality of the model sub-layers' outputs\r\nn = 6  # Number of layers in the encoder stack\r\n\r\nbatch_size = 64  # Batch size from the training process\r\ndropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers\r\n\r\ninput_seq = random.random((batch_size, input_seq_length))\r\n\r\nencoder = Encoder(enc_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)\r\nprint(encoder(input_seq, None, True))<\/pre>\n<p>Running this code produces an output of shape, (<i>batch size<\/i>, <i>sequence length<\/i>, <i>model dimensionality<\/i>). Note that you will likely see a different output due to the random initialization of the input sequence, and the parameter values of the Dense layers.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">tf.Tensor(\r\n[[[-0.4214715  -1.1246173  -0.8444572  ...  1.6388322  -0.1890367\r\n    1.0173352 ]\r\n  [ 0.21662089 -0.61147404 -1.0946581  ...  1.4627445  -0.6000164\r\n   -0.64127874]\r\n  [ 0.46674493 -1.4155326  -0.5686513  ...  1.1790234  -0.94788337\r\n    0.1331717 ]\r\n  [-0.30638126 -1.9047263  -1.8556844  ...  0.9130118  -0.47863355\r\n    0.00976158]\r\n  [-0.22600567 -0.9702025  -0.91090447 ...  1.7457147  -0.139926\r\n   -0.07021569]]\r\n...\r\n\r\n [[-0.48047638 -1.1034104  -0.16164204 ...  1.5588069   0.08743562\r\n   -0.08847156]\r\n  [-0.61683714 -0.8403657  -1.0450369  ...  2.3587787  -0.76091915\r\n   -0.02891812]\r\n  [-0.34268388 -0.65042275 -0.6715749  ...  2.8530657  -0.33631966\r\n    0.5215888 ]\r\n  [-0.6288677  -1.0030932  -0.9749813  ...  2.1386387   0.0640307\r\n   -0.69504136]\r\n  [-1.33254    -1.2524267  -0.230098   ...  2.515467   -0.04207756\r\n   -0.3395423 ]]], shape=(64, 5, 512), dtype=float32)<\/pre>\n<\/p>\n<h2><b>Further Reading<\/b><\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3><b>Books<\/b><\/h3>\n<ul>\n<li>\n<a href=\"https:\/\/www.amazon.com\/Advanced-Deep-Learning-Python-next-generation\/dp\/178995617X\">Advanced Deep Learning with Python<\/a>, 2019.<\/li>\n<li>\n<a href=\"https:\/\/www.amazon.com\/Transformers-Natural-Language-Processing-architectures\/dp\/1800565798\">Transformers for Natural Language Processing<\/a>, 2021.<span class=\"Apple-converted-space\">\u00a0<\/span>\n<\/li>\n<\/ul>\n<h3><b>Papers<\/b><\/h3>\n<ul>\n<li>\n<a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>, 2017.<\/li>\n<\/ul>\n<h2><b>Summary<\/b><\/h2>\n<p>In this tutorial, you discovered how to implement the Transformer encoder from scratch in TensorFlow and Keras.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>The layers that form part of the Transformer encoder.<\/li>\n<li>How to implement the Transformer encoder from scratch. <span class=\"Apple-converted-space\">\u00a0<\/span>\n<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras\/\">Implementing the Transformer Encoder From Scratch in TensorFlow and Keras<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Stefania Cristina Having seen how to implement the scaled dot-product attention, and integrate it within the multi-head attention of the Transformer model, we may [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2022\/10\/06\/implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":6016,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/6015"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=6015"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/6015\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/6016"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=6015"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=6015"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=6015"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}