{"id":1091,"date":"2018-09-26T06:49:00","date_gmt":"2018-09-26T06:49:00","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/09\/26\/sequence-modeling-with-neural-networks-part-i\/"},"modified":"2018-09-26T06:49:00","modified_gmt":"2018-09-26T06:49:00","slug":"sequence-modeling-with-neural-networks-part-i","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/09\/26\/sequence-modeling-with-neural-networks-part-i\/","title":{"rendered":"Sequence Modeling with Neural Networks &#8211; Part I"},"content":{"rendered":"<p>Author: Vincent Granville<\/p>\n<div>\n<p><em>Guest blog post by<span>\u00a0<\/span><a href=\"https:\/\/ziedhy.github.io\/2018\/08\/Introduction_Deep_Learning.html\" target=\"_blank\" rel=\"noopener\">Zied HY<\/a>. Zied is\u00a0<span>Senior Data Scientist at Capgemini Consulting. He is<\/span><span>\u00a0specialized in building predictive models utilizing both traditional statistical methods (Generalized Linear Models, Mixed Effects Models, Ridge, Lasso, etc.) and modern machine learning techniques (XGBoost, Random Forests, Kernel Methods, neural networks, etc.). Zied<\/span><span>\u00a0run some workshops for university students (ESSEC, HEC, Ecole polytechnique) interested in Data Science and its applications, and he is\u00a0<\/span><span>the co-founder of Global International Trading (GIT), a central purchasing office based in Paris.<\/span><\/em><\/p>\n<p><span style=\"font-size: 1.5em;\"><strong>Context<\/strong><\/span><\/p>\n<div id=\"context\" class=\"section level2\">\n<p>In the previous course<span>\u00a0<\/span><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/introduction-to-deep-learning\" target=\"_self\">Introduction to Deep Learning<\/a>, we saw how to use Neural Networks to model a dataset of many examples. The good news is that the basic architecture of Neural Networks is quite generic whatever the application: a stacking of several perceptrons to compose complex hierarchical models and their optimization using gradient descent and backpropagation.<\/p>\n<p>Inspite of this, you have probably heard about Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), LSTM, Auto-Encoders, etc. These deep learning models are different from each other. Each model is known to be particulary performant in some specific tasks, even though, fundamentally, they all share the same basic architecture.<\/p>\n<p>What makes the difference between them is their ability to be more suited for some data structures: text processing could be different from image processing, which in turn could be different from signal processing.<\/p>\n<p>In the context of this post, we will focus on modeling<span>\u00a0<\/span><strong>sequences<\/strong><span>\u00a0<\/span>as a well-known data structure and will study its<span>\u00a0<\/span><strong>specific learning framework<\/strong>.<\/p>\n<p>Applications of sequence modeling are plentiful in day-to-day business practice. Some of them emerged to meet today\u2019s challenges in terms of quality of service and customer engagement. Here some examples:<\/p>\n<ul>\n<li>Speech Recognition to listen to the voice of customers.<\/li>\n<li>Machine Language Translation from diverse source languages to more common languages.<\/li>\n<li>Topic Extraction to find the main subject of customer\u2019s translated query.<\/li>\n<li>Speech Generation to have conversational ability and engage with customers in a human like manner.<\/li>\n<li>Text Summarization of customer feedback to work on key challenges and pain points.<\/li>\n<\/ul>\n<p>In the auto industry, self-parking is also a sequence modeling task. In fact, parking could be seen as a sequence of mouvements where the next movement depends on the previous ones.<\/p>\n<p>Other applications cover text classification, translating videos to natural language, image caption generation, hand writing recognition\/generation, anomaly detection, and many more in the future\u2026which none of us can think (or aware) at the moment.<\/p>\n<p>However, before we go any further in the applications of Sequence Modeling, let us understand what we are dealing with when we talk about sequences.<\/p>\n<\/div>\n<div id=\"introduction-to-sequence-modeling\" class=\"section level2\">\n<h2>Introduction to Sequence Modeling<\/h2>\n<p>Sequences are a data structure where each example could be seen as a series of data points. This sentence: \u201cI am currently reading an article about sequence modeling with Neural Networks\u201d is an example that consists of multiple words and words depend on each other. The same applies to medical records. One single medical record consists in many measurments across time. It is the same for speech waveforms.<\/p>\n<p><strong><em>So why we need a different learning framework to model sequences and what are the special features that we are looking for in this framework?<\/em><\/strong><\/p>\n<p>For illustration purposes and with no loss of generality, let us focus on text as a sequence of words to motivate this need for a different learning framework.<\/p>\n<p>In fact, machine learning algorithms typically require the text input to be represented as a<span>\u00a0<\/span><strong>fixed-length<\/strong>vector. Many operations needed to train the model (network) can be expressed through algebraic operations on the matrix of input feature values and the matrix of weights (think about a n-by-p design matrix, where n is the number of samples observed, and p is the number of variables measured in all samples).<\/p>\n<p>Perhaps the most common fixed-length vector representation for texts is the<span>\u00a0<\/span><strong>bag-of-words<\/strong><span>\u00a0<\/span>or bag-of-n-grams due to its simplicity, efficiency and often surprising accuracy. However, the bag-of-words (BOW) representation has many disadvantages:<\/p>\n<ul>\n<li>\n<p>First, the word order is lost, and thus different sentences can have exactly the same representation, as long as the same words are used. Example: \u201cThe food was good, not bad at all.\u201d vs \u201cThe food was bad, not good at all.\u201d. Even though bag-of-n-grams considers the word order in short context, it suffers from data sparsity and high dimensionality.<\/p>\n<\/li>\n<li>\n<p>In addition, Bag-of-words and bag-of-n-grams have very little knowledge about the semantics of the words or more formally the distances between the words. This means that words \u201cpowerful\u201d, \u201cstrong\u201d and \u201cParis\u201d are equally distant despite the fact that semantically, \u201cpowerful\u201d should be closer to \u201cstrong\u201d than \u201cParis\u201d.<\/p>\n<\/li>\n<li>\n<p>Humans don\u2019t start their thinking from scratch every second. As you read this article,<span>\u00a0<\/span><strong>you understand each word based on your understanding of previous words<\/strong>. Traditional neural networks can\u2019t do this, and it seems like a major shortcoming. Bag-of-words and bag-of-n-grams as text representations do not allow to keep track of long-term dependencies inside the same sentence or paragraph.<\/p>\n<\/li>\n<li>\n<p>Another disadvantage of modeling sequences with traditional Neural Networks (e.g.\u00a0Feedforward Neural Networks) is the fact of not sharing parameters across time. Let us take for example these two sentences : \u201cOn Monday, it was snowing\u201d and \u201cIt was snowing on Monday\u201d. These sentences mean the same thing, though the details are in different parts of the sequence. Actually, when we feed these two sentences into a Feedforward Neural Network for a prediction task, the model will assign different weights to \u201cOn Monday\u201d at each moment in time.<span>\u00a0<\/span><strong>Things we learn about the sequence won\u2019t transfer if they appear at different points in the sequence.<\/strong><span>\u00a0<\/span>Sharing parameters gives the network the ability to look for a given feature everywhere in the sequence, rather than in just a certain area.<\/p>\n<\/li>\n<\/ul>\n<p>Thus, to model sequences, we need a specific learning framework able to:<\/p>\n<ul>\n<li>deal with variable-length sequences<\/li>\n<li>maintain sequence order<\/li>\n<li>keep track of long-term dependencies rather than cutting input data too short<\/li>\n<li>share parameters across the sequence (so not re-learn things across the sequence)<\/li>\n<\/ul>\n<p>Recurrent neural networks (RNNs) could address this issue. They are networks with loops in them, allowing information to persist.<\/p>\n<p>So, let us find out more about RNNs!<\/p>\n<\/div>\n<div id=\"recurrent-neural-networks\" class=\"section level2\">\n<h2>Recurrent Neural Networks<\/h2>\n<div id=\"how-a-recurrent-neural-network-works\" class=\"section level3\">\n<h3>How a Recurrent Neural Network works?<\/h3>\n<p>A Recurrent Neural Network is architected in the same way as a \u201ctraditional\u201d Neural Network. We have some inputs, we have some hidden layers and we have some outputs.<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RATRvUsIYhtJ2nr4pZMJ6-Nxrjauc8ZKaSp5Zy7KfVFxfBVcNXfWoId5nt1XPYgsKM8QaFIzFqPiymACUsCfBt\/Capture.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RATRvUsIYhtJ2nr4pZMJ6-Nxrjauc8ZKaSp5Zy7KfVFxfBVcNXfWoId5nt1XPYgsKM8QaFIzFqPiymACUsCfBt\/Capture.PNG\" width=\"428\" class=\"align-center\"><\/a><\/p>\n<p>The only difference is that each hidden unit is doing a slightly different function. So, let\u2019s explore how this hidden unit works.<\/p>\n<p>A recurrent hidden unit computes a function of an input and its own previous output, also known as the cell state. For textual data, an input could be a vector representing a word<span>\u00a0<\/span><em>x(i)<\/em><span>\u00a0<\/span>in a sentence of<span>\u00a0<\/span><em>n<\/em><span>\u00a0<\/span>words (also known as word embedding).<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5SiFOYwgVXNnWoLwYj6isjskcB7xIiF3EwmKlKFJ8AFIXOUmpfzhyoj346HD0D8ih8TfusvAX2SjaUepqrja1w2\/Capture.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5SiFOYwgVXNnWoLwYj6isjskcB7xIiF3EwmKlKFJ8AFIXOUmpfzhyoj346HD0D8ih8TfusvAX2SjaUepqrja1w2\/Capture.PNG\" width=\"670\" class=\"align-center\"><\/a><\/p>\n<p><em>W<\/em><span>\u00a0<\/span>and<span>\u00a0<\/span><em>U<\/em><span>\u00a0<\/span>are weight matrices and<span>\u00a0<\/span><em>tanh<\/em><span>\u00a0<\/span>is the hyperbolic tangent function.<\/p>\n<p>Similarly, at the next step, it computes a function of the new input and its previous cell state:<span>\u00a0<\/span><strong><em>s2<\/em><span>\u00a0<\/span>=<span>\u00a0<\/span><em>tanh<\/em>(<em>Wx1<\/em>+<span>\u00a0<\/span><em>Us1<\/em>)<\/strong>. This behavior is similar to a hidden unit in a feed-forward Network. The difference, proper to sequences, is that we are adding an additional term to incorporate its own previous state.<\/p>\n<p>A common way of viewing recurrent neural networks is by unfolding them across time. We can notice that<span>\u00a0<\/span><strong>we are using the same weight matrices<span>\u00a0<\/span><em>W<\/em><span>\u00a0<\/span>and<span>\u00a0<\/span><em>U<\/em><span>\u00a0<\/span>throughout the sequence. This solves our problem of parameter sharing<\/strong>. We don\u2019t have new parameters for every point of the sequence. Thus, once we learn something, it can apply at any point in the sequence.<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5Qh0K3mxHHTWmnZJW1*ivzaWnAXU7JUsUMe4Yi4PRXVS0IP9fj9jnrXE9-7MuSS0LK3vzju0eGKrB-4NFiIvAEq\/Capture.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5Qh0K3mxHHTWmnZJW1*ivzaWnAXU7JUsUMe4Yi4PRXVS0IP9fj9jnrXE9-7MuSS0LK3vzju0eGKrB-4NFiIvAEq\/Capture.PNG\" width=\"380\" class=\"align-center\"><\/a><\/p>\n<div id=\"how-a-recurrent-neural-network-works\" class=\"section level3\">\n<p>The fact of not having new parameters for every point of the sequence also helps us<span>\u00a0<\/span><strong>deal with variable-length sequences<\/strong>. In case of a sequence that has a length of 4, we could unroll this RNN to four timesteps. In other cases, we can unroll it to ten timesteps since the length of the sequence is not prespecified in the algorithm. By unrolling we simply mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word.<\/p>\n<div id=\"nb\" class=\"section level4\">\n<h4>NB:<\/h4>\n<ul>\n<li><em>Sn<\/em>, the cell state at time<span>\u00a0<\/span><em>n<\/em>, can contain information from all of the past timesteps: each cell state is a function of the previous self state which in turn is a function of the previous cell state.<span>\u00a0<\/span><strong>This solves our issue of long-term dependencies<\/strong>.<\/li>\n<li>The above diagram has outputs at each time step, but depending on the task this may not be necessary. For example, when predicting the sentiment of a sentence, we may only care about the final output, not the sentiment after each word. Similarly, we may not need inputs at each time step. The main feature of an RNN is its hidden state, which captures some information about a sequence.<\/li>\n<\/ul>\n<p>Now that we understand how a single hidden unit works, we need to figure out how to train an entire Recurrent Neural Network made up of many hidden units and even many layers of many hidden units.<\/p>\n<\/div>\n<\/div>\n<div id=\"how-do-we-train-a-recurrent-neural-network\" class=\"section level3\">\n<h3>How do we train a Recurrent Neural Network?<\/h3>\n<p>Let\u2019s consider the following task: for a set of speeches in English, we need the model to automatically convert the spoken language into text i.e.\u00a0at each timestep, the model produces a prediction of a transcript (an output) based on the part of speech at this timestep (the new input) and the previous transcript (the previous cell state).<\/p>\n<p>Naturally, because we have an output at every timestep, we can have a loss at every timestep. This loss reflects how much the predicted transcripts are close to the \u201cofficial\u201d transcripts.<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RlMKN1nxc*FaGoVKSk*UYNcL30eODKhV4Rt7OSEK3k7D1FnQXH-rdkZOjr-V0tpcDQ392aSI7ljFpt1nUbeEhT\/c1.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RlMKN1nxc*FaGoVKSk*UYNcL30eODKhV4Rt7OSEK3k7D1FnQXH-rdkZOjr-V0tpcDQ392aSI7ljFpt1nUbeEhT\/c1.PNG\" width=\"280\" class=\"align-center\"><\/a><\/p>\n<p><span>The total loss is just the sum of the losses at every timestep.<\/span><\/p>\n<p><span><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5To5Awzw3QOH*p*0siHpXUOL*ZRRD00nl7m1pufmc4uEkfmpDDKtCHP1S-wY8zeSE0f5i6ZOKOENwLZEZCLqMX1\/C2.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5To5Awzw3QOH*p*0siHpXUOL*ZRRD00nl7m1pufmc4uEkfmpDDKtCHP1S-wY8zeSE0f5i6ZOKOENwLZEZCLqMX1\/C2.PNG\" width=\"353\" class=\"align-center\"><\/a><\/span><\/p>\n<p>Since the loss is a function of the network weights, our task it to find the set of weights<span>\u00a0<\/span><em>theta<\/em><span>\u00a0<\/span>that achieve the lowest loss. For that, as explained in the first article \u201cIntroduction to Deep Learning\u201d, we we can apply<span>\u00a0<\/span><strong>the gradient descent algorithm with backpropagation (chain rule) at every timestep<\/strong>, thus taking into account the additional time dimension.<\/p>\n<p><em>W<\/em><span>\u00a0<\/span>and<span>\u00a0<\/span><em>U<\/em><span>\u00a0<\/span>are our two weight matrices. Let us try it out for<span>\u00a0<\/span><em>W<\/em>.<\/p>\n<p>Knowing that the total loss is the sum of the losses at every timestep, the total gradient is just the sum of the gradients at every timestep:<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RzdfFKF9uG52oaEJmiG8MzpXIUI8ZSRRz0J2NGmMpIe-3UNHmxEMDaR3aGY3QeAq8qqDeONV97GE51PHPBW8vu\/C3.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RzdfFKF9uG52oaEJmiG8MzpXIUI8ZSRRz0J2NGmMpIe-3UNHmxEMDaR3aGY3QeAq8qqDeONV97GE51PHPBW8vu\/C3.PNG\" width=\"144\" class=\"align-center\"><\/a><\/p>\n<p><span>And now, we can focus on a single timestep to calculate the derivative of the loss with respect to\u00a0<\/span><em>W<\/em><span>.<\/span><\/p>\n<p><span><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RV5WtWD-7TiRJbpySXyjJgj7FW4Z-ChPcGaGZ0ZbMdjI4Qh0fayyioQzYwm7G5vOwCAhxExc6bZ7e7udsMGbcV\/c.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RV5WtWD-7TiRJbpySXyjJgj7FW4Z-ChPcGaGZ0ZbMdjI4Qh0fayyioQzYwm7G5vOwCAhxExc6bZ7e7udsMGbcV\/c.PNG\" width=\"388\" class=\"align-center\"><\/a><\/span><\/p>\n<p><span>Easy to handle: we just use backpropagation.<\/span><\/p>\n<p><span><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5S6Ao03aj*5wX-zWR0LjSnRk5hRFHOeIVRPPl2U*MskILtR2hWNMNSBudFSlo*XQcMieGi7WRNMI0Yz7E89gqQq\/c.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5S6Ao03aj*5wX-zWR0LjSnRk5hRFHOeIVRPPl2U*MskILtR2hWNMNSBudFSlo*XQcMieGi7WRNMI0Yz7E89gqQq\/c.PNG\" width=\"203\" class=\"align-center\"><\/a><\/span><\/p>\n<p>We remember that<span>\u00a0<\/span><strong><em>s2<span>\u00a0<\/span>=<span>\u00a0<\/span>tanh(Wx1<span>\u00a0<\/span>+<span>\u00a0<\/span>Us1)<\/em><\/strong><span>\u00a0<\/span>so s2 also depends on s1 and s1 also depends on W.<span>\u00a0<\/span><strong>This actually means that we can not just leave the derivative of<span>\u00a0<\/span><em>s2<\/em><span>\u00a0<\/span>with respect to<span>\u00a0<\/span><em>W<\/em><span>\u00a0<\/span>as a constant. We have to expand it out farther.<\/strong><\/p>\n<p>So how does<span>\u00a0<\/span><em>s2<\/em><span>\u00a0<\/span>depend on<span>\u00a0<\/span><em>W<\/em>?<\/p>\n<p>It depends directly on W because it feeds right in (c.f. above formula of<span>\u00a0<\/span><em>s2<\/em>). We also know that<span>\u00a0<\/span><em>s2<\/em><span>\u00a0<\/span>depends on<span>\u00a0<\/span><em>s1<\/em><span>\u00a0<\/span>which depends on W. And we can also see that<span>\u00a0<\/span><em>s2<\/em><span>\u00a0<\/span>depends on<span>\u00a0<\/span><em>s0<\/em><span>\u00a0<\/span>which also depends on W.<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5TahhazU4OQC4h5o-EczTlWx1pAu*1dMbIlVFPXU0esQQ9A*GCn5tOMZumi-3kyP4TBUyCa2F63e2KJMOU2BH8p\/Capture.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5TahhazU4OQC4h5o-EczTlWx1pAu*1dMbIlVFPXU0esQQ9A*GCn5tOMZumi-3kyP4TBUyCa2F63e2KJMOU2BH8p\/Capture.PNG\" width=\"380\" class=\"align-center\"><\/a><\/p>\n<p><span>Thus, the derivative of the loss with respect to\u00a0<\/span><em>W<\/em><span>\u00a0could be written as follows:<\/span><\/p>\n<p><span><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5QLDtVNbonwPu1uYOupfGC0*UnTQKZNjU2MNkUYAuzROQaeZQRfJSJfKuFYvirsGmfuGBa9IPIm4Pv-ke3Vi-B*\/c1.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5QLDtVNbonwPu1uYOupfGC0*UnTQKZNjU2MNkUYAuzROQaeZQRfJSJfKuFYvirsGmfuGBa9IPIm4Pv-ke3Vi-B*\/c1.PNG\" width=\"430\" class=\"align-center\"><\/a><\/span><\/p>\n<p>We can see that the last two terms are basically summing the contributions of<span>\u00a0<\/span><em>W<\/em><span>\u00a0<\/span>in previous timesteps to the error at timestep<span>\u00a0<\/span><em>t<\/em>. This is key to understand how we model long-term dependencies. From one iteration to another, the gradient descent algorithm allows to shift network parameters such that they include contributions to the error from past timesteps.<\/p>\n<p>For any timestep<span>\u00a0<\/span><em>t<\/em>, the derivative of the loss with respect to<span>\u00a0<\/span><em>W<\/em><span>\u00a0<\/span>could be written as follows:<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RP1twYaZGjNw18WX4G4yKJnGRequyt7miF0OYEdm478FLlREBMPHSoG4OP7KBnpHv6bGPG*uiXOPbHFtioO6nr\/C2.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RP1twYaZGjNw18WX4G4yKJnGRequyt7miF0OYEdm478FLlREBMPHSoG4OP7KBnpHv6bGPG*uiXOPbHFtioO6nr\/C2.PNG\" width=\"396\" class=\"align-center\"><\/a><\/p>\n<div id=\"how-do-we-train-a-recurrent-neural-network\" class=\"section level3\">\n<p>So to train the model i.e.\u00a0to estimate the weights of the network, we apply this same process of backpropagation through time for every weight (parameter) and then we use it in the process of gradient descent.<\/p>\n<\/div>\n<div id=\"why-are-recurrent-neural-networks-hard-to-train\" class=\"section level3\">\n<h3>Why are Recurrent Neural Networks hard to train?<\/h3>\n<p>In practice RNNs are a bit difficult to train. To understand why, let\u2019s take a closer look at the gradient we calculated above:<\/p>\n<p><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RAPwG2x20fGg2IszGmtoGPIA*oBO5UIo6cfq2M8*glqUJAsTAaMJfDC68TZ-uPXP9QAlupcz8KK7cC2OtD75bs\/C3.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5RAPwG2x20fGg2IszGmtoGPIA*oBO5UIo6cfq2M8*glqUJAsTAaMJfDC68TZ-uPXP9QAlupcz8KK7cC2OtD75bs\/C3.PNG\" width=\"670\" class=\"align-center\"><\/a><\/p>\n<p><span>We can see that as the gap between timesteps gets bigger, the product of the gradients gets longer and longer. But, what are each of these terms?<\/span><\/p>\n<p><span><a href=\"http:\/\/api.ning.com\/files\/VizWPXcDR5SiFk6mg0cz8DdyjwNyPvEWrI9hILWJ8f8mzsGa-RPeCinrHqf6RDOwYEi0npHBPgXSz39YJpAFS3e0bugG8J6N\/c4.PNG\" target=\"_self\"><img decoding=\"async\" src=\"http:\/\/api.ning.com\/files\/VizWPXcDR5SiFk6mg0cz8DdyjwNyPvEWrI9hILWJ8f8mzsGa-RPeCinrHqf6RDOwYEi0npHBPgXSz39YJpAFS3e0bugG8J6N\/c4.PNG\" width=\"298\" class=\"align-center\"><\/a><\/span><\/p>\n<div id=\"why-are-recurrent-neural-networks-hard-to-train\" class=\"section level3\">\n<p>Each term is basically a product of two terms: transposed<span>\u00a0<\/span><em>W<\/em><span>\u00a0<\/span>and a second one that depends on f\u2019, the derivative of the activation function.<\/p>\n<ul>\n<li>\n<p>Initial weights<span>\u00a0<\/span><em>W<\/em><span>\u00a0<\/span>are usually sampled from standard normal distribution and then mostly < 1.<\/p>\n<\/li>\n<li>\n<p>It turns out (I won\u2019t prove it here but<span>\u00a0<\/span><a href=\"http:\/\/proceedings.mlr.press\/v28\/pascanu13.pdf\">this paper<\/a><span>\u00a0<\/span>goes into detail) that the second term is a Jacobian matrix because we are taking the derivative of a vector function with respect to a vector and its 2-norm, which you can think of it as an absolute value,<span>\u00a0<\/span><strong>has an upper bound of 1<\/strong>. This makes intuitive sense because our tanh (or sigmoid) activation function maps all values into a range between -1 and 1, and the derivative f\u2019 is bounded by 1 (1\/4 in the case of sigmoid).<\/p>\n<\/li>\n<\/ul>\n<p>Thus, with small values in the matrix and multiple matrix multiplications, the<span>\u00a0<\/span><strong>gradient values are shrinking exponentially fast, eventually vanishing completely after a few time steps<\/strong>. Gradient contributions from \u201cfar away\u201d steps become zero, and the state at those steps doesn\u2019t contribute to what you are learning: you end up not learning long-range dependencies.<\/p>\n<p>Vanishing gradients aren\u2019t exclusive to RNNs. They also happen in deep Feedforward Neural Networks. It\u2019s just that RNNs tend to be very deep (as deep as the sentence length in our case), which makes the problem a lot more common.<\/p>\n<p>Fortunately, there are a few ways to combat the vanishing gradient problem.<span>\u00a0<\/span><strong>Proper initialization of the<span>\u00a0<\/span><em>W<\/em>matrix<\/strong><span>\u00a0<\/span>can reduce the effect of vanishing gradients. So can regularization. A more preferred solution is to use<span>\u00a0<\/span><strong><em>ReLU<\/em><\/strong><span>\u00a0<\/span>instead of tanh or sigmoid activation functions. The ReLU derivative is a constant of either 0 or 1, so it isn\u2019t as likely to suffer from vanishing gradients.<\/p>\n<p>An even more popular solution is to use Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) architectures. LSTMs were first proposed in 1997 and are perhaps the most widely used models in NLP today. GRUs, first proposed in 2014, are simplified versions of LSTMs. Both of these RNN architectures were explicitly designed to deal with vanishing gradients and efficiently learn long-range dependencies. We\u2019ll cover them in the next part of this article.<\/p>\n<p>It will come soon!<\/p>\n<p><em>Originally posted <a href=\"https:\/\/ziedhy.github.io\/2018\/09\/Sequence_Modeling_Part_1.html\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/em><\/p>\n<\/div>\n<div id=\"resources-i-used-when-writing-this-article\" class=\"section level3\">\n<h3>Resources I used when writing this article:<\/h3>\n<ul>\n<li><a href=\"http:\/\/introtodeeplearning.com\/\" class=\"uri\">http:\/\/introtodeeplearning.com\/<\/a><\/li>\n<li><a href=\"http:\/\/proceedings.mlr.press\/v28\/pascanu13.pdf\">On the difficulty of training recurrent neural networks<\/a><\/li>\n<li><a href=\"https:\/\/cs.stanford.edu\/~quocle\/paragraph_vector.pdf\">Distributed Representations of Sentences and Documents<\/a><\/li>\n<li><a href=\"http:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs\/\">Understanding LSTM Networks<\/a><\/li>\n<li><a href=\"http:\/\/www.wildml.com\/2015\/10\/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients\/\">Backpropagation Through Time and Vanishing Gradients<\/a><\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><span style=\"font-size: 14pt;\"><b>DSC Resources<\/b><\/span><\/p>\n<ul>\n<li><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/invitation-to-join-data-science-central\">Invitation to Join Data Science Central<\/a><\/li>\n<li><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/fee-book-applied-stochastic-processes\">Free Book: Applied Stochastic Processes<\/a><\/li>\n<li><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/comprehensive-repository-of-data-science-and-ml-resources\">Comprehensive Repository of Data Science and ML Resources<\/a><\/li>\n<li><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/advanced-machine-learning-with-basic-excel\">Advanced Machine Learning with Basic Excel<\/a><\/li>\n<li><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/difference-between-machine-learning-data-science-ai-deep-learning\">Difference between ML, Data Science, AI, Deep Learning, and Statistics<\/a><\/li>\n<li><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/my-data-science-machine-learning-and-related-articles\">Selected Business Analytics, Data Science and ML articles<\/a><\/li>\n<li><a href=\"http:\/\/careers.analytictalent.com\/jobs\/products\">Hire a Data Scientist<\/a><span>\u00a0<\/span>|<span>\u00a0<\/span><a href=\"https:\/\/www.datasciencecentral.com\/page\/search?q=Python\">Search DSC<\/a><span>\u00a0<\/span>|<span>\u00a0<\/span><a href=\"http:\/\/classifieds.datasciencecentral.com\/\">Classifieds<\/a><span>\u00a0<\/span>|<span>\u00a0<\/span><a href=\"http:\/\/www.analytictalent.com\/\">Find a Job<\/a><\/li>\n<li><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blog\/new\">Post a Blog<\/a><span>\u00a0<\/span>|<span>\u00a0<\/span><a href=\"https:\/\/www.datasciencecentral.com\/forum\/topic\/new\">Forum Questions<\/a><\/li>\n<\/ul>\n<\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:763130\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Vincent Granville Guest blog post by\u00a0Zied HY. Zied is\u00a0Senior Data Scientist at Capgemini Consulting. He is\u00a0specialized in building predictive models utilizing both traditional statistical [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/09\/26\/sequence-modeling-with-neural-networks-part-i\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1092,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1091"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1091"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1091\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1092"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1091"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1091"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1091"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}