
In this article, I’m going to show how to implement GRU and LSTM units and how to build deeper RNNs using TensorFlow. I will start by explaining a little theory about GRUs, LSTMs and Deep RNNs, and then explain the code snippet by snippet. This article is meant to be a continuation to my previous article about RNNs:
I’ll suggest you to read that article first. But if you already know about simple RNNs and just want to learn about the content of this article, that’s fine too.
A little theory
The simple type of RNN that was used in my previous article has some issues. To see that, let’s recall the equations of that model:

Here, at is the vector that gets passed among consecutive time steps.
The problem is that, at each time step, when a new version of this vector is computed, it is simply overwritten by the new result. And this makes it easy for our network to “forget” information that is too far into the past. And this can be a problem. For example, if you have a long sentence and some word at the end strongly depends on the very first few words, then it is hard for such network to carry that information until the end.
One thing that we can do to improve the results of our RNN in such cases is trying to emulate the notion of a memory cell. So, instead of overwriting the information that needs to be passed to the next time steps with entirely new values, we will update only some parts of that information vector, and other parts will stay there possibly a longer period of time, and such behaving like some kind of “memory”.
How we decide what we update and what we keep? For that we will use the so-called gates that will be computed by the network. These gates are vectors of numbers between 0 and 1 (the output of a sigmoid activation) that establish how much of at gets updated and how much remains the same. These gates are computed based on learnable parameters.
These ideas gave birth to many types of RNN architectures, among which two of them (GRU and LSTM) will be discussed next. The exact equations and what gates are computed vary among these two RNN architectures, but the main ideas are roughly the same.
Gated Recurrent Unit (GRU)
In a GRU unit, instead of computing directly the at vector, we first compute a candidate (ãt) for it. And when we compute this candidate, we don’t consider the whole at vector, but we use the relevance gate (Γr) to decide what is relevant from at to use. After we have the candidate ãt, we use the update gate (Γu) to decide what information from the candidate goes into the final at, and we use the complement 1-Γu to decide what we keep from the previous a.
Here are the equations for the GRU unit:

Where:
- [at-1 ; xt] – is the concatenation of the previous information vector (at-1) with the input of the current time step (xt)
- σ – is the sigmoid function
- Γr, Γu – are the relevance and update gates
- Wr, Wu, br, bu – are the weights and biases used to compute the relevance and update gates
- ãt – is the candidate for at
- Wa, ba – weights and biases used to compute ãt
- * – is used to denote element-wise multiplication
I think the equations are much clearer than the diagram of the GRU, but nonetheless, below is also the diagram of a GRU, just in case you want to have a look at it:

In the diagram above, they use h instead of a, and r, z for the relevance, respectively the update gates.
Long Short-Term Memory (LSTM)
For the LSTM we have 3 gates instead of 2: update gate (Γu), forget gate (Γf), and output gate (Γo). The gates are computed the same way as for the GRU, just using different set of parameters for each one of them. The update gate serves as before, for establishing how much of the candidate new value to go into the final quantity. But now, instead of using the complement of the update gate, we use a whole separate gate: the forget gate to determine what we should forget from the previous value. We don’t use a relevance gate anymore when computing the candidate.
An important difference from the GRU is that now we have 2 distinct vectors of information that gets passed among time steps: ct and at. at has the same role as in the previous architectures (GRU and simple RNN): it is used both for passing information to future time steps and also for computing the output ŷt. And ct is (usually) used only inside the LSTM unit for passing information to future time steps. at is obtained from tanh(ct) by using the output gate.
Here are the equations for LSTM:

Again, I think the equations are better for understanding what’s actually going on, but nonetheless, below is also a diagram of an LSTM unit in case you want to have a look:

Again, in the diagram they use h instead of a.
Deep RNNs
So, how do we build deeper RNNs?
Let’s first take a look at the simple RNN architecture that I implemented in my previous article:

Here, f can be any kind of RNN unit (like a simple one, or GRU, or LSTM).
For RNNs, there are a few ways we can create deeper networks:
- we either make each individual unit deeper (for example, f or h – instead of having them be linear we can use a deeper feed-forward sub-network in place of them)
- or we can stack more recurrent units on top of each other, like this:

- or we can use a combination of both of the above
So, which way should we be using?
It depends on the problem. If the most difficulty of your model lies along the vertical axis (that is, predicting yt based on xt – so there is not much dependence along the time axis), then the first approach would be better.
But if the most difficult part is modeling the dependencies along the time (horizontal) axis, then the second approach would be better, because we allow at each depth level to have an information vector that’s passed to the next time steps. This approach is the one that we’ll implement in the following section.
Implementation
Our implemented network has an architecture similar to the diagram above, and we will use either GRUs or LSTMs for all f(0), f(1), … We will pass a parameter to the constructor of our Model that will determine if the f units are GRUs or LSTMs. It will also have a parameter that controls how many GRU/LSTM units our model has stacked on top of each other (depth).
This RNN that we are going to implement will be used to learn a language model from this dataset of news headlines similarly to what has been done in my previous article. But this time, instead of using a vocabulary of words, we will use a list of ASCII characters to represent our “vocabulary”, so it will be a character-level language model.
To show the efficacy of deep GRU/LSTM over simple RNN, I also trained the simple RNN from the previous article with characters for the “vocabulary” and included the results towards the end. And you can clearly see an improvement of the deep GRU/LSTM compared to the simple RNN trained using the same “vocabulary” of characters.
We start the code by importing all the necessary libraries:
import numpy as np import pandas as pd import tensorflow as tf import random from typing import Union from math import ceil, sqrt from os import mkdir, listdir
As the end of sentence (EOS) marker will be used the newline character (which has ASCII code 10). The “vocabulary” will be just a list of all the characters for ASCII range [10, 127]. The characters below 10 are not used anywhere here, so they’re not included in our “vocabulary”.
This code was adapted from the previous article which used a vocabulary of words, so whenever you see in code the term “word”, that actually refers to “character”. In this implementation characters are our “words”.
The words2onehot() function takes as parameters the vocabulary and a list of words and converts those words into a one-hot matrix (by using the word2index() function which simply returns the index of a word in the vocabulary).
EOS = chr(10) # End of sentence def build_vocabulary() -> list: # builds a vocabulary using ASCII characters vocabulary = [chr(i) for i in range(10, 128)] return vocabulary def word2index(vocabulary: list, word: str) -> int: # returns the index of 'word' in the vocabulary return vocabulary.index(word) def words2onehot(vocabulary: list, words: list) -> np.ndarray: # transforms the list of words given as argument into # a one-hot matrix representation using the index in the vocabulary n_words = len(words) n_voc = len(vocabulary) indices = np.array([word2index(vocabulary, word) for word in words]) a = np.zeros((n_words, n_voc)) a[np.arange(n_words), indices] = 1 return a
In the sample_word() function we choose and return a character from our “vocabulary” according to the probability distribution given as parameter. But we will use one extra trick here to improve the sampled sentences of our model. Instead of choosing a character from all the “vocabulary”, we take into consideration only the characters that have the most probability and sum up to ‘threshold’ probability. The other characters that sum up to 1-‘threshold’ probability will be ignored. In the Model class below, the threshold has a default value of 0.9.
def sample_word(vocabulary: list, prob: np.ndarray, threshold: float) -> str: # sample a word from the vocabulary according to 'prob' # probability distribution (the softmax output of our model) prob = prob.tolist() vocab_prob = [[vocabulary[i], prob[i]] for i in range(len(prob))] vocab_prob.sort(reverse=True, key=lambda e: e[1]) s = 0 for i in range(len(vocab_prob)): if s > threshold: vocab_prob[i][1] = 0 s += vocab_prob[i][1] vocab = [w for w, p in vocab_prob] prob = np.array([p/s for w, p in vocab_prob]) return np.random.choice(vocab, p=prob)
When we construct a Model class, we have to pass to it the vocabulary, the size of the a vector (inter_time_step_size) which will be the same for all depth levels, the unit type which is either ‘gru’ or ‘lstm’, and the depth which should be >= 1.
We will have another _init() method that actually initializes everything except the weights. This other method is used because we will need it also when we load a saved model.
A different set of weights for GRU and LSTM are needed, so we initialize these weights in 2 different methods: _init_gru() and _init_lstm(). These methods return a list of weights to which we add at the end the weights and bias for computing the output ŷ.
class Model: def __init__(self, vocabulary: list, inter_time_step_size: int, unit_type: str, depth: int): # unit_type can be one of: 'gru' or 'lstm' # depth >= 1 self._init(vocabulary, inter_time_step_size, unit_type, depth) self.weights = (self._init_gru() if unit_type == 'gru' else self._init_lstm()) # weights and bias used to compute y (the softmax predictions) self.wy = tf.Variable(tf.random.normal( stddev=sqrt(2.0/(self.inter_time_step_size+self.vocab_size)), shape=(self.inter_time_step_size, self.vocab_size), dtype=tf.double)) self.by = tf.Variable(tf.random.normal( stddev=sqrt(2.0/(1+self.vocab_size)), shape=(1, self.vocab_size), dtype=tf.double)) self.weights.extend([self.wy, self.by])
In _init() we just do some simple assignments and compute the shapes and standard deviations for the weights and biases.
def _init(self, vocabulary: list, inter_time_step_size: int, unit_type: str, depth: int): self.vocab = vocabulary self.vocab_size = len(vocabulary) self.inter_time_step_size = inter_time_step_size self.combined_size = self.vocab_size + self.inter_time_step_size self.unit_type = unit_type self.depth = depth self.weights_shape_0 = (self.combined_size, self.inter_time_step_size) self.weights_std_dev_0 = sqrt(2.0/(self.combined_size+self.inter_time_step_size)) self.weights_shape_1 = (2*self.inter_time_step_size, self.inter_time_step_size) self.weights_std_dev_1 = sqrt(2.0/(3*self.inter_time_step_size)) self.biases_shape = (1, self.inter_time_step_size) self.biases_std_dev = sqrt(2.0/(1+self.inter_time_step_size)) self.w_shapes = [(self.weights_shape_0, self.weights_std_dev_0)] self.w_shapes.extend([(self.weights_shape_1, self.weights_std_dev_1) for i in range(self.depth-1)]) self.b_shapes = [(self.biases_shape, self.biases_std_dev) for i in range(self.depth)] self.optimizer = tf.keras.optimizers.Adam()
In _init_gru() we initialize the weights and biases that we need for all the GRU units. So, each one of the model’s attributes wr, br, wu,… will hold a list with as many tf.Variable
s as the model’s depth. At index 0 will be the weights and biases for the first level after the input, at index 1 for the next level, …, and so on until level “depth-1”.
Then all those tf.Variables will be returned as one list to make our life easier when calling tape.gradient()
and optimizer.apply_gradients()
.
def _init_gru(self): for s in ['wr', 'wu', 'wa']: setattr(self, s, [tf.Variable(tf.random.normal( stddev=std_dev, shape=shape, dtype=tf.double)) for shape, std_dev in self.w_shapes]) for s in ['br', 'bu', 'ba']: setattr(self, s, [tf.Variable(tf.random.normal( stddev=std_dev, shape=shape, dtype=tf.double)) for shape, std_dev in self.b_shapes]) all_weights = [] for w in [self.wr, self.br, self.wu, self.bu, self.wa, self.ba]: all_weights.extend(w) return all_weights
The _init_lstm() method works similarly to _init_gru(), just that now we have a different set of parameters as needed by the LSTM.
def _init_lstm(self): for s in ['wu', 'wf', 'wo', 'wc']: setattr(self, s, [tf.Variable(tf.random.normal( stddev=std_dev, shape=shape, dtype=tf.double)) for shape, std_dev in self.w_shapes]) for s in ['bu', 'bf', 'bo', 'bc']: setattr(self, s, [tf.Variable(tf.random.normal( stddev=std_dev, shape=shape, dtype=tf.double)) for shape, std_dev in self.b_shapes]) all_weights = [] for w in [self.wu, self.bu, self.wf, self.bf, self.wo, self.bo, self.wc, self.bc]: all_weights.extend(w) return all_weights
reset_state() initializes the variables a and c with a list of matrices filled with zeros, one matrix for each depth level. This method should be called each time our model starts processing a new sentence/batch of sentences.
def reset_state(self, num_samples: int) -> None: def get_init_values(): return [tf.zeros((num_samples, self.inter_time_step_size), dtype=tf.double) for i in range(self.depth)] self.a = get_init_values() if self.unit_type == 'lstm': self.c = get_init_values()
The __call__() method produces a prediction ŷ for the given input x. If the ground truth y is also given, then instead of returning the prediction ŷ, it returns the loss.
It uses _call_level() to execute the “call” at each recurrent layer and feeds the output of one layer to the next one. At the end, the output of the last recurrent layer is given as input to the layer that computes ŷ (or loss).
def __call__(self, x: Union[np.ndarray, tf.Tensor], y: Union[np.ndarray, tf.Tensor, None] = None) -> tf.Tensor: for i in range(self.depth): x = self._call_level(i, x) y_logits = tf.linalg.matmul(x, self.wy)+self.by if y is None: # during prediction return softmax probabilities return tf.nn.softmax(y_logits) else: # during training return loss return tf.math.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(y, y_logits))
_call_level() simply returns the result of either _call_gru() or _call_lstm() depending on the value of self.unit_type.
def _call_level(self, level: int, x: Union[np.ndarray, tf.Tensor]) -> tf.Tensor: return (self._call_gru(level, x) if self.unit_type == 'gru' else self._call_lstm(level, x))
_call_gru() executes the GRU unit using the weights, biases, and the matrix a found at index ‘level’ (see above how they are initialized). At the end, it returns the new a matrix for the specified ‘level’.
def _call_gru(self, level: int, x: Union[np.ndarray, tf.Tensor]) -> tf.Tensor: n = x.shape[0] self.a[level] = self.a[level][0:n] concat_matrix = tf.concat([self.a[level], x], axis=1) relevance_gate = tf.math.sigmoid( tf.linalg.matmul(concat_matrix, self.wr[level]) + self.br[level]) update_gate = tf.math.sigmoid( tf.linalg.matmul(concat_matrix, self.wu[level]) + self.bu[level]) a_candidate = tf.math.tanh( tf.linalg.matmul( tf.concat([tf.math.multiply(relevance_gate, self.a[level]), x], axis=1), self.wa[level]) + self.ba[level]) self.a[level] = (tf.math.multiply(update_gate, a_candidate) + tf.math.multiply((1-update_gate), self.a[level])) return self.a[level]
_call_lstm() does something similar to _call_gru(), but it also uses c and computes the equations specific to LSTM, then returns also the new a matrix.
def _call_lstm(self, level: int, x: Union[np.ndarray, tf.Tensor]) -> tf.Tensor: n = x.shape[0] self.a[level] = self.a[level][0:n] self.c[level] = self.c[level][0:n] concat_matrix = tf.concat([self.a[level], x], axis=1) update_gate = tf.math.sigmoid( tf.linalg.matmul(concat_matrix, self.wu[level]) + self.bu[level]) forget_gate = tf.math.sigmoid( tf.linalg.matmul(concat_matrix, self.wf[level]) + self.bf[level]) output_gate = tf.math.sigmoid( tf.linalg.matmul(concat_matrix, self.wo[level]) + self.bo[level]) c_candidate = tf.math.tanh( tf.linalg.matmul(concat_matrix, self.wc[level]) + self.bc[level]) self.c[level] = (tf.math.multiply(update_gate, c_candidate) + tf.math.multiply(forget_gate, self.c[level])) self.a[level] = tf.math.multiply(output_gate, tf.math.tanh(self.c[level])) return self.a[level]
The fit() method is about the same as in the previous implementation because we made most of the changes regarding GRU/LSTM inside __call__() and __init__(). Something that differs is that now we need to call self.reset_state() before each batch gets processed by the model. Also, in the inner-most for loop we now add the EOS character because it is not by default in the input sentences.
def fit(self, sentences: list, batch_size: int = 128, epochs: int = 10) -> None: n_sent = len(sentences) num_batches = ceil(n_sent / batch_size) for epoch in range(epochs): random.shuffle(sentences) start = 0 batch_idx = 0 while start < n_sent: print('Training model: %05.2f%%' % (100*(epoch*num_batches+batch_idx+1)/(epochs*num_batches),), end='\r') batch_idx += 1 end = min(start+batch_size, n_sent) batch_sent = sentences[start:end] start = end batch_sent.sort(reverse=True, key=lambda s: len(s)) init_num_words = len(batch_sent) self.reset_state(init_num_words) x = np.zeros((init_num_words, self.vocab_size)) time_steps = len(batch_sent[0])+1 with tf.GradientTape() as tape: losses = [] for t in range(time_steps): words = [] for i in range(init_num_words): if t > len(batch_sent[i]): break if t == len(batch_sent[i]): words.append(EOS) break words.append(batch_sent[i][t]) y = words2onehot(self.vocab, words) n = y.shape[0] loss = self(x[0:n], y) losses.append(loss) x = y loss_value = tf.math.reduce_mean(losses) grads = tape.gradient(loss_value, self.weights) self.optimizer.apply_gradients(zip(grads, self.weights))
The sample() method generates a sequence of softmax probabilities and samples characters according to these probabilities until the EOS character is sampled. At each step, the sampled character is given as input to the next call of our model. A default threshold of 0.9 for character sampling is used.
def sample(self, threshold: float = 0.9) -> str: # sample a new sentence from the learned model sentence = '' self.reset_state(1) x = np.zeros((1, self.vocab_size)) while True: y_hat = self(x) word = sample_word(self.vocab, tf.reshape(y_hat, (-1,)).numpy(), threshold) if word == EOS: break sentence += word x = words2onehot(self.vocab, [word]) return sentence
predict_next() works similar to sample(), but instead of generating a sentence from scratch, we give to it the first part of a sentence and the method returns the continuation.
def predict_next(self, sentence: str, threshold: float = 0.9) -> str: # predict the next part of the sentence given as parameter self.reset_state(1) for word in sentence.strip(): x = words2onehot(self.vocab, [word]) y_hat = self(x) s = '' while True: word = sample_word(self.vocab, tf.reshape(y_hat, (-1,)).numpy(), threshold) if word == EOS: break s += word x = words2onehot(self.vocab, [word]) y_hat = self(x) return s
save(), obviously, saves all the information and all the parameters needed to reconstruct the current state of the model.
def save(self, name: str) -> None: mkdir(f'./{name}') mkdir(f'./{name}/weights') with open(f'./{name}/vocabulary.txt', 'w') as f: f.write('[separator]'.join(self.vocab)) with open(f'./{name}/inter_time_step_size.txt', 'w') as f: f.write(str(self.inter_time_step_size)) with open(f'./{name}/unit_type.txt', 'w') as f: f.write(self.unit_type) with open(f'./{name}/depth.txt', 'w') as f: f.write(str(self.depth)) if self.unit_type == 'gru': for s in ['wr', 'br', 'wu', 'bu', 'wa', 'ba']: for i in range(self.depth): np.save(f'./{name}/weights/{s}_{i}.npy', getattr(self, s)[i].numpy()) else: for s in ['wu', 'bu', 'wf', 'bf', 'wo', 'bo', 'wc', 'bc']: for i in range(self.depth): np.save(f'./{name}/weights/{s}_{i}.npy', getattr(self, s)[i].numpy()) np.save(f'./{name}/weights/wy.npy', self.wy.numpy()) np.save(f'./{name}/weights/by.npy', self.by.numpy())
load() reads the files generated by save() and restores the model. It uses the _init() method, that is also used in the constructor, to do some of the work.
def load(self, name: str) -> None: with open(f'./{name}/vocabulary.txt', 'r') as f: vocabulary = f.read().split('[separator]') with open(f'./{name}/inter_time_step_size.txt', 'r') as f: inter_time_step_size = int(f.read()) with open(f'./{name}/unit_type.txt', 'r') as f: unit_type = f.read() with open(f'./{name}/depth.txt', 'r') as f: depth = int(f.read()) self._init(vocabulary, inter_time_step_size, unit_type, depth) weights_names = [] filenames = listdir(f'./{name}/weights') filenames.sort() for filename in filenames: if filename in ['wy.npy', 'by.npy']: continue attr_name, index = filename.replace('.npy', '').split('_') index = int(index) if index == 0: setattr(self, attr_name, []) weights_names.append(attr_name) getattr(self, attr_name).append( tf.Variable(np.load(f'./{name}/weights/{filename}'))) self.wy = tf.Variable(np.load(f'./{name}/weights/wy.npy')) self.by = tf.Variable(np.load(f'./{name}/weights/by.npy')) self.weights = [getattr(self, weight_name) for weight_name in weights_names] self.weights.extend([self.wy, self.by])
And that was the Model class. Next we’re going to use it as promised to learn a language model from the news headlines dataset.
Load the data into a pandas DataFrame:
df = pd.read_csv('../input/million-headlines/abcnews-date-text.csv') df

Then we construct our “vocabulary” of characters and the sentences list.
vocabulary = build_vocabulary() sentences = df['headline_text'].values.tolist()
We construct, then, a model with 3 layers of LSTM units, and the forth layer for computing the softmax output. Then we train it for 20 epochs and save the model.
model = Model(vocabulary, inter_time_step_size=512, unit_type='lstm', depth=3) model.fit(sentences, batch_size=1024, epochs=20) model.save('news_headlines_model') # model.load('news_headlines_model')
Now, it’s time to see how good is our mode. So, we sample 20 sentences from it.
for i in range(20): print(model.sample())
And this is the output:
lash dangerous flyer site the nsw numbers for expectations govt to probe farmers bust two boots planning to recover bikies man jailed for help in bell man mps slam deaths findings at flemington deaths in australia steph leading from darling weather estimated to merger perth car cast spend on second star routes as hit hard boost flooding claremont mayor rests public school closure at back bald trials dr asks democratic point for lack of albumstein to repaid warns ashley man killed out after suspicious death of driving line don strauss considers action on border spill defences spending cuts family finds wa experts rally about fiji train line on rates commonwealth games party to get companies as murray review chris dairy farmers hand up control of season randwick authorities india sunshine coast water agent not pose survive landmarkets planet accc art in euro 2012 with china season legacy bush up for iraq off wall street recovery after dairy farmers parents rally as senator plays down mining ban dependency is bill contractor promises war control laws boost with restrictions ministers pass under spotlight on alert as bushfire not resigns darwin councils money clean up costly plans in europe loss site
It’s far from being perfect, but it’s much better than the simple RNN with no GRU/LSTM units and no extra depth.
Here is the output that I got from that simple RNN:
returls thees to fover christs geths hites words to new grina illedra rouls deating ajustry million at stades exter tehtal it roitsed remaid fitiftte tolled saseeks sworibit spoadaron jail usee pont weachars drignt mastle suysay armwya christmast ban brishtire commar ashamfusting to says racistant as rulests villand wiotinops ships sexicans mexical gathy dronses to zeares po inquitic and lionst del rings treberter told jot cerms card prelither istamatations coostared amation nrn fruswertion wayar hompital acoirs natiomitions greenders tombam murders says sedera phes abood grate mccertson awasters ompitions lakes aster fallers tiss timitishance budgets cobandancy in de pahide investrials word on bust agaits sram artions setuttor whinds clear ranalidn as ralls
As you can see, character-level language models are much harder to learn. That simple RNN worked quite well when using a vocabulary of words, but now it’s hard to even generate correct words.
So, when dealing with a “vocabulary” of characters, we need more complex models (like deep GRUs/LSTMs).
Although character-level models may seem a bad idea, they also have some advantages: they are more flexible, and we are not limited only to those words that we chose to be in our vocabulary.
Now, let’s see what it predicts next for this phrase:
s = 'scientists just discovered' s += model.predict_next(s) s # Output: scientists just discovered woman fined for illegal immigration
… Not exactly what I was expecting, but it is still better than the output of the simple RNN:
scientists just discovered brane janled on the thyed for ta the ibleach hay thdonkkers out of waters marling parinenser mixised death month san mentriats gereralt hirting grefeding accters finices
And that’s it for this article.
The Jupyter notebook can be found on Kaggle. The notebook of the simple RNN can also be found here.
I hope you found this information useful and thanks for reading!
This article is also posted on Medium here. Feel free to have a look!
Hi, looks fantastic! Could you go one step back? I’m on a Mac using Catalina. What software do I need to install before I can use this example?
You need Python and some extra packages (check the imports section of the code in this article). I’m not using a Mac, so I’m not sure which would be the best way to get these on your system. But, I guess, Anaconda would be a good option. You can get the installer from here: https://www.anaconda.com/ and there is also some documentation on how to get started.