And using it to build a language model for news headlines

In this article I’m going to explain first a little theory about Recurrent Neural Networks (RNNs) for those who are new to them, then I’ll show the implementation that I did using TensorFlow. We’re going to see the code snippet by snippet along with the explanations and the output that it produced.

The dataset used is A Million News Headlines.

A little theory about RNNs

Let’s first recall what feed-forward neural networks are: they are functions that map the input x to an output ŷ which is an estimate for the true label y. They can be represented like this:

Or, can be drawn like this:

But, the main idea is that they can do only a one-to-one mapping. That is, each input element x should produce only one output ŷ. But what if we need many-to-many, one-to-many, or many-to-one mappings? What if we have more input elements xt and we want our network to produce multiple outputs yt where the number of inputs and outputs can be the same or can be even different?

A typical example of this type of data would be text data, where you have a sequence of words. Maybe you want to output just one label at the end to classify it as a positive or negative review, or maybe you want for each word to output 1 if it is an animal and 0 otherwise, etc.

Other examples of sequence data, besides text, are audio and video.

When we work with sequence data we typically assign to each element of the sequence a number called time step.

So, can we still use a feed-forward neural network(FFNN) to work with sequences?

Theoretically, yes. We can concatenate all the inputs into one huge input vector and all the outputs into one huge output vector, and then use a FFNN to do the learning.

But, is that a good idea?



Well… Here are some problems with this approach:

  • These concatenated vectors will be huge and the number of parameters needed by the neural network will blow up and will be hard to train
  • There will not be enough flexibility in what we can do. The inputs and outputs of a FFNN need to be fixed. But what if one example in our dataset has 10 words and another has 100 words?

Instead of concatenating all the vectors from all time steps, we will use our input/output vectors (x1, y1), (x2, y2), … (xt, yt), … one pair at a time.

We will feed each pair (xt, yt) into the same neural network in a loop for each time step.

This way we got rid of some of the problems written above, but now we have other problems:

  • At each time step, the neural network will make the prediction only based on the current input and doesn’t take into account information about the previous time steps.
  • The number of output elements is constrained to be equal to the number of input elements. We may want this to be different.

We can solve these issues by making the following modifications:

  • Instead of taking as input only x, our neural network will also receive an extra vector, let’s call it a, that is supposed to carry information about the previous time steps.
  • Then, instead of producing only one output, y, our neural network will also output a new version of the vector a that is meant to contain information about the current time step and the previous ones and to be passed forward to the next time step.

So now, our neural network can be represented as a function of both a and x, f(a, x):

And can be drawn like this:

This new type of architecture solves the previous issues as it incorporates information about previous time steps and it can be set up so that the number of inputs and outputs differ (for example, by using the first part of the network for encoding the inputs and a then decoding into a different number of outputs [encoder-decoder architecture]).

At each time step, our neural network is the same, so it actually feeds the vector a into itself. Because of this, these networks are sometimes also represented graphically this way:

Hence the name Recurrent Neural Networks.

But I think this picture is a little confusing and the unfolded version above is much clearer.

But this f(a, x) can be many things.

Let’s see the exact neural network that we will implement next:

Where f and h do the following mappings:

And here are the exact equations that we’re going to use:


  • [at-1 ; xt] – represents the concatenation of a and x vectors (or matrices for batch sizes > 1)
  • Wa, Wy – the weights matrices that are used to obtain a, respectively ŷ
  • ba, by – the biases that are used to obtain a, respectively ŷ

And the dimensions of these quantities are as follows:

  • n = size of vocabulary
  • m = the size that we choose for a
  • x: 1 x n
  • a: 1 x m
  • Wa: (n+m) x m
  • ba: 1 x m
  • Wy: m x n
  • by: 1 x n

This neural network architecture is the one that we’re going to implement next using TensorFlow.

A few words about Language Models

We will use this implementation of a simple RNN to learn a language model based on the news headlines dataset (link above in the intro).

So, what is a language model?

A language model is a probability distribution over sequences of words. Basically, at each time step our RNN will output softmax probabilities for each word in the vocabulary. These probabilities represent what our model “thinks” the next word in the sentence may be, given information about the previous words.

When we train such a RNN, we use the one-hot representation of a word as the “y”, then at the next time step we use the same one-hot vector as the “x”. So, we input as x’s the one-hot vectors for the words so far in the sentence, and we use as y the one-hot vector for the next word that the RNN should predict, then that “y” becomes the new input “x”, and so on.

For the first a and x we just use zeros, since we have nothing previously.

After we trained a RNN this way, we should be able to sample sentences from the learned model like this:

  • when we call the model the first time (with zeros for a and x) it will output the probabilities for each word of being the first word in the sentence
  • then we choose one word according to this probability distribution and feed its one-hot vector as the next x
  • then the model should output the probabilities vector for the second word, given our choice for the first
  • and so on, after t chosen words the model will output probabilities for the t+1 word, until we choose EOS (a special word representing the end of sentence)

Implementation in Python using TensorFlow

Now, let’s see the code.

Firstly we import all the necessary libraries that we’re going to use:

import numpy as np
import pandas as pd
import tensorflow as tf
import random
from typing import Union
from math import ceil
from os import mkdir

Then, we create a function that constructs the vocabulary, which is just a list with the most frequent words in our dataset. The build_vocabulary() function takes 2 parameters: “sentences” which is a list of all the sentences in our dataset, and “words_to_keep” which is the number of the most frequent words to keep in the vocabulary among all the words in our list of sentences.

Why do we want to keep only some of the words and not all of them? That’s simple: it can be very expensive computationally if we have a big dataset with a lot of words. For example, the dataset that we’re going to use has over 100,000 unique words and if we use all of them there will not be even space enough to allocate all the parameters if we try to run the code on some GPU instances that are available for free out there (like on Kaggle).

In our example below we will limit the vocabulary to the 10,000 most frequent words.

But if we use this approach of using just some of the words, then how we will handle the situations where we encounter a word that is not in the vocabulary?

We need to replace all occurrences in our dataset of the words that are not in the vocabulary with a special word “<UNK>” representing an unknown word (and assuming the “<UNK>” expression is not already in the vocabulary).

Another special word that we use is “<EOS>” that represents the end of sentence.

These 2 special words are always inserted into the vocabulary in addition to the selected ‘words_to_keep’ top words.

UNK = '<UNK>' # Unknown word
EOS = '<EOS>' # End of sentence

def build_vocabulary(sentences: list, words_to_keep: int) -> list:
    # builds a vocabulary using 'words_to_keep' most frequent words
    # encountered in the list of sentences
    vocabulary = {}
    n = len(sentences)
    for i, s in enumerate(sentences):
        print('Creating vocabulary: %05.2f%%' % (100*(i+1)/n,), end='\r')
        for word in s.strip().split():
            vocabulary[word] = vocabulary.get(word, 0) + 1
    vocabulary = list(vocabulary.items())
    vocabulary.sort(reverse=True, key=lambda e: e[1])
    vocabulary = vocabulary[0:words_to_keep]
    vocabulary = [e[0] for e in vocabulary]
    print('Done'+(50*' '))
    return vocabulary

Then we create the function build_sentences() that does these things to our sentences list:

  • changes each sentence from a string to a list of words
  • replaces words that are not in vocabulary with UNK
  • appends EOS at the end of each sentence
def build_sentences(vocabulary: list, sentences: list) -> list:
    # transforms the list of sentences into a list of lists of words
    # replacing words that are not in the vocabulary with <UNK>
    # and appending <EOS> at the end of each sentence
    processed_sent = []
    n = len(sentences)
    for i, sent in enumerate(sentences):
        print('Creating sentences list: %05.2f%%' % (100*(i+1)/n,), end='\r')
        s = []
        for word in sent.strip().split():
            if word not in vocabulary:
                word = UNK
    print('Done'+(50*' '))
    return processed_sent

We will need the words2onehot() function to convert words to one-hot vectors. And this function uses internally word2index() that returns the index of a word in the vocabulary.

def word2index(vocabulary: list, word: str) -> int:
    # returns the index of 'word' in the vocabulary
    return vocabulary.index(word)

def words2onehot(vocabulary: list, words: list) -> np.ndarray:
    # transforms the list of words given as argument into
    # a one-hot matrix representation using the index in the vocabulary
    n_words = len(words)
    n_voc = len(vocabulary)
    indices = np.array([word2index(vocabulary, word) for word in words])
    a = np.zeros((n_words, n_voc))
    a[np.arange(n_words), indices] = 1
    return a

The sample_word() function will return a word at random from the vocabulary according to the probability distribution passed as parameter. If the chosen word is UNK, it continues choosing words until it’s != UNK and returns it.

def sample_word(vocabulary: list, prob: np.ndarray) -> str:
    # sample a word from the vocabulary according to 'prob'
    # probability distribution (the softmax output of our model)
    # until it is != <UNK>
    while True:
        word = np.random.choice(vocabulary, p=prob)
        if word != UNK:
            return word

Then we define our Model class starting with the __init__() methon which takes as parameters the vocabulary and the size that we choose for the a vector. This method creates the weights and biases as tf.Variables initialized with values drawn from a normal distribution whose standard deviation depend on the sizes of that matrix.

class Model:
    def __init__(self, vocabulary: list = [], a_size: int = 0):
        self.vocab = vocabulary
        self.vocab_size = len(vocabulary)
        self.a_size = a_size
        self.combined_size = self.vocab_size + self.a_size
        # weights and bias used to compute the new a
        # (a = vector that is passes to the next time step)
        self.wa = tf.Variable(tf.random.normal(
            shape=(self.combined_size, self.a_size),
            dtype=tf.double)) = tf.Variable(tf.random.normal(
            shape=(1, self.a_size),
        # weights and bias used to compute y (the softmax predictions)
        self.wy = tf.Variable(tf.random.normal(
            shape=(self.a_size, self.vocab_size),
            dtype=tf.double)) = tf.Variable(tf.random.normal(
            shape=(1, self.vocab_size),
        self.weights = [self.wa,, self.wy,]
        self.optimizer = tf.keras.optimizers.Adam()

The __call__() method is the one that allows us to call our model object with the vector a from previous time step and the x of current time step, and produce the next a and the predictions ŷ and return them as a tuple. But if we also pass the true label y as parameter, then instead of ŷ it returns the cross entropy loss. That’s because if we already have the true y, then it means that we are in the training loop and we need just the loss, not to make predictions.

def __call__(self,
                a: Union[np.ndarray, tf.Tensor],
                x: Union[np.ndarray, tf.Tensor],
                y: Union[np.ndarray, tf.Tensor, None] = None) -> tuple:
    a_new = tf.math.tanh(tf.linalg.matmul(tf.concat([a, x], axis=1), self.wa)
    y_logits = tf.linalg.matmul(a_new, self.wy)
    if y is None:
        # during prediction return softmax probabilities
        return (a_new, tf.nn.softmax(y_logits))
        # during training return loss
        return (a_new, tf.math.reduce_mean(
                    tf.nn.softmax_cross_entropy_with_logits(y, y_logits)))

Now it comes the fit() method where all the “magic” happens. There are a lot of things happening here, but here are a few things that are different compared to a feed-forward neural network’s training loop:

  • We need to do each time step separately. That’s because we need to wait for the new a matrix to be computed before going to the next time step. So, we cannot batch data along the time dimension.
  • We use batches that consists of data for the same time step (for example, the same word positions; a batch with first words, a second batch with second words, and so on). Here by a batch I mean what is feed “at once” into the model.
  • To do that we select first ‘batch_size’ number of sentences, then we sort them in the descending order of their word counts.
  • This means that the number of time steps is just the number of words in the first sentence after sorting.
  • Then, inside the with tf.GradientTape() as tape: we iterate over these time steps, and at each time step t we take the tth column of words, turn it into a one-hot matrix and use it for computing the loss at step t
  • These one-hot matrices will shrink along the time dimension, because the shorter sentences will be already processed by the model. So, we make sure that the sizes of a and x shrink accordingly (that’s why we do a[0:n], x[0:n])
  • Append all these losses into a list that we average at the end using loss_value = tf.math.reduce_mean(losses)
def fit(self,
        sentences: list,
        batch_size: int = 128,
        epochs: int = 10) -> None:
    n_sent = len(sentences)
    num_batches = ceil(n_sent / batch_size)
    for epoch in range(epochs):
        start = 0
        batch_idx = 0
        while start < n_sent:
            print('Training model: %05.2f%%' %
            batch_idx += 1
            end = min(start+batch_size, n_sent)
            batch_sent = sentences[start:end]
            start = end
            batch_sent.sort(reverse=True, key=lambda s: len(s))
            init_num_words = len(batch_sent)
            a = np.zeros((init_num_words, self.a_size))
            x = np.zeros((init_num_words, self.vocab_size))
            time_steps = len(batch_sent[0])
            with tf.GradientTape() as tape:
                losses = []
                for t in range(time_steps):
                    words = []
                    for i in range(init_num_words):
                        if t >= len(batch_sent[i]):

                    y = words2onehot(self.vocab, words)
                    n = y.shape[0]
                    a, loss = self(a[0:n], x[0:n], y)
                    x = y
                loss_value = tf.math.reduce_mean(losses)
            grads = tape.gradient(loss_value, self.weights)
            self.optimizer.apply_gradients(zip(grads, self.weights))

The sampe() method keeps choosing words from the vocabulary according to the probabilities produced by our model, until EOS is chosen. After one word is chosen, it is turned into a one-hot vector and used as input in the next time step, and so on. All these words are concatenated and returned as a sentence.

def sample(self) -> str:
    # sample a new sentence from the learned model
    sentence = ''
    a = np.zeros((1, self.a_size))
    x = np.zeros((1, self.vocab_size))
    while True:
        a, y_hat = self(a, x)
        word = sample_word(self.vocab, tf.reshape(y_hat, (-1,)))
        if word == EOS:
        sentence += ' '+word
        x = words2onehot(self.vocab, [word])
    return sentence[1:]

predict_next() works similar to sample(), but instead of producing a new sentence from scratch, it takes the first part of the sentence as a parameter, feeds it into the model, and then continues sampling the next few words until EOS is reached.

def predict_next(self, sentence: str) -> str:
    # predict the next part of the sentence given as parameter
    a = np.zeros((1, self.a_size))
    for word in sentence.strip().split():
        if word not in vocabulary:
            word = UNK
        x = words2onehot(self.vocab, [word])
        a, y_hat = self(a, x)
    s = ''
    while True:
        word = sample_word(self.vocab, tf.reshape(y_hat, (-1,)))
        if word == EOS:
        s += ' '+word
        x = words2onehot(self.vocab, [word])
        a, y_hat = self(a, x)
    return s

The save() and load() methods, obviously, save the parameters and all the information needed about the model into a format that load() can easily reconstruct.

def save(self, name: str) -> None:
    with open(f'./{name}/vocabulary.txt', 'w') as f:
    with open(f'./{name}/a_size.txt', 'w') as f:
        f.write(str(self.a_size))'./{name}/wa.npy', self.wa.numpy())'./{name}/ba.npy','./{name}/wy.npy', self.wy.numpy())'./{name}/by.npy',

def load(self, name: str) -> None:
    with open(f'./{name}/vocabulary.txt', 'r') as f:
        self.vocab =',')
    with open(f'./{name}/a_size.txt', 'r') as f:
        self.a_size = int(
    self.vocab_size = len(self.vocab)
    self.combined_size = self.vocab_size + self.a_size
    self.wa = tf.Variable(np.load(f'./{name}/wa.npy')) = tf.Variable(np.load(f'./{name}/ba.npy'))
    self.wy = tf.Variable(np.load(f'./{name}/wy.npy')) = tf.Variable(np.load(f'./{name}/by.npy'))
    self.weights = [self.wa,, self.wy,]

That was all about the Model class. Next let’s see how to use it.

We start by reading the dataset into a pandas DataFrame:

df = pd.read_csv('../input/million-headlines/abcnews-date-text.csv')

Then we construct the vocabulary using the most frequent 10k words. After that we process the sentences to convert them into a format that’s more convenient to use.

vocabulary = build_vocabulary(df['headline_text'].values.tolist(), words_to_keep=10000)
sentences = build_sentences(vocabulary, df['headline_text'].values.tolist())

We create a Model object by passing the vocabulary and the size of the vector that passes information among time steps (1024). Then we call fit() with a batch size of 128 and 10 epochs. The training took about 6 hours on Kaggle with a GPU.

model = Model(vocabulary, 1024), batch_size=128, epochs=10)

After training, don’t forget to save the model:'news_headlines_model')
# model.load('news_headlines_model')

Now let’s see what sentences our model produces:

for i in range(20):
vandalism closes defence hill suicide solutions
majority in hospital to change gay marriage
nt has honoured for proposed to australia
canegrowers first cervical cancer impacts data
us market closes ahead of sustainable wa
png manus island protest
road crash wont hurt indoor assets report
body found in burma recommendation
2015 chief quits question family for
brisbane drug used dies drown in western tasmania
full interview is offered run concerns to take police
volcano alert of missing at rottnest island
darwin cup crackdown on pay for tour african front official
cowboys manager charged over rooney
cooper installed by cia executive
five hospitalised after death coach from 14yo girl
intelligence ready to russia in sydney
pearson wins un remarks a show
cameron smith is higher in level series in nsw
harvey on trump

It’s not that bad. I mean it’s much better than drawing words completely at random. But, obviously, our model is yet far from being perfect. But this was meant to be just a simple example to get you started with RNNs, not to create breakthrough language models.

And by the way, some of the produced sentences seem quite funny to me.

Here is also an example of continuing a sentence:

s = 'scientists just discovered'
s += model.predict_next(s)
# Output: 'scientists just discovered bureau fisherman mount wetlands'

?‍♂️ This one really surprised me.

You can find the full notebook on Kaggle here.

I hope you found this information useful and thanks for reading!

This article is also posted on Medium here. Feel free to have a look!


Passionate about Data Science, AI, Programming & Math

0 0 votes
Article Rating
Notify of
1 Comment
Newest Most Voted
Inline Feedbacks
View all comments

[…] Creating a simple RNN from scratch with TensorFlow […]

Would love your thoughts, please comment.x