Download 5-Line GPT-Style Text Generation in Python with TensorFlow/Keras - Python
Categories:Viewed: 19 - Published at: 2 months ago

Transformers, even though released in 2017, have only started gaining significant traction in the last couple of years. With the proliferation of the technology through platforms like HuggingFace, NLP and Large Language Models (LLMs) have become more accessible than ever. Yet - even with all the hype around them and with many theory-oriented guides, there aren't many custom implementations online, and the resources aren't as readily available as with some other network types, that have been around for longer. While you could simplify your workcycle by using a pre-built Transformer from HuggingFace (the topic of another guide) - you can get to feel how it works by building one yourself, before abstracting it away through a library. We'll be focusing on building, rather than theory and optimization here.

In this guide, we'll be building an Autoregressive Language Model to generate text. We'll be focusing on the practical and minimalistic/concise aspects of loading data, splitting it, vectorizing it, building a model, writing a custom callback and training/inference. Each of these tasks can be spun off into more detailed guides, so we'll keep the implementation as a generic one, leaving room for customization and optimization depending on your own dataset.

Types of LLMs and GPT-Fyodor

While categorization can get much more intricate - you can broadly categorize Transformer-based language models into three categories:

  • Encoder-Based Models - ALBERT, BERT, DistilBERT, RoBERTa
  • Decoder-Based - GPT, GPT-2, GPT-3, TransformerXL
  • Seq2Seq Models - BART, mBART, T5

Encoder-based models only use a Transformer encoder in their architecture (typically, stacked) and are great for understanding sentences (classification, named entity recognition, question answering). Decoder-based models only use a Transformer decoder in their architecture (also typically stacked) and are great for future prediction, which makes them suitable for text generation. Seq2Seq models combine both encoders and decoders and are great at text generation, summarization and most importantly - translation. The GPT family of models, which gained a lot of traction in the past couple of years, are decoder-based transformer models, and are great at producing human-like text, trained on large corpora of data, and given a prompt as a new starting seed for generation. For instance:

generate_text('the truth ultimately is')

Which under the hood feeds this prompt into a GPT-like model, and produces:

'the truth ultimately is really a joy in history, this state of life through which is almost invisible, superfluous  teleological...'

This is, in fact, a small spoiler from the end of the guide! Another small spoiler is the architecture that produced that text:

inputs = layers.Input(shape=(maxlen,))
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(vocab_size, maxlen, embed_dim)(inputs)
transformer_block = keras_nlp.layers.TransformerDecoder(embed_dim, num_heads)(embedding_layer)
outputs = layers.Dense(vocab_size, activation='softmax')(transformer_block)

model = keras.Model(inputs=inputs, outputs=outputs)

5 lines is all it takes to build a decoder-only transformer model - simulating a small GPT. Since we'll be training the model on Fyodor Dostoyevsky's novels (which you can substitute with anything else, from Wikipedia to Reddit comments) - we'll tentatively call the model GPT-Fyodor.


The trick to a 5-line GPT-Fyodor lies in KerasNLP, which is developed by the official Keras team, as a horizontal extension to Keras, which in true Keras fashion, aims to bring industry-strength NLP to your fingertips, with new layers (encoders, decoders, token embeddings, position embeddings, metrics, tokenizers, etc.). KerasNLP isn't a model zoo. It's a part of Keras (as a separate package), that lowers the barrier to entry for NLP model development, just as it lowers the barrier to entry for general deep learning development with the main package.

Note: As of writing KerasNLP is still being produced, and in early stages. Subtle differences might be present in future versions. The writeup is utilizing version 0.3.0.
        To be able to use KerasNLP, you'll have to install it via pip:
$ pip install keras_nlp

And you can verify the version with:

# 0.3.0

Implementing a GPT-Style Model with Keras

Let's start out by importing the libraries we'll be using - TensorFlow, Keras, KerasNLP and NumPy:

import tensorflow as tf
from tensorflow import keras
import keras_nlp
import numpy as np

Loading Data

Let's load in a few of Dostoyevsky's novels - one would be way too short for a model to fit, without a fair bit of overfitting from the early stages onward. We'll be gracefully using the raw text files from Project Gutenberg, due to the simplicity of working with such data:

crime_and_punishment_url = ''
brothers_of_karamazov_url = ''
the_idiot_url = ''
the_possessed_url = ''

paths = [crime_and_punishment_url, brothers_of_karamazov_url, the_idiot_url, the_possessed_url]
names = ['Crime and Punishment', 'Brothers of Karamazov', 'The Idiot', 'The Possessed']
texts = ''
for index, path in enumerate(paths):
    filepath = keras.utils.get_file(f'{names[index]}.txt', origin=path)
    text = ''
    with open(filepath, encoding='utf-8') as f:
        text =
        # First 50 lines are the Gutenberg intro and preface
        # Skipping first 10k characters for each book should be approximately
        # removing the intros and prefaces.
        texts += text[10000:]

We've simply downloaded all of the files, gone through them and concatenated them one on top of the other. This includes some diversity in the language used, while still keeping it distinctly Fyodor! For each file, we've skipped the the first 10k characters, which is around the average length of the preface and Gutenberg intro, so we're left with a largely intact body of the book for each iteration. Let's take a look at some random 500 characters in the texts string now:

# 500 characters
'nd that was why\nI addressed you at once. For in unfolding to you the story of my life, I\ndo not wish to make myself a laughing-stock before these idle listeners,\nwho indeed know all about it already, but I am looking for a man\nof feeling and education. Know then that my wife was educated in a\nhigh-class school for the daughters of noblemen, and on leaving she\ndanced the shawl dance before the governor and other personages for\nwhich she was presented with a gold medal and a certificate of merit.\n'

Let's separate the string into sentences before doing any other processing:

text_list = texts.split('.')
len(text_list) # 69181

We've got 69k sentences. When you replace the \n characters with whitespaces and count the words:

len(texts.replace('\n', ' ').split(' ')) # 1077574
Note: You'll generally want to have at least a million words in a dataset, and ideally, much much more than that. We're working with a few megabytes of data (~5MB) while language models are more commonly trained on tens of gigabytes of text. This will, naturally, make it really easy to overfit the text input and hard to generalize (high perplexity without overfitting, or low perplexity with a lot of overfitting). Take the results with a grain of salt.
        Nevertheless, let's split these into a <em>training</em>, <em>test</em> and <em>validation</em> set. First, let's remove the empty strings and shuffle the sentences:
# Filter out empty strings ('') that are to be found commonly due to the book's format
text_list = list(filter(None, text_list))

import random

Then, we'll do a 70/15/15 split:

length = len(text_list)
text_train = text_list[:int(0.7*length)]
text_test = text_list[int(0.7*length):int(0.85*length)]
text_valid = text_list[int(0.85*length):]

This is a simple, yet effective way to perform a train-test-validation split. Let's take a peek at text_train:

[' It was a dull morning, but the snow had ceased',
 '\n\n"Pierre, you who know so much of what goes on here, can you really have\nknown nothing of this business and have heard nothing about it?"\n\n"What? What a set! So it\'s not enough to be a child in your old age,\nyou must be a spiteful child too! Varvara Petrovna, did you hear what he\nsaid?"\n\nThere was a general outcry; but then suddenly an incident took place\nwhich no one could have anticipated', ...

Time for standardization and vectorization!

Text Vectorization

Networks don't understand words - they understand numbers. We'll want to tokenize the words:

sequence = ['I', 'am', 'Wall-E']
sequence = tokenize(sequence)
print(sequence) # [4, 26, 472]

Also, since sentences differ in length - padding is typically added to the left or right to ensure the same shape across sentences being fed in. Say our longest sentence is 5-words (tokens) long. In that case, the Wall-E sentence would be padded by two zeros so we ensure the same input shape:

sequence = pad_sequence(sequence)
print(sequence) # [4, 26, 472, 0, 0]

Traditionally, this was done using a TensorFlow Tokenizer and Keras' pad_sequences() methods - however, a much handier layer, TextVectorization, can be used, which tokenizes and pads your input, allowing you to extract the vocabulary and its size, without knowing the vocab upfront! Let's adapt and fit a TextVectorization layer:

from tensorflow.keras.layers import TextVectorization

def custom_standardization(input_string):
    sentence = tf.strings.lower(input_string)
    sentence = tf.strings.regex_replace(sentence, "\n", " ")
    return sentence

maxlen = 50
# You can also set calculate the longest sentence in the data - 25 in this case
# maxlen = len(max(text_list).split(' ')) 

vectorize_layer = TextVectorization(
    standardize = custom_standardization,
    output_sequence_length=maxlen + 1,

vocab = vectorize_layer.get_vocabulary()

The custom_standardization() method can get a lot longer than this. We've simply lowercased all input and replaced \n with " ". This is where you can really put in most of your preprocessing for text - and supply it to the vectorization layer through the optional standardize argument. Once you adapt() the layer to the text (NumPy array or list of texts) - you can get the vocabulary, as well as its size from there:

vocab_size = len(vocab)
vocab_size # 49703

Finally, to de-tokenize words, we'll create an index_lookup dictionary:

index_lookup = dict(zip(range(len(vocab)), vocab))    
index_lookup[5] # of

It maps all of the tokens ([1, 2, 3, 4, ...]) to words in the vocabulary (['a', 'the', 'i', ...]). By passing in a key (token index), we can easily get the word back. You can now run the vectorize_layer() on any input and observe the vectorized sentences:

vectorize_layer(['hello world!'])

Which results in:

<tf.tensor: shape="(1," 51),="" dtype="int64," numpy="array([[" 1,="" 7509,="" 0,="" 0]],="">


Hello has the index of 1 while world has the index of 7509! The rest is the padding to the maxlen we've calculated. We have the means to vectorize text - now, let's create datasets from text_train, text_test and text_valid, using our vectorization layer as a conversion medium between words and vectors that can be fed into GPT-Fyodor.

Dataset Creation

We'll be creating a for each of our sets, using from_tensor_slices() and providing a list of, well, tensor slices (sentences):

batch_size = 64

train_dataset =
train_dataset = train_dataset.shuffle(buffer_size=256)
train_dataset = train_dataset.batch(batch_size)

test_dataset =
test_dataset = test_dataset.shuffle(buffer_size=256)
test_dataset = test_dataset.batch(batch_size)

valid_dataset =
valid_dataset = valid_dataset.shuffle(buffer_size=256)
valid_dataset = valid_dataset.batch(batch_size)

Once created and shuffled (again, for good measure) - we can apply a preprocessing (vectorization and sequence splitting) function:

def preprocess_text(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y

train_dataset =
train_dataset = train_dataset.prefetch(

test_dataset =
test_dataset = test_dataset.prefetch(

valid_dataset =
valid_dataset = valid_dataset.prefetch(

The preprocess_text() function simply expands by the last dimension, vectorizes the text using our vectorize_layer and creates the inputs and targets, offset by a single token. The model will use [0..n] to infer n+1, yielding a prediction for each word, accounting for all of the words before that. Let's take a look at a single entry in any of the datasets:

for entry in train_dataset.take(1):

Investigating the returned inputs and targets, in batches of 64 (with a length of 30 each), we can clearly see how they're offset by one:

(<tf.tensor: shape="(64," 50),="" dtype="int64," numpy="array([[17018," 851,="" 2,="" ...,="" 0,="" 0],="" [="" 330,="" 74,="" 4,="" 68,="" 752,="" 30273,="" 7,="" 73,="" 2004,="" 44,="" 42,="" 67,="" 195,="" 252,="" 102,="" 0]],="">, <tf.tensor: shape="(64," 50),="" dtype="int64," numpy="array([[" 851,="" 2,="" 8289,="" ...,="" 0,="" 0],="" [="" 74,="" 4,="" 34,="" 752,="" 30273,="" 7514,="" 73,="" 2004,="" 31,="" 42,="" 67,="" 76,="" 252,="" 102,="" 8596,="" 0]],="">)


Finally - it's time to build the model!

Model Definition

We'll make use of KerasNLP layers here. After an Input, we'll encode the input through a TokenAndPositionEmbedding layer, passing in our vocab_size, maxlen and embed_dim. The same embed_dim that this layer outputs and inputs into the TransformerDecoder will be retained in the Decoder. As of writing, the Decoder automatically maintains the input dimensionality, and doesn't allow you to project it into a different output, but it does let you define the latent dimensions through the intermediate_dim argument. We'll multiply the embedding dimensions by two for the latent representation, but you can keep it the same or use a number detached from the embedding dims:

embed_dim = 128
num_heads = 4

def create_model():
    inputs = keras.layers.Input(shape=(maxlen,), dtype=tf.int32)
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(vocab_size, maxlen, embed_dim)(inputs)
    decoder = keras_nlp.layers.TransformerDecoder(intermediate_dim=embed_dim, 

    outputs = keras.layers.Dense(vocab_size, activation='softmax')(decoder)

    model = keras.Model(inputs=inputs, outputs=outputs)

        metrics=[keras_nlp.metrics.Perplexity(), 'accuracy']
    return model

model = create_model()

On top of the decoder, we have a Dense layer to choose the next word in the sequence, with a softmax activation (which produces the probability distribution for each next token). Let's take a look at the summary of the model:

Model: "model_5"
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, 30)]              0         

 token_and_position_embeddin  (None, 30, 128)          6365824   
 g_5 (TokenAndPositionEmbedd                                     

 transformer_decoder_5 (Tran  (None, 30, 128)          132480    

 dense_5 (Dense)             (None, 30, 49703)         6411687   

Total params: 13,234,315
Trainable params: 13,234,315
Non-trainable params: 0

GPT-2 stacks many decoders - GPT-2 Small has 12 stacked decoders (117M params), while GPT-2 Extra Large has 48 stacked decoders (1.5B params). Our single-decoder model with a humble 13M parameters should work well enough for educational purposes. With LLMs - scaling up has proven to be an exceedingly good strategy, and Transformers allow for good scaling, making it feasible to train extremely large models. GPT-3 has a "meager" 175B parameters. Google Brain's team trained a 1.6T parameter model to perform sparsity research while keeping computation on the same level as much smaller models. As a matter of fact, if we increased the number of decoders from 1 to 3:

def create_model():
    inputs = keras.layers.Input(shape=(maxlen,), dtype=tf.int32)
    x = keras_nlp.layers.TokenAndPositionEmbedding(vocab_size, maxlen, embed_dim)(inputs)
    for i in range(4):
        x = keras_nlp.layers.TransformerDecoder(intermediate_dim=embed_dim*2, num_heads=num_heads,                                                             dropout=0.5)(x)
    do = keras.layers.Dropout(0.4)(x)
    outputs = keras.layers.Dense(vocab_size, activation='softmax')(do)

    model = keras.Model(inputs=inputs, outputs=outputs)

Our parameter count would be increased by 400k:

Total params: 13,631,755
Trainable params: 13,631,755
Non-trainable params: 0
Most of the parameters in our network come from the TokenAndPositionEmbedding and Dense layers!

Try out different depths of the decoder - from 1 to all the way your machine can handle and note the results. In any case - we're almost ready to train the model! Let's create a custom callback that'll produce a sample of text on each epoch, so we can see how the model learns to form sentences through training.

Custom Callback

class TextSampler(keras.callbacks.Callback):
    def __init__(self, start_prompt, max_tokens):
        self.start_prompt = start_prompt
        self.max_tokens = max_tokens

    # Helper method to choose a word from the top K probable words with respect to their probabilities
    # in a sequence
    def sample_token(self, logits):
        logits, indices = tf.math.top_k(logits, k=5, sorted=True)
        indices = np.asarray(indices).astype("int32")
        preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
        preds = np.asarray(preds).astype("float32")
        return np.random.choice(indices, p=preds)

    def on_epoch_end(self, epoch, logs=None):
        decoded_sample = self.start_prompt

        for i in range(self.max_tokens-1):
            tokenized_prompt = vectorize_layer([decoded_sample])[:, :-1]
            predictions = self.model.predict([tokenized_prompt], verbose=0)
            # To find the index of the next word in the prediction array.
            # The tokenized prompt is already shorter than the original decoded sample
            # by one, len(decoded_sample.split()) is two words ahead - so we remove 1 to get
            # the next word in the sequence
            sample_index = len(decoded_sample.strip().split())-1

            sampled_token = self.sample_token(predictions[0][sample_index])
            sampled_token = index_lookup[sampled_token]
            decoded_sample += " " + sampled_token

        print(f"\nSample text:\n{decoded_sample}...\n")

# First 5 words of a random sentence to be used as a seed
random_sentence = ' '.join(random.choice(text_valid).replace('\n', ' ').split(' ')[:4])
sampler = TextSampler(random_sentence, 30)
reducelr = keras.callbacks.ReduceLROnPlateau(patience=10, monitor='val_loss')

Training the Model

Finally, time to train! Let's chuck in our train_dataset and validation_dataset with the callbacks in place:

model = create_model()
history =, 
                    callbacks=[sampler, reducelr])

The sampler chose an unfortunate sentence that starts with the end quote and start quote, but it still produces interesting results while training:

# Epoch training
Epoch 1/10
658/658 [==============================] - ETA: 0s - loss: 2.7480 - perplexity: 15.6119 - accuracy: 0.6711
# on_epoch_end() sample generation
Sample text:
”  “What do you had not been i had been the same man was not be the same eyes to been a whole man and he did a whole man to the own...
# Validation
658/658 [==============================] - 158s 236ms/step - loss: 2.7480 - perplexity: 15.6119 - accuracy: 0.6711 - val_loss: 2.2130 - val_perplexity: 9.1434 - val_accuracy: 0.6864 - lr: 0.0010
Sample text:
”  “What do you know it is it all this very much as i should not have a great impression  in the room to be  able of it in my heart...

658/658 [==============================] - 149s 227ms/step - loss: 1.7753 - perplexity: 5.9019 - accuracy: 0.7183 - val_loss: 2.0039 - val_perplexity: 7.4178 - val_accuracy: 0.7057 - lr: 0.0010

It starts with:

"What do you had not been i had been the same"...

Which doesn't really make much sense. By the end of the ten short epochs, it produces something along the lines of:

"What do you mean that is the most ordinary man of a man of course"...

While the second sentence still doesn't make too much sense - it's much more sensical than the first. Longer training on more data (with more intricate preprocessing steps) would yield better results. We've only trained it on 10 epochs with high dropout to combat the small dataset size. If it were left training for much longer, it would produce very Fyodor-like text, because it would've memorized large chunks of it.

Note: Since the output is fairly verbose, you can tweak the verbose argument while fitting the model to reduce the amount of text on screen.
        <h4 id="modelinference">Model Inference</h4>

To perform inference, we'll want to replicate the interface of the TextSampler - a method that accepts a seed and a response_length (max_tokens). We'll use the same methods as within the sampler:

def sample_token(logits):
        logits, indices = tf.math.top_k(logits, k=5, sorted=True)
        indices = np.asarray(indices).astype("int32")
        preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
        preds = np.asarray(preds).astype("float32")
        return np.random.choice(indices, p=preds)

def generate_text(prompt, response_length=20):
    decoded_sample = prompt
    for i in range(response_length-1):
        tokenized_prompt = vectorize_layer([decoded_sample])[:, :-1]
        predictions = model.predict([tokenized_prompt], verbose=0)
        sample_index = len(decoded_sample.strip().split())-1

        sampled_token = sample_token(predictions[0][sample_index])
        sampled_token = index_lookup[sampled_token]
        decoded_sample += " " + sampled_token
    return decoded_sample

Now, you can run the method on new samples:

generate_text('the truth ultimately is')
# 'the truth ultimately is really a joy in history, this state of life through which is almost invisible, superfluous  teleological'

generate_text('the truth ultimately is')
# 'the truth ultimately is not to make it a little   thing to go into your own  life for some'

Improving Results?

So, how can you improve results? There are some pretty actionable things you could do:

  • Data cleaning (clean the input data more meticulously, we just trimmed an approximate number from the start and removed newline characters)
  • Get more data (we only worked with a few megabytes of text data)
  • Scale the model alongside the data (stacking decoders isn't hard!)

Going Further - Hand-Held End-to-End Project

Your inquisitive nature makes you want to go further? We recommend checking out our Guided Project: "Image Captioning with CNNs and Transformers with Keras".

In this guided project - you'll learn how to build an image captioning model, which accepts an image as input and produces a textual caption as the output.

You'll learn how to:

  • Preprocess text
  • Vectorize text input easily
  • Work with the API and build performant Datasets
  • Build Transformers from scratch with TensorFlow/Keras and KerasNLP - the official horizontal addition to Keras for building state-of-the-art NLP models
  • Build hybrid architectures where the output of one network is encoded for another

How do we frame image captioning? Most consider it an example of generative deep learning, because we're teaching a network to generate descriptions. However, I like to look at it as an instance of neural machine translation - we're translating the visual features of an image into words. Through translation, we're generating a new representation of that image, rather than just generating new meaning. Viewing it as translation, and only by extension generation, scopes the task in a different light, and makes it a bit more intuitive. Framing the problem as one of translation makes it easier to figure out which architecture we'll want to use. Encoder-only Transformers are great at understanding text (sentiment analysis, classification, etc.) because Encoders encode meaningful representations. Decoder-only models are great for generation (such as GPT-3), since decoders are able to infer meaningful representations into another sequence with the same meaning. Translation is typically done by an encoder-decoder architecture, where encoders encode a meaningful representation of a sentence (or image, in our case) and decoders learn to turn this sequence into another meaningful representation that's more interpretable for us (such as a sentence).


While the preprocessing pipeline is minimalistic and can be improved - the pipeline outlined in this guide produced a decent GPT-style model, with just 5 lines of code required to build a custom decoder-only transformer, using Keras! Transformers are popular and widely-applicable for generic sequence modeling (and many things can be expressed as sequences). So far, the main barrier to entry was a cumbersome implementation, but with KerasNLP - deep learning practicioners can leverage the implementations to build models quickly and easily.