Skip to main content

Text Generation

Natural language generation (NLG) can actually tell a story – exactly like that of a human analyst – by writing the sentences and paragraphs for you. It can also summarize reports.

“Conversations with systems that have access to data about our world will allow us to understand the status of our jobs, our businesses, our health, our homes, our families, our devices, and our neighborhoods — all through the power of NLG. It will be the difference between getting a report and having a conversation. The information is the same but the interaction will be more natural". **

Algorithms

Text Generation with Markov Chain

Markov chains are a stochastic process that are used to describe the next event in a sequence given the previous event only. In our case the state will be the previous word (unigram) or 2 words (bigram) or 3 (trigram). These are more generally known as ngrams since we will be using the last n words to generate the next possible word in the sequence. A Markov chain usually picks the next state via a probabilistic weighting but in our case that would just create text that would be too deterministic in structure and word choice. You could play with the weighting of the probabilities, but really having a random choice helps make the generated text feel original.

Corpus: The dog jumped over the moon. The dog is funny.

Language model:

(The, dog)     -> [jumped, is]
(dog, jumped) -> [over]
(jumped, over) -> [the]
(over, the) -> [moon]
(the, moon) -> [#END#]
(dog, is) -> [funny]
(is, funny) -> [#END#]
import random
import string

class MarkovModel:

def __init__(self):
self.model = None

def learn(self,tokens,n=2):
model = {}

for i in range(0,len(tokens)-n):
gram = tuple(tokens[i:i+n])
token = tokens[i+n]

if gram in model:
model[gram].append(token)
else:
model[gram] = [token]

final_gram = tuple(tokens[len(tokens) - n:])
if final_gram in model:
model[final_gram].append(None)
else:
model[final_gram] = [None]
self.model = model
return model

def generate(self,n=2,seed=None, max_tokens=100):
if seed == None:
seed = random.choice(self.model.keys())

output = list(seed)
output[0] = output[0].capitalize()
current = seed

for i in range(n, max_tokens):
# get next possible set of words from the seed word
if current in self.model:
possible_transitions = self.model[current]
choice = random.choice(possible_transitions)
if choice is None: break

# check if choice is period and if so append to previous element
if choice == '.':
output[-1] = output[-1] + choice
else:
output.append(choice)
current = tuple(output[-n:])
else:
# should return ending punctuation of some sort
if current not in string.punctuation:
output.append('.')
return output

Important Papers

The survey: Text generation models in deep learning [2020]

Survey of the State of the Art in Natural Language Generation: Core tasks, applications [2017]

Neural Text Generation: A Practical Guide [2017]

Neural Text Generation: Past, Present and Beyond [2018]

Experiments

Text Generation with char-RNNs

The Unreasonable Effectiveness of Recurrent Neural Networks

References

BoredHumans.com - Fun AI Programs You Can Use Online

eaglenlp/Text-Generation

tokenmill/awesome-nlg

ChenChengKuan/awesome-text-generation

Tianwei-She/awesome-natural-language-generation

Papers with Code - Text Generation

Eulring/Text-Generation-Papers

Arxiv Sanity Preserver


Decoding techniques - Greedy search, Beam search, Top-K sampling and Top-p sampling with Transformer

How to generate text: using different decoding methods for language generation with Transformers

Google Colaboratory

Controlling Text Generation with Plug and Play Language Models (PPLM)

PPLM lets users combine small attribute models with an LM to steer its generation. Attribute model scan be 100,000 times smaller than the LM and still be effective insteering it, like a mouse sitting atop our wooly mammoth friend and telling it where to go.The mouse tells the mammoth where to go using gradients.

Controlling Text Generation with Plug and Play Language Models

uber-research/PPLM

Plug and Play Language Models: A Simple Approach to Controlled Text Generation

Google Colaboratory

GPT-2 Fine Tuning

Autoregressive Language Generation

It based on the assumption that the probability distribution of a word sequence can be decomposed into the product of conditional next word distributions. The way these models actually work is that after each token is produced, that token is added to the sequence of inputs. And that new sequence becomes the input to the model in its next step.

Word-Level Generation vs Character-Level Generation

In general, word-level language models tend to display higher accuracy than character-level language models. This is because they can form shorter representations of sentences and preserve the context between words easier than character-level language models. However, large corpora are needed to sufficiently train word-level language models, and one-hot encoding isn't very feasible for word-level models. In contrast, character-level language models are often quicker to train, requiring less memory and having faster inference than word-based models. This is because the "vocabulary" (the number of training features) for the model is likely to be much smaller overall, limited to hundreds of characters rather than hundreds of thousands of words.