FrankenText

FrankenText#

https://upload.wikimedia.org/wikipedia/commons/d/df/Frankenstein%2C_or_the_Modern_Prometheus_%28Revised_Edition%2C_1831%29_006.jpg — Fig. 22 Inside cover art from the 1831 edition of the book. Victor Frankenstein looks at his creation.
Public domain. By w:Theodor von Holst. Source: Wikimedia Commons#

We will write a program that generates random sentences based on the book Frankenstein by Mary Shelley.

We will create tokens from the text and create a table that tracks which tokens come after which token.

token

A piece of a string.

The algorithm is as follows:

Embed the book into a string.
Replace each non-printable character with a space
Read tokens delimited by “ ?!” and store them in an array named tokens.

At the same time update a successor table that tracks which token depends on which token (succs).

There must be no copy of a token in the array and tokens are case-sensitive.

ruin! and ruin are different tokens.
Generate random sentences:

Select a random token that starts with a capital letter and continue appending random successors from the successor table until we encounter a token that ends a sentence.

Example output:

That, then, I paused and delirium; he tried to his daughter of languages, and I made arrangements for the bed, watching the information from whom I went forth in this?

Juras, and that burst from print of Safie related that, omnipotent as thou art!

The sentences generated may not make a lot of sense, but the words are not completely random.

Word and successor tables#

The program tracks tokens and successors of each token in the successor table succs. For example for the following text:

The modern the modern apes!

produces the following tokens and successor table succs:

0	“The”
1	“modern”
2	“the”
3	“apes!”

0	{“modern”}
1	{“the”, “apes!”}
2	{“modern”}
3

Template#

Before using the template, download the plain-text version of the book using the following link:

https://www.gutenberg.org/ebooks/84.txt.utf-8

Some of the functionality is already provided and we will live-program some of the functionality.

Activity 48 (Analyzing code)

Look at the function replace_non_printable_chars_with_space(). It does not take any arguments nor return anything. How can it replace characters in the book?
generate_sentence() uses tokens in its definition, even it is not provided as an argument. What do you think about this coding style?

#include <ctype.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define MAX_WORD_COUNT 15'000
#define MAX_SUCCESSOR_COUNT MAX_WORD_COUNT / 2

char book[] = {
#embed "pg84.txt" /// Stores the content of the file as an array of chars.
    , '\0'};      /// Makes `book` a string.

/// Array of tokens registered so far.
/// No duplicates are allowed.
char *tokens[MAX_WORD_COUNT];
/// `tokens`'s current size
size_t tokens_size = 0;

/// Array of successor tokens
/// One token can have many successor tokens. `succs[x]` corresponds to
/// `token[x]`'s successors.
/// We store directly tokens instead of token_ids, because we will directly
/// print them. If we wanted to delete the book, then it would make more sense
/// to store `token_id`s
char *succs[MAX_WORD_COUNT][MAX_SUCCESSOR_COUNT];
/// `succs`'s current size
size_t succs_sizes[MAX_WORD_COUNT];

/// Overwrites non-printable characters in `book` with a space.
/// Non-printable characters may lead to duplicates like
/// `"\xefthe" and "the"` even both print `the`.
void replace_non_printable_chars_with_space() {
  // YOUR CODE HERE
}

/// Returns the id (index) of the token, creating it if necessary.
///
/// Returns token id if token exists in \c tokens, otherwise creates a new entry
/// in \c tokens and returns its token id.
///
/// \param token token to look up (or insert)
/// \return Index of `token` in \c tokens array
size_t token_id(char *token) {
  size_t id;
  for (id = 0; id < tokens_size; ++id) {
    if (strcmp(tokens[id], token) == 0) {
      return id;
    }
  }
  tokens[id] = token;
  ++tokens_size;
  return id;
}

/// Appends the token \c succ to the successors list of \c token.
void append_to_succs(char *token, char *succ) {
  auto next_empty_index_ptr = &succs_sizes[token_id(token)];

  if (*next_empty_index_ptr >= MAX_SUCCESSOR_COUNT) {
    printf("Successor array full.");
    exit(EXIT_FAILURE);
  }

  succs[token_id(token)][(*next_empty_index_ptr)++] = succ;
}

/// Creates tokens on \c book and fills \c tokens and \c succs using
/// the functions \c token_id and \c append_to_succs.
void tokenize_and_fill_succs(char *delimiters, char *str) {
  // YOUR CODE HERE
}

/// Returns last character of a string
char last_char(char *str) {
  // YOUR CODE HERE
}

/// Returns whether the token ends with `!`, `?` or `.`.
bool token_ends_a_sentence(char *token) {
  // YOUR CODE HERE
}

/// Returns a random `token_id` that corresponds to a `token` that starts with a
/// capital letter.
/// Uses \c tokens and \c tokens_size.
size_t random_token_id_that_starts_a_sentence() {
  // YOUR CODE HERE
}

/// Generates a random sentence using \c tokens, \c succs, and \c succs_sizes.
/// The sentence array will be filled up to \c sentence_size-1 characters using
/// random tokens until:
/// - a token is found where \c token_ends_a_sentence
/// - or more tokens cannot be concatenated to the \c sentence anymore.
/// Returns the filled sentence array.
///
/// @param sentence array what will be used for the sentence.
//
//                  Will be overwritten. Does not have to be initialized.
/// @param sentence_size
/// @return input sentence pointer
char *generate_sentence(char *sentence, size_t sentence_size) {
  size_t current_token_id = random_token_id_that_starts_a_sentence();
  auto token = tokens[current_token_id];

  sentence[0] = '\0';
  strcat(sentence, token);
  if (token_ends_a_sentence(token))
    return sentence;

  // Calculated sentence length for the next iteration.
  // Used to stop the loop if the length exceeds sentence size
  size_t sentence_len_next;
  // Concatenates random successors to the sentence as long as
  // `sentence` can hold them.
  do {
    // YOUR CODE HERE
  } while (sentence_len_next < sentence_size - 1);
  return sentence;
}

int main() {
  replace_non_printable_chars_with_space();

  char *delimiters = " \n\r";
  tokenize_and_fill_succs(delimiters, book);

  char sentence[1000];
  srand(time(nullptr)); // Be random each time we run the program

  // Generate sentences until we find a question sentence.
  do {
    generate_sentence(sentence, sizeof sentence);
  } while (last_char(sentence) != '?');
  puts(sentence);
  puts("");

  // Initialize `sentence` and then generate sentences until we find a sentence
  // ending with an exclamation mark.
  do {
    generate_sentence(sentence, sizeof sentence);
  } while (last_char(sentence) != '!');
  puts(sentence);
}

Notes#

Warning

On Windows: If you increase MAX_WORD_COUNT to 50'000, then debugging fails with an unknown error.

Note

We don’t copy strings from book, but work with pointers to the book.

Tip

If it takes too long to run the program or the debugger, create a smaller version of the text file.

Tip

Implement & test step by step. Look at main, at look at the first function you have to implement. After implementing, set a breakpoint after the function and test whether your implementation is correct or not.

Tip

It may take long time for the VARIABLES window to show the contents of the large array book. Use instead the watch feature to track components of arrays.

How? Select a string, e.g., book[i], then right-click, and then click Add to Watch.

Tip

You will probably get segmentation faults during development. The debugger shows you the state when the fault happened. Use it.

Moreover the debugger can also show you how you got to the fault. Look at Call Stack to see the path.

Requirements#

You only fill // YOUR CODE HERE parts.
Include example output of your program in your documentation.
Flowchart

Appendix#

We actually create a Markov chain.