Tag Archives: Text Generation

makemore

Makemore

The best way to predict a future is to look for it. We live in the moment where the world is a vast machine for predicting the future.

This summer, I spent a significant amount of time contemplating large language models and delving deeper into their research. My first encounter with GPT-2 was back in 2019, where I explored its code and experimented with it. During this period, I became curious about transfer learning and its applications. Additionally, I had some prior knowledge about transformers, but it wasn’t as comprehensive as my understanding of LSTMs and RNNs. I couldn’t confidently explain what they did, for example.

While researching transfer learning with smaller models like GPT-2, I stumbled upon Gwern Branwen’s website (https://gwern.net/) and, in particular, his TWDNE Project (https://gwern.net/twdne). I found it clever because it combined a generative model for both images and text. I decided to focus on the text side of the project, as the image aspect was already well-addressed by applications like Stable Diffusion….

Misato Katsuragi as a Math Teacher
Misato Katsuragi as a Math Teacher

I might revisit the image style transfer aspect in the future, as I had previously explored it to some extent. You can find more about this in my “How to Generate Art Demo Followup.”

Before this, I had predominantly explored machine learning with code from the ground up using Python (PMLC). I have used ML practically in the form of genetic algorithms for tuning parameters on investing models for years, non-differentiable, so no chain rule! An offshoot was a project called gen-gen-algo, a generic genetic algorithm. Now, finally after all these side quests, I was ready to tackle something more complex and cutting-edge using GPT.

I found excellent resources on GitHub and in video format from Andrej Karpathy (https://github.com/karpathy). The following repositories were particularly helpful in my learning journey. The first one, “nn-zero-to-hero,” features a series of videos that provided a solid foundation in understanding transformers.

The second repository, “makemore,” served as my warm-up exercise to get back into working with transformers and Large Language Models (LLMs) after a period of dormancy in the field. You can access these repositories here:

1. “nn-zero-to-hero”: https://github.com/karpathy/nn-zero-to-hero
2. “makemore”: https://github.com/karpathy/makemore

Fork of makemore

My experience with “makemore” went beyond the basic examples provided in the original repository, which generated new names based on a dataset of names. Initially, my goal was to apply “makemore” to various datasets other than “names.txt.” I experimented with larger and smaller datasets, including those with extensive collections of English words, numbers for addition, square roots, and a substantial dataset of quotes containing nearly 10 million entries, some of which had lines as long as 505 characters. By using scripts and modifications to “makemore.py,” I conducted a grid search to optimize hyperparameters, including constraints on model size. Output from “makemore.py” was saved to a CSV file, along with hexadecimal hash values for easy tracking and analysis during the tuning process.

To further enhance the code, I introduced a grid search optimization method using a Bash script. This allowed for exploring the hyperparameter space while maintaining a ceiling on the model size. Without such constraints, optimization typically led to increasingly larger models that resulted in the lowest loss.

I also introduced the concept of assigning a random hexadecimal tag to the output of “makemore.py.” This tagging system facilitated the easy identification of the best loss and the associated set of hyperparameters that produced it. Additionally, I incorporated an early stopping mechanism into the “makemore.py” code.

If you’re interested in exploring my fork of Andrej Karpathy’s “makemore” code, you can find it here:

https://github.com/erickclasen/makemore

For a more detailed understanding, I’ve created a comprehensive “verbose-readme.pdf” that provides additional information:

Version on this site, opens in browser:

verbose-readme

GitHub Version requires downloading:

https://github.com/erickclasen/makemore/blob/master/verbose-readme.pdf

 

 

 

 

RNN Text Generation Using Tensorflow

Imagination is the power to make a difference in yourself.

After trying a few RNNa and LSTMs for text generation that rely on numpy alone it is interesting to see the performance of Tensorflow based code that is closer to the cutting edge of what is possible to do with machine learning.

I found a good and easy to use set of code in the following Github archive…

https://github.com/spiglerg/RNN_Text_Generation_Tensorflow

The requirements are simple…

numpy==1.13.3
tensorflow==1.4.0

I was running it on Conda Python 3.6 environment but, this is not a requirement. The code uses a saved folder where it can save training checkpoints, so it is possible to interrupt and resume training and also use it in a generate or “talk” mode after the model has been trained. The caveat that I learned quickly when training on a few types of files is when training each type of file that is trained into requires it’s own set of checkpoints, which is pretty obvious. So it is best to either wipe out the saved dir contents after a run on a specific corpus. OR, better yet make a subdir for the training checkpoints.

Training is basically sending it the following command…

python rnn_tf.py --input_file=data/us-constitution.txt --ckpt_file="saved/model.ckpt" --mode=train

Once trained it can be fed another command…

python rnn_tf.py --input_file=us-const-lstm/us-constitution.txt --ckpt_file="saved/model.ckpt" --test_prefix="The " --mode=talk

or if the checkpoint files have been moved to their own directory then you can use something like this…

python rnn_tf.py --input_file=us-const-lstm/us-constitution.txt --ckpt_file="saved/us-const-trained/model.ckpt" --test_prefix="The " --mode=talk

In the structure for the commands the location of the file is listed and the location of the checkpoint file as well. The generate mode allows priming with a word or phrase such as “The”.

The US Constitution is not a big corpus and I am sure this code like others would benefit from training against a larger corpus. My intent in the future for an experiment is to train it against a file containing all the posts on this site to see what it can do on that corpus.

-rw-r–r– 1 erick erick 1115394 Apr 26 12:01 shakespeare.txt
-rw-r–r– 1 erick erick   45120 Apr 26 12:59 us-constitution.txt
-rw-r–r– 1 erick erick  374605 Apr 26 13:52 my-posts.txt

When trained on the US Constitution it does very well at producing coherent text. Besides the lack of capitalization it seems to be actually to the point of memorizing parts of the text. This might be because it is a small corpus and it is overfitting.

The Senators and Representatives before mentioned, and the Members of the
several State Legislatures, and all executive and judicial Officers, both of
the United States and of the several States, shall be bound by Oath or
Affirmation, to support this Constitution; but no religious Test shall ever be
required as a Qualification to any Office or public Trust under the United
States.

Article 7.

The Ratification of the Conventions of nine States, shall be sufficient for the
Establishment of this Constitution between the States so ratifying the Same.

Sentence:
the several states, shall be bound by oath or
affirmation, to support this constitution; but no religious test shall ever be
required as a qualification to any office or public trust under the united
states.

article 7.

the ratification of the conventions of nine states, shall be sufficient for the
establishment of this constitution between the states so ratifying the same.

done in convention by the unanimous consent of the states present the
seventeenth day of september in the year of our lord on

 

 

the Case of a Bill.

Section 8
The Congress shall have Power To lay and collect Taxes, Duties, Imposts and
Excises, to pay the Debts and provide for the common Defence and general
Welfare of the United States; but all Duties, Imposts and Excises shall be
uniform throughout the United States;

To borrow money on the credit of the United States;

To regulate Commerce with foreign Nations, and among the several States, and
with the Indian Tribes;

To establish an uniform Rule of Naturalization, and un

Sentence:
the case of a bill.

section 8
the congress shall have power to lay and collect taxes, duties, imposts and
excises, to pay the debts and provide for the common defence and general
welfare of the united states; but all duties, imposts and excises shall be
uniform throughout the united states;

to borrow money on the credit of the united states;

to regulate commerce with foreign nations, and among the several states, and
with the indian tribes;

to establish an uniform rule of naturalization, and un

 

Training

Training against the corpus of blog posts on this site produced output like this and took about 4 hours of compute time.

batch: 0  loss: 4.492201328277588  speed: 121.8853488969507 batches / s
batch: 100  loss: 3.214789628982544  speed: 1.3747759497226923 batches / s
batch: 200  loss: 3.0983948707580566  speed: 1.4065962415903654 batches / s
batch: 300  loss: 2.8669371604919434  speed: 1.4141226357348917 batches / s
batch: 400  loss: 2.359729051589966  speed: 1.416853411853437 batches / s
batch: 500  loss: 2.0080957412719727  speed: 1.4160802277642834 batches / s

batch: 19500  loss: 0.22069120407104492  speed: 1.4188681716674931 batches / s
batch: 19600  loss: 0.21757778525352478  speed: 1.4218841226396346 batches / s
batch: 19700  loss: 0.2309599369764328  speed: 1.362554971973392 batches / s
batch: 19800  loss: 0.23969298601150513  speed: 1.3983937654375616 batches / s
batch: 19900  loss: 0.23989509046077728  speed: 1.3854887855619515 batches / s

The following is some samples of the output it generates. It definitely could use more training to help it. The fact that the posts contain some code, numbers and jargon probably doesn’t help either.

Sentence:
the installed.
 display install wiflinut for ray run process queue every monday, wednesday and friday ran
   1000000 ractine resitely and configure a firewall to only allow certain ip numbers a
   connection to show that the board is
   powered. there are a concatenated version of the log.txt
cacking out of the full -ho 1 than i could have it may be set the
   command which just restart the “how ther have up suncals
   regulator, frequency valies.
   more data. i sho, vift…
sudo selond

   below is

Sentence:
the whole hmad can noid through the server and logged
in a while later and the shutdown script had recorded failed pings into
systemctl.

i was not ne rewent when it shuts down.

for a help afout shourd entire (but mean most looking a series of
for clean ubuntu server install will prompt for a username and password to access folders as
well, especially if the users and password is needed autosuspend should oright. it level, no 62 defanly 34-fermentation crontab, still radio
shar