Text Generation | eXclusive ORange

Fork of makemore

My experience with “makemore” went beyond the basic examples provided in the original repository, which generated new names based on a dataset of names. Initially, my goal was to apply “makemore” to various datasets other than “names.txt.” I experimented with larger and smaller datasets, including those with extensive collections of English words, numbers for addition, square roots, and a substantial dataset of quotes containing nearly 10 million entries, some of which had lines as long as 505 characters. By using scripts and modifications to “makemore.py,” I conducted a grid search to optimize hyperparameters, including constraints on model size. Output from “makemore.py” was saved to a CSV file, along with hexadecimal hash values for easy tracking and analysis during the tuning process.

To further enhance the code, I introduced a grid search optimization method using a Bash script. This allowed for exploring the hyperparameter space while maintaining a ceiling on the model size. Without such constraints, optimization typically led to increasingly larger models that resulted in the lowest loss.

I also introduced the concept of assigning a random hexadecimal tag to the output of “makemore.py.” This tagging system facilitated the easy identification of the best loss and the associated set of hyperparameters that produced it. Additionally, I incorporated an early stopping mechanism into the “makemore.py” code.

If you’re interested in exploring my fork of Andrej Karpathy’s “makemore” code, you can find it here:

For a more detailed understanding, I’ve created a comprehensive “verbose-readme.pdf” that provides additional information:

Version on this site, opens in browser:

GitHub Version requires downloading:

Imagination is the power to make a difference in yourself.

After trying a few RNNa and LSTMs for text generation that rely on numpy alone it is interesting to see the performance of Tensorflow based code that is closer to the cutting edge of what is possible to do with machine learning.

I found a good and easy to use set of code in the following Github archive…

https://github.com/spiglerg/RNN_Text_Generation_Tensorflow

The requirements are simple…

numpy==1.13.3
tensorflow==1.4.0

I was running it on Conda Python 3.6 environment but, this is not a requirement. The code uses a saved folder where it can save training checkpoints, so it is possible to interrupt and resume training and also use it in a generate or “talk” mode after the model has been trained. The caveat that I learned quickly when training on a few types of files is when training each type of file that is trained into requires it’s own set of checkpoints, which is pretty obvious. So it is best to either wipe out the saved dir contents after a run on a specific corpus. OR, better yet make a subdir for the training checkpoints.

Training is basically sending it the following command…

python rnn_tf.py --input_file=data/us-constitution.txt --ckpt_file="saved/model.ckpt" --mode=train

Once trained it can be fed another command…

python rnn_tf.py --input_file=us-const-lstm/us-constitution.txt --ckpt_file="saved/model.ckpt" --test_prefix="The " --mode=talk

or if the checkpoint files have been moved to their own directory then you can use something like this…

python rnn_tf.py --input_file=us-const-lstm/us-constitution.txt --ckpt_file="saved/us-const-trained/model.ckpt" --test_prefix="The " --mode=talk

In the structure for the commands the location of the file is listed and the location of the checkpoint file as well. The generate mode allows priming with a word or phrase such as “The”.

The US Constitution is not a big corpus and I am sure this code like others would benefit from training against a larger corpus. My intent in the future for an experiment is to train it against a file containing all the posts on this site to see what it can do on that corpus.

-rw-r–r– 1 erick erick 1115394 Apr 26 12:01 shakespeare.txt
-rw-r–r– 1 erick erick 45120 Apr 26 12:59 us-constitution.txt
-rw-r–r– 1 erick erick 374605 Apr 26 13:52 my-posts.txt

When trained on the US Constitution it does very well at producing coherent text. Besides the lack of capitalization it seems to be actually to the point of memorizing parts of the text. This might be because it is a small corpus and it is overfitting.

The Senators and Representatives before mentioned, and the Members of the
several State Legislatures, and all executive and judicial Officers, both of
the United States and of the several States, shall be bound by Oath or
Affirmation, to support this Constitution; but no religious Test shall ever be
required as a Qualification to any Office or public Trust under the United
States.

Article 7.

The Ratification of the Conventions of nine States, shall be sufficient for the
Establishment of this Constitution between the States so ratifying the Same.

Sentence:
the several states, shall be bound by oath or
affirmation, to support this constitution; but no religious test shall ever be
required as a qualification to any office or public trust under the united
states.

article 7.

the ratification of the conventions of nine states, shall be sufficient for the
establishment of this constitution between the states so ratifying the same.

done in convention by the unanimous consent of the states present the
seventeenth day of september in the year of our lord on

the Case of a Bill.

Section 8
The Congress shall have Power To lay and collect Taxes, Duties, Imposts and
Excises, to pay the Debts and provide for the common Defence and general
Welfare of the United States; but all Duties, Imposts and Excises shall be
uniform throughout the United States;

To borrow money on the credit of the United States;

To regulate Commerce with foreign Nations, and among the several States, and
with the Indian Tribes;

To establish an uniform Rule of Naturalization, and un

Sentence:
the case of a bill.

section 8
the congress shall have power to lay and collect taxes, duties, imposts and
excises, to pay the debts and provide for the common defence and general
welfare of the united states; but all duties, imposts and excises shall be
uniform throughout the united states;

to borrow money on the credit of the united states;

to regulate commerce with foreign nations, and among the several states, and
with the indian tribes;

to establish an uniform rule of naturalization, and un

Training

Training against the corpus of blog posts on this site produced output like this and took about 4 hours of compute time.

batch: 0 loss: 4.492201328277588 speed: 121.8853488969507 batches / s
batch: 100 loss: 3.214789628982544 speed: 1.3747759497226923 batches / s
batch: 200 loss: 3.0983948707580566 speed: 1.4065962415903654 batches / s
batch: 300 loss: 2.8669371604919434 speed: 1.4141226357348917 batches / s
batch: 400 loss: 2.359729051589966 speed: 1.416853411853437 batches / s
batch: 500 loss: 2.0080957412719727 speed: 1.4160802277642834 batches / s

…

batch: 19500 loss: 0.22069120407104492 speed: 1.4188681716674931 batches / s
batch: 19600 loss: 0.21757778525352478 speed: 1.4218841226396346 batches / s
batch: 19700 loss: 0.2309599369764328 speed: 1.362554971973392 batches / s
batch: 19800 loss: 0.23969298601150513 speed: 1.3983937654375616 batches / s
batch: 19900 loss: 0.23989509046077728 speed: 1.3854887855619515 batches / s

The following is some samples of the output it generates. It definitely could use more training to help it. The fact that the posts contain some code, numbers and jargon probably doesn’t help either.

Sentence:
the installed.
display install wiflinut for ray run process queue every monday, wednesday and friday ran
   1000000 ractine resitely and configure a firewall to only allow certain ip numbers a
   connection to show that the board is
   powered. there are a concatenated version of the log.txt
cacking out of the full -ho 1 than i could have it may be set the
   command which just restart the “how ther have up suncals
   regulator, frequency valies.
   more data. i sho, vift…
sudo selond

below is

Sentence:
the whole hmad can noid through the server and logged
in a while later and the shutdown script had recorded failed pings into
systemctl.

i was not ne rewent when it shuts down.

for a help afout shourd entire (but mean most looking a series of
for clean ubuntu server install will prompt for a username and password to access folders as
well, especially if the users and password is needed autosuspend should oright. it level, no 62 defanly 34-fermentation crontab, still radio
shar

eXclusive ORange

A working title for now

Tag Archives: Text Generation

Makemore

Fork of makemore

RNN Text Generation Using Tensorflow

Training

April 2024
M	T	W	T	F	S	S
« Feb
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30