Transformers | eXclusive ORange

A Transformer Theory

Part of my pet theory on transformers is that they might be a thing that will stick around for a while. I know ML changes fast but, I have an example from history that serves to illustrate a design that has staying power. Radio started out with many schemes to detect signals progressing along the following lines, coherers, galena crystals, diodes, tuned radio frequency, regeneration, super regeneration, all forms of direct conversion ( signal gets demodulated right from the incoming frequency) and finally superheterodyne which takes the incoming frequency and mixes it down or up to an intermediate frequency before detecting/de modulating the signal. The first methods each lasted but a few years and all under performed and had flaws. Once the superheterodyne was invested there were just a few different flavors of the same idea, so called single, double and triple conversions, really just more layers to reject out of band signals more efficiently. The superheterodyne, like the transformer has staying power. After 100 years it is the ‘way’ to handle a radio signal. So that tech is in your Wifi, Phone, TV, Stereo, Modem, 2 way radios, GPS and so on and is unlikely to be replaced by anything but a tech that just maps it onto digital tech. So my theory is that the transformer is in the same ballpark, it is the superheterodyne of ML or at least close, a step more or two away, that’s all. The types of radio reception using earlier methods each had their day in the sun, just as RNNs, LSTMs, GRUs were each the cutting edge ML go to architecture for a while.

The best way to predict a future is to look for it. We live in the moment where the world is a vast machine for predicting the future.

This summer, I spent a significant amount of time contemplating large language models and delving deeper into their research. My first encounter with GPT-2 was back in 2019, where I explored its code and experimented with it. During this period, I became curious about transfer learning and its applications. Additionally, I had some prior knowledge about transformers, but it wasn’t as comprehensive as my understanding of LSTMs and RNNs. I couldn’t confidently explain what they did, for example.

While researching transfer learning with smaller models like GPT-2, I stumbled upon Gwern Branwen’s website (https://gwern.net/) and, in particular, his TWDNE Project (https://gwern.net/twdne). I found it clever because it combined a generative model for both images and text. I decided to focus on the text side of the project, as the image aspect was already well-addressed by applications like Stable Diffusion….

I might revisit the image style transfer aspect in the future, as I had previously explored it to some extent. You can find more about this in my “How to Generate Art Demo Followup.”

Before this, I had predominantly explored machine learning with code from the ground up using Python (PMLC). I have used ML practically in the form of genetic algorithms for tuning parameters on investing models for years, non-differentiable, so no chain rule! An offshoot was a project called gen-gen-algo, a generic genetic algorithm. Now, finally after all these side quests, I was ready to tackle something more complex and cutting-edge using GPT.

I found excellent resources on GitHub and in video format from Andrej Karpathy (https://github.com/karpathy). The following repositories were particularly helpful in my learning journey. The first one, “nn-zero-to-hero,” features a series of videos that provided a solid foundation in understanding transformers.

The second repository, “makemore,” served as my warm-up exercise to get back into working with transformers and Large Language Models (LLMs) after a period of dormancy in the field. You can access these repositories here:

1. “nn-zero-to-hero”: https://github.com/karpathy/nn-zero-to-hero
2. “makemore”: https://github.com/karpathy/makemore

Fork of makemore

My experience with “makemore” went beyond the basic examples provided in the original repository, which generated new names based on a dataset of names. Initially, my goal was to apply “makemore” to various datasets other than “names.txt.” I experimented with larger and smaller datasets, including those with extensive collections of English words, numbers for addition, square roots, and a substantial dataset of quotes containing nearly 10 million entries, some of which had lines as long as 505 characters. By using scripts and modifications to “makemore.py,” I conducted a grid search to optimize hyperparameters, including constraints on model size. Output from “makemore.py” was saved to a CSV file, along with hexadecimal hash values for easy tracking and analysis during the tuning process.

To further enhance the code, I introduced a grid search optimization method using a Bash script. This allowed for exploring the hyperparameter space while maintaining a ceiling on the model size. Without such constraints, optimization typically led to increasingly larger models that resulted in the lowest loss.

I also introduced the concept of assigning a random hexadecimal tag to the output of “makemore.py.” This tagging system facilitated the easy identification of the best loss and the associated set of hyperparameters that produced it. Additionally, I incorporated an early stopping mechanism into the “makemore.py” code.

If you’re interested in exploring my fork of Andrej Karpathy’s “makemore” code, you can find it here:

https://github.com/erickclasen/makemore

For a more detailed understanding, I’ve created a comprehensive “verbose-readme.pdf” that provides additional information:

Version on this site, opens in browser:

verbose-readme

GitHub Version requires downloading:

https://github.com/erickclasen/makemore/blob/master/verbose-readme.pdf

eXclusive ORange

A working title for now

Tag Archives: Transformers

Are Transformers Here to Stay for a While?

A Transformer Theory

Makemore

Fork of makemore