Adding Numbers with a Transformer

Following from my previous article I’ve trained a transformer model to perform symbolic addition. Given two numbers expressed as a sequence of digits, the model can output the sum of those numbers, as a sequence of digits. This is a pretty simple problem for a transformer model, but it’s a good introduction to the techniques used to train transformers to execute symbolic algorithms.

First, try out the model for yourself:

+ =

The training code for this model can be found on GitHub.

The transformer model used in the above demo is a decoder-only transformer, which is the same model used by language models such as ChatGPT.^[1] These models are trained by showing them many sequences of data, which would usually be sentences scraped from the internet but in our case it’s just sequences of numbers. When running a decoder-only transformer model, you provide an incomplete sequence and it completes it to look like something that would appear in the training data. For example, ChatGPT is trained on a dataset of prompts and responses,^[4] so when given a prompt it will complete the sequence by generating an appropriate response. My addition model is trained on sequences of numbers followed by their sum, so when you give the model a pair of number sequences it will complete the overall sequence by generating the sum digits.

The digits are given to the model and outputted in reverse order, with the least significant digit first. This is needed because when adding long numbers, you need to work out the least significant digits first before you can compute the more significant digits. The number 10 is used to indicate the end of each digit sequence.

This is a very basic example of a transformer performing an algorithm, however similar techniques can be used to do more interesting things with transformers. For example, Meta trained a transformer model to perform an efficient tree search using self-learned heuristics.^[3] Also, transformers have proven very effective at working with various types of real world data such as text and images,^[2] so this could be used to apply reasoning skills to real world problems. I’m hoping to explore this area further and train transformers to solve increasingly challenging algorithms.

References

[1] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901. https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

[2] Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020). https://arxiv.org/abs/2010.11929

[3] Lehnert, Lucas, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, and Yuandong Tian. “Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping.” arXiv preprint arXiv:2402.14083 (2024). https://arxiv.org/abs/2402.14083

[4] Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang et al. “Training language models to follow instructions with human feedback.” Advances in neural information processing systems 35 (2022): 27730-27744. https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html

Adding Numbers with a Transformer

References

Further Reading

Learning Algorithms from Scratch

Learning Algorithms with a Transformer

Transformers as Rule Engines