Like many people over the last few months I've been playing with a number of Large Language Models (LLMs). LLMs are
perhaps best typified by the current media star ChatGPT. It is hard to avoid the current media buzz while every tech
titan is developing their "AI" play and people are exposed to tools where the label of Artificial Intelligence is
liberally applied. The ability of these models to spit out competent comprehensible
text is seemingly a step change in ability compared to previous generations of tech.
I thought I would try and collect some of my thoughts and perspectives on this from the point of view of a systems
programmer. For those not familiar
with the term is refers to the low level development of providing platforms for the applications people actually use. In
my case a lot of the work I do on QEMU which involves emulating the very
lowest level instructions a computer can do: the simple arithmetic and comparison of numbers that all code is eventually
expressed as.
Magic numbers and computing them
I claim no particular expertise on machine learning so expect this to be a very superficial explanation of whats going
on.
In normal code the CPU tends to execute a lot of different instruction sequences as a program runs through solving the
problem you have set it. The code that calculates where to draw your window will be different to the code checking the
network for new data, or the logic that stores information safely on your file system. Each of those tasks is decomposed
and abstracted into simpler and simpler steps until eventually it is simple arithmetic dictating what the processor
should do do next. You occasionally see hot spots where a particular sequence of instructions are doing a lot of heavy
lifting. There is a whole discipline devoted to managing computational complexity and ensuring algorithms are as
efficient as possible.
However the various technologies that are currently wowing the world work very differently. They are models of various
networks represented by a series of magic numbers or "weights" arranged in a hierarchical structure of interconnected
matrices. While there is a lot of nuance to how problems are
encoded and fed into these models fundamentally the core piece of computation is multiplying a bunch of numbers with
another bunch of numbers feeding their results into the next layer of the network. At the end of the process the model
spits out a prediction of the most likely next word is going to be. After selecting one the cycle repeats taking to
account our expanded context to predict the most likely next word.
The "models" that drive these things are described mostly by the number of parameters they have. This encompasses the
number of inputs and outputs they have and the number of numbers in between. For example common small open source models
start at 3 billion parameters with 7, 13 and 34 billion also being popular sizes. Beyond that it starts getting hard to
run models locally on all but the most tricked out desktop PCs. As a developer my desktop is pretty beefy (32 cores,
64Gb RAM) and can chew through computationally expensive builds pretty easily. However as I can't off-load processing
onto my GPU a decent sized model will chug out a few words a second while maxing out my CPU. The ChatGPT v4 model is
speculated to run about 1.7 trillion parameters which needs to be run on expensive cloud hardware - I certainly don't
envy OpenAI their infrastructure bill.
Of course the computational power needed to run these models is a mere fraction of what it took to train them. In fact
the bandwidth and processing requirements are so large it pays to develop custom silicon that is really good at
multiplying large amounts of numbers and not much else. You can get a lot more bang for your buck compared to running
those calculations on a general purpose CPU designed for tackling a wide range of computation problems.
The Value of Numbers
Because of the massive investment in synthesising these magic numbers they themselves become worth something. The "magic
sauce" behind a model is more about how it was trained and what data was used to do it. We already know its possible to
encode societies biases into models due to sloppy selection of the input data. One of the principle criticisms of
proprietary generative models is how opaque the training methods are making it hard to judge their safety. The degree to
which models may regurgitate data without any transformation is hard to quantify when you don't know what went into it.
As I'm fundamentally more interested in knowing how the technology I use works under the hood its fortunate there is a
growing open source community working on building their own models. Credit should be given to Meta who made their
language model LLaMA 2 freely available on fairly permissive
terms. Since then there has been an explosion of open source projects that can run the models (e.g:
llama.cpp,
Ollama) and provide front-ends (e.g:
Oobabooga's text generation UI, Ellama front-end for Emacs) for them.
Smaller Magic Numbers
The principle place where this work is going on is Hugging Face. Think of it as the GitHub of the machine learning community. It provides an
environment for publishing and collaborating on data sets and models as well hosting and testing their effectiveness in
various benchmarks. This make experimenting with models accessible to developers who aren't part of the well funded
research divisions of the various tech titans. Datasets for example come with
cards which describe the sources that went into these multi-terabyte
files.
One example of a such is the RedPajama dataset. This is an open source initiative to recreate the LLaMA training data which combines
data from the open web and well as numerous permissively licensed source such as Wikipedia, GitHub, StackExchange and
ArXiv. This dataset has been used to train models like OpenLLaMA in an attempt to provide an unencumbered version of Meta's LLaMA 2. However
training up these foundational models is an expensive and time consuming task, the real action is taking these models
and then fine tuning them for particular tasks.
To fine tune a model you first take a general purpose model and further train it against data with a specific task in
mind. The purpose of this is not only to make your new model better suited for a particular task but also to optimise
the number of calculations that model has to do to achieve acceptable results. This is also where the style of prompting
will be set as you feed the model examples of the sort of questions and answers you want it to give.
The are further stages that be applied including "alignment" where you ensure results are broadly in tune with the
values of the organisation. This is the reason the various chatbots around won't readily cough up the recipe to build
nukes or make it easier to explicitly break the law. This can be augmented with Reinforcement Learning through Human
Feedback (RHLF) which is practically the purpose of every CAPTCHA you'll have
filled in over the last 25 years online.
Finally the model can be quantised to make it more manageable. This takes advantage of the fact that a lot of the
numbers will be have a negligible effect on the result for a wide range of inputs. In those cases there is no point
storing them at full precision. As computation is a function of the number of bits of information being processed this
also reduces the cost of computation. While phones and other devices are increasingly including dedicated hardware to
process these models they are still constrained by physics - and the more you process the more heat you need to
dissipate, the more battery you use and the more bandwidth you consume. Obviously the more aggressively you quantise the
models the worse it will perform so there is an engineering trade off to make. Phones work best with multiple highly
tuned models solving specific tasks as efficiently as possible. Fully flexible models giving a
J.A.R.V.I.S like experience will probably always need to run in the cloud
where thermal management is simply an exercise in plumbing.
Making magic numbers work for you
Before we discuss using models I want to discuss 3 more concepts: "prompts", "context" and "hallucinations".
The prompt is the closest thing there is to "programming" the model. The prompt can be purely explicit or include other
inputs behind the scenes. For example the prompt can instruct the model to be friendly or terse, decorate code snippets
with markdown, make changes as diffs or in full functions. Generally the more explicit your prompt is about what you
want the better the result you get from the model. Prompt
engineering has the potential to be one of those newly created job
titles that will have to replace the jobs obsoleted by advancing AI. One of the ways to embed AI APIs into your app is
to create a task specific prompt that will be put in front of user input that guides the results to what you want.
The "context" is the rest of the input into the model. That could be the current conversation in a chat or the current
page of source code in a code editor. The larger the context the more reference the model has for its answer although
that does come at the cost of even more composition as the context makes for more input parameters into the model.
In a strong candidate for 2023's word of the year "hallucination" describes the quirky and sometime unsettling behaviour
of models outputting weird sometimes contradictory information. They will sincerely and confidently answer questions
with blatant lies or start regurgitating training data when
given certain prompts. It is a salient reminder that the statistical nature of these generative models will mean they
occasionally spout complete rubbish. They are also very prone to following the lead of their users - the longer you chat
with a model the more likely it is to end up agreeing with you.
So lets talk about what these models can and can't do. As a developer one of the areas I'm most interested in is their
ability to write code. Systems code especially is an exercise in precisely instructing a computer what to do in explicit
situations. I'd confidently predicted my job would be one of the last to succumb to the advance of AI as systems aren't
something you can get "mostly" right. It was quite a shock when I first saw quite how sophisticated the generated code
can be.
Code Review
One of the first things I asked ChatGPT to do was review a function I'd written. It manged to make 6 observations about
the code, 3 of which where actual logic problems I'd missed and 3 where general points about variable naming and
comments. The prompt is pretty important though. If not constrained to point out actual problems LLMs tend to have a
tendency to spit out rather generic advice about writing clean well commented code.
They can be super useful when working with an unfamiliar language or framework. If you are having trouble getting
something to work it might be faster to ask an LLM how to fix your function that spending time reading multiple
StackOverflow answers to figure out what you've misunderstood. If compiler errors are
confusing supplying the message alongside the code can often be helpful in understanding whats going on.
Writing Code
However rather than just suggesting changes one very tempting use case is writing code from scratch based on a
description of what you want. Here the context is very important, the more detail you provide the better chance of
generating something useful. My experience has been that the solutions are usually fairly rudimentary and can often
benefit from a manual polishing step once you have something working.
For my QEMU KVM Forum 2023 Keynote I got ChatGPT
to write the first draft of a number of my data processing scripts. However it missed obvious optimisations by
repeatedly reading values inside inner loops that made the scripts slower than they needed to be.
If the task is a straight transformation they are very good. Ask an LLM to convert a function in one language into
another and it will do a pretty good job - and probably with less mistakes than your first attempt. However there are
limitations. For example I asked a model to convert some Aarch64 assembler into the equivalent 32 bit Arm assembler. It
did a very good job of the mechanical part of that but missed the subtle differences in how to setup the MMU. This
resulted in code which compiled but didn't work until debugged by a human who was paying close attention to the
architecture documentation as they went.
One of the jobs LLM's are very well suited for is writing code that matches an existing template. For example if you are
mechanically transforming a bunch of enums into a function to convert them to strings you need only do a few examples
before there is enough context for the LLM to reliably figure out what you are doing. LLM's are a lot more powerful than
a simple template expansion because you don't need to explicitly define a template first. The same is true of tasks like
generating test fixtures for your code.
There is a potential trap however with using LLMs to write code. As there is no source code and the proprietary models
are fairly cagey about exactly what data the models where trained on there are worries about them committing copyright
infringement. There are active debates ongoing in the open source community (e.g. on
qemu-devel) about the potential ramifications of a model regurgitating its training data. Without clarity on what
license that data has there is a risk of contaminating projects with code of an unknown province. While I'm sure these
issues will be resolved in time it's certainly a problem you need to be cognisant off.
Prose
Writing prose is a much more natural problem territory for LLM's and an area where low-effort text generation will be
rapidly replaced by generative models like ChatGPT. "My" previous blog
post
was mostly written by a ChatGPT based on a simple brief and a few requests for rewrites in a chat session. While it made
the process fairly quick the result comes across as a little bland and "off". I find there is a tendency for LLM's to
fall back on fairly obvious generalisations and erase any unique authorial voice there may have been.
However if you give enough structure its very easy to get an LLM to expand on a bullet list into more flowery prose.
They are more powerful when being fed a large piece of text and asked to summarise key information in a more accessible
way.
They are certainly an easy way to give a first pass review of your writing although I try to re-phrase things myself
rather than accept suggestions verbatim to keep my voice coming through the text.
Final Thoughts
The recent advances in LLM's and the public's exposure to popular tools like ChatGPT have certainly propelled the topic
of AI in the zeitgeist. While we are almost certainly approaching the "Peak of Inflated Expectations" stage of the hype
cycle they will undoubtedly be an important step on the road to the
eventual goal of Artificial General Intelligence (AGI).
We are still a long way from being able to ask computers to solve complex problems they way they can in for example in
Star Trek. However in their current form they will certainly have a big impact on the way we work over the next decade
or so.
It's important as a society we learn about how they are built, what their limitations are and understand the
computational cost and resultant impact on the environment. It will be awhile before I'd want to trust a set of magic
numbers over a carefully developed algorithm to actuate the control surfaces on a plane I'm flying on. However they are
already well placed to help us learn new information through interactive questioning and summarising random information
on the internet. We must learn to recognise when we've gone down hallucinatory rabbit hole and verify what we've learned
with reference to trusted sources.