A Systems Programmer's Perspectives on Generative AI

Posted on Sun 10 December 2023 by alex in geek

Like many people over the last few months I've been playing with a number of Large Language Models (LLMs). LLMs are perhaps best typified by the current media star ChatGPT. It is hard to avoid the current media buzz while every tech titan is developing their "AI" play and people are exposed to tools where the label of Artificial Intelligence is liberally applied. The ability of these models to spit out competent comprehensible text is seemingly a step change in ability compared to previous generations of tech.

I thought I would try and collect some of my thoughts and perspectives on this from the point of view of a systems programmer. For those not familiar with the term is refers to the low level development of providing platforms for the applications people actually use. In my case a lot of the work I do on QEMU which involves emulating the very lowest level instructions a computer can do: the simple arithmetic and comparison of numbers that all code is eventually expressed as.

Magic numbers and computing them

I claim no particular expertise on machine learning so expect this to be a very superficial explanation of whats going on.

In normal code the CPU tends to execute a lot of different instruction sequences as a program runs through solving the problem you have set it. The code that calculates where to draw your window will be different to the code checking the network for new data, or the logic that stores information safely on your file system. Each of those tasks is decomposed and abstracted into simpler and simpler steps until eventually it is simple arithmetic dictating what the processor should do do next. You occasionally see hot spots where a particular sequence of instructions are doing a lot of heavy lifting. There is a whole discipline devoted to managing computational complexity and ensuring algorithms are as efficient as possible.

However the various technologies that are currently wowing the world work very differently. They are models of various networks represented by a series of magic numbers or "weights" arranged in a hierarchical structure of interconnected matrices. While there is a lot of nuance to how problems are encoded and fed into these models fundamentally the core piece of computation is multiplying a bunch of numbers with another bunch of numbers feeding their results into the next layer of the network. At the end of the process the model spits out a prediction of the most likely next word is going to be. After selecting one the cycle repeats taking to account our expanded context to predict the most likely next word.

The "models" that drive these things are described mostly by the number of parameters they have. This encompasses the number of inputs and outputs they have and the number of numbers in between. For example common small open source models start at 3 billion parameters with 7, 13 and 34 billion also being popular sizes. Beyond that it starts getting hard to run models locally on all but the most tricked out desktop PCs. As a developer my desktop is pretty beefy (32 cores, 64Gb RAM) and can chew through computationally expensive builds pretty easily. However as I can't off-load processing onto my GPU a decent sized model will chug out a few words a second while maxing out my CPU. The ChatGPT v4 model is speculated to run about 1.7 trillion parameters which needs to be run on expensive cloud hardware - I certainly don't envy OpenAI their infrastructure bill.

Of course the computational power needed to run these models is a mere fraction of what it took to train them. In fact the bandwidth and processing requirements are so large it pays to develop custom silicon that is really good at multiplying large amounts of numbers and not much else. You can get a lot more bang for your buck compared to running those calculations on a general purpose CPU designed for tackling a wide range of computation problems.

The Value of Numbers

Because of the massive investment in synthesising these magic numbers they themselves become worth something. The "magic sauce" behind a model is more about how it was trained and what data was used to do it. We already know its possible to encode societies biases into models due to sloppy selection of the input data. One of the principle criticisms of proprietary generative models is how opaque the training methods are making it hard to judge their safety. The degree to which models may regurgitate data without any transformation is hard to quantify when you don't know what went into it.

As I'm fundamentally more interested in knowing how the technology I use works under the hood its fortunate there is a growing open source community working on building their own models. Credit should be given to Meta who made their language model LLaMA 2 freely available on fairly permissive terms. Since then there has been an explosion of open source projects that can run the models (e.g: llama.cpp, Ollama) and provide front-ends (e.g: Oobabooga's text generation UI, Ellama front-end for Emacs) for them.

Smaller Magic Numbers

The principle place where this work is going on is Hugging Face. Think of it as the GitHub of the machine learning community. It provides an environment for publishing and collaborating on data sets and models as well hosting and testing their effectiveness in various benchmarks. This make experimenting with models accessible to developers who aren't part of the well funded research divisions of the various tech titans. Datasets for example come with cards which describe the sources that went into these multi-terabyte files.

One example of a such is the RedPajama dataset. This is an open source initiative to recreate the LLaMA training data which combines data from the open web and well as numerous permissively licensed source such as Wikipedia, GitHub, StackExchange and ArXiv. This dataset has been used to train models like OpenLLaMA in an attempt to provide an unencumbered version of Meta's LLaMA 2. However training up these foundational models is an expensive and time consuming task, the real action is taking these models and then fine tuning them for particular tasks.

To fine tune a model you first take a general purpose model and further train it against data with a specific task in mind. The purpose of this is not only to make your new model better suited for a particular task but also to optimise the number of calculations that model has to do to achieve acceptable results. This is also where the style of prompting will be set as you feed the model examples of the sort of questions and answers you want it to give.

The are further stages that be applied including "alignment" where you ensure results are broadly in tune with the values of the organisation. This is the reason the various chatbots around won't readily cough up the recipe to build nukes or make it easier to explicitly break the law. This can be augmented with Reinforcement Learning through Human Feedback (RHLF) which is practically the purpose of every CAPTCHA you'll have filled in over the last 25 years online.

Finally the model can be quantised to make it more manageable. This takes advantage of the fact that a lot of the numbers will be have a negligible effect on the result for a wide range of inputs. In those cases there is no point storing them at full precision. As computation is a function of the number of bits of information being processed this also reduces the cost of computation. While phones and other devices are increasingly including dedicated hardware to process these models they are still constrained by physics - and the more you process the more heat you need to dissipate, the more battery you use and the more bandwidth you consume. Obviously the more aggressively you quantise the models the worse it will perform so there is an engineering trade off to make. Phones work best with multiple highly tuned models solving specific tasks as efficiently as possible. Fully flexible models giving a J.A.R.V.I.S like experience will probably always need to run in the cloud where thermal management is simply an exercise in plumbing.

Making magic numbers work for you

Before we discuss using models I want to discuss 3 more concepts: "prompts", "context" and "hallucinations".

The prompt is the closest thing there is to "programming" the model. The prompt can be purely explicit or include other inputs behind the scenes. For example the prompt can instruct the model to be friendly or terse, decorate code snippets with markdown, make changes as diffs or in full functions. Generally the more explicit your prompt is about what you want the better the result you get from the model. Prompt engineering has the potential to be one of those newly created job titles that will have to replace the jobs obsoleted by advancing AI. One of the ways to embed AI APIs into your app is to create a task specific prompt that will be put in front of user input that guides the results to what you want.

The "context" is the rest of the input into the model. That could be the current conversation in a chat or the current page of source code in a code editor. The larger the context the more reference the model has for its answer although that does come at the cost of even more composition as the context makes for more input parameters into the model.

In a strong candidate for 2023's word of the year "hallucination" describes the quirky and sometime unsettling behaviour of models outputting weird sometimes contradictory information. They will sincerely and confidently answer questions with blatant lies or start regurgitating training data when given certain prompts. It is a salient reminder that the statistical nature of these generative models will mean they occasionally spout complete rubbish. They are also very prone to following the lead of their users - the longer you chat with a model the more likely it is to end up agreeing with you.

So lets talk about what these models can and can't do. As a developer one of the areas I'm most interested in is their ability to write code. Systems code especially is an exercise in precisely instructing a computer what to do in explicit situations. I'd confidently predicted my job would be one of the last to succumb to the advance of AI as systems aren't something you can get "mostly" right. It was quite a shock when I first saw quite how sophisticated the generated code can be.

Code Review

One of the first things I asked ChatGPT to do was review a function I'd written. It manged to make 6 observations about the code, 3 of which where actual logic problems I'd missed and 3 where general points about variable naming and comments. The prompt is pretty important though. If not constrained to point out actual problems LLMs tend to have a tendency to spit out rather generic advice about writing clean well commented code.

They can be super useful when working with an unfamiliar language or framework. If you are having trouble getting something to work it might be faster to ask an LLM how to fix your function that spending time reading multiple StackOverflow answers to figure out what you've misunderstood. If compiler errors are confusing supplying the message alongside the code can often be helpful in understanding whats going on.

Writing Code

However rather than just suggesting changes one very tempting use case is writing code from scratch based on a description of what you want. Here the context is very important, the more detail you provide the better chance of generating something useful. My experience has been that the solutions are usually fairly rudimentary and can often benefit from a manual polishing step once you have something working.

For my QEMU KVM Forum 2023 Keynote I got ChatGPT to write the first draft of a number of my data processing scripts. However it missed obvious optimisations by repeatedly reading values inside inner loops that made the scripts slower than they needed to be.

If the task is a straight transformation they are very good. Ask an LLM to convert a function in one language into another and it will do a pretty good job - and probably with less mistakes than your first attempt. However there are limitations. For example I asked a model to convert some Aarch64 assembler into the equivalent 32 bit Arm assembler. It did a very good job of the mechanical part of that but missed the subtle differences in how to setup the MMU. This resulted in code which compiled but didn't work until debugged by a human who was paying close attention to the architecture documentation as they went.

One of the jobs LLM's are very well suited for is writing code that matches an existing template. For example if you are mechanically transforming a bunch of enums into a function to convert them to strings you need only do a few examples before there is enough context for the LLM to reliably figure out what you are doing. LLM's are a lot more powerful than a simple template expansion because you don't need to explicitly define a template first. The same is true of tasks like generating test fixtures for your code.

There is a potential trap however with using LLMs to write code. As there is no source code and the proprietary models are fairly cagey about exactly what data the models where trained on there are worries about them committing copyright infringement. There are active debates ongoing in the open source community (e.g. on qemu-devel) about the potential ramifications of a model regurgitating its training data. Without clarity on what license that data has there is a risk of contaminating projects with code of an unknown province. While I'm sure these issues will be resolved in time it's certainly a problem you need to be cognisant off.

Prose

Writing prose is a much more natural problem territory for LLM's and an area where low-effort text generation will be rapidly replaced by generative models like ChatGPT. "My" previous blog post was mostly written by a ChatGPT based on a simple brief and a few requests for rewrites in a chat session. While it made the process fairly quick the result comes across as a little bland and "off". I find there is a tendency for LLM's to fall back on fairly obvious generalisations and erase any unique authorial voice there may have been.

However if you give enough structure its very easy to get an LLM to expand on a bullet list into more flowery prose. They are more powerful when being fed a large piece of text and asked to summarise key information in a more accessible way.

They are certainly an easy way to give a first pass review of your writing although I try to re-phrase things myself rather than accept suggestions verbatim to keep my voice coming through the text.

Final Thoughts

The recent advances in LLM's and the public's exposure to popular tools like ChatGPT have certainly propelled the topic of AI in the zeitgeist. While we are almost certainly approaching the "Peak of Inflated Expectations" stage of the hype cycle they will undoubtedly be an important step on the road to the eventual goal of Artificial General Intelligence (AGI). We are still a long way from being able to ask computers to solve complex problems they way they can in for example in Star Trek. However in their current form they will certainly have a big impact on the way we work over the next decade or so.

It's important as a society we learn about how they are built, what their limitations are and understand the computational cost and resultant impact on the environment. It will be awhile before I'd want to trust a set of magic numbers over a carefully developed algorithm to actuate the control surfaces on a plane I'm flying on. However they are already well placed to help us learn new information through interactive questioning and summarising random information on the internet. We must learn to recognise when we've gone down hallucinatory rabbit hole and verify what we've learned with reference to trusted sources.

Comments

With an account on the Fediverse or Mastodon, you can respond to this post. Since Mastodon is decentralized, you can use your existing account hosted by another Mastodon server or compatible platform if you don't have an account on this one. Known non-private replies are displayed below.

Learn how this is implemented here.