..

2024-03-01

cloud infrastructure that proposes verifiable bugfixes

LLMs are thus far the fastest adopted technology in history after OpenAI reported that ChatGPT achieved 100 million users within 2 months of its release. Chief among the uses cases for LLMs is code generation. Today, we mostly generate code with LLMs using chat-like interfaces within a terminal, website, or code editor. However as impressive as LLMs are at coding today, any developer that has used LLMs for code generation knows that even with the right context the vast majority of the time LLMs don't generate code that works as-is. Instead the generated code either needs to be throughly massaged, or it serves as inspiration for the code we ultimately end up writing. In some cases, the LLM-generated code seems to come close to the desired solution and we verify it by perusing it and/or manually running it within our codebases. However, this rarely works. One Princeton study that tasked two different LLMs to solve 2,294 real GitHub issues found that Claude 2 and GPT-4 solved only 4.8% and 1.7% of the issues respectively.

Let's look at the most common types of LLM code generation and for debugging let's explore a workflow to automatically verify LLM-generated bugfixes outside of a chat/visual interface where we can't see the many mistakes made by LLMs. These types of workflows to programmatically verify LLM-generated code might not be the norm today, but they will become more attractive as the price of invoking LLMs goes down and the generative coding abilities of LLMs improve.

the different types of LLM code generation

Let's take a closer look at the different scenarios in which developers use LLMs for code generation:

1. "I know how to write this code but it's repetitive or cumbersome to write"

Code editor AI copilots shine here. This is one of the most common scenarios and we can easily verify the accuracy of the code as there are often many code samples that can be given to the LLM. Some examples include implementing a function with a well defined purpose, calling an API and then performing some sort of data transformation on the response, or generating self-contained components in React or some other reactive frontend framework.

2. "I don't know how to write this code. Help"

This use case is tricky because the LLM can often lead us astray unless we reason from first principles and try to understand the underlying technology, API and/or framework. However, we often don't have many other alternatives as searching Google and Stackoverflow probably didn't help much either. This often ends up being a conversation with a lot of back and forth where code-generated is interleaved with explanations and ideas for how approach the problem in ways that may or may not work.

3. "Write tests for this code"

This one is straightforward for the LLM to understand, but there is still no easy way to know if the tests generated do a sufficiently good enough job at testing the provided code even if the tests pass. However, it is up to the developer to be exhaustive enough in describing the many different ways in which the provided code should be tested and what type of input should be provided for unit tests. I worked on a syntax that combined English and functional notation into a constrained format that was compiled by LLMs into tested Python programs. We tried to make sure the LLM had enough information to generate valid tests by failing to compile the syntax if examples were not provided by the user, but there was still no way of verifying the LLM-generated unit tests themselves.

4. "Make this error go away"

This one is straightforward for the LLM to understand. The developer often just provides the stacktrace and can sometimes add more context by including the code referenced in the stacktrace. LLMs tend to do better when the error generated belongs to a commonly used programming language and/or framework as training datasets from Reddit and Stackoverflow are likely richer for these cases. The LLM can often output instructions or sample code that the developer can implement to quickly test if the bugfix works. Like the 2nd use case on this list, this use case can also end up being a conversation with a lot of back and forth where code generated by the LLM is interleaved with suggestions on ways to resolve the error that may or may not work.

I'm particularly interested in automating the verification of the LLM code generated for the last use case because the functionality of LLM-generated code when debugging is binary and always verifiable- did the error go away and did any new errors arise?

automatically verifying LLM-generated bugfixes

Automatically debugging exceptions within a containerized production application is possible with an opinionated cloud infrastructure and error monitoring setup that collects the necessary data to verify bugfixes and behind the scenes tasks an LLM to generate a bugfix for each of the recorded production errors. A pull request to the application's codebase is submitted if and only if a bugfix verifiably fixes a given error. Pull requests ideally run with unit tests that ensure there are no breaking changes or regressions. Finally, one of the maintainers can take a closer look at the LLM's proposal for a verified bugfix to a particular error and decide to discard it, further edit it or merge it.

In order for the cloud infrastructure hosting the production application in question to verify the LLM-generated bugfixes, the platform needs to be able to parse and collect incoming requests and outgoing network traffic using eBPF (more on that later). The infrastructure offering can then create branches out of the main git branch of the application's codebase with potential bugfixes and deploy them to sandboxed development environments behind the scenes. Finally, the cloud offering replays the incoming requests for the sandboxed environments while simultaneously mocking the outgoing network traffic using previously recorded values from the time the error took place. If no error is thrown, the cloud offering opens a pull request for the maintainers to review. This is a rough sketch of what I am describing:

A cloud infrastructure platform that is able to generate and verify LLM bugfixes while also providing the best UX requires two important features:

1. git-based deployment process for containerized apps

A git-centered deployment experience allows us to deploy by pushing a git branch to a remote repository without having to configure a CI/CD system. This workflow has been around since 2009 thanks to Heroku, the grandfather of all PaaS. This DevOps abstraction tucks away a fair amount of complexity by making some assumptions about our app. For most cases, it really should be as simple as some form of pushing to a branch with a Dockerfile or docker-compose.yml in it (or nothing in particular if using Nixpacks) without having to deal with the fairly standard DevOps process of:

  • building the docker image from a Dockerfile
  • pushing the image to a repository
  • updating the process in a VM or K8 pod with the new code
  • I do wish more container cloud offerings today where as easy to use as the git-based, webapp offerings of Netlify or Vercel. Fly.io is one good example of a provider with a seamless, git-based deployment process for containerized apps. Cloud infrastructure that abstracts away the repetitive parts of the deployment process not only has superior UX, but it can easily provide features that can save us a lot of development time like preview deployments. Most importantly, a git-based deployment process also makes it possible to programmatically replicate deployments behind the scenes. This is imperative to be able to propose verifiable bugfixes as we can test if an LLM can generate code that fixes a given bug, and do that repeatedly for all bugs, without an overly complicated CI/CD workflow, or some sort of spaghetti of yaml.

    2. replay of incoming traffic and mocking of outgoing traffic via eBPF-based, SDK-less error monitoring

    Error monitoring tools are used primordialy to triage applicaiton errors. However, some offerings can also collect networking traffic to later replicate production conditions in sandboxed staging environments to improve reliability of upcoming software releases. Most error monitoring tools offer SDKs to explicitly instrument errors and networking calls within an app which can be tedious. However, some providers like DataDog have SDK-less error monitoring offerings built on eBPF.

    eBPF is a Linux technology that can run sandboxed programs in the kernel without changing kernel source. This technology opens the door for powerful error monitoring tools by hooking directly into kernel OS calls without any additional syntax required within a containerized codebase. However, fully SDK-less error monitoring via eBPF programs requires privileged, or sudo, access to hook into the kernel to then span the containerized app as a child process. This can be a complicated setup as most containerized deployment systems provide a sandboxed, or unprivileged, environment to run apps on. As such, SDK-less error monitoring is less common and typically reserved for business critical high performance needs. However, an SDK-less setup provides both a better UX, and the ability to replay incoming traffic and mock outgoing traffic behind the scenes which is necessary to verify the LLM's proposed bugfixes programmatically. Mocking all outgoing network traffic is particulally tricky as any communication to downstream databases and services needs to be properly parsed, recorded and stubbed which would not be possible without eBPF hooks.

    devtools that can automatically verify LLM-generated code

    Certain devtools within domains where the functionality of LLM-generated code can be verified have been on the rise as they can save developers countless of hours. This trend will continue as the price of invoking LLMs goes down and the generative coding abilities of LLMs improve. Figma to code plugins are now abundant and significantly better than only a couple of years ago thanks to the newer generative AI models that can understand design assets and use them to verify the correctness of LLM-generated frontend code. Similarly, we can verify code generated by an LLM that was tasked to fix application errors because the expected output is simply that the error goes away and no new errors arise.

    Drop me a line at lfdepombo at gmail dot com if you want to learn more about a cloud infrastructure offering I have been working on that proposes verifiable bugfixes. I would also love to hear your thoughts on any type of programmatic verification for LLM-generated code, or on building abstractions with LLMs without directly exposing a chat, or visual, interface to end-users.