Coding Agents for Systems Programming
What building a bare-metal OS taught me about working with AI
For the past year I've been building NanoOS, a bare-metal OS in Rust for a small RISC-V board (a LicheeRV Nano), with the goal of running statically linked ELF binaries compiled for Linux. It boots from the onboard SD card, outputs to UART, and can run a separately compiled Rust hello world on real hardware. I've spent years programming above these abstractions. I wanted to better understand them from the syscall level down.
Coding agents have been part of my work throughout. I started the project with the assistance of Github Copilot (not much more than fancy code complete at the time) and ChatGPT for design and debug exploration and have advanced to CLI tools like Claude Code which I direct towards the analysis, design, planning, coding, and debugging. The tools have become very good at some things but a limitation remains: they can only do what they’re told. At the systems level, the responsibility for identifying the problem, framing it precisely, and choosing among possible solutions belongs to the engineer.
Understanding that limitation clarifies where coding agents add the most value.
What Agents Are Good At
Once you give a coding agent direction on what you want it can use what it’s been trained on to create accurate code for many things:
- Complete cpu and hardware register-set definitions
- File formats and offsets into them
- Tricky multi-tier memory management units and how to initialize them
- Conventions from other systems' code like syscall dispatch tables
It also applies idiomatic patterns from systems code it's seen before. This is useful when the task is well-scoped.
While developing the read capability for my ext2 filesystem for NanoOS I figured out all the virtual filesystem functions I wanted to support and defined the order I wanted to implement them in. I then methodically worked with the agent to implement them one-by-one including creating the code to test and exercise things along the way. I built code that accurately modeled the on-disk data structures for the superblock, inodes, and directory entries and code to traverse those data structures. The result was a program that correctly read a filesystem created on my development host.
Given good specifications, the coding agent helped me write a well-structured, working program.
The Trap
When programming at the system level you may not understand all the tradeoffs and decisions that matter.
Coding agents fall down when you hand off implementation responsibility before you thoroughly understand the solution you need. If you haven’t carefully explored the outcome you want, how the code should be structured, and what correctness provisions need to be in place, the agent won’t explore the design space for you.
The agent optimizes for the problem as stated. It will produce code that compiles and on the surface even produces results that look correct. However, things will be overlooked: correct write order for crash safety, reentrancy, proper use of DMA and interrupts, and future extensibility.
There’s a gap between “compiles and runs” and “is correct for what I need”
The Write Problem
Adding write support to the ext2 filesystem is where I really came to understand this gap. Starting with being able to read an ext2 filesystem had been a straightforward choice. I needed somewhere to get my ELF binaries I was going to execute and an ext2 filesystem from the SD card appeared to be one of the simplest Linux-oriented filesystems to interpret.
The problems started to become apparent when I asked the coding agent for an implementation plan for ext2 write. I always start with a plan: without one the agent gets ahead of my ability to verify progress. Looking at that initial plan though it was clear the agent was proposing a naive approach. For any given operation it would determine what blocks needed to be written to disk with no specified ordering and while it knew it had DMA it had no plan for treating the file mutation operation asynchronously. It would have hung the requesting thread until all the block writes were complete.
With the agent’s help, I explored how to do proper write ordering, how this would actually integrate with the disk block cache and importantly, how to create a queue of pending writes so they could be performed in the correct order. I layered in provisions for allowing subsequent mutations to blocks already waiting to be written in the cache and discovered that I’d need multiple priority-ordered queues to ensure the writes were done in the correct order. This design was getting very complex.
Even though I had dismissed implementing ext3 or ext4 earlier because a journaled filesystem sounded like more complexity than I wanted to take on, I asked the agent to compare my ext2 design to journaling in ext3 or ext4. The answer was surprising. The agent suggested that an ext3 journaled filesystem would end up being as resilient to crashes (if not more so) and would be easier to implement. No need for complex logic around the LRU block cache, and no prioritized write queues. As a bonus, a crash before the filesystem was completely updated meant that on the next boot the journal could be replayed, eliminating the need for a scary fsck run.
Choosing ext2 had seemed straightforward when all I needed was read support. It wasn't until I got deep into write ordering and async performance that I discovered how much that choice had left unresolved. The agent hadn’t found any of this.
The Thread Model
Design exploration before coding can surface decisions you didn't know you faced. It can't protect you from assumptions that only become apparent when a different language runtime uncovers demands your design can't satisfy.
I had already implemented kernel threads and made the process model a special case of one of those threads. This was fine for a simple statically linked Rust executable which has no threading needs by default. I started thinking about whether I could run a Go executable too as they are statically linked by default.
Writing and building a RISC-V Go executable on my MacOS host was even simpler than the Rust example. Using the same technique I had for determining what syscalls needed to be implemented to support my Rust program I fired off my Go executable and let it run until it hit an unimplemented syscall. Eventually it got to a clone() call which makes sense as the Go runtime implements threading by default.
I examined the semantics of the clone syscall and what I would need to do to implement it given my thread and process model. It was then that I realized my decision to store context in both my thread structure and process structure was incorrect for supporting a multithreaded process.
Multiple process threads each have their own execution context while sharing process things like the virtual memory map, open files, and process signal handlers. It became obvious that all my process structure should hold is information about those shared resources. All of the execution context belongs in the thread structure.
The agent built exactly what I originally specified and it was correct. There were limiting assumptions that had made their way into the implementation. It took the Go runtime to make it visible.
My Process Today
The two examples haven't changed my opinion on whether or not I should use a coding agent. They have changed how I examine the problem space before having the agent write code. Here’s my process now for making the important decisions about a problem I’m working on solving:
- Identify the architectural challenge and the constraints of the system I’m working on.
- Enumerate possible approaches. I use an agent to help here because it has broad knowledge of how similar problems have been solved.
- Compare approaches against my constraints and pick the most promising.
- The key step: Before committing, work with the agent to further stress-test my choice. What are the hardest operations the full feature set will require? What assumptions have I baked in that may need to be further challenged? What might I regret when I get to the more demanding cases.
- Capture the architecture in a brief. Document the challenge, constraints, chosen approach, and the decisions that have been made. The agent will work from this as its specification.
- Work with the agent to build a phased implementation plan from the brief. Read it critically. What has it assumed? What has it left unaddressed? Look for the steps in the plan that appear to be the most challenging. Question the agent one more time to make sure that it’s clear what and how those difficult steps will be implemented.
- Only then start implementing, working through the plan incrementally.
Ultimately I need to remember that the agent is a mirror. It reflects the decisions and instructions that are given to it. Looking at what the agent has built or plans to build can be a reliable way to discover what I haven’t figured out yet.
The good news for NanoOS is that I discovered these problems before painting myself into too much of a corner. It even turns out that the on-disk format of ext3 is the same as ext2 so all of the code I have to read it stays the same. The ext3 implementation and the thread refactor are still ahead of me. The briefs for them are written.