Four AI Agents—One Server Provisioning Project

AI beyond the demos

2026-06-08
15 min read

It was that time of the decade again: my venerable Amazon Linux 2 (EL7) servers were now several years out of support, and internet-exposed servers do not age gently. I would have to update my server provisioning project to support EL10 and OVHcloud as a new provider. It was going to be a deep rabbit hole, so I drafted the help of AI agents. Not just one, but four of them: Anthropic’s Claude Code, OpenAI’s Codex, Google’s Gemini CLI, and Mistral’s Vibe.

I was not ready for what I found.

Mind Blown

I started with low expectations: my project is complex; it uses Bash scripts and Makefiles; has over 20 different sets of service-specific configuration files; uses custom macros in those files; uses custom Unix permissions, custom YAML “blueprint” and XML “account” configuration files, and more. If it gets something wrong, it can break SSH access, Apache, MySQL, certificates, Unix permissions, or live traffic. It’s not your typical app or website. Would an AI agent understand this?

I quickly stumbled onto my first head-scratcher and decided to ask Claude for help. I told it it could reach the server using ssh myhost from my local machine if needed, to debug the issue. Lo and behold, it did just that.

It ran remote commands on the server (both a VM during development and, later, a live server in production), and it understood the state of those servers as the outcome of running my code. It understood how that state mapped to my code and which code fixes were needed to address the current server configuration, rather than just fixing the server itself.

That distinction is the key. It did not behave like a remote shell operator. It behaved like an engineer debugging a deployment system. The server was evidence; the repository was the source of truth that needed correction.

You could call this AI’s “mic drop” moment for me. I was mind-blown and hooked.

It Feels Like a Partner

Working with these agents felt like having a human partner, especially because I treated them as such. I treated them as experts with no prior experience on my project. I could give them a general overview (the AGENTS.md file) and high-level context for each request, and they would act reasonably well, with the same caveats you’d place on such a human engineer.

One concrete example. My project copies pre-prepared service configuration files to the proper locations on the server (e.g., /etc/httpd/conf/httpd.conf). Most of the content is the same across all servers (e.g., KeepAlive On), but some specifics need to be replaced in those files (domain names, assigned IP addresses, etc.), so I use custom macros for those, as in ServerName {{DOMAIN}}. I also have conditional macros as {{WITH_IPV6 { … }}}.

Even though I didn’t explain the pattern beforehand, the AI agents would see it and make assumptions, just like a human engineer would. Claude tried to nest these conditional macros, which my code actually didn't support. Yes, it would be easy to ask it why it didn’t work, and it would get there, and it would be easy to ask it to improve my code to add support for macro nesting, but that’s not the point. The point is that it acts as a human professional, making some reasonable assumptions. It’s up to us to ensure our prompts are part of a conversation, not single requests that “should” have a perfect outcome.

Conversational Modes

As with human partners, my interactions with the agents had two main modes.

An everyday mode for simple iterations through fixes and small features. Think of it as a conversation, a dialogue with a smart partner who hasn’t memorized every corner of the project. I would describe the problem, get options, review and fine-tune gaps by replying (repeat these two steps as necessary), and then apply the consensus myself or ask the agent to do it. In my project, the AI agents surfaced approaches I had not previously considered.

A formal mode for bigger milestones, where some planning would be helpful. Think of it as sitting down with your partner at a table with a piece of paper to discuss and plan what the feature should include. I would start by brainstorming (“ask me questions to clarify”), then have the agent write a design spec, then write an implementation plan (optional), and finally implement it. In my project, eight specs were written (examples: fail2ban introduction, Loki/Promtail/Grafana observability stack introduction, SSH hardening). The spec kept complex work from drifting.

No special agent-specific frameworks or prompt engineering were required in either mode.

Smarter You, Smarter Me

If you treat the AI agent as a human expert you’re pair programming with, you “talk” to it even just for double-checking your immediate plans. As in any pair programming session, the final code is sometimes more yours, sometimes more of your peer.

A couple of real examples.

In one case, I wanted Bash code that took $conf_firewall, a space-separated list of mail services to open on the firewall. Only recognized service names would be opened; unknown names would be discarded. If no services were given, all mail services should be open. All services should be closed on the firewall if the mail service was not being installed.

From my experience with “real” programming languages, I assumed a flow that required tokenizing $conf_firewall (strtok) into tokens and iterating over each token to match known services. Conceptually:

SUDO_FIREWALL_REMOVE( $service );
…
if( $with_mail )
    {
    foreach( strtok($conf_firewall ?: "service …") as $token )
        {
        switch( $token )
            {
            case "service":
                SUDO_FIREWALL_ADD( $token );
                break;
            …
            }
        }
    }

The agent came up with a much more elegant solution by taking advantage of Bash pattern matching:

mail_services=
[ $with_mail ]  &&  [ -n "$conf_firewall" ]  &&  mail_services=" $conf_firewall "
[ $with_mail ]  &&  [ -z "$conf_firewall" ]  &&  mail_services=" service … "
[[ $mail_services == *" service "* ]]  &&  SUDO_FIREWALL_ADD    service
[[ $mail_services != *" service "* ]]  &&  SUDO_FIREWALL_REMOVE service
…

This is slower than what I had in mind, but one doesn’t write in Bash for performance. 🙂

In another situation, my project needed to install VictoriaMetrics, and at the time, this meant downloading the service’s tarball rather than using the OS package manager. The agent had given me this code to retrieve the latest release:

VM_VER=$(curl -s https://api.github.com/repos/VictoriaMetrics/VictoriaMetrics/releases/latest \
    | grep '"tag_name"' \
    | head -1 \
    | sed 's/.*"tag_name": *"\([^"]*\)".*/\1/')
curl -L "https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/${VM_VER}/victoria-metrics-linux-${detected_arch_name}-${VM_VER}.tar.gz"

Initially, this worked, but a few days later, this code failed to retrieve the tarball. In discussion with the agent, we found that this was because VictoriaMetrics had just released a new Enterprise (paid) release, and the corresponding Community release was not out yet.

We went back and forth with a few ideas from the agent to retrieve the latest Community release, but they all kept failing, until I realized we could just get the latest tarball with a filename pattern that matched a Community release. The pragmatic fit for what I needed was:

VM_LATEST_URL=$( curl -s "https://api.github.com/repos/VictoriaMetrics/VictoriaMetrics/releases?per_page=20" \
    | grep -E "\"browser_download_url\":.*/victoria-metrics-linux-${detected_arch_name}-v[0-9.]+[.]tar[.]gz" \
    | sort \
    | tail -1 \
    | sed 's/.*": "\(.*\)"/\1/' )
curl -L "${VM_LATEST_URL}"

The agent kept trying to identify the right Community version from the release metadata. I solved it by looking at the problem from the agent’s point of view: what observable difference could it use? Both Enterprise and Community releases were listed, but only the Community tarball matched the downloadable URL pattern I needed. The useful distinction was not the version number; it was the artifact URL.

You still need to stay on top of the generated code. The agent may be reasoning hard, but it can keep reasoning along the wrong axis.

Different Brains, Different Personalities

Each of my four AI agents had its own personality.

Claude was the slowest, even when just using the Sonnet model, but it felt organized and clever. The agent (Claude Code) is highly configurable and has a thriving ecosystem of plugins that I took advantage of (e.g., Superpowers). Without a doubt, that helped my impression of a “very organized” Claude.

Codex was like this eager developer trying to make a name for themselves, and would start making changes to my code when my prompt barely allowed for it (e.g., “I was expecting this outcome, but it’s not working. Why?”). It was quicker than Claude and went straight to the mostly correct solutions.

Gemini was the one I had the hardest time evaluating. It didn't feel up to par with Claude and Codex, but I was using its free tier while paying for the others. I’ll keep an eye on it.

Mistral was the quickest of the bunch, but it struggled with a few issues. As with any agent, it’s hard to tell which problems it struggled with: it successfully solved what I perceived as hard problems but struggled with simpler prompts, so it was a bit hit-or-miss. More than once, it also froze without response. Given its European origins, I’ll continue to follow its growth.

A concrete example prompt I asked all of them on the Web:

I can use inet_protocols to make sure Postfix only uses IPv4 on outgoing/incoming connections. Is there a way to force it to use IPv4 on outgoing ONLY and any (IPv4, IPv6) protocol on incoming connections? I have issues with Spamhaus making checks against IPv6/64 and my server having an IPv6/128 address.

Claude gives you all the options, but it is complex and over-engineered: good for learning, bad for a quick fix.
ChatGPT/Codex goes straight into the proper solution I was aiming for, but lacks the additional context useful for learning.
Gemini is incorrect in its “modern and clean way” to solve the issue. Postfix’s own documentation explicitly says that smtp_address_preference = ipv4 does not solve this case. It is otherwise as complete as Claude. To be fair, Gemini states that on the Web it does not check documentation, but the agent would have.
Mistral is the fastest by far, pragmatic like Codex, but incorrect like Gemini.

All four agents provided solutions that would have at least partially fixed the issue.

I Could Trust Them, But I Still Set Boundaries

None of the agents misbehaved. I manually approved each SSH command and reviewed the final code changes at the end of each feature, and nothing out of the ordinary was being done. Yes, agents could seem to be “escaping their sandboxes” at times, but that’s just because they creatively try to work around what could be misconfigurations. If I tell them about the sandbox and that I want it respected (“do not do X”), then they don’t attempt to escape it. In fact, they even tell me about the weaknesses in the sandbox!

Even so, I’ve been sandboxing them more – but that’s a topic for another article.

If You're Just Beginning With AI

Start with a spec. For anything non-trivial, write down what you’re trying to build before writing any code. The AI agent will help you write the spec. That conversation is often where you discover the requirements you hadn’t thought through. And once the spec exists, it keeps the work grounded when things get complicated mid-implementation.

You still need to be the senior engineer. AI agents amplify your expertise; they don’t substitute for it. If you can’t evaluate whether the output is correct, you can’t safely use it on real infrastructure. The people who will get the most out of this are the people who already know enough to catch the mistakes.

No special frameworks required. I never learned any prompt engineering techniques. I just described what I was trying to do, provided context when needed, and reviewed what came back. That was enough. If you’re putting off trying this because you think you need to learn a new skill first, you probably don’t.

Use it for the hard thinking, not just the typing. The biggest wins weren’t autocomplete. In many cases where I described a problem, I got back an approach I hadn’t considered — one that turned out to be better.

The surprising part was that with enough context and supervision, these tools could participate in real engineering work: reading the repo, inspecting a live server, proposing options, making changes, and explaining the trade-offs. I would not delegate my infrastructure to them blindly. But as pair programmers, they changed the shape of the work.

Going Forward

Will I keep using AI in my projects? Absolutely yes. It feels like I have just scratched the surface of what it can do, but costs can grow rapidly. I want to try out local AI and see how it fares against the frontier models.

Will AI write 80% of my code? Depends on the project. Anything low-stakes can be vibe-coded, produced, and tested by AI, with multiple agents, OpenClaw, and the works. But I would not fully delegate the foundations of what can become a business to AI agents. Mistakes can compound, and assumptions that agents did not discuss with me can come back and hurt in the medium or long term.

And that is a topic for another article as well.