I Let an AI Help Me Build an AI System. It Fought Me the Entire Way.

At some point during this build, I realized something slightly absurd:

I was using an AI coding agent to debug an AI system that was failing because of another AI service while deploying through infrastructure that deletes itself for fun.

This is not satire. This is modern engineering.

This is what actually happened building the GRIZL AI system, what worked, what didn’t, and why the next phase is going to look less like a chatbot and more like a small army.

The Setup: Copilot Agent + Codespaces + Controlled Chaos

I didn’t build this the “traditional” way.

This wasn’t:

  • local dev

  • manual commits

  • careful iteration

This was:

  • GitHub Copilot Agent writing code, fixing issues, opening PRs

  • Codespaces acting as a disposable dev environment

  • Me acting more like a product owner / reviewer than a line-by-line coder

In theory, this is the dream:

“Describe what you want, agent builds it.”

In reality, it’s more like:

“Describe what you want, agent builds 80% of it perfectly and 20% of it like it has never seen a computer before.”

Still… that 80% is dangerous in a good way.

What I Actually Built

At a high level, GRIZL now has:

1. A Grounded AI Chat System

  • Azure OpenAI (chat + embeddings)

  • Azure AI Search (vector + semantic + keyword fallback)

  • Structured knowledge base (chunked, indexed, embedded)


This part is working for real now. Not demo-working. Production-working.

2. A Fully Instrumented Retrieval Layer

We went from:

“something broke”

to:

endpoint=grizl-search.search.windows.net semantic=true vector=true fallback=false


Which is the difference between:

  • guessing

  • and knowing exactly which piece failed

3. A Frontend That Actually Feels Like a Product

  • Chat UI with memory feel

  • Source attribution (now being cleaned up so it’s not fake links to nowhere)

  • Error handling that doesn’t scream “developer preview”

It finally looks like something you’d show someone without apologizing first.

What Worked Better Than Expected

Copilot Agent Is Actually Viable (With Supervision)

I’ll say it straight:

Copilot Agent (using Claude Sonnet 4.6) + Issues → PR → Review loop is fast

Like… uncomfortably fast.

Things it handled well:

  • wiring middleware

  • adding telemetry

  • refactoring constants (VECTOR_FIELD_NAME, etc.)

  • test scaffolding

  • CI/CD wiring

Where it struggled:

  • subtle infra bugs

  • Azure-specific quirks

  • anything involving “this service talks to that service under these conditions”

So the pattern becomes:

Agent builds You validate reality

Not optional.

Codespaces Removed All Friction

No “it works on my machine.”

No environment drift.

No local setup hell.

Just:

open → code → run → destroy → repeat

It pairs really well with agent-based workflows because everything is disposable.

Hybrid Search + Embeddings = Real Answers

Once everything aligned:

  • embeddings matched dimensions (1536)

  • index schema correct

  • semantic config actually real (not theoretical)

  • vector profile properly wired

The system stopped guessing and started answering.

That moment where:

fallback=false

…felt like hitting a checkpoint in a boss fight.

What Did Not Work (and Tried to Ruin My Life)

Azure Deployments Are Not Your Friend

If a variable is not explicitly defined in Bicep or pipeline:

It will disappear.

Not degrade. Not warn. Not log helpfully.

Disappear.

I have now set the same environment variables more times than I care to admit.

“Resource Not Found” Is a Lifestyle

At one point the system was trying to query:

grizl-openai.openai.azure.com/indexes/...

Which is not where search indexes live.

That wasn’t a bug in logic.

That was one miswired config.

And suddenly your entire system is confidently asking the wrong service for the right thing.

Vector + Semantic Setup Is Not Plug-and-Play

Things that broke, in no particular order:

  • API version mismatch

  • vector field not recognized

  • semantic config invalid

  • embedding deployment missing

  • field naming mismatches

  • index created but not actually usable

Every time you fix one thing, another thing steps forward like:

“hello, I am the real problem”

Silent Fallbacks Will Trick You

The system looked like it worked long before it actually did.

Because it was quietly doing:

  • semantic → fail → fallback

  • vector → fail → fallback

  • everything → keyword search

And still returning answers.

That is dangerous.

Because you don’t notice you’re running on the weakest version of your system.

The Part That Actually Matters

This project stopped being “a chatbot” pretty quickly.

What we’re really building is:

A System That Knows Things

  • retrieves structured knowledge

  • ranks it

  • explains it

  • cites it

That’s step one.

Where This Is Going Next (The Fun Part)

This is where it gets interesting.

1. GRIZL Multi-Agent System

Instead of one chat brain, we move to:

  • Retrieval Agent (search + ranking)

  • Reasoning Agent (answer construction)

  • Personality Agent (voice, tone, character)

  • Action Agent (commands, automation)

Each one focused. Coordinated. Replaceable.

Less “one smart model” More “team of specialized operators”

2. Self-Learning Feedback Loop

Right now:

  • user asks → system answers

Next:

  • user asks → system answers → system evaluates itself

We start capturing:

  • which answers worked

  • which didn’t

  • what sources were useful

Then feeding that back into:

  • ranking

  • prompt construction

  • knowledge weighting

Not full autonomy. Controlled learning.

3. Ticketing + Memory Layer

This is a big one.

Turning chat into:

  • persistent conversations

  • issue tracking

  • user context memory

Think:

“AI support system that actually remembers what you asked yesterday”

Instead of:

“stateless goldfish with confidence”

4. Real Source System (Not Decorative Links)

Right now sources exist.

Next:

  • clickable

  • mapped to real pages

  • possibly inline highlights

Because “sources” that go nowhere is just UI cosplay.

5. Performance + Perception

We already hacked perceived speed with streaming.

Next:

  • caching

  • faster retrieval

  • precomputed answers for common queries

Goal:

feels instant even when it isn’t

The Real Takeaway

The hardest part of this wasn’t AI.

It was alignment.

  • aligning services

  • aligning schemas

  • aligning environments

  • aligning expectations with reality

AI didn’t fail.

Infrastructure did. Configuration did. Assumptions did.

And maybe the most important realization:

The future of building AI systems isn’t just writing code It’s orchestrating systems and supervising other AIs doing the same thing

Copilot Agent didn’t replace me.

It made me faster.

But it also made it very clear:

You still need someone in the loop who knows when something is technically working but actually wrong

If you’re building in this space right now and it feels chaotic:

Good.

That means you’re not just using the tools.

You’re actually pushing them.

— Jeremiah Williams

Still debugging
Still shipping

Next
Next

I Taught an AI to Make a Cartoon Child Dance to Trap Music.