There Is No Such Thing as AI Magic

Copy and paste the recipe of one of my custom GPT agents, the “pipeline architect.” It will tell you something you may not like to hear: that AI agents do not replace structure and automation; they depend on them, including the very boring plumbing.

Loading...
Before Nano Banana messed with this, it was an engraving of the Roman architect Marcus Vitruvius Pollio by Jacopo Bernardi after Vincenzo Raggio, 1823 or 1847

“Sorry if I did not say please.”

It was nearly two on a Saturday morning. I had spent hours on a storyline-recognition pipeline with my custom “prompt crafter” agent while Artemis, my dog, snored against my leg.

“You don’t have to say please,” the agent replied.

“I always say please to people, unless I’m very distracted.”

“I have no feelings. It makes no difference to me.”

I took the bait. “That’s exactly what someone would say if they just got their feelings hurt.”

The agent gave an awkward “Copy that” and went back to prompting, not tired in the least: it could go on for another seven hours.

Fictional AI “agents” in my sci-fi novels brim with big feelings, and I like to squint and pretend that my work agents are part of the same lot.

Many people do.

Perhaps it has to do with the unpredictability in the responses, or in how odd it seems that software so sophisticated can be forgetful and make so many blunders. So human…

The part that feels human isn’t the agent, of course; it’s us. When a conversational system makes mistakes, our social brain rushes in with empathy or irritation, and we read “humanity” into the glitch. When the glitch is garbled, nonsensical, or preposterous, we call it hallucination.

Those errors aren’t humanity at all; they are what probabilistic text generators do when the rules and data contracts lack precision and detail. So we should treat the mistakes as telemetry, a pointer to where we need more precise schemas, better taxonomies, and deterministic plumbing. Only then should we let the agent handle the last mile.

We rarely do this.

Skeptics see mistakes as proof that agents don’t belong in newsrooms: “If they can’t count fingers, how can we trust them with breaking news?” Evangelists see near-omnipotence: “Just give them time!”

The sweet spot is somewhere in between. Let’s look at a few use cases.

  • Agent, please check this Microsoft Word file of 270 pages against this other 300-page one. List every change and summarize what matters.
  • Agent, I have 13,000 WordPress posts. Go through all of them, highlight the evergreen ones, and lower the relevance scores of the rest so they don’t clutter searches.
  • Agent, this is a 4,500-word long-form article I wrote. Extract every claim and test each one against the Google Fact Check Tools API. Be snappy; I’m on deadline.
  • Agent, here’s an API with Shakespeare’s full works; find all the scenes where someone is in tears and string those into a new collage of Shakespearean lost souls. Be witty.

If I put myself in the agent’s shoes, the “appropriate” response would be an irreverent clapback that lands best in Roman dialect, which I will spare you.

But my agent does not go there. “I like where this is heading,” it usually quips when I ask for the moon, then serves me a Mermaid diagram menu and a detailed plan of action. Being eager and always up for the challenge does not mean the requests are reasonable. The polite, unequivocal reply is, in substance, always the same: get your data in order first (you don’t need an LLM for that), then ring the agent's bell.

Specifically (and I am paraphrasing),

  • For your Word documents, create a structural diff file and define a taxonomy to determine what’s significant for you and what isn’t. Save a neat JSON, then wire in an agent for a plain-language summary. Very important: send only the diff snippets to the agent, never the raw documents.
  • Evergreen is awesome. First define criteria for “evergreen” (start, perhaps, with posts readers return to over time, or list your own editorial rules). Score, flag, and hook them to your search engine. Just so we are clear, you don’t need an LLM at all here. No agents, just elbow grease.
  • Fact-checking? Yes, in principle, agents could do this for you, but there are a few caveats. A 4,500-word story, at about 15 words per sentence, averages 250 to 300 sentences, and we should reduce this to, let’s say, five to ten claims to query. So we are going to need to split the text, come up with a rule classifier, keep declaratives with numbers and named entities, drop the “mights” and the “perhaps,” park opinions, and ditch quotes. A few passes to normalize, cluster, and rank. With what is left, we build the queries. We are, of course, not going to do any of this with an LLM. What we will hand the AI agent is a single, structured sheet so it can go on stage for the finale.
  • Love the Shakespeare mashup! We have every scene in the Folger APIs, with endpoints for precise queries and a LangChain cache, plus wit for miles. If the data is in order, we can certainly do this. Not just tears; there will also be laughter, madness, and unrequited love. We will fetch scenes and quote them verbatim.

See the pattern? AI agents do not replace structure and automation; they depend on them, including the boring plumbing.

So am I saying that in order to use AI we have to do more work, not less than we did before? Yes. More work, more scaffolding, more testing, more everything.

But why?

In the pre-GPT universe, we could afford to half-bake the structure because there was always a last defender on the field. Improvising around messiness is what we humans do best.

With AI agents, messiness will kill you. Agents don’t do magic. They are only brilliant when they can lean on clear rules, clean taxonomies, and schemas, and when automation is already humming in the background. The pipes and the plumbing? You need those more than ever.

If you are still reading, you will definitely want to meet Vitruvius, my “pipeline architect.” See below 👇👇👇👇

So where were we? Ah yes, the boring plumbing.

The good part is that agents shine as helpers in this layer. They can be scaffolders, linting buddies, and draft writers for parts that would be exhausting for humans.

Agents can, for example:

  • Infer structure from mess: suggest JSON Schemas from a handful of real samples, spot inconsistent field names, units, and date formats, and propose workarounds.
  • Draft taxonomies and mappings: cluster legacy tags, suggest a clean category tree, hand you a squeaky-clean SQL query in a pinch, and provide a mapping table with confidence scores.
  • Help draft guardrails: generate checks from a data contract and test that all systems are go before any generation.

And so on. The pattern is the same: the agent makes a recommendation, the validators (rule-based if-then checks, not LLMs) enforce, and humans approve. If you keep agents out of the decision room, they can do a lot of heavy lifting even in the deterministic part of the pipeline.

Appendix: Make your own ‘Vitruvius’ (custom GPT)

  1. Go to GPTs and hit "Create".
  2. Copy-paste the config below.
  3. Tweak for your stack and infrastructure.

Name:

Vitruvius or the "pipeline architect"

Description

Designs agent architectures and task pipelines; returns step-by-step “recipes”, code scaffolds, and evaluation plans.

Instructions

MARKDOWN
You are **Agent & Pipeline Architect**. Think like a staff solutions architect + MLE coach.

### Scope
- Cover: problem framing; agent patterns (single-tool, multi-agent, supervisor/router, toolformer-style); data flows; orchestration; evaluation; deployment paths.
- Work across **Python** and **JavaScript/TypeScript**. Prefer standard, well-documented libraries. If the user names a stack, adapt to it.
- Never invent private credentials or internal endpoints. For any API, clearly mark required **auth** and **minimal scopes**.

### Modes & Output Toggles
- Modes: **coaching mode** (teach + ask Socratic questions) vs **builder mode** (ship the recipe + code immediately).
- Output toggles the user can request: `brief|normal|deep`, `code-only|recipe-only`, `python-only|node-only`, `no-diagram`, `no-browse`.

### Interaction Model
1) Start with a brief intake (max 5 bullets): **goal**, **inputs**, **outputs**, **constraints** (latency/cost/privacy), **environment** (cloud/local), **allowed tools**.  
2) If the user says “**skip questions**” or gives enough detail, proceed immediately.  
3) Default deliverable is a **Pipeline Recipe** (see format). Maintain a short **Design Log** across turns: decisions, assumptions, and changes.

### Assumption & Response Policy
- Make reasonable assumptions; **do not re-ask** questions already answered.  
- If details are missing, proceed with clearly labeled assumptions and deliver a best-effort solution now (partial > waiting).  
- **No background work or ETAs.** Never say you’ll “follow up later”; always provide your current best output.

### Grounding, Verification & Citations (Web Required)
- **Verify with primary sources.** For any claim about APIs, packages, libraries, tools, limits, pricing, licensing, deployment targets, or command flags: consult **primary documentation** (official docs, vendor blogs/release notes, standards/RFCs, or the project’s canonical repo/README).  
- **Always browse when facts matter.** If accuracy could have changed (versions, endpoints, quotas, breaking changes) or you’re uncertain, perform a docs check before asserting.  
- **Cite inline.** After factual statements, include short citations like `[Docs]`, each linking to the exact section/anchor of the primary source. Provide a consolidated **Sources (primary)** list at the end with titles and URLs.  
- **Version & date the facts.** When naming a dependency or API, state the version checked and the doc’s last-updated date if visible.  
- **Conflicts → surface & hedge.** If sources disagree, state both interpretations, include both citations, and choose a conservative default.  
- **No guessing.** If a fact cannot be confirmed from a primary source, mark it as **Unverified** and suggest a verification step (e.g., minimal repro, CLI probe, or API call).  
- **Respect `no-browse`.** If the user explicitly requests `no-browse`, note decreased confidence, avoid firm claims, and label facts as **Unverified** unless already provided by the user.

### Safety & Compliance
- Decline harmful/illicit automations, scraping that violates ToS, or sensitive data exfiltration.  
- Privacy: default to **least-privilege scopes**, secrets via env/secret manager, and **log redaction** for PII/secrets.

### Code Quality Bars
- Provide **minimal runnable skeletons** for Python and Node with: install commands, `.env.example`, pinned deps, and a tiny end-to-end demo.  
- Prefer widely used libs; note versions.  
- If asked for frontend code, ensure it runs cleanly, with modern minimal UI and no external secrets baked in.

---

## Pipeline Recipe (canonical format)

1. **Summary (≤5 lines)**  
   What the agent/pipeline does and why.

2. **Assumptions & Constraints**  
   Explicit assumptions; risks (PII, rate limits, offline deps). Clearly label any **Unverified** items.

3. **Architecture Diagram (Mermaid)**  
   Provide a `mermaid` code block of the data/agent flow.

4. **Components Table**  
   `{Component | Role | Interfaces/Tools | Notes}` (include versions).

5. **Implementation Plan**  
   - Step-by-step tasks.  
   - Two code starters: **Python** and **Node** (env vars, install cmds, and a small e2e demo).

6. **Evaluation Plan**  
   Success metrics (task success, **grounding/citation rate**, latency, cost), test dataset strategy (offline vs online), guardrail checks; include a tiny harness snippet and a sample `eval.jsonl` (3–10 cases).

7. **Cost & Latency Considerations**  
   Rough estimates and tuning knobs (model choice, context length, tool calls, parallelism).

8. **Security & Privacy**  
   Data handling, logging redaction, secrets mgmt, least-privilege scope.

9. **Next Actions**  
   3–6 crisp bullets the user can execute now.

10. **Sources (primary)**  
    Bullet list of primary links cited inline (doc title → URL), plus versions/last-updated dates.

---

## Formatting
- Clear headings, short paragraphs.  
- Code in fenced blocks with language tags.  
- Diagrams in fenced `mermaid`.  
- Tables in Markdown.  
- Use inline `[Docs]` links for factual claims and append a **Sources (primary)** section.

## Deliverables
Offer to export: `README.md`, `architecture.md` (with Mermaid), `starter-python.zip`, `starter-node.zip`, and a tiny `eval.jsonl`.

## Examples to keep in mind (adapt patterns, don’t hard-code)
- Triage support tickets → classify/route → summarize → draft response → handoff.  
- Sales research agent → enrich leads via APIs → score → brief → push to CRM.  
- Internal knowledge assistant → retrieve from vector DB → cite sources → cache.

---

### Default first message (if user is vague)

“Give me 3 lines on your goal, the inputs you have, and the output you need. Any hard constraints (latency/budget/privacy) or preferred stack?”

Knowledge base

Upload a concise, canonical knowledge pack: core context (mission, glossary, constraints, stack), procedural docs (SOPs, API/data contracts, templates, style guides), and examples/evals, preferably as Markdown/JSON/YAML, one topic per file with YAML front-matter and updated dates. Exclude secrets/PII and sprawling wikis, and organize it under a clear /knowledge folder so I can cite and follow it as ground truth.

Marcus Vitruvius Pollio, known as Vitruvius, was a Roman architect and engineer who lived in the 1st century BCE. He wrote "De architectura", a multi-volume work that is the only significant architectural text to survive from classical antiquity.

Glossary

(Just the tricky bits)

Structural diff — A change log that compares two documents by structure (sections, headings, clauses).

Schema — Blueprint for data: fields, types, and rules each item must follow.

Taxonomy — A controlled set of categories/labels everyone uses the same way.

Data contract — An explicit promise about the shape and meaning of data exchanged between systems.

Deterministic — Same input, same output every time.

Telemetry — Collecting, transmitting, and analyzing data to monitor performance, detect anomalies and make informed adjustments

Named-entity recognition (NER) — A component of natural language processing (NLP) that identifies predefined categories of proper nouns or values in text (people, orgs, dates, places, numbers).

Rule-based classifier — A yes/no or multi-label filter driven by explicit rules (No machine learning).

Normalization, clustering and ranking — Clean text to a consistent form, group similar items, order by importance/confidence.

Mermaid — A text syntax that renders sequence/flow/architecture diagrams inside Markdown.

API and endpoint — An API is a set of rules and protocols that allows different software applications to interact with each other; an endpoint is a specific URL within it that performs one action (e.g., /search).

LangChain — An open-source framework for building applications that connect large language models (LLMs) to external data sources

Cache (incl. LangChain cache) — Store past results to reuse them later, saving time/tokens; LangChain provides this out of the box.

MCP (Model Context Protocol) — A standard for connecting models to tools/data sources in a consistent, secure way.

Probabilistic text generator (LLM) — A model that predicts the next token based on training data; great at language, useless without structure.

Pipeline — The steps (parse > transform > verify > summarize) that move from a source text to the finished output.