Modular Journalism with AI Agents

Detecting user effects and mapping Information needs in automated mews pipelines

↩️ Haven't read part one? It’s [here].

Over the past few weeks, in June and July 2025, I picked up my modular journalism research from where I had left off a year earlier and set out to test whether a grammar of structured news could help reduce the risk of hallucination when using artificial intelligence to transform one kind of news artifact into another.

This was the original question we had at the start of this process, when four news organizations – Deutsche Welle, Maharat Foundation, Clwster, and Il Sole 24 Ore – set out to explore whether radically rethinking our traditional storytelling formats, namely the long-form article, could help engage underserved and broader audiences. That challenge was part of the 2021 edition of JournalismAI – still in the pre-GPT era.

I found that not only is it possible to build a pipeline where you feed something old in on one end and get something shiny, new, and different out the other – it’s actually a relatively trivial task. Frankenstein is real.

I’m not surprised. We'd worked out the theory long enough to know it would function. I just hadn’t taken the summer to connect the electrodes and let the machine run on the bench for a while.

Frankenstein, though, is often ugly and impulsive – and it takes a lot of TLC at every step of the process to stay true to the core principles and trust the grammar. The real surprise was discovering that automated journalism done with AI agents is painstaking – but a deeply gratifying adventure, and far more a journalist’s task than an engineer’s.

No – more than that: I’ve come to believe that journalism done with AI agents should simply be called journalism. I can’t think of a single scenario where that wouldn’t be a win for users, publishers, and journalists alike.

I said with AI agents, of course – not by AI agents.

Let’s walk through it step by step.

My experiments in automated journalism are not open-ended. They’re conducted within the confines of structured content. Or – for the ten of us who use that term – they extend the grammar of modular journalism.

Structured content means every part of a news artifact has both semantic relevance and a practical purpose for the user. That purpose might be answering a question someone is actually asking, or fulfilling an information need we can consistently detect and measure through user research.

By extension, any portion of a news artifact that doesn’t serve a purpose doesn’t belong in the structure.

This is not a limitation, it’s a choice. In a system where discourse is made up of actionable parts, content becomes malleable and mutable. If the goal is to engage the widest possible audience – or simply to engage any audience – it makes sense to offer a highly personalized format, tailored to the user’s time, place, and interests. Modularity provides the structural mechanics that make it possible to shape-shift content into new storytelling forms.

Modularity, what is it again? Is that when a text is split into a few chunks by catchy H2 headers?

No – that practice is a clever UX trick. It helps break up long blocks of text, lets them breathe, and allows users to scan quickly and stay focused before they scroll away. Sometimes those headers match the names of our modules, “Why is this important?”, “What do people say?”, and in some cases, they serve a real information need. Axios does this consistently. Newsweek too-ish. There are hints of it at Semafor. But in none of those cases is modularity part of the mission statement.

To an extent, that’s justified. True modularity isn’t presentational – and it’s hard to do. It’s the underlying design of information, which means you need a beefed-up content management system. One that stores content not by unique IDs and slugs, but by semantic nodes and the relationships between them. The unit of measure is paragraphs, not articles. Modularity changes how we produce content and how we deliver it to users.

Not to mention: URL management. And all the craziness with SEO. Oh my.

Also, journalists often dislike it.

Journalist : Modularity :: Actor : Aaron Sorkin script.

The structure sets the pace. The delivery is excruciatingly hard – and still, it needs soul.

So it usually stops at the prototype stage. Even when A/B tests show fairly dramatic engagement lifts (often 30%+), we were never able to get stuff in production. (To quote a trusted industry colleague, anonymously but authoritatively:)

Back to the workbench. Modularity is precise, deliberate, liquid, and fully machine-readable. It’s also fun, because it’s the users who get to decide what matters in journalistic products, not the journalists. So you can invert the inverted pyramid, if you want, and promote context and background as more relevant than updates and breaking news. 🤘

Ok, enough with the preamble, let's see the project.

The project

Let’s begin with the diagram below. The flow consists of five processing units operating independently, each with one or more AI agents (the blue circles – I call them precincts; I can’t help it). The pipeline connects directly to the modular journalism API via three main endpoints: user effects, information needs, and modular stories (shown in the diagram as green diamonds). Humans (the pink bubbles) intervene at three points in the chain. Aside from the final copy edit, they’re only called in when the system encounters a problem the agents can’t resolve on their own. The yellow shapes represent content at different stages.

There are countless ways to approach the engineering. I kept things as simple as possible: the entire pipeline runs on the Next.js frontend with server-side rendering, orchestrated with LangChain. The architecture allows for dynamic workflows while keeping things lightweight.
I tested the flow using both the OpenAI API and the Gemini API, choosing Gemini for the proof of concept mainly because of pricing differences. While Google’s tokens are more expensive in the long run, they’re free to use (within limits) during research and prototyping.
As for performance, the cheapest OpenAI model – GPT‑3.5 Turbo – struggled with complex, multi-step prompts. Trial and error often led to actual errors. That said, LangChain integrates more smoothly with OpenAI than with Gemini. I only briefly tested GPT‑4, but the results were consistently strong. I had set a $10 token budget for the run; it came out to $11.70 with taxes – enough to validate the workflow. I bookmarked OpenAI for future testing, switched to Gemini, and didn’t look back.
One of the nice things about LangChain is that it makes it easy to mix and match models within the same pipeline, so you can pick the best one for each step – assuming you’ve got tokens to spend.
For the backend, I’m using Strapi Community Edition with a PostgreSQL database, which also powers the modular journalism API the pipeline connects to. While the structure for storing “atoms of content” for the generative phase is already set up in React, the upstream connection will remain on hold until the full flow is ready.

Clicking on the module nodes shows a definition of the categories in van Dijk's news schemata.

Agent 1 — Structured News Health Check

Assessing modular functionality and information need coverage

The first processing units – Agent 1 (needs detection) and Agent 2 (effects detection) – review the source material to assess whether it’s ready to enter the pipeline.

The core criterion is modular functionality: identifying which parts are usable as self-contained content modules, and which are dysfunctional – either because they lack enough substance to answer a core user question or because they contain rhetorical effects that compromise reliability.

We treat news artifacts as existing on a spectrum. At one end is modular-first content – written with modularity in mind and structured to meet a range of anticipated user information needs. At the other lies propaganda and unethical journalism. Everything else falls somewhere in between. (For a deeper dive into the principles, see here.)

The pipeline’s initial task is to decide whether the source contains enough substance to address 13 core user questions:

What happened?
What are the key facts?
What is the data?
What do key people say?
Are there images / videos / audios about this?
Why is this important?
What has got us here?
What is the impact on my community?
Are any people particularly or disproportionately affected?
How can we fix it?
How can I contribute / help?
What happens if…?
What don’t we know?

These questions form the basis for six modular story types: “Update Me,” “Data & Facts,” “Solutions Approach,” “Community Impact,” “Inspire Me,” and “Help Me Understand.”

The list of user information needs is extensible to support additional story types. This flexibility allows us to test how small changes in modular inputs shift emphasis across the broader motivations in Dmitry Shishkin’s User Needs model.

Source material, especially in the training phase, comes in all shapes and forms, so some ground rules were in order. To better guide the agents in their runs, we don’t simply accept the original split into paragraphs. Instead, we look for the underlying structure of the news artifact.

For this, I originally used Teun A. van Dijk’s news schemata. Adopting van Dijk at the time was convenient and nostalgic: it brought me back to the early days of the work, when his framework served as the compass for our (100% not artificial) thinking about modularity. Back then, we manually and painstakingly segmented stories into functional and dysfunctional units. In the early agentic phase, the schemata similarly gave the models a structural map (a cleaner board to start from), making it somewhat easier to see how different parts of an artifact fit together and to detect both substance gaps and rhetorical distortion.

Since then, that scaffolding has been formally replaced in the modular journalism API by Allan Bell’s news schema, which is nimbler, closer to how news operates at paragraph level, and better suited to consistent, sentence-to-atom tagging. Van Dijk was a useful bridge; Bell is the grammar we can more realistically ask agents — and eventually CMSs — to digest.

At this stage, we are not concerned with how well organized an artifact is. The necessary information may be buried in a meandering long-form article, wrapped in vague or biased language, or scattered throughout. What matters is whether the core information exists.

The processing unit has two agents:

Agent 1A’s main task is deciding if a news artifact meets any user information needs from the modular API – and if so, how well. It scans the text and essentially plays Jeopardy: identifying significant portions of the text and describing them with a question:

"The Secretary General of NATO..." – Who is involved?

"President Donald Trump just enacted a new wave of tariffs..." – What happened?

Agent 1B parses the content again after it's been analyzed by Agent 2 and is the one who figures out what to do with it.

The goal is to decide whether the source is fit for purpose or needs more work. If there's a problem, a human is notified and the artifact is discarded. If only some information needs are met – or only partially met – the agent again consults with its human. Regardless, anything deemed useful will be extracted and stored.

Agent 1B is more opinionated. It asks:

Is the content, after removing the dysfunctional bits, factual and informative enough?
Does it misrepresent, exaggerate, or contain inappropriate claims that should be removed?
How’s the sourcing–is it missing, vague, anonymous, or rogue?

This is also the first firewall for fake, satirical, or irrelevant content. If someone tries to submit an article from The Onion, Agent 1B – using feedback from Agent 1A and Agent 2 – is there to finger-wag and block the way.

Each agent appends notes and observations directly to the content before it advances to the next stage and everything gets saved in the database so the process is transparent and issues are always traceable.

Working on prompts for each of the agents is a continuous, iterative process – two steps forward, one step back. The models are deeply involved in a kind of human-assisted self-improvement loop:

“Given this expectation, what would it take for you to catch this–or not fall into that trap?”

Journalistic expertise is slowly codified into layered instructions. AI agents and human journalists collaborate to create flexible, modular content. Imagine that.

Modular stories and user information needs are not hardcoded into prompts and instructions: they are dynamically pulled from endpoints in the Modular Journalism API. It’s worth noting that our current selection of core user interest needs, and their combination into modular story types, is grounded in user research–but it’s not set. Different user needs may emerge as relevant for underserved audiences, and entirely new story formats may prove more engaging under different circumstances.

Ultimately, the system is designed to determine user information needs on a per-user basis and generate bespoke coverage–without relying on a fixed editorial structure.

Once Agents 1A and 2 have done their passes, we can run the story readiness assessment. This step asks a simple question: given what we have now, could we publish a functional modular story of a given type?

The assessment cross-checks each story type’s required user information needs (coming directly from the Modular API) against the findings from Agent 1, subtracts any segments compromised by effects flagged by Agent 2, and then scores readiness accordingly.

The figure shows a training demo run. Here the agent is constrained by fixed user-need definitions and quoted evidence spans–the point isn’t creativity but conformance to the API’s criteria for, say, ‘key facts.’

The output is two-fold: a high-level readiness map for editors (showing which story types are green-lit, which need more reporting, and which are impossible from this source), and a granular list of atoms – the smallest, self-contained content units answering a single user question. Atoms are the building blocks for modular stories: they can be reused, rearranged, and remixed across formats while keeping their sourcing and context intact. Together, the readiness map and the atoms inventory give the next stages of the pipeline a clean menu of what’s available, what’s missing, and what’s suspect. Again, the figure below shows an output of Agent 1 that does not yet take into account the findings of Agent 2.

Triage → Needs→Stories map (expanded set). The Sankey below shows how the current set of user information needs detected by Agent 1 flow into available story types. Bands start at needs (left) and end at story types (right); band width reflects the number of validated atom-level matches, and color identifies the destination story type. The labels and counts are pulled from the Modular Journalism API, so new needs and the three additional story types introduced in this essay appear automatically.

Information Needs => Stories Sankey. Left: user information needs. Right: story types (including the three new templates added with this research). Width = number of matches; color = story type.

Two steps back

Ok, one lesson learnt. While a more granular set of user effects helps agents catch the nuances of loaded or manipulative rhetoric, expanding the user information needs only added confusion. I’ve cut the list from 116 to 26 – the ones covered by the story types we intend to generate. A smaller set lets us focus on the structure of each need and represent it consistently for the agent, so it can mirror that structure when matching requirements and preparing the generative phase.

The structured example below is how the agent is learning to think user information needs should look like.

TEXT

[
  {
    "input_excerpt": "The Labour party has called for an investigation into Sir Geoffrey Cox...",
    "target_atoms": [
      {
        "need_slug": "what-happened",
        "vdj_role": "Lead",
        "status": "solid",
        "confidence": 0.84,
        "evidence_spans": [
          { "text": "The Labour party has called for an investigation into Sir Geoffrey Cox" }
        ],
        "entities": [
          { "id": "e1", "name": "Labour Party", "kind": "org" },
          { "id": "e2", "name": "Sir Geoffrey Cox", "kind": "person", "role": "MP" }
        ],
        "events": [
          { "action": "call for investigation", "actors": ["e1"], "targets": ["e2"] }
        ]
      }
    ]
  }
]

Agent 2 — Cleaning for Bias

Detecting user effects and rhetorical distortion

The task in the second processing unit is more linear. A single agent pulls from the User Effects endpoint of the Modular Journalism API and determines – paragraph by paragraph – whether any editorial effects are present in the artifact.

Agent 2 evaluates structural quality using semantic principles, making editorial judgments based on rhetorical patterns and cues, not ideology.

Is there a clear structure?
Is the tone neutral?
Is the language clear?
Are there basic grammar issues or typos?

Several layers have been added to the API to support these tasks, user effects are grouped in rhetorical patterns where we collect lexical and structural cues to help detection and guards for exceptions.

User effects are now expressed in two degrees of gravity, "intentions" and 'flags', red and yellow cards. For example, one of the patterns of Emotional Activation & Agitation can be present in the text as a warning:

"Sensitive details are included: public-interest rationale may need stating"

or as a more serious issue:

I will report on identifying/graphic details lacking a compelling public-interest rationale

To see Agent 2 in action see the log of the training with samples from the output.

To support this agent’s editorial judgment, I added a more granular taxonomy of user information needs and user effects. This expanded classification – you can see it here – breaks down rhetorical patterns, flags common distortions, and maps them to unmet information needs. It forms the foundation for more precise agent behavior: not just scoring structure, but diagnosing how and why a paragraph falls short of modular standards. Over time, the taxonomy has evolved into both a reference model and a training toolkit.

For a detailed look at rhetorical patterns and detection cues, see this page.

Even at POC depth, making structured logic a prerequisite for generation consistently produces higher-quality than free-form prompting – because outputs are constrained by API-defined needs, quoted evidence, and sourcing.

The search for user effects has a strong influence on the pipeline. It introduces a heightened awareness of shifty journalistic practices and defines clarity in contrast to them.

The user effect razor is sharp – so sharp that even ChatGPT-3.5 Turbo, which isn’t built for complex or chained tasks, typically falls in line and flags biased language effectively.

Agent 2 is also an extremely fascinating thing to watch. User effects are not detected based on expertise or editorial practice; they are rooted in logic and linguistics. We’re not asking the model to “spot propaganda” – we’re asking it to detect the rhetorical appearance of propaganda.

The model is judging form, not substance – and it's definitely not concerned with truth. Still, it will spot persuasion tactics and flag content that is not factual. It asks how something is said, not whether it's true. And this, it turns out, may very well be something a language model is better equipped to do than the average human.

So yes, only one agent is needed here. But this is also a crucial ethical checkpoint in the pipeline – one where I’d be curious to run parallel checks with OpenAI, Google, and Anthropic, just to see how each handles their own Minority Report duties.

(A brief but inevitable digression into content management idealism: these checks shouldn’t be limited to modular, automated journalism. Any content created for public consumption could benefit from passing through these filters – ideally, in real time, in the very moment we’re writing our reports.)

Here’s what the screenshot shows. Agent 2 scans each numbered segment and (for orientation only) tentatively tags it with van Dijk news schemata like Lead/Background/Context. We don’t adopt van Dijk as a framework; we use it to anchor spans in an inverted-pyramid-ish structure so Agents 1 and 2 analyze the same text slices. For each slice, Agent 2 either returns No effects or one or more detections with: the label, a brief rationale, the exact quoted evidence span, a severity tier (low/medium/high), and a confidence score. “effect” = positive detection; “flag” = softer alert for review.

In this run, Segments 2 and 6 are clean. 1 flags reader address (65%), 3 framing (“something changed,” 70%), 4 loaded language + selective evidence (75%/70%), 5 loaded characterization (85%), 7 opinion injection (90%), 8 generalization + loaded phrasing (80%/75%). These can be withheld or down-weighted before Agent 1 assembles answers. Because this sample is adapted from an analysis show, pre-tagging it as analysis/opinion would relax thresholds and downgrade some “red cards.”

Agent 3 — The inventarium

Processing unit number three has more in common with a post office than with detective work. The single agent in this precinct is tasked with maintaining an inventory of the content – verified and cleansed in the previous stages – and moving it along the pipeline.

It’s not a difficult job. The agent accesses the Modular Stories and User Information Needs endpoints in the Modular Journalism API and determines which story types have sufficient information to proceed, and which are incomplete or entirely lacking. Then, it saves them in the form of atoms, rich in relationships with thousands of other atoms. These atoms and their metadata are the LEGO bricks of modularity.

Do we have enough atoms for an Update Me story on topic abc? Move right along!

Data & Facts? Yes siree, Bob. Proceed to cashier number 11.

Community Impact? That one’s still a work in progress.

Full Context? Fuggedaboudit.

It’s surprisingly hard to get a model to say it doesn’t know something. Perhaps because, while humans have endured three rollercoaster years of AI hype, LLMs have been out here know-it-all-ing 24/7 to ride the trillion-dollar bubble.

And so it requires generous prompting – and frequent, emphatic redirection – to get one to admit it doesn’t have a valid answer to a polite, clever, clearly worded question.

Do not introduce new claims, facts, or invented quotes.
Never invent a “more correct” name if you’re not sure.
If two options are ambiguous, do not silently pick one.

Agent 3 has to swallow its pride:

“Are any people particularly or disproportionately affected?”

Can’t answer this, mate. More research is required.

Only–this is one of those “knowing not to know” epiphany moments. Admitting you don’t know is exactly what unlocks the next phase of the pipeline: finding out.

Agent 4 — The reporter

The description of this next processing unit in the LangChain flow should begin with a reflection: the role of AI agents in journalism isn’t to blindly generate new content – it’s to detect absence, and to set the conditions for human intervention and (AI-assisted) research.

This is the kind of newsroom where engineers do the heavy lifting – building a backend architecture capable of handling a new type of deeply digital content, structured in a way that allows personalization for each user.

Agent 4 has the tools to query an archive of liquid content and answer hard questions like:

“Are any people particularly or disproportionately affected?”

And if it can’t find that answer?

It knows to escalate – to ask something precise of the one journalist in the newsroom who can fill that blank with a micro-assignment.

The interesting part? Human assistance doesn’t need to arrive as a full story. It can be a note, a snippet, a half-thought, a link, a Slack message – the same way two colleagues bounce context between them in the middle of reporting.

All of this is only possible through deeply digital organizational and process thinking – where high tech is explicitly tasked with supporting high-value staff to ensure high-value outcomes, in all forms of content.

The tech, frankly, is already high enough. What’s missing is trust – and that trust isn’t earned through hype, but through process. Through a shared understanding that AI is a toolkit, not a warlock.

And in this particular niche, AI has probably already earned some trust – because spotting what’s missing in a flow of information is something agents can be ridiculously good at. Especially when we’re working with, what was it again?

Ah, yes – structures for news: precise, deliberate, liquid, and fully machine-readable.

Agent 5 — The cautious generator

If Agent 4 is the one who knows what’s missing, Agent 5 is the one who refuses to fake it. This final generative unit is where modular story assembly occurs – but only within strict guardrails. The agent receives pre-cleared building blocks: validated modules from upstream, each tagged to one or more story types. Its job is to reconstruct these into finished stories, following a fixed recipe for each format – Update Me, Community Impact, Data & Facts, and so on.

But it operates with a clear philosophy: don’t make things up. Generation here is cautious by design. No ‘creative filling in.’ Ambiguities stay ambiguous; gaps are labeled as gaps. The model is told in no uncertain terms: if something’s missing, say so – and move on.

The real action, however, happened long before the curtain – in the training of the agents. Anyone who’s worked in pre-ChatGPT automated journalism – the era of expert systems, dataset farming, and if-this-then-that knee scraping – must appreciate these infinite games of chess with language models. Working around quirks and limitations to produce worthy copy leaves room for camaraderie, high fives, and smart-assery.

After hours – or days – we’re left with pages of distilled, often humorous, new chapters in the prompt typescript saga. Sometimes, agents respond better to arguments, metaphors, and anecdotes than to direct instructions. And when I look back at the paper trail, I’m left with a question: where should the copyright lie – in the Finnegans Wake of prompting, or in the commodity of the generated text?

Alongside their training with humans, agents also learn from the Archive – a growing library of modular stories that meet editorial standards. When a new piece is generated, it’s not just saved – it’s benchmarked. If it passes, it enters the Archive not as passive storage, but as prompt-shaping precedent: a high-quality example that can be referenced, reused, or echoed in future runs. The weights don’t change, but the prompting footprint evolves. The system gets sharper not through machine learning updates, but through accumulated editorial judgment. Call it epistemic fine-tuning.

No story – whether the product of artificial or human intelligence – should go straight to publishing. The Copy Editor is the one human who touches every piece in this pipeline. They can polish, refine, or approve, or send it all the way back to Unit #4 if something essential is still missing. Their feedback is, once again, not only editorial but instructional. Mistakes, awkward phrasings, rare blunders: these become prompt notes, injected back into the system.

In the end, this isn't just a generation step – it’s a loop. A disciplined, cautious, human-in-the-loop system – where modular journalism evolves from a static format into a learning organism.

The research continues. I’ll be publishing samples of modular stories generated entirely by agents in the coming weeks.

My parting message is a quote – not from a person, but from the language model itself, mid-training:

“What remains is not invention, but adoption.”

This work would not be here without the help of Shirish Kulkarni, Mattia Peretti and David Caswell. 🫶 I am very grateful.

August 2, 2025Pier Paolo Bozzano