Why Journalism Needs Structure Before AI Eats It

Modular journalism meets answer systems. A resilient linguistic framework for news, mapped user information needs, and a minimal atom schema for AI readers. Coming up with an atom-based infrastructure is the easy part. The hard part is getting publishers to understand the structure imperative — rebuilding their CMSs around atoms — and agents to respect structured attribution.

Over the last couple of years, my modular journalism work has been split between two slightly incompatible worlds.

On one side, there's the human world: editors, news product gurus, meta-journalists trying to break away from the long-form article and reimagine news around what people actually need to know. That's where the first versions of the Modular Journalism API came from: user information needs, story templates, and the stubborn insistence that "Why is this important?" and "How does this affect my community?" deserve first-class status.

On the other side, there's the agent world: language models, tools, and pipelines that don't care about our formats and newsroom practices at all. They just see text and probabilities. If we want them to respect journalism – its sourcing, its uncertainty, its boundaries – we have to give them something better than a wall of prose and a hope that "good writing" will somehow translate into "machine-readable structure." Spoiler: it does not.

This digression from my modular research lives exactly in the overlap between those worlds.

The question I'm trying to answer is narrow and practical:

What does a news story need to look like so that AI agents can reliably understand, search, and reuse it without destroying its meaning and provenance?

The answer, unsurprisingly, is structure. But not abstract, hand-wavy "we really should structure things more." I mean structure at the level where agents actually operate: episodes, events, and paragraphs. A grammar that tells them, sentence by sentence, "this is the action," "this is reaction," "this is background," "this is context," "this is someone speculating about what might happen next." And, dear agent, if you lose too many of these pieces along the way, you’ll still produce an answer when someone asks you a pointed question — but it will be overly confident, plausible-sounding crapola, not journalism.

That's why Teun van Dijk's news schemata, which were perfect as a conceptual compass when we started this journey, have now given way to Allan Bell’s schema in the modular journalism API. Once a good underlying framework for news story structure is in place, it becomes possible to map each paragraph to both a linguistic function (action, reaction, background, context, expectations…) and a user need ("What happened?", "What is the impact on my community?", "What don’t we know?", …).

From there, the path is almost forced: paragraphs become atoms; atoms become entries in an API; the API becomes a contract between journalism and agents. The rest of this essay walks through that transition: how Bell replaces van Dijk as the linguistic layer, how Bell's nodes are paired with user information needs, and how those pairings can be turned into a minimal schema — a protocol that both modular newsrooms and answer systems can share.

So this isn't just about making my own pipeline work better (well, not only). It’s about making journalism legible to any agent.

Disclaimer: this is by all means not a finished work, nor one without imprecision. It's the sketch of an idea that, touch wood, will turn into an MVP if I find the energy to work a few more weekends, resisting the pull to just go back to Scrivener to write space operas instead.

Goodbye Teun, hello Allan

It has become a bit of a tradition (and not an unpleasant one) that whenever I return to my modular journalism research, I end up spending a weekend or two dealing with some aspect of news linguistics. In this case, the opportunity came with the retirement of Teun van Dijk's news schemata from the modular journalism API, and its formal replacement with Allan Bell's schema, which is nimbler and far better suited to this purpose.

Bell’s work is not my discovery; it's Sannuta Raghu's 🙌 and I’m very grateful for it.

Van Dijk's framework was first introduced in 2023 as a foundational stepping stone for the definition of journalistic modules. Although it was obviously outdated and bloated, the role of the news schemata in that context was purely theoretical: we needed strong journalistic principles to stand on, and an inverted pyramid to move away from. When, a few months ago, I began working on recognizing the functional components of journalistic texts in an agentic pipeline, the complexity of van Dijk's schemata came back to bite me.

The excellent AI agents we have in our toolkits in the fall of 2025 have perhaps been instructed by publicists never to admit it, but they do not like complexity. Not even a little bit.

Bell’s schema is a drop-in replacement for van Dijk's, but it’s closer to how news actually works on the page – and that makes it much easier for agents to use as a blueprint for tagging the semantic structure and also easier for humans to tweak prompts. Van Dijk's model is brilliant as a macro-structure for narrative, but its categories are big, fuzzy and global. The same paragraph can contribute to the "summary" and also to the "situation". For an LLM that has to label individual sentences, that adds discretion and margins of error. Pipelines have enough moving parts; we can’t afford to measure the measuring tape on each run.

Bell flips the perspective. He treats a news story as a sequence of episodes made of events, and then decomposes each event into functional slots: action, actors, setting, follow-up (consequences and reactions), commentary (context, evaluation, expectations) and background (previous episodes and history), plus headline, lead/abstract, and attribution at the story level. Those slots map very directly onto textual cues: datelines and bylines for attribution; temporal adverbs and tenses for background vs follow-up; reporting verbs for reactions ("said", "claimed"); hedging and modality for expectations ("could", "is expected to"), and so on.

For agents, especially my agents who are asked to not make judgement and only use rhetorical and lexical cues, this is ideal. Instead of asking "Where are we in the narrative arc?" they can ask "In this sentence, is the journalist describing what happened, reacting to it, explaining it, or reaching back into the past?" That makes automatic tagging of functional components cleaner, more consistent across stories, and easier to adapt to downstream tasks like information needs and effects detection.

Allen Bell's news story schema. Tap on nodes for definitions.

Teun van Dijk's news schemata. Tap on nodes for definitions.

It may not be a surprise that switching the linguistic framework from a convoluted to a straightforward one helped the AI agent in mapping the structure of the text.

The next step was to start matching the new Allan Bell nodes to the user information needs.

I can't share direct access to the pipeline, but I can now save a complete snapshot of my agents runs. The snapshot refers to the same Scroll.in article Sannuta Raghu used for her News Atom demo.

Sannuta's News Atom is the most complete articulation I’ve seen of what a sentence-level unit of journalism should look like in the age of AI: 15 fields, spanning identity, epistemic status, semantic grounding, provenance, review workflow, and licensing. It’s journalism’s EXIF, and my work stands on its shoulders.

In this project, though, I needed something slightly different. The modular journalism pipeline already produces linguistic slots, user information needs, and paragraph-level atoms, and I wanted a minimal, JSON-LD-friendly surface that answer systems and MCP agents could use today without having to implement the full News Atom stack. That led me to a deliberately narrower schema.news/Atom v0.1: one Bell role, one dominant user need, offsets into the source, a small actors/action/setting/time bundle, and provenance via schema.org NewsArticle (and its siblings OpinionNewsArticle, BackgroundNewsArticle, etc.).

I also tried to imagine what could be a workable compromise between what LLMs need and what a random publisher’s IT department might be remotely willing to tackle, without resorting to voodoo dolls and needles.

Why should newsrooms bother with esoteric structures and atoms? Because if they don't, agents will consume their content as raw slop, strip the attribution, and kill their brand authority. Structured atoms are the only way to at least make attribution possible in an AI-mediated world. Someone will have of course to make (force?) agents, who have so far benefited from synthesis without attribution, respect the structure.

With that reality in mind, this is not an alternative to the News Atom blueprint, but more a thin compatibility layer: a pragmatic, user-needs-first view of the same underlying structure, tuned for LLM reliability and deployment inside a working agentic pipeline. In future iterations, it should be straightforward to map a full News Atom into a set of schema.news/Atom instances (and vice versa). For now, my priority was to keep the surface area small enough that both humans and agents can use it in a pinch.

The snapshot and JSON-LD below show what this looks like on a real story: a Trump–Apple tariffs piece where each segment is tagged with a Bell role, mapped to a user need, and emitted as a reusable Atom.

JSON

{
  "@id": "https://scroll.in/latest/1082718/rush-hour-trump-threatens-apple-with-tariff-rahul-gandhi-criticises-s-jaishankar-and-more",
  "@type": "NewsArticle",
  "author": [
    {
      "name": "Name Lastname",
      "@type": "Person"
    }
  ],
  "hasPart": [
    {
      "@id": "https://scroll.in/latest/1082718/...#atom-lead",
      "text": "President Donald Trump on Friday said that technology company Apple could face a 25% tariff on iPhones sold in the United States if they were not manufactured in the country.",
      "@type": "Atom",
      "actors": [
        {
          "name": "President Donald Trump",
          "@type": "Person"
        },
        {
          "name": "Apple",
          "@type": "Organization"
        }
      ],
      "actions": [
        "Trump said Apple could face a 25% tariff on iPhones sold in the United States if they are not manufactured in the country."
      ],
      "endChar": 176,
      "setting": {
        "name": "United States",
        "@type": "Place"
      },
      "isPartOf": {
        "@id": "https://scroll.in/latest/1082718/rush-hour-trump-threatens-apple-with-tariff-rahul-gandhi-criticises-s-jaishankar-and-more"
      },
      "position": 1,
      "eventTime": "Friday",
      "startChar": 0,
      "identifier": "lead-1",
      "newsFunction": "action"
    },
    {
      "@id": "https://scroll.in/latest/1082718/...#atom-reaction-1",
      "text": "In a social media post, Trump said that he had informed Apple's Chief Executive Officer Tim Cook that he expects iPhones sold in the United States to be manufactured in the country and not in \"India, or anyplace else\". \"If that is not the case, a tariff of at least 25% must be paid by Apple to the US,\" he added.",
      "@type": "Atom",
      "actors": [
        {
          "name": "President Donald Trump",
          "@type": "Person"
        },
        {
          "name": "Apple's Chief Executive Officer Tim Cook",
          "@type": "Person"
        },
        {
          "name": "Apple",
          "@type": "Organization"
        }
      ],
      "actions": [
        "Trump said in a social media post that he told Tim Cook he expects all iPhones sold in the United States to be manufactured in the country and warned that Apple must pay at least a 25% tariff if they are made elsewhere."
      ],
      "endChar": 520,
      "setting": {
        "name": "social media post about iPhone manufacturing and tariffs",
        "@type": "Place"
      },
      "isPartOf": {
        "@id": "https://scroll.in/latest/1082718/rush-hour-trump-threatens-apple-with-tariff-rahul-gandhi-criticises-s-jaishankar-and-more"
      },
      "position": 2,
      "eventTime": "On Friday",
      "startChar": 178,
      "identifier": "reaction-1",
      "newsFunction": "reaction"
    },
    {
      "@id": "https://scroll.in/latest/1082718/...#atom-context-1",
      "text": "This came amid attempts by Apple to diversify its manufacturing beyond China, where it makes most of its iPhones, amid tariff and geopolitical concerns. The company does not manufacture its smartphones in the US. It plans to source the majority of its US iPhone supply from India by the end of next year to reduce its dependence on China.",
      "@type": "Atom",
      "actors": [
        {
          "name": "Apple",
          "@type": "Organization"
        }
      ],
      "actions": [
        "Apple is diversifying iPhone manufacturing away from China and plans to source most US iPhone supply from India by the end of next year."
      ],
      "endChar": 1324,
      "setting": {
        "name": "China and India",
        "@type": "Place"
      },
      "isPartOf": {
        "@id": "https://scroll.in/latest/1082718/rush-hour-trump-threatens-apple-with-tariff-rahul-gandhi-criticises-s-jaishankar-and-more"
      },
      "position": 3,
      "startChar": 949,
      "identifier": "context-1",
      "newsFunction": "context"
    },
    {
      "@id": "https://scroll.in/latest/1082718/...#atom-background-1",
      "text": "Trump's so-called reciprocal tariffs imposed on several countries, including a 26% \"discounted\" levy on India, took effect on April 9. Hours later, however, Trump reduced the rates on imports from most countries to 10% for 90 days to provide time for trade negotiations.",
      "@type": "Atom",
      "actors": [
        {
          "name": "President Donald Trump",
          "@type": "Person"
        }
      ],
      "actions": [
        "Trump imposed reciprocal tariffs, including a 26% levy on India, then reduced most rates to 10% for 90 days to allow for trade talks."
      ],
      "endChar": 1586,
      "isPartOf": {
        "@id": "https://scroll.in/latest/1082718/rush-hour-trump-threatens-apple-with-tariff-rahul-gandhi-criticises-s-jaishankar-and-more"
      },
      "position": 4,
      "eventTime": "From April 9 for 90 days",
      "startChar": 1326,
      "identifier": "background-1",
      "newsFunction": "background"
    }
  ],
  "@context": [
    "https://schema.org",
    {
      "Atom": "news:Atom",
      "news": "https://schema.news/",
      "actors": "news:actors",
      "actions": "news:actions",
      "endChar": "news:endChar",
      "setting": "news:setting",
      "eventTime": "news:eventTime",
      "startChar": "news:startChar",
      "newsFunction": "news:newsFunction",
      "NewsOrganization": "news:NewsOrganization"
    }
  ],
  "headline": "Trump threatens Apple with 25% tariff on iPhones not made in the US",
  "publisher": {
    "name": "Scroll.in",
    "@type": [
      "NewsMediaOrganization",
      "NewsOrganization"
    ]
  },
  "datePublished": "2025-05-23T04:00:00.000Z"
}

Structured content management

There's another point worth making explicit: right now, I'm using agents to infer structure from prose that was never written with structure in mind — essentially retrofitting atoms onto articles after the fact. That works as a bridge strategy for legacy content and publishers who won't change their workflows no matter what.

But the real goal isn't to get better at reverse-engineering journalism into atoms; it's to make CMSs guide journalists to create structured content from the start. If a publishing system prompts journalists to you to fill in "What happened?" and "Who is affected?" as first-class fields - not SEO afterthoughts, but the actual unit of composition - then atoms become native, not inferred. Journalism becomes atom-first, not article-first-then-decomposed. That eliminates the discretion and error margins of automated tagging, and it forces newsrooms to think in terms of user needs from the moment of creation rather than tacking them on during distribution.

The schema.org extension becomes not just an output format but a workflow framework. Which means the real adoption question isn't just "Will publishers extend their markup?" but "Will they toss the outdated article-based CMSs in a landfill and adopt a new generation of tools that handles structure natively?" That's a heavier lift, but it's where the structural imperative actually has teeth.

Allen Bell nodes matched with user information needs. The linguistic framework is the compass agents use to match journalistic content to user information needs.

It may not be the most orthodox way to proceed – working on an atom schema MVP while already in the midst of a modular journalism automation pipeline – but oddly the two projects overlap in many places.

The goal is, after all, the same: create a structural framework that makes an agent’s work easier. But this is the moment where the path splits and we need to define a new roadmap.

Turning atoms into a protocol (and an MCP tool)

By the time we get to Bell, user needs, and user effects, the shape of the thing is pretty clear: we’re no longer talking about an article but about a graph of atoms.

Each atom:

answers one user information need (or a tight cluster of them),
occupies one Bell slot (action, reaction, context, background, expectations, consequences, etc.),
and knows who said what, where, and when.

Right now, that graph lives inside my stack: Strapi + Postgres + a Next.js frontend + a small zoo of agents. To make journalism legible to other agents – the ones I don’t control – these atoms need to stop being a local implementation detail and start behaving like a protocol.

The next step is to expose them as a tool via MCP, the Model Context Protocol. In practice, that means:

Freezing a minimal atom shape – we can perhaps call it schema.news/Atom v0.1

Not every field, just the ones any answer intermediary would actually use:

@id, text
newsFunction (Bell: action, reaction, context, background, expectations, consequences, etc.)
offsets in the source (startChar, endChar)
actors, actions, setting, eventTime
a link to a user-need slug (what-happened, what-is-the-impact, and so on)

Wrapping the atom store in tools that agents can call, for example:

search_atoms(query, newsFunction, need_slug, limit) – find relevant atoms for a question
get_atoms_for_article(article_id, newsFunction) – get all actions + context for a story
get_atoms_for_need(need_slug, topic) – gather everything we know that answers “What happened?” about X

Always returning provenance with the atoms

Each response carries:

the atom itself,
a pointer back to the source material (@id, headline, publisher, published date),
and, ideally, a canonical content hash for de-duplication.

From the agent’s point of view, this turns journalism into something queryable in a single, consistent way: "Given a question, call the MCP tool, get atoms back, compose an answer from those atoms, keep the links to where they came from."

If you squint, this looks a little bit like the structure imperative turned into an API, no?

It’s also where the symmetry from the silly “geometric proof” becomes visible: the same structure that lets my agents clean and recombine news into modules is what lets other agents read it without mangling attribution.

A responder to test the theorem

Once atoms are visible as a protocol instead of a private schema, the obvious next move is to test whether the central claim actually holds up:

If journalism is modular and structured at atom level, answer systems will find it easier to use with attribution than without.

To test that, I want something deliberately small and slightly ridiculous: a responderer.

The responderer is a tiny agent whose job is only this:

Take a user question.
Call the MCP atom tools.
Assemble an answer using only returned atoms.
Show its working: which atoms it used, from which stories, with which publishers.

No open-web search, no creative “enhancement,” no extra facts. If the atoms aren’t there, the responder says so and points to the missing user needs:

“I can tell you what happened and who is affected, but I have no atoms for how we can fix it yet.”

In other words: the responderer is not a product; it’s a wind tunnel for the structure.

If it struggles to answer “What happened?” using only action atoms, the Bell mapping is wrong.
If it constantly needs to peek at background to answer a “why” question, the user need is mis-specified.
If it keeps returning half-baked answers, we know which atoms, needs, or user effects are missing in the pipeline.

Every failure becomes a calibration point: either the schema is underspecified, the annotations are poor, or the user need is moot and must be tweaked.

And Docs, so others can break it

To make this useful beyond my own bench tests, the last part of the roadmap is dull but essential: docs.

Nothing miraculous, let's aim for the low hanging fruit first:

A one-page spec for schema.news/Atom v0.1

Field list, examples, and how Bell slots + user needs map into it.
A couple of fully worked JSON-LD examples from real stories.

A "how to annotate" section for humans and agents:

How to decide whether a paragraph is background vs previous episodes.
How to pick one primary user need for an atom.
How to express actors/actions/setting consistently.

A "how to plug into the responderer" note:

How another newsroom could expose their atoms through the same MCP tools.
How to test their own corpus against the responder and see where it falls short.

November 30, 2025Pier Paolo Bozzano