Modular Journalism with AI Agents

Loading...

The project

  • There are countless ways to approach the engineering. I kept things as simple as possible: the entire pipeline runs on the Next.js frontend with server-side rendering, orchestrated with LangChain. The architecture allows for dynamic workflows while keeping things lightweight.
  • I tested the flow using both the OpenAI API and the Gemini API, choosing Gemini for the proof of concept mainly because of pricing differences. While Google’s tokens are more expensive in the long run, they’re free to use (within limits) during research and prototyping.
  • As for performance, the cheapest OpenAI model – GPT‑3.5 Turbo – struggled with complex, multi-step prompts. Trial and error often led to actual errors. That said, LangChain integrates more smoothly with OpenAI than with Gemini. I only briefly tested GPT‑4, but the results were consistently strong. I had set a $10 token budget for the run; it came out to $11.70 with taxes – enough to validate the workflow. I bookmarked OpenAI for future testing, switched to Gemini, and didn’t look back.
  • One of the nice things about LangChain is that it makes it easy to mix and match models within the same pipeline, so you can pick the best one for each step – assuming you’ve got tokens to spend.
  • For the backend, I’m using Strapi Community Edition with a PostgreSQL database, which also powers the modular journalism API the pipeline connects to. While the structure for storing “atoms of content” for the generative phase is already set up in React, the upstream connection will remain on hold until the full flow is ready.
Clicking on the module nodes shows a definition of the categories in van Dijk's news schemata.

Agent 1 — Structured News Health Check

(For a deeper dive into the principles, see here.)

  • What happened?
  • What are the key facts?
  • What is the data?
  • What do key people say?
  • Are there images / videos / audios about this?
  • Why is this important?
  • What has got us here?
  • What is the impact on my community?
  • Are any people particularly or disproportionately affected?
  • How can we fix it?
  • How can I contribute / help?
  • What happens if…?
  • What don’t we know?

Directory of Liquid Content

Loading...
Agent 1
Agent 1

  • Is the content, after removing the dysfunctional bits, factual and informative enough?
  • Does it misrepresent, exaggerate, or contain inappropriate claims that should be removed?
  • How’s the sourcing–is it missing, vague, anonymous, or rogue?

Loading...
Agent 2
Agent 2

Loading...
Figure 1

Loading...
Figure 1

Information Needs => Stories Sankey. Left: user information needs. Right: story types (including the three new templates added with this research). Width = number of matches; color = story type.

Two steps back

TEXT
[
  {
    "input_excerpt": "The Labour party has called for an investigation into Sir Geoffrey Cox...",
    "target_atoms": [
      {
        "need_slug": "what-happened",
        "vdj_role": "Lead",
        "status": "solid",
        "confidence": 0.84,
        "evidence_spans": [
          { "text": "The Labour party has called for an investigation into Sir Geoffrey Cox" }
        ],
        "entities": [
          { "id": "e1", "name": "Labour Party", "kind": "org" },
          { "id": "e2", "name": "Sir Geoffrey Cox", "kind": "person", "role": "MP" }
        ],
        "events": [
          { "action": "call for investigation", "actors": ["e1"], "targets": ["e2"] }
        ]
      }
    ]
  }
]

Agent 2 — Cleaning for Bias

  • Is there a clear structure?
  • Is the tone neutral?
  • Is the language clear?
  • Are there basic grammar issues or typos?

user effectsrhetorical patterns

  • "Sensitive details are included: public-interest rationale may need stating"

  • I will report on identifying/graphic details lacking a compelling public-interest rationale

log of the training

Loading...
Figure 1

Agent 3 — The inventarium

  • Do not introduce new claims, facts, or invented quotes.
  • Never invent a “more correct” name if you’re not sure.
  • If two options are ambiguous, do not silently pick one.

Agent 4 — The reporter

Agent 5 — The cautious generator

Loading...
Figure 1

Pier Paolo Bozzano