NOS Archive

Creating a searchable archive and API of the Dutch public broadcaster NOS

Overview

Until 2024, the NOS had a great archive. You could choose a date in a category and see what happened on that day. At the end of 2023, the NOS announced that old articles would only be accessible by their search function. The only problem is that this search function is terrible. You cannot filter by date or category. If you are looking for a topic that has several articles about it, it is like looking for a needle in a haystack. Using the things I've learned in my first (failed) project, I scraped the Internet Archive's Wayback Machine until 2010 and categorised and indexed all articles I could get until 2010. Additionally, I am ingesting rich data going forward from June 2024. To make it more useful than it was even before, I added variable archive windows (day/week/month) and categories, AI summarisation, search, and a public API.

Goals

Learning Objectives

  • SQL and NoSQL databases
  • Big data refinement
  • API setup

Practical Objectives

  • Better search NOS articles
  • Public API to search articles
  • Recreate and improve archive function

Implementation Details

Backend Architecture

Built with Flask framework using blueprints for modular design. Implements SQLAlchemy ORM for database operations with migration support via Alembic.

Frontend Design

Responsive UI built with Bootstrap CSS framework. Jinja2 templates provide server-side rendering with minimal JavaScript for enhanced interactions.

Deployment Infrastructure

Deployment on a VPS served with the Gunicorn HTTP server and routed with NGINX. Data is served throught a MySQL instance coupled with Elasticsearch.

Development Progress

Status

Completed

Start Date

2025-10-07

Features

Historic data until 2010 Completed
Automatic data ingestion Completed
Categorisation & labeling Completed
Implement search engine Completed
Day/Week/Month archive Completed
AI summarisations Completed
Public rate-limited API Completed