Abstract:
In the contemporary era of information silos, the ability to synthesise vast datasets into concise formats is a critical challenge. Timeline Summarisation (TLS) addresses this by processing extensive corpora of news articles to produce a chronological sequence of key events, where each date is paired with a brief, salient summary. For general audiences, navigating hundreds of disparate articles to reconstruct a narrative is labour-intensive; an automated timeline provides an immediate understanding of event progression and the- matic essence. This thesis explores the algorithmic generation of such timelines, a task of significant in- terest within the Natural Language Processing (NLP) community. Evaluation is typically benchmarked against gold-standard datasets, specifically T17, CRISIS, and ENTITIES, using standardised metrics to compare machine-generated outputs against human-expert references. We investigate two distinct methodological frameworks to address the TLS problem: Rhetorical Structure Theory (RST): This approach analyses the discourse structure of news articles to identify salient sentences. Our hypothesis posited that “nucleus” sen- tences, which are frequently elaborated upon by “satellite” text, represent the core infor- mation. While this method provided insights into document structure, the results did not achieve state-of-the-art performance. Graph-based Entity and Event Extraction: The second approach involved clustering sentences based on shared entities and events. We initially explored Large Language Mod- els (LLMs) for extraction; however, the computational latency proved prohibitive for large- scale datasets. Consequently, we transitioned to a more e!cient spaCy-based pipeline. Our findings indicate that graph-theoretic properties, specifically node degree, serve as ef- fective indicators for identifying the most significant events to include in a timeline.