Knowledge Extraction from Cyber Security Incident Reports

GUPTE, PARTH

Please use this identifier to cite or link to this item: http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/9858

Title:	Knowledge Extraction from Cyber Security Incident Reports
Authors:	Patil, Sangameshwar Shukla, Manish GUPTE, PARTH Dept. of Data Science 20201008
Keywords:	NLP Cyber Security MSPhi3 Mistral Knowledge Graphs Attack Graphs Logical Attack Graphs
Issue Date:	May-2025
Citation:	104
Abstract:	Building systems secure from cyber attacks requires researchers and cyber security ex perts to understand past attacks and the vulnerabilities exploited by the attackers. To make this analysis easier for experts, the cyber security community has been proactively identifying different vulnerabilities and modelling potential and past threat scenarios using formalisms such as attack graphs, attack trees, knowledge graphs, logical attack graphs, etc. Many organisations have been collecting information about cyber incidents and attacks in the form of Cyber Incident Reports, which contain descriptions of the events that oc cur during a cyber attack. These reports are semi-structured, with critical information primarily presented in an unstructured text format. Deriving insights from these CTI Reports is of paramount importance to the cyber security community as it helps experts design more secure systems while avoiding the vulnerabilities of past systems. This is done by using formalisms such as attack graphs to create a representation of the attacks that can be easily interpreted by machines or experts and aid in analysis. Creating at tack graphs from CTI reports requires cyber security domain knowledge and the ability to interpret causality and chronology. In addition to this, CTI reports often contain unstruc tured, incomplete information, which makes the task even more challenging to automate. In this thesis, we build automated methods of creating 2 types of attack graph repre sentations, Logical Attack Graphs (LAGs) and Knowledge Graphs (KGs) using LLMs. In the first part of the thesis, we benchmark the performance of two language models, Microsoft Phi-3-mini-128k-instruct (MS-Phi3) and Mistral-7B-Instruct-v0.2 (Mistral), for generating Logical Attack Graphs. We use several prompting methodologies, such as few shot prompting, to try to improve the performance of the model and finally evaluate all promptvariations and modelsusinganautomatedsystemwebuiltusingSentence-BERT. We find that MS Phi 3, despite being a much smaller model, outperforms Mistral in sev eral benchmarks. In the second part, we switch our attention to Knowledge Graphs and generate these graphs using a version of Microsoft Phi-3-mini-128k-instruct fine-tuned for generating Knowledge Graphs from news articles and then improve its performance for the cyber security context by using cutting-edge techniques such as supervised fine-tuning and direct preference optimization. Finally, we evaluate our performance using a Sentence BERT model-based method and find that we can improve the performance of the models using supervised fine-tuning.
URI:	http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/9858
Appears in Collections:	MS THESES

Files in This Item:

File	Description	Size	Format
20201008_Parth_Gupte_MS_Thesis.pdf	MS Thesis	2.54 MB	Adobe PDF	View/Open Request a copy

Show full item record