TSVAD+ - A Transformer based approach for Speaker Diarization

KAPADIA, ANSH JAY

DR Home
→
THESES & PROJECT REPORTS
→
MS THESES
→
View Item

dc.contributor.advisor	Siong, Chng Eng
dc.contributor.author	KAPADIA, ANSH JAY
dc.date.accessioned	2025-05-20T06:39:26Z
dc.date.available	2025-05-20T06:39:26Z
dc.date.issued	2025-05
dc.identifier.citation	53	en_US
dc.identifier.uri	http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/10039
dc.description.abstract	Speaker diarization, the task of determining "who spoke when" in an audio recording, is a critical component in applications such as meeting transcription, voice assistant technolo gies, and conversational analysis. Traditional clustering-based diarization methods strug gle with overlapping speech, while end-to-end neural diarization (EEND) systems often lack robustness across diverse acoustic conditions. This thesis presents TS-VAD+, an en hanced transformer-based speaker diarization model that builds upon the TS-VAD frame work by incorporating state-of-the-art speaker embeddings (ECAPA-TDNN), a WavLM based speech encoder, and memory-aware attention mechanisms. These improvements aim to address key limitations in handling multi-speaker and overlapping speech scenarios. We evaluate the TS-VAD+ model on the DIHARD III dataset, demonstrating its effec tiveness through systematic experiments. Pretraining on wideband simulated data (16 kHz) significantly improved domain adaptation, outperforming narrowband-pretrained models. Further refinements, including VBx clustering, voice activity detection (VAD) postprocessing, and data augmentation. While the memory module TS-VAD+ (mm-TS VAD+) showed promising results in leveraging external speaker embeddings, its perfor mance gains were limited by the size of the fine-tuning dataset. Overall, TS-VAD+ demonstrates competitive performance in speaker diarization, par ticularly in high-overlap conditions. Future work could explore self-supervised speaker embeddings, dynamic memory mechanisms, and large-scale augmentation strategies to further enhance diarization accuracy and generalization across diverse domains.	en_US
dc.language.iso	en	en_US
dc.subject	DATA SCIENCE	en_US
dc.subject	DEEP LEARNING	en_US
dc.subject	SPEAKER DIARIZATION	en_US
dc.title	TSVAD+ - A Transformer based approach for Speaker Diarization	en_US
dc.type	Thesis	en_US
dc.description.embargo	One Year	en_US
dc.type.degree	BS-MS	en_US
dc.contributor.department	Dept. of Data Science	en_US
dc.contributor.registration	20201268	en_US

Files in this item

Name: 20201268_Ansh_Jay ...

Size: 1.484Mb

Format: PDF

Description: MS Thesis

View/Open

This item appears in the following Collection(s)

MS THESES [2219]
Thesis submitted to IISER Pune in partial fulfilment of the requirements for the BS-MS Dual Degree Programme/MSc. Programme/MS-Exit Programme

Show simple item record

Search Repository

Advanced Search

Browse

All of Repository
This Collection
- Titles
- Authors
- By Advisor
- By Issue Date
- Subjects
- By Type
- By Department

TSVAD+ - A Transformer based approach for Speaker Diarization

Files in this item

This item appears in the following Collection(s)

Search Repository

Browse

All of Repository

This Collection

My Account