Please use this identifier to cite or link to this item:
http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/10039
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Siong, Chng Eng | - |
dc.contributor.author | KAPADIA, ANSH JAY | - |
dc.date.accessioned | 2025-05-20T06:39:26Z | - |
dc.date.available | 2025-05-20T06:39:26Z | - |
dc.date.issued | 2025-05 | - |
dc.identifier.citation | 53 | en_US |
dc.identifier.uri | http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/10039 | - |
dc.description.abstract | Speaker diarization, the task of determining "who spoke when" in an audio recording, is a critical component in applications such as meeting transcription, voice assistant technolo gies, and conversational analysis. Traditional clustering-based diarization methods strug gle with overlapping speech, while end-to-end neural diarization (EEND) systems often lack robustness across diverse acoustic conditions. This thesis presents TS-VAD+, an en hanced transformer-based speaker diarization model that builds upon the TS-VAD frame work by incorporating state-of-the-art speaker embeddings (ECAPA-TDNN), a WavLM based speech encoder, and memory-aware attention mechanisms. These improvements aim to address key limitations in handling multi-speaker and overlapping speech scenarios. We evaluate the TS-VAD+ model on the DIHARD III dataset, demonstrating its effec tiveness through systematic experiments. Pretraining on wideband simulated data (16 kHz) significantly improved domain adaptation, outperforming narrowband-pretrained models. Further refinements, including VBx clustering, voice activity detection (VAD) postprocessing, and data augmentation. While the memory module TS-VAD+ (mm-TS VAD+) showed promising results in leveraging external speaker embeddings, its perfor mance gains were limited by the size of the fine-tuning dataset. Overall, TS-VAD+ demonstrates competitive performance in speaker diarization, par ticularly in high-overlap conditions. Future work could explore self-supervised speaker embeddings, dynamic memory mechanisms, and large-scale augmentation strategies to further enhance diarization accuracy and generalization across diverse domains. | en_US |
dc.language.iso | en | en_US |
dc.subject | DATA SCIENCE | en_US |
dc.subject | DEEP LEARNING | en_US |
dc.subject | SPEAKER DIARIZATION | en_US |
dc.title | TSVAD+ - A Transformer based approach for Speaker Diarization | en_US |
dc.type | Thesis | en_US |
dc.description.embargo | One Year | en_US |
dc.type.degree | BS-MS | en_US |
dc.contributor.department | Dept. of Data Science | en_US |
dc.contributor.registration | 20201268 | en_US |
Appears in Collections: | MS THESES |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
20201268_Ansh_Jay_Kapadia_MS_Thesis.pdf | MS Thesis | 1.52 MB | Adobe PDF | View/Open Request a copy |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.