Please use this identifier to cite or link to this item: http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/7933
Title: Application of imitation learning in automating end-to-end exploratory data analysis
Other Titles: Application of imitation learning in automating end to end exploratory data analysis
Authors: Pate, Hima
Manwani, Naresh
PATEL, DEVARSH
Dept. of Data Science
20181222
Keywords: REINFORCEMENT LEARNING
IMITATION LEARNING
GENERATIVE ADVERSARIAL IMITATION LEARNING
GAIL
EXPLORATORY DATA ANALYSIS
EDA
Issue Date: May-2023
Citation: 80
Abstract: One of the open problems in data science is how to automate the end-to-end EDA process, which involves exploring the dataset, identifying patterns, outliers, and relationships among variables, and preparing the data for further analysis or modeling. Some of the existing approaches try to frame this problem as a Sequential Decision Making Problem and use Reinforcement Learning (RL) to solve it. However, a major challenge in this approach is how to define and assign rewards for each action (such as GROUP, FILTER, etc.) that is taken during the EDA process. These rewards are essential for RL to learn an optimal policy. The rewards are usually manually defined using various interestingness measures that capture how relevant or informative an action is given the current state of the analysis. However, these measures may not be able to capture all the important aspects of an action, such as its impact on subsequent actions or its alignment with the analysis goals. We present a novel end-to-end EDA method that learns to perform data analysis tasks from human expert EDA notebooks without explicitly relying on any interestingness mea sures. Our method uses an imitation learning framework that learns the optimal policy for EDA by mimicking the actions of expert data analysts. Specifically, we employ generative adversarial imitation learning (GAIL) which allows our model to capture the essential as pects of data analysis in various domains. Our method can generate EDA notebooks that are comparable to human-generated ones in terms of quality and diversity. The proposed approach is able to generate EDA sessions on different datasets that share the same schema. We evaluate our method on existing datasets for AutoEDA benchmarking and on synthetic datasets. We show that our method surpasses the current state-of-the-art end-to-end EDA method on various performance metrics and can generalize well on unseen datasets. Moreover, we show that the EDA sessions (generated using the learned model with our method) use a diverse set of interestingness measures for each step of the EDA process as a byproduct.
URI: http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/7933
Appears in Collections:MS THESES

Files in This Item:
File Description SizeFormat 
20181222_Devarsh_Patel_MS_ThesisMS Thesis1.15 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.