Application of imitation learning in automating end-to-end exploratory data analysis

PATEL, DEVARSH

Please use this identifier to cite or link to this item: http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/7933

Title:	Application of imitation learning in automating end-to-end exploratory data analysis
Other Titles:	Application of imitation learning in automating end to end exploratory data analysis
Authors:	Pate, Hima Manwani, Naresh PATEL, DEVARSH Dept. of Data Science 20181222
Keywords:	REINFORCEMENT LEARNING IMITATION LEARNING GENERATIVE ADVERSARIAL IMITATION LEARNING GAIL EXPLORATORY DATA ANALYSIS EDA
Issue Date:	May-2023
Citation:	80
Abstract:	One of the open problems in data science is how to automate the end-to-end EDA process, which involves exploring the dataset, identifying patterns, outliers, and relationships among variables, and preparing the data for further analysis or modeling. Some of the existing approaches try to frame this problem as a Sequential Decision Making Problem and use Reinforcement Learning (RL) to solve it. However, a major challenge in this approach is how to define and assign rewards for each action (such as GROUP, FILTER, etc.) that is taken during the EDA process. These rewards are essential for RL to learn an optimal policy. The rewards are usually manually defined using various interestingness measures that capture how relevant or informative an action is given the current state of the analysis. However, these measures may not be able to capture all the important aspects of an action, such as its impact on subsequent actions or its alignment with the analysis goals. We present a novel end-to-end EDA method that learns to perform data analysis tasks from human expert EDA notebooks without explicitly relying on any interestingness mea sures. Our method uses an imitation learning framework that learns the optimal policy for EDA by mimicking the actions of expert data analysts. Specifically, we employ generative adversarial imitation learning (GAIL) which allows our model to capture the essential as pects of data analysis in various domains. Our method can generate EDA notebooks that are comparable to human-generated ones in terms of quality and diversity. The proposed approach is able to generate EDA sessions on different datasets that share the same schema. We evaluate our method on existing datasets for AutoEDA benchmarking and on synthetic datasets. We show that our method surpasses the current state-of-the-art end-to-end EDA method on various performance metrics and can generalize well on unseen datasets. Moreover, we show that the EDA sessions (generated using the learned model with our method) use a diverse set of interestingness measures for each step of the EDA process as a byproduct.
URI:	http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/7933
Appears in Collections:	MS THESES

Files in This Item:

File	Description	Size	Format
20181222_Devarsh_Patel_MS_Thesis	MS Thesis	1.15 MB	Adobe PDF	View/Open

Show full item record