Digital Repository

Application of imitation learning in automating end-to-end exploratory data analysis

Show simple item record

dc.contributor.advisor Pate, Hima
dc.contributor.advisor Manwani, Naresh
dc.contributor.author PATEL, DEVARSH
dc.date.accessioned 2023-05-19T07:00:33Z
dc.date.available 2023-05-19T07:00:33Z
dc.date.issued 2023-05
dc.identifier.citation 80 en_US
dc.identifier.uri http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/7933
dc.description.abstract One of the open problems in data science is how to automate the end-to-end EDA process, which involves exploring the dataset, identifying patterns, outliers, and relationships among variables, and preparing the data for further analysis or modeling. Some of the existing approaches try to frame this problem as a Sequential Decision Making Problem and use Reinforcement Learning (RL) to solve it. However, a major challenge in this approach is how to define and assign rewards for each action (such as GROUP, FILTER, etc.) that is taken during the EDA process. These rewards are essential for RL to learn an optimal policy. The rewards are usually manually defined using various interestingness measures that capture how relevant or informative an action is given the current state of the analysis. However, these measures may not be able to capture all the important aspects of an action, such as its impact on subsequent actions or its alignment with the analysis goals. We present a novel end-to-end EDA method that learns to perform data analysis tasks from human expert EDA notebooks without explicitly relying on any interestingness mea sures. Our method uses an imitation learning framework that learns the optimal policy for EDA by mimicking the actions of expert data analysts. Specifically, we employ generative adversarial imitation learning (GAIL) which allows our model to capture the essential as pects of data analysis in various domains. Our method can generate EDA notebooks that are comparable to human-generated ones in terms of quality and diversity. The proposed approach is able to generate EDA sessions on different datasets that share the same schema. We evaluate our method on existing datasets for AutoEDA benchmarking and on synthetic datasets. We show that our method surpasses the current state-of-the-art end-to-end EDA method on various performance metrics and can generalize well on unseen datasets. Moreover, we show that the EDA sessions (generated using the learned model with our method) use a diverse set of interestingness measures for each step of the EDA process as a byproduct. en_US
dc.language.iso en en_US
dc.subject REINFORCEMENT LEARNING en_US
dc.subject IMITATION LEARNING en_US
dc.subject GENERATIVE ADVERSARIAL IMITATION LEARNING en_US
dc.subject GAIL en_US
dc.subject EXPLORATORY DATA ANALYSIS en_US
dc.subject EDA en_US
dc.title Application of imitation learning in automating end-to-end exploratory data analysis en_US
dc.title.alternative Application of imitation learning in automating end to end exploratory data analysis en_US
dc.type Thesis en_US
dc.description.embargo One Year en_US
dc.type.degree BS-MS en_US
dc.contributor.department Dept. of Data Science en_US
dc.contributor.registration 20181222 en_US


Files in this item

This item appears in the following Collection(s)

  • MS THESES [1705]
    Thesis submitted to IISER Pune in partial fulfilment of the requirements for the BS-MS Dual Degree Programme/MSc. Programme/MS-Exit Programme

Show simple item record

Search Repository


Advanced Search

Browse

My Account