Abstract:
The primary goal of our project is to create a non - deep learning
solution for effectively segmenting cells within tabular data,
accommodating tables with or without gridlines.
We have devised an algorithm based on K-Means Clustering to
facilitate cell segmentation within tables, irrespective of the presence
of gridlines. Our approach involves identifying clusters of characters,
often representing words or numbers, and subsequently calculating
their centres of mass. We create distinct arrays for the x and y
coordinates of these centres. Employing K-Means clustering
separately on x coordinates and y coordinates of centres, we
determine the optimal number of clusters, denoted as 'k,' from 1 to a
predefined maximum value ('max_k') using a novel method for
selecting the most suitable 'k', as the existing methods yielded
unsatisfactory results. Subsequently, we discern rows and columns
separately by employing K-Means clustering with the determined 'k'
and identify individual cells through the intersection of these rows
and columns.
In addition, we have developed an alternative algorithm tailored for
tables containing gridlines. In this scenario, we use canny edge
detection and hough transform to detect lines, followed by the
identification of intersection points. We use intersection points to
detect gridlines. Using these detected gridlines, we reconstruct the
table structure.