Algorithms for Table Structure Recognition

Main Article Content

Yosveni Escalona http://orcid.org/0000-0003-2992-0540

Abstract

Tables are widely adopted to organize and publish data. For example, the Web has an enormous number of tables, published in HTML, embedded in PDF documents, or that can be simply downloaded from Web pages. However, tables are not always easy to interpret due to the variety of features and formats used. Indeed, a large number of methods and tools have been developed to interpreted tables. This work presents the implementation of an algorithm, based on Conditional Random Fields (CRFs), to classify the rows of a table as header rows, data rows or metadata rows. The implementation is complemented by two algorithms for table recognition in a spreadsheet document, respectively based on rules and on region detection. Finally, the work describes the results and the benefits obtained by applying the implemented algorithm to HTML tables, obtained from the Web, and to spreadsheet tables, downloaded from the Brazilian National Petroleum Agency.
Abstract 278 | PDF (Español (España)) Downloads 126 PDF Downloads 33

References

[1] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri, “Infogather: Entity augmentation and attribute discovery by holistic matching with web tables,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 97–108. [Online]. Available: https://doi.org/10.1145/2213836.2213848
[2] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Webtables: Exploring the power of tables on the web,” Proc. VLDB Endow., vol. 1, no. 1, pp. 538–549, Aug. 2008. [Online]. Available: https://doi.org/10.14778/1453856.1453916
[3] E. Koci, M. Thiele, O. Romero, and W. Lehner, “Table identification and reconstruction in spreadsheets,” in Advanced Information Systems Engineering, E. Dubois and K. Pohl, Eds. Cham: Springer International Publishing, 2017, pp. 527–541.
[4] P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu, “Recovering semantics of tables on the web,” Proc. VLDB Endow., vol. 4, no. 9, pp. 528–538, Jun. 2011. [Online]. Available: https://doi.org/10.14778/2002938.2002939
[5] G. Limaye, S. Sarawagi, and S. Chakrabarti, “Annotating and searching web tables using entities, types and relationships,” Proc. VLDB Endow., vol. 3, no. 1–2, pp. 1338–1347, Sep. 2010. [Online]. Available: https://doi.org/10.14778/1920841.1921005
[6] T. F. Varish Mulwad and A. Joshi, “Generating Linked Data by Inferring the Semantics of Tables,” in Proceedings of the First International Workshop on Searching and Integrating New Web Data Sources, September 2011, co-located with VLDB 2011. [Online]. Available: https://bit.ly/3p8s1q0
[7] A. S. Corrêa and P.-O. Zander, “Unleashing tabular content to open data: A survey on pdf table extraction methods and tools,” in Proceedings of the 18th Annual International Conference on Digital Government Research, ser. dg.o ’17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 54–63. [Online]. Available: https://doi.org/10.1145/3085228.3085278
[8] B. Yildiz, K. Kaiser, and S. Miksch, “pdf2table: A method to extract table information from pdf files.” [Online]. Available: https://bit.ly/3k2ejBa
[9] Y. Liu, P. Mitra, and C. L. Giles, “Identifying table boundaries in digital documents via sparse line detection,” in CIKM ’08, 2008. [Online]. Available: https://bit.ly/369nWcm
[10] T. Kieninger, “Table structure recognition based on robust block segmentation,” 1998, pp. 22–32. [Online]. Available: https://bit.ly/38k4YT9
[11] M. Zhang and K. Chakrabarti, “Infogather+: Semantic matching and annotation of numeric and time-varying attributes in web tables,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’13. New York, NY, USA: Association for Computing Machinery, 2013, pp. 145–156. [Online]. Available: https://doi.org/10.1145/2463676.2465276
[12] Z. Zhang, “Towards efficient and effective semantic table interpretation,” in The Semantic Web – ISWC 2014, P. Mika, T. Tudorache, A. Bernstein, C. Welty, C. Knoblock, D. Vrandecic, P. Groth, N. Noy, K. Janowicz, and C. Goble, Eds. Cham: Springer International Publishing, 2014, pp. 487–502. [Online]. Available: https://doi.org/10.1007/978-3-319-11964-9_31
[13] H. Masuda and S. Tsukamoto, “Recognition of html table structure,” 2004. [Online]. Available: https://bit.ly/3p8xL2Q [14] J. Fang, P. Mitra, Z. Tang, and C. L. Giles, “Table header detection and classification,” in AAAI, 2012. [Online]. Available: https://bit.ly/2IcT3vy
[15] D. Pinto, A. McCallum, X. Wei, and W. B. Croft, “Table extraction using conditional random fields,” in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, ser. SIGIR ’03. New York, NY, USA: Association for Computing Machinery, 2003, pp. 235–242. [Online]. Available: https://doi.org/10.1145/860435.860479
[16] I. A. Doush and E. Pontelli, “Detecting and recognizing tables in spreadsheets,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, ser. DAS ’10. New York, NY, USA: Association for Computing Machinery, 2010, pp. 471–478. [Online]. Available: https://doi.org/10.1145/1815330.1815391
[17] E. Koci, M. Thiele, W. Lehner, and O. Romero, “Table recognition in spreadsheets via a graph representation,” in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), 2018, pp. 139–144. [Online]. Available: https://doi.org/10.1109/DAS.2018.48
[18] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, pp. 282–289. [Online]. Available: https://bit.ly/3lbW1yE
[19] J. L. Solé, Book review: Pattern recognition and machine learning. Cristopher M. Bishop. Information Science and Statistics. Springer, 2007. [Online]. Available: https://bit.ly/3l7doRq
[20] M. D. Adelfio and H. Samet, “Schema extraction for tabular data on the web,” Proc. VLDB Endow., vol. 6, no. 6, pp. 421–432, Apr. 2013. [Online]. Available: https://doi.org/10.14778/2536336.2536343