FinTOC-2020 Shared Task:
“Financial Document Structure Extraction”
To be held at The 28th International Conference on Computational Linguistics (COLING’2020), Barcelona, Spain on 12 December 2020 [New Date].
NEW: Participation Form: https://forms.gle/LFsVaw6DqYikhKHx9
Important Dates [New Dates]:
December 1st, 2019 Registration opens. February 17th, 2020: Release of training set. March 23rd, 2020: Release of test set.
- registration deadline May 30, 2020
- result submission deadline June 30, 2020
- release of results July 30, 2020
- Shared task papers due September 1, 2020
- Notification of acceptance October 1, 2020
- Camera-ready papers due November 1, 2020
- Workshop and shared task dates December 12, 2020
A vast amount of financial documents are created and published constantly in machine-readable formats (generally PDF file format), with only minimal structure information. Firms use such documents to report their activities, financial situation or potential investment plans to shareholders, investors and the financial markets, basically corporate annual reports containing detailed financial and operational information.
In some countries as in the US or in France, regulators as EDGAR SEC or AMF require firms to follow a certain template when reporting their financial results to insure standardisation and consistency across firms’ disclosures. In other European countries, on the other hand, the management usually have more discretion on what where and how to report resulting in lack of standardisation between financial documents published within the same market.
In this shared task, we focus on analysing Financial Prospectuses; official PDF documents in which investment funds precisely describe their characteristics and investment modalities. Although the content they must include is often regulated, their format is not standardized and displays a great deal of variability ranging from plain text format, towards more graphical and tabular presentation of data and information. The majority of prospectuses are published without a table of content (TOC), which is usually needed to help readers to navigate within the document by following a simple outline of headers and page numbers, and assist legal teams in checking if all the contents required are fully included. Thus, automatic analyses of prospectuses to extract their structure is becoming more and more vital to many firms across the world.
The second edition of the FinTOC shared task proposes two tracks: one track for english documents and another for french documents, and it will score systems on both Title detection and TOC generation performance. We have revised the task and greatly simplified data formats to make it as smooth as possible for every interested researcher to participate and submit their systems’ outputs at FinTOC’2.
Participants need to register. Once registered, all participating teams will be provided with a common training dataset containing PDF documents and the associated TOC annotation.
Existing work on book and document table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of documents from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous documents and books. Compared to regular books (mostly provided in a full text format with limited structural information such as pages and paragraphs), Financial documents, containing textual and non textual content, have a more sophisticated structure including, parts, sections, sub-sections, sub-sub-sections.
Data Format and Evaluation:
The following pdf file describes the data format and evaluation metric used in the shared task: Data Format Details
Each team should write a short paper describing their methods. The paper will be published on ACL Anthology in the FNP 2020 proceedings as part of COLING 2020.
Shared task Paper Submission Instructions:
Submission now open.
Please include the shared task name in the title of submission. E.g. “Paper Title at FinTOC-2020”
Submission format should follow COLING’s 2020 author kit: https://coling2020.org/pages/submission
Please follow the guidelines and use the COLING-2020 style files in coling2020.zip
It has LATEX files, Microsoft Word template file, and a sample PDF file.
Shared task paper submissions are short papers of no more than 4 pages plus unlimited references. Teams with multiple runs can submit only one paper explaining their methods.
Shared Task Organisers:
- Dr Dialekti Valsamou, Fortia Financial Solutions
- Dr Ismail El Maarouf, Fortia Financial Solutions
- Najah-Imane Bentabet, Fortia Financial Solutions
- Rémi Juge, Fortia Financial Solutions
- Virginie Mouilleron, Fortia Financial Solutions
Shared Task Contact:
Questions about FinTOC-2020 shared task can be sent to: