FinToc-2022 Shared Task: Financial Document Structure Extraction

To be held at The 4th Financial Narrative Processing Workshop (FNP 2022), Marseille, France  24 June 2022.

NEW: https://tinyurl.com/286d67sc (this is now open as of 03/02/2022)

New: Winners of the shared task will receive a free registration to attend the FNP 2022 workshop  at LREC 2022 generously provided by the European Language Resources Association (ELRA) http://www.elra.info/en/.


Important Dates:

– 1st Call for papers & shared task participants: 10 January 2022

– 2nd Call for papers & shared task participants: 1 March 2022

Training set release: 25 February 2022 → 09 March 2022 (Extended)

Blind test set release: 25 March 2022 → 31 March 2022 (Extended)

Systems submission: 1 April 2022 → 07 April 2022 (Extended)

Release of results: 5 April 2022 → 13 April 2022 (Extended)

Paper submission deadline: 8 April 2022 → 12 April 2022 (Extended)

– Papers notification of acceptance: 3 May 2022

– Workshop date: 24 June 2022 (full day event)

 


Winners🎉:

The winning team of the FinTOC 2022 Shared task is 🏅Team ISP RAS 🏅from Ivannikov Institute for System Programming of the RAS, Russia. Congratulations 😀🥳👍👏!

ISP RAS team members:

Anastasiia Olegovna Bogatenkova, Oksana Vladimirovna Belyaeva, Andrew Igorevich Perminov and Ilya Sergeevich Kozlov

 

and for both subtasks for Spanish data, the winner is team 🏅swapUNIBA🏅 from The University of Bary and Objectway SpA, Italy. Congratulations 😀🥳👍👏!

swapUNIBA team members:

Pierluigi Cassotti, Cataldo Musto, Marco DeGemmis, Georgios Lekkas and Giovanni Semeraro


Awards:

Winners of the shared task will receive a free registration to attend the FNP 2022 workshop  at LREC 2022 generously provided by the European Language Resources Association (ELRA)

The free registrations are provided by the European Language Resources Association (ELRA) http://www.elra.info/en/.

The free registration can be only used by one of the team members, we’ll get it touch with the winning team and ask for the name of the person attending and presenting the paper.

The winning teams will also receive a money prize generously provided by Fortia .


Support:

FinTOC 2022 is supported by the European Language Resources Association (ELRA) http://www.elra.info/en/. ELRA will provide a free workshop registration to attend LREC 2022 to the winning team of FinTOC 2022. See details in the Awards section above.


Introduction:

A vast amount of financial documents are created and published constantly in machine-readable formats (generally PDF file format), with only minimal structure information. Firms use such documents to report their activities, financial situation or potential investment plans to shareholders, investors and the financial markets, basically corporate annual reports containing detailed financial and operational information.

In some countries as in the US or in France, regulators as EDGAR SEC or AMF require firms to follow a certain template when reporting their financial results to insure standardisation and consistency across firms’ disclosures. In other European countries, on the other hand, the management usually have more discretion on what where and how to report resulting in lack of standardisation between financial documents published within the same market.

Existing work on book and document table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of documents from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous documents and books. Compared to regular books (mostly provided in a full text format with limited structural information such as pages and paragraphs), Financial documents, containing textual and non textual content, have a more sophisticated structure including, parts, sections, sub-sections, sub-sub-sections.

In this shared task, we focus on analysing Financial Prospectuses; official PDF documents in which investment funds precisely describe their characteristics and investment modalities. Although the content they must include is often regulated, their format is not standardized and displays a great deal of variability ranging from plain text format, towards more graphical and tabular presentation of data and information. The majority of prospectuses are published without a table of content (TOC), which is usually needed to help readers to navigate within the document by following a simple outline of headers and page numbers, and assist legal teams in checking if all the contents required are fully included. Thus, automatic analyses of prospectuses to extract their structure is becoming more and more vital to many firms across the world.

Thanks to the contribution of the Autonomous University of Madrid (UAM, Spain), the fourth edition of the FinTOC shared task proposes the same welcomes a new track for Spanish documents in addition to English and French, and it will score systems on both Title detection and TOC generation performance as has been the practice from previous editions.

Participants need to register. Once registered, all participating teams will be provided with a common training dataset containing PDF documents and the associated TOC annotation.


Task:

The fourth edition of the FinTOC shared task proposes the same two tracks as the FinTOC’2  and FinTOC’3 editions: one track for English documents and another for french documents, and it will score systems on both Title detection and TOC generation performance. We have revised the task and greatly simplified data formats to make it as smooth as possible for every interested researcher to participate and submit their systems’ outputs at FinTOC’4.

Participants need to register. Once registered, all participating teams will be provided with a common training dataset containing PDF documents and the associated TOC annotation.


Background:

Existing work on book and document table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of documents from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous documents and books. Compared to regular books (mostly provided in a full text format with limited structural information such as pages and paragraphs), Financial documents, containing textual and non textual content, have a more sophisticated structure including, parts, sections, sub-sections, sub-sub-sections. 


How to participate:

To participate please use the registration form below to add details of your team: https://tinyurl.com/286d67sc (this is now open as of 03/02/2022)


Data Format and Evaluation:

TBA


Shared task Paper Submission Instructions:

TBA


Shared Task Organisers:

– Abderrahim Aitazzi, Fortia Financial Solutions
– Sandra Bellato, Fortia Financial Solutions
– Blanca Carbajo Coronado, Universidad Autónoma de Madrid
– Dr Ismail El Maarouf, Fortia Financial Solutions
– Dr Juyeon Kang, Fortia Financial Solutions
– Prof. Ana Gisbert, Universidad Autónoma de Madrid
– Prof. Antonio Moreno Sandoval, Universidad Autónoma de Madrid


Shared Task Contact:

Questions about FinTOC-2022 shared task can be sent to:

fin.toc.task@gmail.com