FinTOC-2023 Shared Task: Financial Document Structure Extraction

To be held at The 5th Financial Narrative Processing Workshop (FNP 2023), Sorrento, Italy, 15-18 December 2023.


Important Dates:

  • 1st Call for papers & shared task participants: June 12, 2023
  • 2nd Call for papers & shared task participants: July 17, 2023
  • Final Call for papers & shared task participants: August 17, 2023
  • Training set release: August 21, 2023
  • Blind test set release: September 21, 2023
  • Systems submission: October 03, 2023
  • Release of results: October 09, 2023
  • Paper submission deadline:October 30, 2023 (anywhere in the world).
  • Notification of paper acceptance to authors: November 12, 2023
  • Camera-ready of accepted papers: November 20, 2023
  • Workshop date (1-day event): December 15-18, 2023 (exact date to be announced)

Introduction:

A vast and continuously growing volume of financial documents is being created and published in machine-readable formats, predominantly in PDF format. Unfortunately, these documents often lack comprehensive structural information, presenting a challenge for efficient analysis and interpretation. Nevertheless, these documents play a crucial role in enabling firms to report their activities, financial situation, and investment plans to shareholders, investors, and the financial markets. They serve as corporate annual reports, offering detailed financial and operational information.

In certain countries like the United States and France, regulators such as the SEC (Securities and Exchange Commission) and the AMF (Financial Markets Authority) have implemented requirements for firms to adhere to specific reporting templates. These regulations aim to promote standardization and consistency across firms’ disclosures. However, in various European countries, management typically possesses more flexibility in determining what, where, and how to report financial information, resulting in a lack of standardization among financial documents published within the same market.

Although there has been some research conducted on the recognition of books and document table of contents (TOC), most of the existing work has focused on small-scale, application-dependent and domain-specific datasets. This limited scope poses challenges when dealing with a vast collection of heterogeneous documents and books, where TOCs from different domains exhibit significant variations in visual layout and style. Consequently, recognizing and extracting TOCs becomes an intricate problem. Indeed, in comparison to regular books that are typically provided in a full-text format with limited structural information, such as pages and paragraphs, financial documents possess a more complex structure. They consist of various elements, including parts, sections, sub-sections, and even sub-sub-sections, incorporating both textual and non-textual content. Thus, TOC pages are not always present to help readers navigate the document, and when they are, they often only provide access to the main sections.

In this shared task, our objective is to undertake the analysis of various types of financial documents:

  1. KIID: Key Investor Information Document.
  2. Prospectus: official PDF documents where investment funds meticulously describe their characteristics and investment modalities.
  3. Réglement and Financial Annual Reports/Financial Statements: they provide a detailed overview of a company’s financial performance and operations over the course of a fiscal year.

These documents play a vital role in providing crucial information to investors, stakeholders, and regulatory bodies. While the content they must contain is often prescribed and regulated, their format lacks standardization, leading to a significant degree of variability. The presentation styles range from plain text format to more visually rich and data-driven graphical and tabular representations. Notably, the majority of those documents are published without a table of contents.

A TOC is typically essential for readers as it enables easy navigation within the document by providing a clear outline of headers and corresponding page numbers. Additionally, TOCs serve as a valuable resource for legal teams, facilitating the verification of the inclusion of all the required contents. Consequently, the automated analysis of these documents to extract their structure is becoming increasingly useful for numerous firms worldwide.

Our primary focus for this edition is to expand the extraction of table of contents to a wider variety of financial documents, and the task will involve developing highly efficient algorithms and methodologies to address the challenges associated with such a dataset. Our aim is to achieve a level of generalization, ensuring that the developed system can be applied to different types of financial documents. This way, we want to demonstrate the versatility and effectiveness of the ML algorithms used in TOC extraction, enabling a streamlined and consistent approach across various financial document types.

In addition, for this edition, we are excited to introduce a dataset that goes beyond textual annotations. Our proposed dataset will include visual (spatial) annotations that capture the coordinates of the titles and the hierarchical structure of the documents. This comprehensive approach enables a more holistic analysis and understanding of financial documents.

By incorporating visual annotations, we can capture the visual cues and design elements that contribute to the overall structure and organization of the documents. This allows us to delve deeper into the visual representation of the table of contents and extract valuable insights from the visual hierarchy present in these financial documents. The combination of textual and visual annotations provides a richer and more nuanced dataset, making it possible to increase the accuracy and effectiveness of the machine learning algorithms and methodologies employed in TOC extraction.

Thanks to the contribution of the Autonomous University of Madrid (UAM, Spain), the fifth edition of the FinTOC Shared Task welcomes a track for Spanish documents, continuing from the previous edition, in addition to the English and French tracks.

In this edition, systems will be scored based on their performance in both Title detection and TOC generation using more precise evaluation metrics based on visual annotations.


Task:

The fifth edition of the FinTOC Shared Task introduces three tracks, following the format of FinTOC’4. These tracks include one for English documents, one for French documents, and a third track for Spanish documents. In this edition, systems will be scored based on their performance in both Title detection and TOC generation using more precise evaluation metrics based on visual annotations.

Participants need to register. Once registered, all participating teams will be provided with a common training dataset containing PDF documents and the associated TOC annotation.


Background:

Existing work on book and document table of contents (TOC) recognition has been almost all on small-size, application-dependent, and domain-specific datasets. However, TOC of documents from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large-scale collection of heterogeneous documents and books. Compared to regular books (mostly provided in a full-text format with limited structural information such as pages and paragraphs), financial documents containing textual and non-textual content have a more sophisticated structure, including parts, sections, sub-sections, and sub-sub-sections.


How to participate:

To participate, please use this registration form to add details of your team.

It is now open as of 06/01/2023.


Data Format and Evaluation:

TBA


Shared task Paper Submission Instructions:

TBA


Shared Task Organisers:

  • Abderrahim Aitazzi, 3DS Outscale (ex Fortia), France
  • Sandra Bellato, 3DS Outscale (ex Fortia), France
  • Blanca Carbajo Coronado, Universidad Autónoma de Madrid
  • Dr Ismail El Maarouf, Imprevicible
  • Dr Juyeon Kang, 3DS Outscale (ex Fortia), France
  • Prof. Ana Gisbert, Universidad Autónoma de Madrid
  • Prof. Antonio Moreno Sandoval, Universidad Autónoma de Madrid

Shared Task Contact:

Questions about FinTOC-2023 shared task can be sent to:

fin.toc.task@gmail.com