Skip to content

open-source-modelling/SFCR_using_Mistral

Repository files navigation

Transcribe SFCR tables with Mistral AI

Example Python code to transcribe tables from regulatory filings into a digital form. To run these examples you will need an Anaconda environment, a Mistral API key. In this example we transcribed the balance sheet table from Solvency and Financial Conditions reports that companies need to file every year.

For a subset we took the main 18 life insurance companies operating on the Italian market.

Companies in scope

  • Credemvita S.p.A.
  • AXA MPS Assicurazioni Vita
  • CRÈDIT AGRICOLE VITA
  • Società Reale Mutua di Assicurazioni
  • Cardif Vita S.p.A.
  • MEDIOLANUM VITA S.p.A.
  • Generali Italia S.p.A.
  • Banco BPM Vita S.p.A.
  • HDI ASSICURAZIONI S.p.A.
  • Gruppo Assicurativo Poste Vita
  • FIDEURAM VITA S.P.A.
  • CNP Vita Assicura S.p.A.
  • ITAS VITA
  • Helvetia Vita S.p.A.
  • Vittoria Assicurazioni S.p.A.
  • GROUPAMA ASSICURAZIONI S.P.A.
  • UniCredit Allianz Vita S.p.A.
  • Zurich Investments Life S.p.A.

Description of the process

The process of extraction is performed in 5 phases.

Phase 0: Find the reports and identify the relevant tables (manually).

  1. Identify the new SFCR report and save it into the folder Input.
  2. Identify the pages where the tables of interest are.
  3. Compile the map of the company run in the master_list.csv.

Phase 1: Run the Extraction notebook (released on 23-September-2025).

The notebook performs the following steps (with slight modifications depending on the table format):

  1. Save the page with the table into a separate folder Single_pdf.
  2. Use either a Python package or specialized LLM to create a digital equivalent of the table.
  3. Fix the systemic errors that prevent the table from being saved as DataFrame.
  4. Save the DataFrame into the Output folder.

Phase 2: Run the Processing notebook (released on 4-October-2025).

The notebook applies fixes to the DataFrame to make the numbers closer to the reported numbers. It joins all the tables into a single dataset and saves it into the Dirty_Combined folder.

Phase 3: Run the Cross-Validation notebook (released on 7-October-2025).

The notebook applies a series of tests that check for the internal consistency between the numbers. Flags potential errors. After the individual fixes are applied, it saves the table into the Cleaner_Combined folder.

Phase 4: Final modifications to the table and a manual inspection (no script for this step).

Contact

A version of this process is used by us to extract data for our actuarial models. One of the benefits of releasing our code is the feedback and improvement ideas. If you have any, you can contact us at [email protected].

License

MIT license

About

Transform pdfs into DataFrames using Mistral OCR and Python.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •