
MAPE Engine
This study presents the MAPE Engine, an innovative AI-powered framework that integrates advanced natural language processing (NLP), large language models (LLMs), embedding-based validation and clustering techniques to extract information from approximately 180,000 historical records.
What is MAPE Engine?
The study of the extensive and unstructured correspondence of the Portuguese Empire (1610–1833), archived in the Arquivo Histórico Ultramarino de Lisboa, poses a great challenge to traditional research methods due to its complexity and volume. This study presents the MAPE Engine, an innovative AI-powered framework that integrates advanced natural language processing (NLP), large language models (LLMs), embedding-based validation and clustering techniques to extract information from approximately 180,000 historical records. The method automates the assignment of concise, contextual topics to each correspondence and organizes them into thematic clusters that reveal overarching categories such as, among others, colonial administration, maritime trade, religious affairs, and mobility. By leveraging the multilingual and contextual understanding capabilities of the LLaMA 3.2 model and advanced clustering algorithms, this approach overcomes the limitations of traditional archival processing and provides improved accessibility and interpretability. The MAPE engine paves the way for transformative archival research. It enables international scholars and history enthusiasts to explore hidden patterns and connections in historical datasets in a bilingual, user-friendly tool in English and Portuguese.
Mape Engine will be soon available on the server of Tadeusz Manteuffel Institute of History
Polish Academy of Sciences
Find out more about MAPE Engine
Our Dataset
MAPE: A Dataset of Correspondence from the Portuguese Empire
The MAPE dataset comprises 182,491 historical correspondence records from the Arquivo Histórico Ultramarino de Lisboa (Portuguese Overseas Archives of Lisbon, hereafter AHU), in particular from the collection of the Conselho Ultramarino (Overseas Council), covering the period from 1581 to 1859.
The MAPE dataset is provided as a single CSV file the repository root. It consolidates all correspondence registers extracted from the AHU PDFs into a uniform tabular structure.
The consolidated dataset originally contained correspondence in Portuguese, which was a significant barrier for a global audience. To overcome this limitation, we translated the original content into English using Google Gemini 1.5 Flash, a lightweight transformer-based model optimized for multilingual text processing and translation. Google Gemini 1.5 Flash supports over 100 languages and is designed to strike a balance between speed, computational efficiency and high-quality text creation.
Download Dataset here: MAPE: A Dataset of Correspondence from the Portuguese Empire: https://zenodo.org/records/15481608
How to cite: Błoch, A., Vasques Filho, D., Bojanowski, M., Santana, C., & Hussain, S. (2025). MAPE: A Dataset of Correspondence from the Portuguese Empire [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15481608