Our latest article: Unveiling the Critical Nexus of Data Preprocessing and Transparent Documentation for Result Quality and Reproducibility in Digital History
- Agata Bloch
- 11 wrz
- 1 minut(y) czytania
Our latest article has been published in Digital Humanities Quarterly: "Unveiling the Critical Nexus of Data Preprocessing and Transparent Documentation for Result Quality and Reproducibility in Digital History".

This paper underscores the importance of adequate data preprocessing, transparency, and documentation in digital history research, showcasing how these often overlooked practices impact research quality and reproducibility. We present a topic modelling case study involving over 160,000 records of official correspondence of the Atlantic Portuguese Empire from 1640 to 1822 to illustrate how these practices, associated with standardised formats and metadata conventions, facilitate the sharing and reproduction of experiments. First, we evaluate the impact of data cleaning and preprocessing on model performance. Second, concerning model selection, we compare the performance of latent Dirichlet allocation (LDA), latent semantic indexing (LSI), and Gibbs sampling algorithm for a Dirichlet mixture model (GSDMM). Besides stressing the underestimated significance of data preprocessing and transparent documentation to strengthen research robustness and contribute to a reproducibility culture, we also demonstrate the potential of topic modelling in digital historical studies, specifically in the context of the Atlantic Portuguese Empire.


Komentarze