Data processing pipeline for scientific publications to identify priority research areas
Iskandar B. Soliev - Postgraduate student of Informatics and Computer Engineering Dept., Tomsk Polytechnic University
Abstract
In the context of rapidly growing volumes of scientific information, there is a need for automated analysis methods capable of identifying the most promising research direc tions. The relevance of the study is due to the impossibility of manually processing huge data sets and the need for operational strategic planning of scientific activities. The article aims to develop and test a data processing pipeline for scientific publications, which will allow the systematisation of large amounts of information and provide decision support in scientific organisations. The implementation of the pipeline utilises the Lens.org platform, which provides access to extensive databases of scientific publications. Information col lection followed by preprocessing includes duplicate removal, tokenisation, lemmatisation and text vectorisation. The author applies the Latent Dirichlet Allocation (LDA) to identify hidden topics. Additionally, the paper conducts citation analysis and graph analysis of the relationships between publications, and also pays special attention to the development of a new metric – the “Priority Index”, which combines indicators of citation, thematic rel evance, and temporal trend of publications. Testing of the pipeline on a sample of more than 50 000 publications for 2014–2024 demonstrates high accuracy and efficiency of the proposed method. The research results enable the identification of key research directions, such as artificial intelligence, big data processing, and distributed energy systems, as well as tracing the dynamics of their development.
Keywords: scientific publications; data processing pipeline; priority directions; topic modeling; citation analysis; priority index
For citation: Soliev I. B. Data processing pipeline for scientific publications to identify priority research areas. Digital models and solutions. 2025. Vol. 4, no. 1. Pp. 17–34. DOI: 10.29141/2949-477X-2025-4-1-2. EDN: MOWAQR.