Colav's 2025 ETL Checklist: Second Semester Data Prep

Nov 25, 2025 by Admin 54 views

Hey guys! Get ready because we're diving deep into the essential requirements for our ETL (Extract, Transform, Load) data processing for the second semester of 2025. This isn't just a technical checklist; it's our roadmap to ensuring the Colav and Impactu platforms continue to deliver top-notch, reliable, and comprehensive data that empowers researchers, institutions, and policymakers alike. We're talking about making sure everything runs smoothly, from fetching raw data to transforming it into insightful knowledge, and then loading it up for everyone to use. Our goal, as always, is to provide high-quality content and immense value to our readers and users, making sure the data reflects the most current and accurate research landscape. This process is absolutely crucial for maintaining the integrity and usefulness of our analytical tools and platforms. It’s all about creating a robust foundation for everything we do, ensuring that the insights we provide are always based on the freshest and most meticulously processed information. So, let’s buckle up and walk through what we need to get sorted for this critical data run, making sure every piece of the puzzle is perfectly in place for an impactful second half of 2025. This detailed plan covers everything from core database updates to advanced AI-driven post-calculations, all designed to enhance our data's richness and accessibility. We're committed to making this an efficient and highly effective data cycle, ultimately benefiting the entire research community that relies on our platforms. Getting these ETL requirements right is foundational, so let's make sure we're on top of every single detail.

Updating Our Core Databases: The Foundation of Impactful Research

Alright, team, the first and arguably most critical step in our ETL process for the second semester of 2025 is ensuring our core databases are completely up-to-date. Think of these as the lifeblood of our entire operation. Without fresh, accurate data flowing in, our analytical tools can't provide the current insights that users expect. We're talking about a multi-faceted approach, covering everything from global research databases to specialized Colombian datasets, ensuring we capture the full spectrum of academic activity. Each database presents its own unique set of challenges and requirements, but the underlying principle remains the same: meticulous data acquisition and integration.

Getting Fresh Data from OpenAlex: A Deep Dive

One of our primary data sources, OpenAlex, is absolutely fundamental for understanding the global research landscape. For this second semester 2025 ETL run, we have a clear, multi-step process to get this massive dataset into our systems. First up, we need to handle the download of the latest OpenAlex snapshot. This isn't just clicking a button, guys; it often involves managing large file transfers, ensuring network stability, and verifying data integrity during the download. Once we have the raw files, the next hurdle is decompression. These files are typically compressed to save space, and extracting them correctly is essential before we can do anything else. Any hiccup here can delay the entire process, so robust error handling is key.

After decompression, we move into the actual loading phase. The data needs to be loaded into MongoDB, which serves as our flexible, high-performance data store. This step requires careful schema mapping and efficient insertion strategies to handle the sheer volume of OpenAlex data. Simultaneously, or shortly thereafter, we'll be performing the carga a ElasticSearch. Elasticsearch is vital for providing the fast, full-text search capabilities that our users rely on for exploring publications, authors, and institutions. Ensuring consistent indexing and proper field mapping in Elasticsearch is paramount for search accuracy and performance.

Beyond the general data load, a crucial specialized task for our platform is the corte de datos Colombia. This involves extracting and refining only the data relevant to Colombia from the vast OpenAlex dataset. This step is critical for platforms like Impactu, which focus heavily on national research impact. We've got a specific process outlined in our openalex_load repository on GitHub, and importantly, we need to address issue #641 related to this cut. Resolving this issue is a high priority to ensure the accuracy and completeness of our Colombian research data. This entire OpenAlex pipeline, from raw download to specialized national cuts, is a cornerstone of our data processing requirements and directly impacts the quality of insights we can offer about research worldwide and specifically within Colombia. It underpins many of our analyses regarding research output, collaborations, and disciplinary trends, making its accurate and timely update non-negotiable.

Mining Colombian Talent: Minciencias (Yuku) and CvLac Scraping

Shifting our focus closer to home, the Datos Abiertos Minciencias (Yuku) platform is an incredibly rich source of information about Colombian researchers, particularly through the CvLac database. For this ETL run, one of our key tasks is the scrapping of CvLac data. This process, while incredibly valuable, requires careful execution. CvLac profiles contain detailed information about researchers' academic backgrounds, publications, projects, and professional experiences. Automatically extracting this data allows us to build comprehensive profiles for Colombian academics, offering granular insights into their careers and contributions. The data we gather from CvLac via Yuku directly feeds into our understanding of the human capital within the Colombian science and technology ecosystem. It’s not just about listing publications; it's about identifying expertise, tracking career trajectories, and understanding research networks at a deeply personal level. The challenge here lies in developing robust and adaptable scraping tools that can navigate the structure of CvLac, handle potential changes in its layout, and do so efficiently and ethically. Ensuring data quality and completeness during this scraping process is paramount, as inconsistencies can lead to inaccuracies in researcher profiles and downstream analyses. This data is invaluable for mapping the scientific capabilities of the country and identifying key players in various fields. It helps us understand the skills, experiences, and collaborative potential of thousands of researchers, which is foundational for any assessment of national research capacity and impact. Therefore, dedicating sufficient resources to this CvLac scraping requirement is a top priority for our 2025 data processing efforts, directly contributing to the unique insights Colav provides about Colombian academia.

Keeping Up with Journal Rankings: ScimagoJR & DOAJ

To ensure our platform offers a holistic view of scholarly communication, including journal quality and accessibility, we need to integrate data from ScimagoJR and DOAJ. For ScimagoJR, a leading journal ranking indicator, our specific requirement for the second semester of 2025 is to acquire and process only the 2025 data. This targeted approach ensures that our journal metrics are current and relevant, reflecting the latest impact factors and quartile rankings. ScimagoJR data is crucial for researchers evaluating publication venues, for institutions assessing the visibility of their faculty's work, and for administrators making strategic decisions about research funding and promotion. It allows our users to contextualize research output by understanding the standing of the journals in which articles are published, providing a valuable layer of information that goes beyond simple citation counts. Accuracy in handling this specific year's data is key to providing reliable, up-to-date journal performance metrics across various disciplines.

Equally important, although serving a different purpose, is the DOAJ (Directory of Open Access Journals). The DOAJ provides a curated list of high-quality, open-access, peer-reviewed journals. Integrating DOAJ data into our platform allows us to identify and highlight research published in open-access venues, promoting accessibility and transparency in science. This is incredibly important in today's academic landscape, where open science initiatives are gaining significant traction. By including DOAJ, we empower users to filter and discover open-access content more easily, and we provide valuable context for researchers looking for reputable open-access publication options. Ensuring both ScimagoJR and DOAJ data are fresh and correctly integrated means our users get a more complete picture of the publication landscape, encompassing both traditional impact metrics and the growing open-access movement. These two datasets, though distinct, together provide essential metadata that enriches our understanding of academic publishing trends and helps guide our users to high-quality content that aligns with current research dissemination practices. Successfully integrating these during our ETL process helps solidify our platform's ability to offer comprehensive and forward-looking analyses.

ORCID for Accurate Author Identification: Beyond Just Names

One of the persistent challenges in research data management is accurately identifying authors, especially with common names or variations in spelling. This is where ORCID (Open Researcher and Contributor ID) comes into play, and it's a critical requirement for our second semester 2025 ETL run. Our main goal here is the reparación de nombres de autores openalex. The data from OpenAlex, while extensive, can sometimes have inconsistencies in author names, making it hard to link all works to a single individual. ORCID provides a persistent digital identifier that distinguishes one researcher from another, even those with the same name, and links them to their contributions. This drastically improves the quality and reliability of our author profiles.

To facilitate this, we have two key systems in place. First, we need to ensure the sistema para cargar ORCID en mongo, solo con nombres y apellidos is fully functional. This system, found at https://github.com/colav-playground/orcid_load, is specifically designed to load ORCID data into our MongoDB instance, focusing on capturing the essential name and surname information that helps us match and resolve author identities. By doing this, we create a robust internal mapping that links various name permutations to a definitive ORCID. Second, the plugin de ORCID person (https://github.com/colav/Kahi_plugins/pull/364) is essential. This Kahi plugin helps us integrate ORCID information directly into our person entities, ensuring that each author profile on our platform is enriched with their unique identifier. This integration allows for more accurate author disambiguation, which means users can trust that all publications attributed to an ORCID belong to the same person. This not only enhances data quality but also significantly improves the user experience by providing clearer, more reliable author information. Getting this ORCID pipeline updated and running smoothly is non-negotiable for improving the integrity and connectivity of our author data, making our platform an even more precise tool for tracking individual research output and impact. It ensures that our analyses are always linked to the correct researcher, providing invaluable insight into individual contributions and collaborations.

Integrating Institutional Data: SIIU and DSpace

Beyond individual researchers and publications, understanding the institutional landscape is paramount. This brings us to the SIIU (Sistema de Información Institucional Universitario) and DSpace data updates, both crucial elements of our ETL requirements for the second semester of 2025. For SIIU, the process begins with the obtención del DUMP. This dump is essentially a snapshot of the institutional database, containing a wealth of information about faculty, departments, courses, and other internal institutional data. Once obtained, the next step is the extracción del DUMP. This involves parsing and preparing the data from the dump for integration into our systems. The SIIU data provides critical context for understanding the structure and operations of academic institutions, allowing us to link research output to specific university departments and units. This granular institutional data is vital for accurate reporting and detailed analysis of research impact within an organizational framework. It helps us map internal structures and understand how research activities align with institutional goals, offering a unique perspective on the academic ecosystem.

Concurrently, we focus on DSpace, which represents a vast network of institutional repositories. Our current status is quite good; we have 105 repositorios descargados, meaning we've already captured a significant amount of open-access content directly from university archives. For this upcoming run, a key task is to agregar BanRep (Banco de la República de Colombia) to our capture list. Incorporating BanRep's repository will further enrich our dataset with high-quality institutional publications, particularly in economic and social sciences, directly from a prestigious national institution. The captura of DSpace repositories is about aggregating institutional knowledge, making it discoverable, and linking it back to the respective institutions and authors. It's a testament to the open science movement and our commitment to making publicly funded research readily available. By meticulously updating and expanding our SIIU and DSpace data, we significantly enhance our platform's ability to provide a comprehensive, institutionally-contextualized view of research output, collaboration, and impact, ensuring we continue to offer high-quality content and value to all our users. These efforts underscore our commitment to a holistic approach to data, integrating both broad academic outputs and specific institutional contributions.

Supercharging Data with AI and Smart Tools: Post-Calculation Power-Ups

Once our core databases are loaded and harmonized, the real magic begins with our post-calculation processes. This is where we leverage advanced technologies, particularly Artificial Intelligence, to supercharge our data, extracting deeper insights and making it even more valuable and user-friendly. These steps go beyond mere data aggregation; they transform raw information into meaningful, actionable intelligence. For the second semester 2025 ETL run, we're focusing on two major areas that will dramatically enhance the analytical capabilities and user experience of our platforms: AI-driven topic assignments and sophisticated autocomplete indices. These innovations are at the forefront of what makes our platforms truly powerful, allowing us to delve into the thematic content of research and making data access incredibly efficient.

Unlocking Insights with AI Topics: Chia and Kahi Integration

One of the most exciting aspects of our post-calculation phase is the application of AI to automatically assign topics to research publications. This process, driven by our Topics Con AI initiative, significantly enhances the discoverability and analytical depth of our data. At the heart of this is the sistema de inferencia con Chia, found at https://github.com/colav/Chia/pull/10. Chia is our powerful AI inference engine responsible for understanding the content of scholarly articles and identifying the most relevant topics based on state-of-the-art natural language processing models. This isn't just about keyword matching; it's about semantic understanding, allowing us to categorize research by its core themes, even if explicit keywords aren't present. This automated topic assignment is a game-changer for exploring research trends and identifying emerging areas of study.

Following Chia's inference, we integrate these findings through the sistema de asignación de tópicos con post calculations within Kahi, detailed in https://github.com/colav/Kahi_plugins/pull/361. This Kahi plugin takes the AI-generated topics and systematically assigns them to the relevant publications and, by extension, to authors and institutions. The value here is immense: researchers can more easily find related work, institutions can analyze their thematic strengths, and policymakers can identify areas requiring more funding or attention. This AI-driven topic assignment drastically improves the navigability and analytical potential of our entire dataset, allowing users to move beyond simple bibliographic searches to explore the conceptual landscape of research. It transforms raw publication data into a rich tapestry of interconnected ideas, making our platform an invaluable tool for discovering patterns and making informed decisions across the academic spectrum. Successfully deploying these AI topic assignment requirements is a central piece of our 2025 data processing strategy, moving us further into advanced data intelligence.

Boosting User Experience: Elasticsearch Autocompletion with Kerana

For any data-intensive platform, user experience is paramount, and a smooth, intuitive search interface is key. This is precisely what we aim to enhance with the generación de indices de ElasticSearch para autocompletación con Kerana. Kerana, our dedicated tool (https://github.com/colav/Kerana/pull/3), plays a vital role in creating highly efficient and responsive autocomplete functionality across our platforms. After all the data is processed and enriched, we need to ensure that users can find what they're looking for quickly and effortlessly.

Kerana is specifically tasked with generating autocomplete indices for several critical entities: Autores, Instituciones, Unidades académicas, and Sub-Unidades académicas. Imagine a user typing a few letters into a search bar and immediately seeing relevant suggestions for authors like _