Source Integration Methods
Overview
The ALIGN Global Hub harmonizes data from diverse external sources into a unified product registry. This document describes the technical methods and rules used for this integration.
The Integration Process
1. Data Extraction
Data is extracted from external databases (e.g., CSV, Excel, API) and loaded into a staging environment.
2. Standardization
Field names and values are mapped to a common schema.
- Date Standardization: All dates are converted to a standard ISO format (YYYY-MM-DD).
- Categorical Mapping: Product types (e.g., “Diagnostic Test”, “Assay”) are mapped to a set of controlled vocabularies (e.g., “Diagnostic”).
3. Record Linking and Deduplication
The Hub uses a combination of automated and manual methods to link records across sources:
- Exact Match: Matching on product name and manufacturer.
- Fuzzy Match: Handling variations in spelling, synonyms, and brand names.
- Identifier Match: Using unique IDs like ClinicalTrials.gov (NCT) numbers or WHO Prequalification IDs.
4. Rule-Based Reconciliation
When multiple sources provide information on the same product, the following rules are applied:
- Trial Phase: The highest (most advanced) phase is recorded.
- Approval Status: If any source indicates a “Yes” for regulatory approval (e.g., WHO PQ, US FDA), the product is marked as approved.
- Projected Dates: Dates from the most recent data extraction are prioritized.
Quality Control
- Automated Checks: Identification of missing values, out-of-range dates, and inconsistent categorical data.
- Manual Review: Periodic review by subject matter experts to resolve complex duplicates and validate high-priority product records.
Integration Frequency
The data integration pipeline is run on a quarterly basis to ensure the Hub reflects the most current information from external sources.

© 2026 ALIGN Consortium. All rights reserved.