Documentation: Difference between revisions

    From UNITApedia
    Line 114: Line 114:
    * '''Organization''': Multiple buckets can be created (e.g., “dev” for storing [https://unitapedia.univ-unita.eu/hop/ Apache HOP] transformations, indicators buckets to store CSV files coming from UNITA Offices).
    * '''Organization''': Multiple buckets can be created (e.g., “dev” for storing [https://unitapedia.univ-unita.eu/hop/ Apache HOP] transformations, indicators buckets to store CSV files coming from UNITA Offices).


    === 6. Strapi (Middleware / Headless CMS) ===
    === Strapi (Middleware / Headless CMS) ===
    * '''Purpose''': Provides a user-friendly interface for UNITA Offices to manually input or update indicator data.   
    * '''Purpose''': Provides a user-friendly interface for UNITA Offices to manually input or update indicator data.   
    * '''Data Flow''': Stores raw records in its own PostgreSQL database schema, which [https://unitapedia.univ-unita.eu/hop/ Apache HOP] then reads, transforms, and pushes into datamart.   
    * '''Data Flow''': Stores raw records in its own [https://unitapedia.univ-unita.eu/pga/ PostgreSQL]L database schema, which [https://unitapedia.univ-unita.eu/hop/ Apache HOP] then reads, transforms, and pushes into datamart.   
    * '''APIs''': Exposes REST or GraphQL endpoints if needed for external integrations or advanced use cases.
    * '''APIs''': Exposes REST or GraphQL endpoints if needed for external integrations or advanced use cases.


    === 7. Administrative Interfaces ===
    === Administrative Interfaces ===
    * '''phpMyAdmin''': Web-based administration tool for MariaDB (if used, e.g., for certain MediaWiki tables or other services).   
    * '''phpMyAdmin''': Web-based administration tool for MariaDB (if used, e.g., for certain MediaWiki tables or other services).   
    * '''pgAdmin''': Used to manage [https://unitapedia.univ-unita.eu/pga/ PostgreSQL] databases, including creation of new schemas, user roles, and backups.
    * '''pgAdmin''': Used to manage [https://unitapedia.univ-unita.eu/pga/ PostgreSQL] databases, including creation of new schemas, user roles, and backups.


    === 8. Docker-Based Infrastructure ===
    === Docker-Based Infrastructure ===
    * '''Containerization''': Each service (MediaWiki, [https://unitapedia.univ-unita.eu/strapi/ Strapi], [https://unitapedia.univ-unita.eu/hop/ Apache HOP], [https://unitapedia.univ-unita.eu/pga/ PostgreSQL], [https://unitapedia.univ-unita.eu/minio/ MinIO], Nginx) runs in its own container, simplifying updates and scaling.   
    * '''Containerization''': Each service (MediaWiki, [https://unitapedia.univ-unita.eu/strapi/ Strapi], [https://unitapedia.univ-unita.eu/hop/ Apache HOP], [https://unitapedia.univ-unita.eu/pga/ PostgreSQL], [https://unitapedia.univ-unita.eu/minio/ MinIO], Nginx) runs in its own container, simplifying updates and scaling.   
    * '''Networking''': Docker Compose defines an internal network allowing containers to communicate securely without exposing internal ports directly to the public internet.   
    * '''Networking''': Docker Compose defines an internal network allowing containers to communicate securely without exposing internal ports directly to the public internet.   
    * '''Environment Variables''': Credentials and configuration details (e.g., database passwords, S3 access keys) are injected at runtime to keep them out of version control.
    * '''Environment Variables''': Credentials and configuration details (e.g., database passwords, S3 access keys) are injected at runtime to keep them out of version control.

    Revision as of 13:19, 11 June 2025

    System Overview

    The UNITApedia system is composed of two integrated main components designed to enhance data accessibility, transparency, and collaboration among UNITA members. It connects a shared data warehouse with a MediaWiki-based front-end, creating a dynamic and scalable ecosystem for data visualization, management, and analysis.

    Shared Data Warehouse

    Acts as the central repository for structured data such as deliverables, indicators, and progress metrics. Utilizes metadata, ontology, and semantic web technologies to provide a comprehensive, interconnected view of data collected across all UNITA members. Supports efficient data centralization, organization, and analysis, ensuring a unified understanding of the data ecosystem. Backed by PostgreSQL, enabling complex queries, scalability, and robust data storage. Alongside Apache HOP as an ETL to develop powerful data pipelines.

    MediaWiki-Based Front-End Interface

    Provides a user-friendly system for monitoring project progress, visualizing metrics, and assessing impact. Acts as the primary user interface, powered by extensions like External Data, Scribunto, and Semantic MediaWiki. Dynamically retrieves data through its API layer, integrating seamlessly with the data warehouse. Enhances decision-making and collaboration by providing stakeholders with real-time, actionable insights. Share and collaborate with other users to extend the UNITA knowledge-base.

    Key Features

    • Near real-time integrated data pipeline processus:
      • Utilizes robust APIs to fetch and display updated information from the PostgreSQL database.
      • Near-instantaneous process from data extraction to final result display on UNITApedia.
    • User-Friendly Interface:
      • Built on MediaWiki, ensuring an intuitive experience for users of varying technical backgrounds.
      • Extensions like Page Forms and Semantic MediaWiki simplify data input, annotation, and querying.
    • Open Source:
      • Designed with modularity and scalability in mind, allowing deployment across other UNITA members or similar institutions.
      • Supports customization to meet unique institutional needs while adhering to UNITA’s vision.
    • Dynamic Queries:
      • Uses optimized prepared PostgreSQL statements and Lua scripting via MediaWiki extensions to deliver efficient and dynamic data visualization.
      • Allows advanced customization of data presentation formats based on user needs.
    • Scalable Architecture:
      • Employs a Dockerized infrastructure for each subsystem (MediaWiki, Strapi, PostgreSQL, Apache HOP, etc.), ensuring modularity and scalability.
      • Supports efficient deployment, updates, and resource allocation.
    • Enhanced Collaboration and Transparency:
      • Enables cross-institutional collaboration by centralizing data in the shared warehouse.
      • Provides stakeholders with real-time visualizations, ensuring informed decision-making and alignment with organizational goals.

    System Architecture

    This chapter provides an overview of the UNITApedia system architecture, highlighting the containerized design, data flows, and interactions between the various services. The architecture ensures scalability, maintainability, and security, while leveraging open-source technologies to facilitate collaboration and data accessibility across the UNITA alliance.

    The following considerations shaped the UNITApedia architecture:

    • Modularity & Scalability
      • Docker ensures each service is isolated, easily updated, and can be scaled independently if usage grows.
      • Clear separation of roles (Strapi for input, Apache HOP for ETL, MediaWiki for output) streamlines development and maintenance.
    • Open-Source & Extensibility
      • MediaWiki: Chosen for its mature ecosystem (extensions like Semantic MediaWiki, External Data, Page Forms) and robust community support.
      • PostgreSQL: Offers advanced query capabilities, reliability, and easy integration with Apache HOP.
      • MinIO: An open-source, S3-compatible object store that fits seamlessly into containerized deployments.
    • Security & SSL
      • Nginx-Proxy + ACME Companion: Provides automated certificate management and secure HTTPS connections, protecting data in transit.
      • Role-Based Access: Strapi enforces form-level permissions, while MediaWiki can be configured with namespace-based access for sensitive data.
    • Data Consistency & Quality
      • Apache HOP ETL: Ensures data from different sources (Strapi, MinIO CSVs) is validated, cleaned, and structured before landing in the datamart.
      • Semantic MediaWiki: Allows for structured data definitions and cross-referencing, ensuring consistent reporting across tasks and indicators.
    • Maintainability & Future Growth
      • Each service can be updated or replaced with minimal impact on the others, thanks to Docker’s container-based isolation.
      • The architecture can accommodate new data sources, additional tasks/indicators, or new alliances with minimal refactoring.


    Architecture to represent the solution proposed in the framework of the task 1.2 working's groups.
    UNITApedia Global Architecture

    Request Flow

    1. User Interaction: A UNITA Office user or a Task Leader navigates to the UNITApedia URL.
    2. Nginx-Proxy: Receives the request over HTTPS and routes it to the appropriate container (MediaWiki, Strapi, etc.).
    3. Data Entry (Strapi): If the user is adding new indicator data, the form submission is stored in the Strapi database.
    4. ETL (Apache HOP): On a scheduled or on-demand basis, Apache HOP retrieves the new entries from Strapi (or CSV files in MinIO), applies transformations, and loads them into the datamart.
    5. MediaWiki Display: MediaWiki queries the datamart schema via the External Data extension to display up-to-date metrics on wiki pages or dashboards.
    6. Administration: pgAdmin is used by database administrators for maintenance tasks, accessible behind the Nginx-Proxy with proper credentials.


    UNITApedia Technological Architecture

    High-Level Overview

    UNITApedia is composed of several interconnected services running in Docker containers, orchestrated via Docker Compose. The main components are:

    Nginx-Proxy Service

    • Role: Acts as a reverse proxy, routing external HTTP/HTTPS requests to the appropriate backend service based on URL paths.
    • Security: Integrates with the ACME Companion service for automatic SSL certificate management and renewal, ensuring secure connections via HTTPS.
    • Endpoints: Forwards traffic to MediaWiki, Strapi, phpMyAdmin, pgAdmin, MinIO, Apache HOP, and any additional admin interfaces.


    Nginx-proxy Service Architeture

    MediaWiki Container

    • Primary Role: Serves as the user-facing front-end, allowing UNITA stakeholders to view, edit, and query data related to alliance activities and indicators.
    • Extensions:
    • Configuration: Managed via LocalSettings.php, which includes namespace definitions (e.g., DataSrc and Doc) and data source connections (prepared SQL statements).


    MediaWiki Service Architeture

    PostgreSQL (Data Warehouse)

    • Role: Central repository storing structured data such as deliverables, indicators, and metrics.
    • Multi-Database Setup:
      • strapi: Contains raw input tables from Strapi forms.
      • datamart: Holds transformed and processed data ready for MediaWiki queries.
      • unita-data: Contain additional metadata or wiki configuration tables.
    • Administration: Managed via pgAdmin for database operations (e.g., backups, user management).

    Apache HOP (ETL and Reporting)

    • Processes:
      • Data Retrieval: Fetches raw datasets from MinIO buckets (CSV files) or Strapi tables in PostgreSQL.
      • Data Transformation: Cleans and normalizes data, ensuring consistency (e.g., date formatting, numeric checks, selecting values).
      • Data Integration: Loads validated data into the datamart schema for consumption by MediaWiki.
    • Scheduling & Monitoring: Deployed Apache HOP “Carte Server” allows scheduling of jobs and transformations, with logs for error handling and performance monitoring.


    Apache HOP Service Architeture

    MinIO (Object Storage)

    • Role: Stores raw data files (CSV, PDFs, images, etc.) uploaded by UNITA Offices.
    • Integration: Apache HOP connects to MinIO using an S3-compatible interface, retrieving files for ETL processing.
    • Organization: Multiple buckets can be created (e.g., “dev” for storing Apache HOP transformations, indicators buckets to store CSV files coming from UNITA Offices).

    Strapi (Middleware / Headless CMS)

    • Purpose: Provides a user-friendly interface for UNITA Offices to manually input or update indicator data.
    • Data Flow: Stores raw records in its own PostgreSQLL database schema, which Apache HOP then reads, transforms, and pushes into datamart.
    • APIs: Exposes REST or GraphQL endpoints if needed for external integrations or advanced use cases.

    Administrative Interfaces

    • phpMyAdmin: Web-based administration tool for MariaDB (if used, e.g., for certain MediaWiki tables or other services).
    • pgAdmin: Used to manage PostgreSQL databases, including creation of new schemas, user roles, and backups.

    Docker-Based Infrastructure

    • Containerization: Each service (MediaWiki, Strapi, Apache HOP, PostgreSQL, MinIO, Nginx) runs in its own container, simplifying updates and scaling.
    • Networking: Docker Compose defines an internal network allowing containers to communicate securely without exposing internal ports directly to the public internet.
    • Environment Variables: Credentials and configuration details (e.g., database passwords, S3 access keys) are injected at runtime to keep them out of version control.