Documentation
System Overview
The UNITApedia system is composed of two integrated main components designed to enhance data accessibility, transparency, and collaboration among UNITA members. It connects a shared data warehouse with a MediaWiki-based front-end, creating a dynamic and scalable ecosystem for data visualization, management, and analysis.
Acts as the central repository for structured data such as deliverables, indicators, and progress metrics. Utilizes metadata, ontology, and semantic web technologies to provide a comprehensive, interconnected view of data collected across all UNITA members. Supports efficient data centralization, organization, and analysis, ensuring a unified understanding of the data ecosystem. Backed by PostgreSQL, enabling complex queries, scalability, and robust data storage. Alongside Apache HOP as an ETL to develop powerful data pipelines.
MediaWiki-Based Front-End Interface
Provides a user-friendly system for monitoring project progress, visualizing metrics, and assessing impact. Acts as the primary user interface, powered by extensions like External Data, Scribunto, and Semantic MediaWiki. Dynamically retrieves data through its API layer, integrating seamlessly with the data warehouse. Enhances decision-making and collaboration by providing stakeholders with real-time, actionable insights. Share and collaborate with other users to extend the UNITA knowledge-base.
Key Features
- Near real-time integrated data pipeline processus:
- Utilizes robust APIs to fetch and display updated information from the PostgreSQL database.
- Near-instantaneous process from data extraction to final result display on UNITApedia.
- User-Friendly Interface:
- Built on MediaWiki, ensuring an intuitive experience for users of varying technical backgrounds.
- Extensions like Page Forms and Semantic MediaWiki simplify data input, annotation, and querying.
- Open Source:
- Designed with modularity and scalability in mind, allowing deployment across other UNITA members or similar institutions.
- Supports customization to meet unique institutional needs while adhering to UNITA’s vision.
- Dynamic Queries:
- Uses optimized prepared PostgreSQL statements and Lua scripting via MediaWiki extensions to deliver efficient and dynamic data visualization.
- Allows advanced customization of data presentation formats based on user needs.
- Scalable Architecture:
- Employs a Dockerized infrastructure for each subsystem (MediaWiki, Strapi, PostgreSQL, Apache HOP, etc.), ensuring modularity and scalability.
- Supports efficient deployment, updates, and resource allocation.
- Enhanced Collaboration and Transparency:
- Enables cross-institutional collaboration by centralizing data in the shared warehouse.
- Provides stakeholders with real-time visualizations, ensuring informed decision-making and alignment with organizational goals.
System Architecture
This chapter provides an overview of the UNITApedia system architecture, highlighting the containerized design, data flows, and interactions between the various services. The architecture ensures scalability, maintainability, and security, while leveraging open-source technologies to facilitate collaboration and data accessibility across the UNITA alliance.
The following considerations shaped the UNITApedia architecture:
- Modularity & Scalability
- Docker ensures each service is isolated, easily updated, and can be scaled independently if usage grows.
- Clear separation of roles (Strapi for input, Apache HOP for ETL, MediaWiki for output) streamlines development and maintenance.
- Open-Source & Extensibility
- MediaWiki: Chosen for its mature ecosystem (extensions like Semantic MediaWiki, External Data, Page Forms) and robust community support.
- PostgreSQL: Offers advanced query capabilities, reliability, and easy integration with Apache HOP.
- MinIO: An open-source, S3-compatible object store that fits seamlessly into containerized deployments.
- Security & SSL
- Nginx-Proxy + ACME Companion: Provides automated certificate management and secure HTTPS connections, protecting data in transit.
- Role-Based Access: Strapi enforces form-level permissions, while MediaWiki can be configured with namespace-based access for sensitive data.
- Data Consistency & Quality
- Apache HOP ETL: Ensures data from different sources (Strapi, MinIO CSVs) is validated, cleaned, and structured before landing in the datamart.
- Semantic MediaWiki: Allows for structured data definitions and cross-referencing, ensuring consistent reporting across tasks and indicators.
- Maintainability & Future Growth
- Each service can be updated or replaced with minimal impact on the others, thanks to Docker’s container-based isolation.
- The architecture can accommodate new data sources, additional tasks/indicators, or new alliances with minimal refactoring.
Request Flow
- User Interaction: A UNITA Office user or a Task Leader navigates to the UNITApedia URL.
- Nginx-Proxy: Receives the request over HTTPS and routes it to the appropriate container (MediaWiki, Strapi, etc.).
- Data Entry (Strapi): If the user is adding new indicator data, the form submission is stored in the Strapi database.
- ETL (Apache HOP): On a scheduled or on-demand basis, Apache HOP retrieves the new entries from Strapi (or CSV files in MinIO), applies transformations, and loads them into the datamart.
- MediaWiki Display: MediaWiki queries the datamart schema via the External Data extension to display up-to-date metrics on wiki pages or dashboards.
- Administration: pgAdmin is used by database administrators for maintenance tasks, accessible behind the Nginx-Proxy with proper credentials.
High-Level Overview
UNITApedia is composed of several interconnected services running in Docker containers, orchestrated via Docker Compose. The main components are:
Nginx-Proxy Service
- Role: Acts as a reverse proxy, routing external HTTP/HTTPS requests to the appropriate backend service based on URL paths.
- Security: Integrates with the ACME Companion service for automatic SSL certificate management and renewal, ensuring secure connections via HTTPS.
- Endpoints: Forwards traffic to MediaWiki, Strapi, phpMyAdmin, pgAdmin, MinIO, Apache HOP, and any additional admin interfaces.
MediaWiki Container
- Primary Role: Serves as the user-facing front-end, allowing UNITA stakeholders to view, edit, and query data related to alliance activities and indicators.
- Extensions:
- External Data – Dynamically queries the PostgreSQL data warehouse (datamart) to display indicators, metrics, and other information.
- Page Forms / Semantic MediaWiki – Enables structured data input and advanced semantic queries within the wiki.
- Scribunto (Lua) – Allows for more advanced scripting and data manipulation.
- PluggableAuth + OpenID Connect – Integrates SSO (e.g., Keycloak) for institutional login.
- Configuration: Managed via LocalSettings.php, which includes namespace definitions (e.g., DataSrc and Doc) and data source connections (prepared SQL statements).
PostgreSQL (Data Warehouse)
- Role: Central repository storing structured data such as deliverables, indicators, and metrics.
- Multi-Database Setup:
- strapi: Contains raw input tables from Strapi forms.
- datamart: Holds transformed and processed data ready for MediaWiki queries.
- unita-data: Contain additional metadata or wiki configuration tables.
- Administration: Managed via pgAdmin for database operations (e.g., backups, user management).
Apache HOP (ETL and Reporting)
- Processes:
- Data Retrieval: Fetches raw datasets from MinIO buckets (CSV files) or Strapi tables in PostgreSQL.
- Data Transformation: Cleans and normalizes data, ensuring consistency (e.g., date formatting, numeric checks, selecting values).
- Data Integration: Loads validated data into the datamart schema for consumption by MediaWiki.
- Scheduling & Monitoring: Deployed Apache HOP “Carte Server” allows scheduling of jobs and transformations, with logs for error handling and performance monitoring.
MinIO (Object Storage)
- Role: Stores raw data files (CSV, PDFs, images, etc.) uploaded by UNITA Offices.
- Integration: Apache HOP connects to MinIO using an S3-compatible interface, retrieving files for ETL processing.
- Organization: Multiple buckets can be created (e.g., “dev” for storing Apache HOP transformations, indicators buckets to store CSV files coming from UNITA Offices).
Strapi (Middleware / Headless CMS)
- Purpose: Provides a user-friendly interface for UNITA Offices to manually input or update indicator data.
- Data Flow: Stores raw records in its own PostgreSQLL database schema, which Apache HOP then reads, transforms, and pushes into datamart.
- APIs: Exposes REST or GraphQL endpoints if needed for external integrations or advanced use cases.
Administrative Interfaces
- phpMyAdmin: Web-based administration tool for MariaDB (if used, e.g., for certain MediaWiki tables or other services).
- pgAdmin: Used to manage PostgreSQL databases, including creation of new schemas, user roles, and backups.
Docker-Based Infrastructure
- Containerization: Each service (MediaWiki, Strapi, Apache HOP, PostgreSQL, MinIO, Nginx) runs in its own container, simplifying updates and scaling.
- Networking: Docker Compose defines an internal network allowing containers to communicate securely without exposing internal ports directly to the public internet.
- Environment Variables: Credentials and configuration details (e.g., database passwords, S3 access keys) are injected at runtime to keep them out of version control.
LocalSettings Configuration (MediaWiki)
The LocalSettings.php
file is the backbone of the UNITApedia Impact Observatory’s MediaWiki installation. It drives everything from site identity to extensions, external data sources, caching, security and beyond. Below is an overview of how your current configuration supports the site’s functionality.
Basic Site Configuration
Site Identity & URLs
$wgSitename
: Set via theOBSERVATORY_NAME
environment variable.$wgServer
&$wgCanonicalServer
: Usehttps://
+DOMAIN_NAME
from env.$wgScriptPath
: Set to""
, so all URLs are relative to the webroot.$wgResourceBasePath
: Mirrors$wgScriptPath
for static assets.
Locales & Protocols
- Default language is English (
$wgLanguageCode = "en"
). - Shell locale forced to
C.UTF-8
for consistent sorting/formatting. - Raw HTML is enabled (
$wgRawHtml = true
) and an extra allowed protocol was added (https://elearn.univ-pau.fr/
).
User Preferences & Authentication
Email & Notifications
- Email is fully enabled (
$wgEnableEmail
,$wgEnableUserEmail
,$wgEmailAuthentication
). - Sender address and emergency contact are pulled from env (
MEDIAWIKI_PWD_EMAIL
,MEDIAWIKI_CONTACT_EMAIL
).
Login Options
- Local login via PluggableAuth is enabled (
$wgPluggableAuth_EnableLocalLogin = true
). - Keycloak/OpenID Connect example remains commented out for future SSO.
Database Settings
Primary Database (MySQL/MariaDB)
- Type:
mysql
on hostmariadb
. - Credentials and database name injected from
MEDIAWIKI_DB_
env vars. - Table options: InnoDB with binary charset.
Caching, File Uploads & Image Handling
Uploads & Commons
- File uploads are enabled (
$wgEnableUploads = true
). - InstantCommons integration is turned on (
$wgUseInstantCommons = true
). - ImageMagick is used for conversions (
$wgUseImageMagick = true
, convert command at/usr/bin/convert
).
File Types & Security
- A broad list of extensions is allowed: png, gif, jpg, doc, xls, pdf, pptx, svg, etc.
- A MIME-type blacklist protects against script uploads (e.g. PHP, shell scripts, MS executables).
Localization & Time Zone
- Wiki text in English; PHP shell locale
C.UTF-8
. Time Zone
:$wgLocaltimezone
set toUTC
, anddate_default_timezone_set('UTC')
for consistency.
Security & HTTPS
Secret & Upgrade Keys
:$wgSecretKey
and$wgUpgradeKey
loaded from env vars.HTTPS Enforcement
: All traffic is forced over HTTPS ($wgForceHTTPS = true
).
Skins, Permissions & User Groups
Skinning
- Default skin is Vector-2022 (
$wgDefaultSkin = 'vector-2022'
), with older Vector-2011 disabled. - All users are locked onto Vector-2022 (
$wgVectorDefaultSkinVersion = '2'
,$wgVectorShowSkinPreferences = false
).
User Rights
- Anonymous (
*
) users can read and edit pages but cannot create accounts. - Registered
user
role loses self-edit rights (CSS/JS/JSON). sysop
and custom roles (e.g. translator, recipes) have fine-grained SMW and Page Forms permissions.
Enabled Extensions
A streamlined but powerful set of extensions is loaded via wfLoadExtension()
:
- Arrays
- Babel
- CategoryTree
- Cargo
- Cite
- CleanChanges
- CodeEditor/CodeMirror
- ConfirmEdit
- DataTransfer
- External Data
- ECharts
- FlexDiagrams
- Gadgets
- HeaderTabs
- IframePage
- ImageMap
- InputBox
- Interwiki
- MagicNoCache
- Maps
- Math
- ModernTimeline
- MultimediaViewer
- Network
- Nuke
- Page Forms
- PageImages
- PageSchemas
- ParserFunctions
- PdfHandler
- PluggableAuth
- Poem
- Renameuser
- ReplaceText
- SecureLinkFixer
- Scribunto
- SpamBlacklist
- SyntaxHighlight_GeSHi
- TabberNeue
- TemplateData
- TemplateWizard
- TextExtracts
- TitleBlacklist
- Translate
- TreeAndMenu
- UniversalLanguageSelector
- Variables
- VisualEditor
- Widgets
- WikiEditor
Mapping & Charts
- ECharts for rich, JS-driven charts.
- OpenStreetMap for coordinate-based maps.
Semantic MediaWiki Stack
wfLoadExtension('SemanticMediaWiki')
andenableSemantics(getenv('DOMAIN_NAME'))
.- SMW add-ons: SemanticResultFormats, SemanticCompoundQueries, SemanticFormsSelect.
- Semantic links enabled in the
DATASRC
namespace.
External Data Sources & Query Files
Local file source DDD
pointing at/home/hub/data/files/dev/
.- PostgreSQL source ID for live lookups.
- GET-allowance turned on (
$wgExternalDataAllowGetters = true
). - Custom query includes:
query_meta_unita.php
,query_meta_indicators.php
,query_raw.php
,query_count.php
,query_DEMO_DEC24.php
,query_DEMO_JAN25.php
, and anindicators.php
aggregator.
Custom Namespaces
Doc
(800/801
) andDataSrc
(810/811
) namespaces defined for structured separation of docs vs. ingested data.- A
Recipes
(805/806
) namespace for specialized content.
Mail & Logging
SMTP
: Local Postfix onlocalhost:25
, no auth, unencrypted.- Mail debug logs written to
/tmp/mediawiki-mail.log
.
Debugging & Development
Error Display
: All exception details, backtraces, SQL errors, and development warnings are enabled ($wgShowExceptionDetails = true
,$wgShowDebug = true
, etc.) for rapid troubleshooting.
Docker-Compose File Configuration
The UNITApedia system is deployed using Docker Compose (version 3), which orchestrates all the services required for the application. This configuration ensures modularity, scalability, and clear separation between components. Below is the updated Docker Compose configuration.
Version and Networks
Version
: Compose file uses version3
.Networks
: A custom networkobservatory_net
defined with thebridge
driver.
networks: observatory_net: driver: bridge
Volumes
Persistent storage is defined through several named volumes:
volumes: mariadb: dw-data: pgadmin-volume: html: certs: acme: minio:
Services
Each service is defined with specific images, settings, and dependencies.
mariadb: image: mariadb:10.11 container_name: mariadb restart: always networks: - observatory_net expose: - "3306" volumes: - mariadb:/var/lib/mysql - ./services/mariadb/_initdb.mariadb/:/docker-entrypoint-initdb.d/ env_file: - .env - ./services/mariadb/.env
postgres: image: postgres:14.0-alpine container_name: postgres restart: unless-stopped networks: - observatory_net expose: - "5432" ports: - "5432" volumes: - dw-data:/var/lib/postgresql/data/ - ./services/strapi/strapi.dump:/tmp/strapi.dump - ./services/postgres/_initdb.pg/:/docker-entrypoint-initdb.d/ env_file: - .env - ./services/postgres/.env
mediawiki: build: context: ./services/mediawiki dockerfile: MediaWiki.Dockerfile container_name: mediawiki restart: always networks: - observatory_net expose: - "80" volumes: - ./services/mediawiki/LocalSettings.php:/var/www/html/LocalSettings.php:ro - ./services/mediawiki/composer.local.json:/var/www/html/composer.local.json - ./services/mediawiki/images/:/var/www/html/images/:rw - ./services/mediawiki/resources/assets/:/var/www/html/resources/assets/ - ./services/mediawiki/extensions:/var/www/html/extensions/ - ./services/mediawiki/mediawiki/:/var/www/html/mediawiki/:ro env_file: - .env - ./services/mediawiki/.env environment: VIRTUAL_HOST: ${DOMAIN_NAME} VIRTUAL_PATH: / VIRTUAL_PORT: "80" LETSENCRYPT_HOST: ${DOMAIN_NAME} LETSENCRYPT_EMAIL: ${LETSENCRYPT_EMAIL} depends_on: - mariadb - postgres
strapi: build: context: ./services/strapi dockerfile: Strapi.Dockerfile image: strapi/strapi:latest container_name: strapi restart: unless-stopped networks: - observatory_net expose: - "1337" volumes: - ./services/strapi/config:/opt/app/config - ./services/strapi/src:/opt/app/src - ./services/strapi/package.json:/opt/package.json - ./services/strapi/yarn.lock:/opt/yarn.lock - ./services/strapi/.env:/opt/app/.env - ./services/strapi/public/uploads:/opt/app/public/uploads env_file: - /data/impact-observatory/services/strapi/.env environment: VIRTUAL_HOST: ${DOMAIN_NAME} VIRTUAL_PATH: /strapi/ VIRTUAL_DEST: / VIRTUAL_PORT: "1337" LETSENCRYPT_HOST: ${DOMAIN_NAME} LETSENCRYPT_EMAIL: ${LETSENCRYPT_EMAIL} depends_on: - postgres command: /bin/sh -c "yarn strapi ts:generate-types && yarn develop"
hop-web: image: apache/hop-web:latest container_name: hop-web restart: unless-stopped ports: - "8080" volumes: - ./services/hop-web/projects:/project - ./services/hop-web/tomcat/config:/config env_file: - .env environment: VIRTUAL_HOST: ${DOMAIN_NAME} VIRTUAL_PATH: /hop/ VIRTUAL_DEST: / VIRTUAL_PORT: "8080" LETSENCRYPT_HOST: ${DOMAIN_NAME} LETSENCRYPT_EMAIL: ${LETSENCRYPT_EMAIL} AWS_ACCESS_KEY_ID: zcby8I0PeG1uprpYO4KR AWS_SECRET_ACCESS_KEY: xyaCmOf86QWiyGM3L5BfKFv5WQxS70pjKKAbqQIN AWS_REGION: us-east-1 AWS_ENDPOINT: http://minio:9000 AWS_PATH_STYLE: <q>true</q> networks: - observatory_net depends_on: - postgres
minio: image: minio/minio:latest container_name: minio restart: always networks: - observatory_net ports: - "9000" - "9001" volumes: - ./services/minio/data:/data env_file: - .env - ./services/minio/.env environment: VIRTUAL_HOST: ${DOMAIN_NAME} VIRTUAL_PATH: /minio/ VIRTUAL_DEST: / VIRTUAL_PORT: "9001" MINIO_BROWSER_REDIRECT_URL: https://unitapedia.univ-unita.eu/minio command: server /data --console-address ":9001"
phpmyadmin
,pgadmin
,nginx-proxy
, andacme-companion
follow similar patterns for image, ports, volumes, networks, and virtual-host environment variables.
Makefile Configuration
The Makefile
is designed to automate common tasks involved in managing the UNITApedia deployment. By reading settings from an .env
file and an extensions.config
file, it enables consistent builds, extension installation, maintenance operations, and backups.
Environment and Extension Configuration
Environment Inclusion
:
TheMakefile
begins by including the.env
file and./services/mediawiki/.env
, so that environment-specific variables (e.g., database credentials, domain names) are available throughout the build process.
Extensions Configuration
:
Theextensions.config
file lists extension names and versions. Although the cloning logic is commented out, the Makefile can use tools likeawk
to extractEXTENSION_NAMES
andEXTENSION_VERSIONS
for automated fetching of MediaWiki extensions.
Core Targets
help
: Show available commands and descriptions.up
: Start all containers ($(DC) up -d
).down
: Stop all containers ($(DC) down
).restart
: Restart all containers.logs
: Follow container logs with the last 100 lines.clean
: Remove local build & backup artifacts, reset extensions directory, and prune unused Docker objects.build
/build-no-ext
: Build full stack and initialize MediaWiki extensions and services.mw-install
: Install Composer dependencies and runmaintenance/update.php
inside the MediaWiki container.mw-initialize
: First-time Composer update (no scripts) and post-install scripts, then runmaintenance/update.php
.mw-dev-copy
: Copy the MediaWiki container’s/var/www/html
contents into a local folder for development.backup
: Run bothbackup-mariadb
andbackup-postgres
.backup-mariadb
: Dump the MariaDB database into a timestamped file underservices/mariadb/backups
.backup-postgres
: Dump the PostgreSQL database into a timestamped file underservices/postgres/backups
.
# Load environment variables include .env include ./services/mediawiki/.env DC = docker-compose DEFAULT_EXT_VERSION = REL1_39 .DEFAULT_GOAL := help help: ## Show help @echo "Available commands:" @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf " \033[36m%-20s\033[0m %s\n", $$1, $$2}' # ───── Docker Lifecycle ───── up: ## Start all containers $(DC) up -d down: ## Stop all containers $(DC) down restart: ## Restart all containers $(DC) down && $(DC) up -d logs: ## Show container logs $(DC) logs -f --tail=100 clean: ## Remove local build & backup artifacts rm -rf mediawiki_container_folder services/**/backups rm -rf services/mediawiki/extensions && mkdir -p services/mediawiki/extensions docker system prune -a --force build: mw-extensions ## Build full stack, install MediaWiki extensions, and start everything $(DC) build make mw-initialize build-no-ext: ## Build full stack, install MediaWiki extensions, and start everything $(DC) build make mw-initialize # ───── MediaWiki ───── mw-install: ## Install composer and update mediawiki (running maintenance script) $(DC) up -d sleep 10 $(DC) exec mediawiki composer install $(DC) exec mediawiki php maintenance/update.php mw-initialize: ## First-time Composer install with no scripts (safe for fresh builds) $(DC) up -d sleep 10 @echo "📦 Composer update (no-scripts)..." $(DC) exec mediawiki composer update --no-dev --prefer-dist --optimize-autoloader --no-scripts @echo "⚙️ Running post-install/update scripts manually (if any)..." $(DC) exec mediawiki composer run-script post-update-cmd || true @echo "🧹 Running MediaWiki update.php..." $(DC) exec mediawiki php maintenance/update.php mw-dev-copy: ## Copy MediaWiki container contents locally (for dev) rm -rf mediawiki_container_folder $(DC) cp mediawiki:/var/www/html/. mediawiki_container_folder/ # ───── Database Backups ───── backup: ## Backup all databases make backup-mariadb make backup-postgres backup-mariadb: mkdir -p services/mariadb/backups $(DC) exec database sh -c "mysqldump -u$$MARIADB_USER -p$$MARIADB_PWD $$MEDIAWIKI_DB_NAME" > services/mariadb/backups/backup_$(shell date +%F_%T).sql backup-postgres: mkdir -p services/postgres/backups $(DC) exec datawarehouse pg_dump -U $$DATABASE_USERNAME $$DATABASE_NAME > services/postgres/backups/backup_$(shell date +%F_%T).sql
Data Architecture
This chapter outlines the data architecture of UNITApedia, which is pivotal for ensuring that data is consistent, reliable, and readily accessible for the Impact Observatory. It explains how the design supports accurate and timely monitoring and evaluation of the university alliance's impact. Covering the complete data lifecycle—from data ingestion and modeling to processing, storage, and access. It details the systems and processes involved in managing data flow, establishes governance and security measures, and discusses strategies for scalability and performance optimization.
Data Sources and Ingestion
Data Sources
Manually
- Strapi Input Forms: Certain UNITA activities (e.g., task-based deliverables) require manual data entry. Authorized users in each UNITA office submit relevant metrics and progress updates through Strapi forms.
Semi-Automatically
- MinIO Uploads (CSV): UNITA offices may upload internal data from local databases (e.g., student mobility records, research outputs, events) into MinIO buckets for automatic ingestion into the Datawarehouse.
- Publicly Available Data Sets: Data from external APIs (e.g., Moodle, Erasmus+ portals) or open data repositories may be incorporated for broader impact analysis or benchmarking.
Data Ingestion Methods
ETL Pipeline (Semi-automatic – MinIO)
- Apache HOP Integration: Apache HOP serves as the primary Extract-Transform-Load (ETL) tool. With designed jobs it periodically checks all the configured MinIO buckets for each indicator where CSV or other structured files are uploaded by UNITA partners.
- Data Transformation: Once Apache HOP detects new files in MinIO, it cleanses and transforms the data according to predefined mappings and rules (e.g., converting date formats, normalizing institution names).
- Loading into Data Warehouse: After validation, the transformed data is loaded into PostgreSQL, ensuring consistent schemas and reliable storage. Any errors or exceptions (e.g., missing columns, incorrect data types) are logged and reported back to the relevant partners.
ETL Pipeline (Manual – Strapi forms)
- User Submission: For data that cannot be automatically generated, UNITA offices fill out Strapi forms for each indicator.
- Validation and Approval: Basic validation rules (e.g., mandatory fields, numeric ranges) are applied at form submission. Where needed, designated coordinators such as Task Leaders or Project Managers can review and approve entries before they are transferred to Apache HOP for transformation and integration.
- Data Transformation: Once Apache HOP detects new entries in the Strapi database on PostgreSQL, it cleanses and transforms the data according to predefined mappings and rules (e.g., converting date formats, normalizing institution names).
- Loading into Data Warehouse: After validation, the transformed data is loaded into the PostgreSQL Datawarehouse, ensuring consistent schemas and reliable storage. Any errors or exceptions (e.g., missing columns, incorrect data types) are logged and reported back to the relevant partners.
Batch vs. Near real-time Ingestion
- Batch Frequency: In most cases, ingestion jobs run on a scheduled basis—daily or weekly—depending on data volume and the nature of the indicators. For example, monthly metrics on student mobility may only require a weekly refresh.
- On-Demand Updates: When critical data (e.g., newly completed deliverables, urgent progress metrics) must be reflected quickly in dashboards, users can trigger an on-demand ETL job via the Apache HOP server.