Editing Documentation (section)

= Data Architecture =
This chapter outlines the data architecture of UNITApedia, which is pivotal for ensuring that data is consistent, reliable, and readily accessible for the Impact Observatory. It explains how the design supports accurate and timely monitoring and evaluation of the university alliance's impact. Covering the complete data lifecycle—from data ingestion and modeling to processing, storage, and access. It details the systems and processes involved in managing data flow, establishes governance and security measures, and discusses strategies for scalability and performance optimization.

== Data Sources and Ingestion ==

=== Data Sources ===
# '''Manually'''
## [https://unitapedia.univ-unita.eu/strapi/ Strapi] Input Forms: Certain UNITA activities (e.g., task-based deliverables) require manual data entry. Authorized users in each UNITA office submit relevant metrics and progress updates through Strapi forms.  
# '''Semi-Automatically''' 
## [https://unitapedia.univ-unita.eu/minio/ MinIO] Uploads (CSV): UNITA offices may upload internal data from local databases (e.g., student mobility records, research outputs, events) into MinIO buckets for automatic ingestion into the Datawarehouse.  
## Publicly Available Data Sets: Data from external APIs (e.g., Moodle, Erasmus+ portals) or open data repositories may be incorporated for broader impact analysis or benchmarking.

=== Data Ingestion Methods ===
* <q>'''ETL Pipeline (Semi-automatic – MinIO)'''</q>  
# '''Apache HOP Integration''': [https://unitapedia.univ-unita.eu/hop/ Apache HOP] serves as the primary '''Extract-Transform-Load (ETL)''' tool. With designed jobs it periodically checks all the configured [https://unitapedia.univ-unita.eu/minio/ MinIO] buckets for each indicator where CSV or other structured files are uploaded by UNITA partners.  
# '''Data Transformation''': Once [https://unitapedia.univ-unita.eu/hop/ Apache HOP] detects new files in [https://unitapedia.univ-unita.eu/minio/ MinIO], it cleanses and transforms the data according to predefined mappings and rules (e.g., converting date formats, normalizing institution names).  
# '''Loading into Data Warehouse''': After validation, the transformed data is loaded into [https://unitapedia.univ-unita.eu/pga/ PostgreSQL], ensuring consistent schemas and reliable storage. Any errors or exceptions (e.g., missing columns, incorrect data types) are logged and reported back to the relevant partners.  

* <q>'''ETL Pipeline (Manual – Strapi forms)'''</q>  
# '''User Submission''': For data that cannot be automatically generated, UNITA offices fill out [https://unitapedia.univ-unita.eu/strapi/ Strapi] forms for each indicator.  
# '''Validation and Approval''': Basic validation rules (e.g., mandatory fields, numeric ranges) are applied at form submission. Where needed, designated coordinators such as Task Leaders or Project Managers can review and approve entries before they are transferred to [https://unitapedia.univ-unita.eu/hop/ Apache HOP] for transformation and integration.  
# '''Data Transformation''': Once [https://unitapedia.univ-unita.eu/hop/ Apache HOP] detects new entries in the [https://unitapedia.univ-unita.eu/strapi/ Strapi] database on [https://unitapedia.univ-unita.eu/pga/ PostgreSQL], it cleanses and transforms the data according to predefined mappings and rules (e.g., converting date formats, normalizing institution names).  
# '''Loading into Data Warehouse''': After validation, the transformed data is loaded into the [https://unitapedia.univ-unita.eu/pga/ PostgreSQL] Datawarehouse, ensuring consistent schemas and reliable storage. Any errors or exceptions (e.g., missing columns, incorrect data types) are logged and reported back to the relevant partners.  

* <q>'''Batch vs. Near real-time Ingestion'''</q>  
# '''Batch Frequency''': In most cases, ingestion jobs run on a scheduled basis—daily or weekly—depending on data volume and the nature of the indicators. For example, monthly metrics on student mobility may only require a weekly refresh.  
# '''On-Demand Updates''': When critical data (e.g., newly completed deliverables, urgent progress metrics) must be reflected quickly in dashboards, users can trigger an on-demand ETL job via the [https://unitapedia.univ-unita.eu/hop/ Apache HOP] server.  



[[File:Architecture Data.jpg|thumb|930px|center|Logical Architecture Data Input]]

== Data Modeling and Storage ==

=== Conceptual Data Model ===

==== Task ====
* Represents a high-level work package or sub-project (e.g., [https://unitapedia.univ-unita.eu/index.php/Manage_and_coordinate_UNITA T1.2], [https://unitapedia.univ-unita.eu/index.php/UNITA_Quality_Assurance T1.3], [https://unitapedia.univ-unita.eu/index.php/Education_and_research_%26_innovation_community T2.3]).
* Each Task can have multiple Indicators associated with it.

==== Indicator ====
* A specific metric or measure used to track progress and impact (e.g., <q>Number of UNITA offices established</q>, <q>Percentage of deliverables submitted on time</q>).
* Attributes include:
** ''ID'': Unique identifier.  
** ''Task Reference'': Links the Indicator to a specific Task (e.g., [https://unitapedia.univ-unita.eu/index.php/Manage_and_coordinate_UNITA T1.2]).  
** ''Description'': Brief text describing what the Indicator measures.  
** ''Unit of Measurement'': Defines whether the Indicator is a percentage, numeric count, or other format.  
** ''Source of Data'': Describes where the raw data originates (e.g., <q>EC Platform</q>, <q>UNITA Offices</q>, [https://unitapedia.univ-unita.eu/strapi/ <q>Strapi forms</q>], [https://unitapedia.univ-unita.eu/minio/ <q>MinIO CSVs</q>]).  
** ''Frequency'': How often the data is expected/required (e.g., annual, monthly, after each event).  
** ''Baseline'': The initial value at the start of the project (e.g., Baseline 2022).  
** ''Target'': The goal to be reached by a certain date (e.g., Target 2027).  
** ''Parent Indicator'' (optional): References another Indicator if there is a hierarchical or “roll-up” relationship (e.g., sub-indicators summing to a main indicator).

=== Example Data Flow ===
* <code>[https://unitapedia.univ-unita.eu/strapi/ Strapi] or [https://unitapedia.univ-unita.eu/minio/ MinIO] &rarr; Data is collected for each Indicator (e.g., number of participants at an event).</code>
* <code>[https://unitapedia.univ-unita.eu/hop/ Apache HOP] &rarr; Cleans/transforms the data and inserts/updates rows in datamart.indicator</code>
* <code>MediaWiki &rarr; Uses the [https://www.mediawiki.org/wiki/Extension:External_Data External Data extension] to query datamart tables (e.g., t126, t125) and store or display them in wiki pages.</code>

=== Physical Data Model (PostgreSQL) ===

* '''Database Organization'''
** Strapi Database  
*** Stores form entries and raw indicator data before processing.  
*** Table examples: t121s, t122s, etc.  
** unita-data Database  
*** Holds metadata used by MediaWiki.  
*** Includes tables or configurations for the wiki’s [https://www.mediawiki.org/wiki/Extension:Semantic_MediaWiki semantic extensions].  
** datamart Database  
*** Core location for transformed and aggregated data coming from [https://unitapedia.univ-unita.eu/hop/ Apache HOP].  
*** Example tables: public.t126, public.t127, etc. (each corresponding to a particular indicator set).



[[File:Postgres.png|thumb|930px|center|UNITApedia Datawarehouse]]

== Data Access and Consumption ==

=== MediaWiki for Data Retrieval ===

MediaWiki serves as the primary front-end for displaying and interacting with UNITApedia data. Its [https://www.mediawiki.org/wiki/Extension:External_Data External Data extension] allows the wiki to query the [https://unitapedia.univ-unita.eu/pga/ PostgreSQL] datamart directly, retrieving the latest processed data (e.g., indicators, event counts, user engagement metrics).

==== External Data Extension ====
<pre>
$wgExternalDataSources['DW_UNITAData_DEMOJAN25_T125'] = [
  'server'   => 'datawarehouse',
  'type'     => 'postgres',
  'name'     => 'datamart',
  'user'     => getenv('DATABASE_USERNAME'),
  'password' => getenv('DATABASE_PASSWORD'),
  'prepared' => <<<'POSTGRE'
SELECT * FROM public.t125 ORDER BY id ASC;
POSTGRE
];
</pre>
* Defines a named data source (<code>DW_UNITAData_DEMOJAN25_T125</code>) pointing to the datamart database.  
* Uses prepared SQL statements for efficient and secure queries.  
* Data retrieval is performed via parser functions like <code>#get_db_data</code> to pull records into wiki pages.

==== Custom Namespaces ====
* <code>DataSrc</code> (ID 810) is used for storing and managing data imports.  
* <code>Doc</code> (ID 800) holds documentation or supplementary content.  
* These namespaces help organize data separately from content pages, and control access/edit permissions as needed.

==== Example Page for Data Import ====
<pre>
{{#get_db_data:
 db=DW_UNITAData_DEMOJAN25_T125
 |data=id=id,People_active=people_active,entry_date=date
 |cache seconds=0
}}

{{#store_external_table: DW_UNITAData_DEMOJAN25_T125
 |id={{{id}}}
 |People_active={{{People_active}}}
 |entry_date={{{entry_date}}}
}}

{{#ask: [[-Has subobject::{{FULLPAGENAME}}]]
 |?id
 |?People_active
 |?entry_date
 |class=datatable
 |format=datatables
 |mainlabel=-
 |header=show
}}
</pre>
* <code>#get_db_data</code> fetches rows from <code>datamart.public.t125</code>.  
* <code>#store_external_table</code> writes these rows into Semantic MediaWiki subobjects, enabling semantic queries.  
* <code>#ask</code> displays the results in a dynamic DataTables format.



[[File:DataSrcNamespace.png|thumb|930px|center|DataSrc Namespace Demo]]

=== MediaWiki for Dashboards and Reports ===

MediaWiki’s flexibility allows you to build custom dashboards for stakeholders using a combination of extensions (e.g., [https://www.mediawiki.org/wiki/Extension:PageForms PageForms], [https://www.mediawiki.org/wiki/Extension:Scribunto Scribunto], [https://www.mediawiki.org/wiki/Extension:Cargo Cargo], [https://www.mediawiki.org/wiki/Extension:Maps Maps], [https://www.mediawiki.org/wiki/Extension:Charts Charts]). You can present numeric indicators, graphs, and charts inline on wiki pages.

==== MediaWiki-Based Dashboards ====
* '''Charts and Graphs''': Extensions like [https://www.mediawiki.org/wiki/Extension:Graph Graph], [https://www.mediawiki.org/wiki/Extension:Maps Maps], or [https://www.mediawiki.org/wiki/Extension:Timeline Timeline] can visually represent indicator trends (e.g., monthly website traffic, number of participants at events).  
* '''Interactive Tables''': The DataTables format (shown above) enables sorting, filtering, and pagination of tabular data.

==== Key Performance Indicators (KPIs) ====
* The system tracks various KPIs such as <q>Percentage of deliverables submitted on time</q>, <q>Number of participants in matching events</q>, and <q>Website traffic volume</q>.  
* These metrics can be displayed as gauges, time-series charts, or tables, depending on stakeholder needs.



[[File:Graphs example.png|thumb|930px|center|Graphs Extension]]

== Data Governance, Quality, and Security ==

=== Data Quality Assurance ===

==== Validation Rules ====
* '''Strapi Forms''' 
** '''Field-Level Validation''': Certain fields (e.g., numerical indicators, date fields) must conform to a predefined format or fall within acceptable ranges. If a value is invalid, the user is prompted to correct it before the data is stored.  
** '''Review & Publish Step''': Strapi includes a <q>Published</q> button that serves as an extra checkpoint. Users must explicitly publish each entry after reviewing it, ensuring that data is accurate and complete before it proceeds to the transformation process in [https://unitapedia.univ-unita.eu/hop/ Apache HOP].

* '''Apache HOP Transformations'''
** Additional checks occur when data is moved from [https://unitapedia.univ-unita.eu/strapi/ Strapi] or [https://unitapedia.univ-unita.eu/minio/ MinIO] into the datamart. These checks may include data type conversions, date validations, and referential integrity (e.g., matching Task IDs to existing records such as [https://unitapedia.univ-unita.eu/index.php/Manage_and_coordinate_UNITA T1.2], [https://unitapedia.univ-unita.eu/index.php/UNITA_Quality_Assurance T1.3]).  
** Records that fail validation can be flagged or routed to a separate output for manual review and correction.

==== Error Handling and Logging ====
* '''Transformation Logs''': [https://unitapedia.univ-unita.eu/hop/ Apache HOP] logs each step of the ETL process. If data is malformed or fails a validation check, the record is either flagged or routed to an <q>error output</q> step for manual review.  
* '''Notifications''': Administrators (or designated data stewards) can receive email or dashboard alerts when ETL jobs fail or encounter anomalies.  
* '''Rollback/Correction''': Erroneous data can be corrected in [https://unitapedia.univ-unita.eu/strapi/ Strapi] or CSV files and reprocessed. Historical error logs help identify recurring issues (e.g., formatting mistakes from a specific partner).

=== Governance Model ===

The governance model defines roles, responsibilities, and processes to ensure data integrity and proper stewardship of indicator definitions.

==== Roles and Responsibilities ====
* '''UNITA Office (CRUD on Alliance Level for Their University):'''  
** Can create, read, update, and delete records for their institution’s indicators.  
** Responsible for ensuring that data submitted (e.g., participant counts, events) is accurate and timely.  
* '''Task Leader (Read-Only at Task Level):'''  
**Can monitor and view data for all universities participating in their assigned task(s) (e.g., [https://unitapedia.univ-unita.eu/index.php/Education_and_research_%26_innovation_community T2.3]).  
** Typically does not edit or delete data, but may request corrections or clarifications from the UNITA Offices.  
* '''Project Manager (CRUD at All Levels):'''
** Has the highest level of access, capable of modifying any indicator data across the alliance.  
** Primarily uses this access for oversight or emergency corrections, while day-to-day data entry is handled by UNITA Offices.

==== Version Control of Data Definitions ====
* '''Indicator Definitions''': Each indicator (e.g., T1.2.5, T2.3) may evolve over time. Changes to definitions—such as calculation rules, target values, or frequency—should be documented and versioned.  
* '''MediaWiki Tracking''': Edits to indicator documentation pages (in the <code>Doc</code> namespace) can be tracked via MediaWiki’s revision history, providing a transparent audit trail.

=== Data Security ===

==== Access Control ====
* '''Database Roles and Privileges:'''
** Row-level or schema-level security can be configured if certain data must be restricted to specific roles.  
* '''Application-Level Permissions:''' 
** [https://unitapedia.univ-unita.eu/strapi/ Strapi] can enforce role-based access, ensuring only UNITA Office roles can submit or update data.  
** MediaWiki’s namespaces (e.g., <code>DataSrc</code>, <code>Doc</code>) can be locked down to specific groups or user roles for editing, while read access might be more widely available.

==== Authentication/Authorization ====
* '''MediaWiki''': Users log in to access pages or to perform data-related actions. In a production environment, Single Sign-On (SSO) or OAuth could be integrated to streamline user management.  
* '''Strapi''': Form submissions require an authenticated user with the correct permissions.  
* '''Apache HOP Carte''': Administrator credentials are required to deploy and schedule transformations.

==== Datawarehouse Compartmentalization ====
The UNITApedia Datawarehouse is architected to clearly separate different types of data through the use of multiple databases and schemas. Each database within the data warehouse serves a distinct purpose, and within each database, data is organized into logical schemas containing related tables. For example:
* '''Strapi Database:'''  
** Contains raw input data from [https://unitapedia.univ-unita.eu/strapi/ Strapi] forms.  
** Schemas in this database hold tables with initial, untransformed indicator entries.  
* '''Datamart Database:''' 
** Stores transformed and processed data ready for consumption by MediaWiki.  
** Within the datamart, tables are organized under one or more schemas (commonly the public schema) such as public.t125, public.t126, etc.  
* '''Additional Databases:''' 
** The unita-data database contains metadata and configuration details used by MediaWiki.  

This multi-database, multi-schema design enables:  
* '''Isolation:''' Raw data is kept separate from processed data, reducing the risk of accidental overwrites and ensuring data integrity.  
* '''Security:''' Different access controls can be applied at both the database and schema levels, so only authorized roles can access or modify sensitive data.  
* '''Maintainability:''' Logical separation simplifies performance tuning, backup strategies, and data governance by grouping related tables together.