What problem are we trying to solve?


Today, companies (and also the public sector) are capturing and storing an enormous amount of data from their customers and citizens.
Nevertheless, it is difficult for a single organization to house all relevant data and skills necessary to analyze such data. Additional data may be needed to enrich the context and/or the solution space. Therefore organizations are usually in need to collaborate with external organizations (separate legal entities).

Different sources of data

Data may come from different sources:
  • Public sources such as open data portals
  • Purchased datasets from commercially-driven data providers
  • Clients, suppliers or other stakeholders from the organization's environment

A data-sharing agreement must be in place

Either way, a clear contract between data providers and data consumers must be set in place.

Multiple regions, multiple regulations

Additionally, both parties (consumers and providers) may be located in different countries and even different regions, which incur in extra efforts to enable the respect of the corresponding regulations.

Multiple data formats, storage providers

Furthermore, once the counter-parties have been identified and the agreements have been in place, additional technical hurdles kick-in, e.g. what format should the data be shared? excel documents, json documents, sql databases, nosql databases, etc.

Avoid a single point of failure

It is generally not recommended to aggregate the data in a single data store, as to avoid honeypots vulnerable to ransomware, so a more decentralized approach is desirable.

Avoid Role-based-access attached to individual data items

RBAC has many benefits for centralized data management systems, but it gets increasingly hard when the number of participants (different employees from different organizations) increases. In those situations, it would be better to automate the access by interpreting the Roles and Responsibilities that are mentioned in the Data-sharing agreement.

Trade-off between search and importing data

In any given centralized system, there will be a trade-off between the "standardization" of data before importing it, or a limited search of disparate data formats. The ideal solution would give you freedom to share data in any format (no specific schema) and enabling users to "query data" as if they were "rows & columns".