Community Decision around data storage strategy for reporting platform

Hi Everyone,

We currently use Druid as the storage layer for the OpenLMIS v3 reporting platform and recognize that implementations may have different requirements for expertise and scale when it comes to v3 reporting. We have created an overview of data storage options for the reporting platform that documents the pros and cons of choosing a specific storage technology for the reference distribution of the reporting stack.

I would like to start a conversation around this document and work toward a recommendation for the reporting reference distribution. Specifically, it would be good to get a sense of scale, skillset in data warehousing and willingness to adopt a structureless storage vs a traditional RDBMS in currently implementations. I hope to start the conversation here on the forums and add it to the next technical committee.

Please review the linked document above and post any thoughts or conversations here or as comments directly on the wiki.

Sincerely,

Craig

This is interesting Craig,

The way I’ve been viewing this so far is not as an either-or decision, but rather needing two concurrent approaches: Druid for dashboards and the sort of high-level indicators which senior managers, program leads, etc might use for visibility, and RDBMS/PostgreSQL for operational “reports” which perhaps might be leveraged in Superset, but more likely would be aimed at our banded report writer Jasper.

As an example of my thinking I’ll take one of the domains in OpenLMIS: stock management (aka inventory). Druid seems to fit well in creating DISC indicators from many stock cards with many line items across many locations grouping by different levels of geographic hierarchy (and presumably we’ll use administrative hierarchy later). There is we could say a certain business logic to those indicators, however I wouldn’t say that they’re duplicated in both the transactional and reporting stacks. A RDBMS which we might build to power the Routine Reporting microservice (not a transactional instance such as Stock Managements), might be useful in generating an operational/routine report such as “Which Facilities are below their min stock and haven’t submitted an requisition for the current period” (not a real example of a requested report that I know of atm). Both reports to my mind seem to have very different stakeholders with different visualization needs, and both appear to have the right scale for the right storage mechanism.

To my mind the community at large has already arrived at needing both storage solutions. I think we can also safely presume that the larger implementations have already built skill sets around reporting from Postgres - and some of the main challenges with the larger implementations involve either needing many materialized views to achieve acceptable (or perhaps sometimes not) responsiveness, or if not materialized views than utilizing custom ETL’s with a star or perhaps a snowflake schema on the other end. The simple operational reports (e.g. print a list of fridge makes and models in country) aren’t the ones which need much more than Postgres. Lastly I also know of one implementation (SIGLUS) which utilized Cubes, and I recall that the chief challenge was in the complexity of constructing schemas for some of these global health indicators.

In terms of replicated business logic, I’d like to see us explore Event Data Pumps (extract from Sam Newman’s Building Microservices) more. We’ve been running up against the need for data pumps in v3 for awhile, and now with Nifi and Kafka (and we’ve discussed atom and web-sub) I think we should re-visit how these data-pumps can be built inside the services which own the data (e.g. Stock Management, Requisition, etc).

That’s my thoughts. Hopefully I’m not missing something critical here? Did I miss the genesis for what started this need for a decision on storage?

Best,

Josh

···

On Thursday, May 17, 2018 at 4:34:11 PM UTC-7, Craig Appl wrote:

Hi Everyone,

We currently use Druid as the storage layer for the OpenLMIS v3 reporting platform and recognize that implementations may have different requirements for expertise and scale when it comes to v3 reporting. We have created an overview of data storage options for the reporting platform that documents the pros and cons of choosing a specific storage technology for the reference distribution of the reporting stack.

I would like to start a conversation around this document and work toward a recommendation for the reporting reference distribution. Specifically, it would be good to get a sense of scale, skillset in data warehousing and willingness to adopt a structureless storage vs a traditional RDBMS in currently implementations. I hope to start the conversation here on the forums and add it to the next technical committee.

Please review the linked document above and post any thoughts or conversations here or as comments directly on the wiki.

Sincerely,

Craig