Request for information on OpenLMIS v2 and v3 system monitoring technologies

Hi,

I would like to learn more about the technologies used to monitor production systems in OpenLMIS v2 and v3. These monitoring systems commonly identify the hardware resources that are being used, whether services are available and alert system administrators when there are failures. Examples include services like UptimeRobot, Graphite, Grafana, Scalyr, Nagios and Pingdom.

Can you share a list of technologies in your deployment, describe how they are used and indicate whether there is a paid subscription required?

Have we developed a standard monitoring stack for OpenLMIS v3 other than Scalyr, which monitors and reports on logs?

Thank you,

Craig

Since you covered Scalyr, I’ll add that v3 did have Prometheus and Grafana early on. At that point though it was a bit early in our maturity and it was taking resources to report on the same information that Scalyr was already reporting on - log events, alerts on missing pings, disk, io, etc. Since then I’ve hoped (and we’ve moved the needle a little) that we could add more instrumentation to our micro-services to power Prometheus as well as to better inform Consul (and our Health and Information service) of service availability. The last two things I’d add is that I’ve felt that performance debugging in Malawi’s implementation would be greatly aided with request tracing and it’d be beneficial for someone to do a quick spike on v3 with New Relic.

For Malawi I know they combined a few AWS CloudWatch metrics into the mix with Scalyr as well - though I’d hesitate to speak for them in terms of how well it went or what gaps still are apparent.

Thanks Craig.

Best,

Josh

···

On Tuesday, May 22, 2018 at 12:42:54 PM UTC-7, Craig Appl wrote:

Hi,

I would like to learn more about the technologies used to monitor production systems in OpenLMIS v2 and v3. These monitoring systems commonly identify the hardware resources that are being used, whether services are available and alert system administrators when there are failures. Examples include services like UptimeRobot, Graphite, Grafana, Scalyr, Nagios and Pingdom.

Can you share a list of technologies in your deployment, describe how they are used and indicate whether there is a paid subscription required?

Have we developed a standard monitoring stack for OpenLMIS v3 other than Scalyr, which monitors and reports on logs?

Thank you,

Craig

Hello,

  yes, that's correct. In addition to what we already can monitor with Scalyr using just its agent, we have also connected it to AWS Cloud Watch (which is easy and covered by Scalyr docs - ). The amount of metrics on CloudWatch is huge, but some important ones that we alert on are:
  • CPU Credit Balance - this is absolutely crucial if you are using any burst-capable instances (like EC2 t2 or RDS t2). When those instances run out of available credits they slow down a lot (only a small percent of processing capacity is available). This was a huge problem in MW and usually required switching to a non-burstable instance temporarily when we run out of credits.
  • Replica lag - we are using a read-only database replica to allow a Tableau client to connect and execute queries there. There’s a number of reasons the replica may run into issues, which would result in inaccurate / not up-to-date data in Tableau reports. We use this metric to make sure that the lag between the master database and the read replica is not bigger than a few minutes.
  • Available storage space - obviously we don’t wanna run out of space on our server or database.
  • ELB latency - we monitor this since our instances are behind a load balancer. Alert on the increased ELB latency though is usually caused by some slowness caused by what happens behind the ELB and followed by additional alert.
  • Database connections - this used to be a problem in the past - we have been running out of available database connections and weren’t sure why there are sudden jumps in the number of connections that are made. This isn’t happening anymore though.
    Some other potentially interesting ones that we monitor, but didn’t have specific problems with yet are read and write latency for the database. We also monitor a bunch of things using just Scalyr and HTTP monitor - eg. increased 4xx/5xx responses or exceptions and errors. We also monitor every single micro service and alert when any of them stops responding. At this point, I believe our alerting setup covers all/most scenarios that can cause problems for the end users. We know about any potential problems before they do and can act on them quick enough.

Best regards,

  Sebastian.
···

https://www.scalyr.com/solutions/import-cloudwatch
On 22.05.2018 23:37, wrote:

josh.zamor@openlmis.org

    Since you covered Scalyr, I'll add that v3 did have Prometheus and Grafana early on.  At that point though it was a bit early in our maturity and it was taking resources to report on the same information that Scalyr was already reporting on - log events, alerts on missing pings, disk, io, etc.  Since then I've hoped (and we've [          moved the needle](http://docs.openlmis.org/en/latest/conventions/serviceHealth.html) a little) that we could add [          more instrumentation](https://openlmis.atlassian.net/browse/OLMIS-4567?filter=20546) to our micro-services to power Prometheus as well as to better inform Consul (and our [          Health and Information service](https://github.com/OpenLMIS/openlmis-diagnostics)        ) of service availability.  The last two things I'd add is that I've felt that performance debugging in Malawi's implementation would be greatly aided with [          request tracing](https://openlmis.atlassian.net/browse/OLMIS-4532) and it'd be beneficial for someone to do a quick [spike](https://openlmis.atlassian.net/browse/OLMIS-1739)
    on v3 with New Relic.
      For Malawi I know they combined a few AWS CloudWatch metrics into the mix with Scalyr as well - though I'd hesitate to speak for them in terms of how well it went or what gaps still are apparent.

Thanks Craig.

Best,

Josh

On Tuesday, May 22, 2018 at 12:42:54 PM UTC-7, Craig Appl wrote:

Hi,

          I would like to learn more about the technologies used to monitor production systems in OpenLMIS v2 and v3. These monitoring systems commonly identify the hardware resources that are being used, whether services are available and alert system administrators when there are failures. Examples include services like UptimeRobot, Graphite, Grafana, Scalyr, Nagios and Pingdom.
          Can you share a list of technologies in your deployment, describe how they are used and indicate whether there is a paid subscription required?
          Have we developed a standard monitoring stack for OpenLMIS v3 other than Scalyr, which monitors and reports on logs?

Thank you,

Craig

  You received this message because you are subscribed to the Google Groups "OpenLMIS Dev" group.

  To unsubscribe from this group and stop receiving emails from it, send an email to openlmis-dev+unsubscribe@googlegroups.com.

  To post to this group, send email to openlmis-dev@googlegroups.com.

  To view this discussion on the web visit [https://groups.google.com/d/msgid/openlmis-dev/707a6878-dd4e-4f9e-807e-053a55c3c9b3%40googlegroups.com](https://groups.google.com/d/msgid/openlmis-dev/707a6878-dd4e-4f9e-807e-053a55c3c9b3%40googlegroups.com?utm_medium=email&utm_source=footer).

  For more options, visit [https://groups.google.com/d/optout](https://groups.google.com/d/optout).


Sebastian Brudziński

              Senior Software Developer / Team Leader


SolDevelo
Sp. z o.o. [LLC] / www.soldevelo.com
Al. Zwycięstwa 96/98, 81-451, Gdynia, Poland
Phone: +48 58 782 45 40 / Fax: +48 58 782 45 41
sbrudzinski@soldevelo.com

Thank you,

Does Scalyr require a paid subscription? Is there an open source discount that is available to implementers and ministries of health?

Thanks,

Craig

···

On Tuesday, May 22, 2018 at 12:42:54 PM UTC-7, Craig Appl wrote:

Hi,

I would like to learn more about the technologies used to monitor production systems in OpenLMIS v2 and v3. These monitoring systems commonly identify the hardware resources that are being used, whether services are available and alert system administrators when there are failures. Examples include services like UptimeRobot, Graphite, Grafana, Scalyr, Nagios and Pingdom.

Can you share a list of technologies in your deployment, describe how they are used and indicate whether there is a paid subscription required?

Have we developed a standard monitoring stack for OpenLMIS v3 other than Scalyr, which monitors and reports on logs?

Thank you,

Craig