Failing performance tests

Hi all,

during work on issues with failing performance tests, we found that random Bad Gateway and Gateway Timeout errors are caused by a lack of memory on the perftest instance. When the instance runs out of memory, the referencedata service is killed and mentioned errors occur for subsequent performance tests. Our suggestion is to reduce the number of services used by the perftest or increase server size.

Services that potentially could be removed from perftest are: cce, ftp, report, and notification. The diagnostics service is already removed.

What do you think?

Best,
Klaudia

Hi @Klaudia_Palkowska, a couple questions for you:

  1. Have we significantly increased testing / testing data over the last 6-12 months that would cause higher memory usage? Or has this been going on for longer?
  2. Are we still turning the perftest server on only when we need it? I just did a quick check and it looks like it’s still running, and I haven’t found our code to turn it on/off, though I was certain we’d done this already.
  3. Did we change something around Sept 2019? It looks like our inter-zone spending has gone up since then. I’m not entirely sure this is the cause, but it looks like the DB is in a different AZ (east-1b) then the instance (east-1a).

Normally I’d say just increase the instance size, though looking at the AWS bill the perftest server is costing us more than I expected, and I’d like to be sure we’re keeping our recommended deployment topology up to date with what testing is finding. Taking out those services is OK, though not ideal, and I’d expect doesn’t give you much (% wise) memory back, plus eventually we should be testing CCE I’d think.

Thanks for digging into this.

Best,
Josh

Hi @joshzamor,

thanks for the reply. Answering your questions:

  1. We didn’t increase testing data and I’m not sure when the performance tests started to fail but the ticket which describes the issue was created over a year ago: https://openlmis.atlassian.net/browse/OLMIS-6006. We thought the issue can be related to adding hapifhir service to the perftest but it looks like this service was added on Sept 2018 so this is probably the wrong track.
  2. The functional test server works as you described. I think that we’ve never used this configuration on perftest before.
  3. I see that uat2 is running but not used. Maybe that’s the reason for the higher costs.

Best,
Klaudia

Thanks @Klaudia_Palkowska! Then yes lets go ahead and bump up the instance size for now and see what kind of memory spikes we’re getting with the perftests. You’re right I had confused the functional test server with the perftest one, would it be possible to easily turn the perftest server on/off automatically as well? I think we decreased how often those tests run so it’d be great to save on instance costs when it’d otherwise sit idle.

I wasn’t factoring in uat2 into the cost though it’d be great to save where we can - however are we sure we’re not using it? Not for testing 3.9? Is it maybe hooked up to the reporting stack demo? I don’t remember why we left it running.

@joshzamor We could do that but some kind of flag would be needed because we are still using perftest for manual performance tests which we run for every release (a few times per RC). So the server would need to be up until the finish of testing. @Sebastian_Brudzinski What do you think?

@ibewes Are you sometimes using uat2? Or know if anyone is using?

Hello @Klaudia_Palkowska - I do not know of anyone using UAT2.

I guess we could have a separate job that allows you to start/stop the performance test server on demand, separate from the performance testing job. We could just run it as needed.

Best,
Sebastian

1 Like

Thank you all,

I’ve bumped up the instance size and all performance tests are green after the first run :tada: I’m still not 100% sure if we can stop UAT2 or not…

Moreover, the ticket about starting/stopping perftest on demand is created and already in progress.

Best,
Klaudia

Given that noone on this forum claims to be using it, I’d go ahead and shut it down. Please note, though, that RDS instances can only be stopped for 7 days. After that time they will switch on again automatically. If we do not want to use it anymore, we should terminate it completely.

Best,
Sebastian

Thanks @Klaudia_Palkowska for getting that on/off capability going.

Agreed on UAT2, thanks @Sebastian_Brudzinski (BTW I’ve just heard of this called the scream test: shut it down and see if anyone screams).

As for turning it off, that instance is controlled through terraform, we should shut it off by running a terraform’s destroy command on it. That terminates everything, though that should be fine as it’s easy to turn it back on with a terraform apply.