Skip to content

Conversation

@joyent-automation
Copy link

#23 Negative byte lag calculated from pgstatsmon metrics after shard rebuild

This PR was migrated-from-gerrit, https://cr.joyent.us/#/c/5256/.
The raw archive of this CR is here.
See MANTA-4594 for info on Joyent Eng's migration from Gerrit.

CR discussion

Patch Set 1 code comments
docs/migrating.md#37 @kellymclaughlin

Why is pg_recovery_wal_received_bytes not susceptible to the same problem of being a counter? Couldn't that value also decrease in the case of a role change?

docs/migrating.md#37 @KodyKantor

That's a good question. I didn't test this. I assumed that it wouldn't be affected in the same way.

I think the key issue is that most of the pg_stat_replication stats report on the downstream PG, which can change at any time. The pg_recovery set of stats only report on the PG being queried (and not a downstream server), so I think they shouldn't change... but I need to test that and will update with a comment in the GitHub issue when I have results.

docs/migrating.md#37 @KodyKantor

I finally got around to testing this. It appears that the stats we get from the WAL position administrative functions (like pg_current_xlog_flush_location() and friends) can safely operate as counters. I did a role change while running

select pg_last_xlog_receive_location() as recv, pg_last_xlog_replay_location() as repl;

in a .2 second \watch and the values continued to increment. I think this makes sense. In order for the value returned by the first function to decrease the upstream server would have to re-send WAL segments that have already been received by the downstream server. For the second function to have its value decrease the local server would have to re-replay previous WAL segments. I don't think either of these things happens.

However, there is a problem here. All of these leave around garbage after a sync gets promoted to primary because of pgstatsmon#22.

Also, I notice that I mention pgstatsmon#22 instead of pgstatsmon#23 in this migrating.md file. pgstatsmon#22 isn't fixed yet. I'll correct that issue.

lib/queries.js#140 @KodyKantor

The 'pg' module interprets 8-byte 'numeric' typed values as a strings. When these were counters that wasn't an issue because pgstatsmon would perform math on these values to convert them into the difference since the query was last run. Gauge fields have their values passed directly into node-artedi which will assert that the thing being passed in is actually a number and not a string.

To avoid casting everything as a number in the main pgstatsmon code I figured we could be deliberate and cast only the things that we know will be cast to strings by the 'pg' module.

This is probably a bug that should be fixed in node-artedi too, since node-artedi should be able to observe data points greater than MAX_SAFE_INTEGER, at least for the gauge type where no arithmetic is needed (usually).

@KodyKantor commented at 2018-12-17T21:16:07

Patch Set 2:

(1 comment)

@kellymclaughlin commented at 2018-12-18T20:26:16

Patch Set 1:

(1 comment)

@KodyKantor commented at 2018-12-19T18:08:16

Patch Set 1:

(1 comment)

@KodyKantor commented at 2019-05-16T18:22:36

Patch Set 1:

(1 comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants