If a `folsom_metrics_histogram_ets` owned table dies, kv_stat cannot recreate it #508

russelldb · 2013-03-18T10:24:24Z

Spiral and histogram metrics have their own ets tables. a Folsom gen_sever folsom_metrics_histogram_ets creates and owns these tables. In some environments this process seems to be crashing and taking all the metrics with them.

There are two problems with this.

folsom_metrics:delete/1 fails for spirals due to the order of the cascade.
folsom metrics are stored in 3 tables. Delete deletes the metric's own table
then deletes the metric meta data from the spiral table and finally the folsom table.
Since the ets table has already gone, the cascade delete fails before the metadata can
be removed, so the metric cannot be deleted or re-created. The logs fill up.
This issue (A crash of folsom_metrics_histogram_ets breaks all spiral metrics boundary/folsom#55) and these branches address that. https://github.com/basho/folsom/tree/rdb-spiral-delete and https://github.com/basho/folsom/tree/riak1.3.n-spiral-delete
Riak_kv_stat no longer crashes when there is an error updating a stat. To avoid large
message queues on busy nodes, riak_kv stats are updated in the calling process. When there is a stat error it is logged, but riak_kv_stat does not crash, so the stats are not
deleted and recreated. A broken stat stays broken.

The text was updated successfully, but these errors were encountered:

seancribbs · 2013-03-18T14:26:31Z

@russelldb Is it possible that we could introduce a table heir to receive the orphaned tables, and have the folsom process get them back when it restarts?

russelldb · 2013-03-18T14:28:05Z

Of course. But shouldn't we address the problem of recovering if the tables go away? An heir makes it less likely, but still possible. I still haven't found a way, short of deleting the tables directly, to reproduce the problem.

seancribbs · 2013-03-18T14:37:18Z

@russelldb Sure. Perhaps something should monitor that process? shrug

russelldb · 2013-03-18T16:04:16Z

@seancribbs yeah, but the problem is that all the kv stats are updated direct, in process, by get / put fsms, vodes, etc. They can't all monitor it. And if we funnel stats through a single process…we've been there. I have a branch that detects errors and asks another process to delete / recreate the broken stat. That plus the folsom patch is probably enough, I hope.

russelldb · 2013-03-19T19:30:51Z

Joe accepted two folsom PRs to fix this, both in master now. However for 1.3.1 we will probably need to release with a folsom fork that is the 1.3.0 tagged version + the two PRs.

russelldb · 2013-03-19T19:44:18Z

Sorry, dear reviewer, here are the PRs.

basho/riak_core#287
basho/riak_api#26
basho/riak_pipe#70
#513

And this branch of folsom

https://github.com/basho/folsom/tree/riak1.3.n-spiral-delete

To test this, have some background load of gets/puts etc running (I use the stat_punisher basho bench config from this PR #506).

Attach to the node
Delete all the spiral ets tables

Spirals = [N || {N, [{type, T}]} <- folsom_metrics:get_metrics_info(), T == spiral].
[begin {spiral, Tid, _Pid} = folsom_metrics_spiral:get_value(S), ets:delete(Tid) end || S <- Spirals].

Look at your log filling up
Get stats? You can't!

Apply all the branches from this issue

Attach to the node
Delete all the spiral ets tables

Spirals = [N || {N, [{type, T}]} <- folsom_metrics:get_metrics_info(), T == spiral].
[begin {spiral, Tid, _Pid} = folsom_metrics_spiral:get_value(S), ets:delete(Tid) end || S <- Spirals].

Look at the temporary burst of errors in the log
Get stats? You may have some unavailable until some process updates them (repair happens on update)

engelsanchez · 2013-03-21T21:17:45Z

I've played with the stats killing ETS tables here and there and everything seems kosher. My few little observations have been addressed in the different PRs and I can't come up with anything else
👍 💃 Ship it!

russelldb · 2013-04-03T14:12:12Z

Addressed by 1.3.1.

DeadZen · 2013-04-11T10:37:49Z

What about an ets:give_away to a manager process so the stats data isn't lost on a crash?

russelldb · 2013-04-11T10:47:27Z

Would need support at the folsom level, it has been discussed in this issue boundary/folsom#30 (and others, probably, see the 'race' issue too)

These stats are 1 minute sliding windows, losing them is a shame, but not a disaster. This seemed like the best way for riak_kv. Even with heirs in ets, you still need to deal with the eventual possibility of the ets table going away.

russelldb mentioned this issue Mar 20, 2013

Fix bug where stats endpoints were calculating _all_ riak_kv stats #506

Closed

ghost assigned engelsanchez Mar 20, 2013

russelldb closed this as completed Apr 3, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If a `folsom_metrics_histogram_ets` owned table dies, kv_stat cannot recreate it #508

If a `folsom_metrics_histogram_ets` owned table dies, kv_stat cannot recreate it #508

russelldb commented Mar 18, 2013

seancribbs commented Mar 18, 2013

russelldb commented Mar 18, 2013

seancribbs commented Mar 18, 2013

russelldb commented Mar 18, 2013

russelldb commented Mar 19, 2013

russelldb commented Mar 19, 2013

engelsanchez commented Mar 21, 2013

russelldb commented Apr 3, 2013

DeadZen commented Apr 11, 2013

russelldb commented Apr 11, 2013

If a folsom_metrics_histogram_ets owned table dies, kv_stat cannot recreate it #508

If a folsom_metrics_histogram_ets owned table dies, kv_stat cannot recreate it #508

Comments

russelldb commented Mar 18, 2013

seancribbs commented Mar 18, 2013

russelldb commented Mar 18, 2013

seancribbs commented Mar 18, 2013

russelldb commented Mar 18, 2013

russelldb commented Mar 19, 2013

russelldb commented Mar 19, 2013

engelsanchez commented Mar 21, 2013

russelldb commented Apr 3, 2013

DeadZen commented Apr 11, 2013

russelldb commented Apr 11, 2013

If a `folsom_metrics_histogram_ets` owned table dies, kv_stat cannot recreate it #508

If a `folsom_metrics_histogram_ets` owned table dies, kv_stat cannot recreate it #508