-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If a folsom_metrics_histogram_ets
owned table dies, kv_stat cannot recreate it
#508
Comments
@russelldb Is it possible that we could introduce a table heir to receive the orphaned tables, and have the folsom process get them back when it restarts? |
Of course. But shouldn't we address the problem of recovering if the tables go away? An heir makes it less likely, but still possible. I still haven't found a way, short of deleting the tables directly, to reproduce the problem. |
@russelldb Sure. Perhaps something should monitor that process? shrug |
@seancribbs yeah, but the problem is that all the kv stats are updated direct, in process, by get / put fsms, vodes, etc. They can't all monitor it. And if we funnel stats through a single process…we've been there. I have a branch that detects errors and asks another process to delete / recreate the broken stat. That plus the folsom patch is probably enough, I hope. |
Joe accepted two folsom PRs to fix this, both in master now. However for 1.3.1 we will probably need to release with a folsom fork that is the 1.3.0 tagged version + the two PRs. |
Sorry, dear reviewer, here are the PRs. basho/riak_core#287 And this branch of folsom https://github.com/basho/folsom/tree/riak1.3.n-spiral-delete To test this, have some background load of gets/puts etc running (I use the stat_punisher basho bench config from this PR #506). Attach to the node
Look at your log filling up Apply all the branches from this issue Attach to the node
Look at the temporary burst of errors in the log |
I've played with the stats killing ETS tables here and there and everything seems kosher. My few little observations have been addressed in the different PRs and I can't come up with anything else |
Addressed by 1.3.1. |
What about an ets:give_away to a manager process so the stats data isn't lost on a crash? |
Would need support at the folsom level, it has been discussed in this issue boundary/folsom#30 (and others, probably, see the 'race' issue too) These stats are 1 minute sliding windows, losing them is a shame, but not a disaster. This seemed like the best way for riak_kv. Even with heirs in ets, you still need to deal with the eventual possibility of the ets table going away. |
Spiral and histogram metrics have their own ets tables. a Folsom gen_sever
folsom_metrics_histogram_ets
creates and owns these tables. In some environments this process seems to be crashing and taking all the metrics with them.There are two problems with this.
folsom metrics are stored in 3 tables. Delete deletes the metric's own table
then deletes the metric meta data from the spiral table and finally the folsom table.
Since the ets table has already gone, the cascade delete fails before the metadata can
be removed, so the metric cannot be deleted or re-created. The logs fill up.
This issue (A crash of
folsom_metrics_histogram_ets
breaks allspiral
metrics boundary/folsom#55) and these branches address that. https://github.com/basho/folsom/tree/rdb-spiral-delete and https://github.com/basho/folsom/tree/riak1.3.n-spiral-deletemessage queues on busy nodes, riak_kv stats are updated in the calling process. When there is a stat error it is logged, but riak_kv_stat does not crash, so the stats are not
deleted and recreated. A broken stat stays broken.
The text was updated successfully, but these errors were encountered: