Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a folsom_metrics_histogram_ets owned table dies, kv_stat cannot recreate it #508

Closed
russelldb opened this issue Mar 18, 2013 · 10 comments

Comments

@russelldb
Copy link
Member

Spiral and histogram metrics have their own ets tables. a Folsom gen_sever folsom_metrics_histogram_ets creates and owns these tables. In some environments this process seems to be crashing and taking all the metrics with them.

There are two problems with this.

  1. folsom_metrics:delete/1 fails for spirals due to the order of the cascade.
    folsom metrics are stored in 3 tables. Delete deletes the metric's own table
    then deletes the metric meta data from the spiral table and finally the folsom table.
    Since the ets table has already gone, the cascade delete fails before the metadata can
    be removed, so the metric cannot be deleted or re-created. The logs fill up.
    This issue (A crash of folsom_metrics_histogram_ets breaks all spiral metrics boundary/folsom#55) and these branches address that. https://github.com/basho/folsom/tree/rdb-spiral-delete and https://github.com/basho/folsom/tree/riak1.3.n-spiral-delete
  2. Riak_kv_stat no longer crashes when there is an error updating a stat. To avoid large
    message queues on busy nodes, riak_kv stats are updated in the calling process. When there is a stat error it is logged, but riak_kv_stat does not crash, so the stats are not
    deleted and recreated. A broken stat stays broken.
@seancribbs
Copy link
Contributor

@russelldb Is it possible that we could introduce a table heir to receive the orphaned tables, and have the folsom process get them back when it restarts?

@russelldb
Copy link
Member Author

Of course. But shouldn't we address the problem of recovering if the tables go away? An heir makes it less likely, but still possible. I still haven't found a way, short of deleting the tables directly, to reproduce the problem.

@seancribbs
Copy link
Contributor

@russelldb Sure. Perhaps something should monitor that process? shrug

@russelldb
Copy link
Member Author

@seancribbs yeah, but the problem is that all the kv stats are updated direct, in process, by get / put fsms, vodes, etc. They can't all monitor it. And if we funnel stats through a single process…we've been there. I have a branch that detects errors and asks another process to delete / recreate the broken stat. That plus the folsom patch is probably enough, I hope.

@russelldb
Copy link
Member Author

Joe accepted two folsom PRs to fix this, both in master now. However for 1.3.1 we will probably need to release with a folsom fork that is the 1.3.0 tagged version + the two PRs.

@russelldb
Copy link
Member Author

Sorry, dear reviewer, here are the PRs.

basho/riak_core#287
basho/riak_api#26
basho/riak_pipe#70
#513

And this branch of folsom

https://github.com/basho/folsom/tree/riak1.3.n-spiral-delete

To test this, have some background load of gets/puts etc running (I use the stat_punisher basho bench config from this PR #506).

Attach to the node
Delete all the spiral ets tables

Spirals = [N || {N, [{type, T}]} <- folsom_metrics:get_metrics_info(), T == spiral].
[begin {spiral, Tid, _Pid} = folsom_metrics_spiral:get_value(S), ets:delete(Tid) end || S <- Spirals].

Look at your log filling up
Get stats? You can't!

Apply all the branches from this issue

Attach to the node
Delete all the spiral ets tables

Spirals = [N || {N, [{type, T}]} <- folsom_metrics:get_metrics_info(), T == spiral].
[begin {spiral, Tid, _Pid} = folsom_metrics_spiral:get_value(S), ets:delete(Tid) end || S <- Spirals].

Look at the temporary burst of errors in the log
Get stats? You may have some unavailable until some process updates them (repair happens on update)

@engelsanchez
Copy link
Contributor

I've played with the stats killing ETS tables here and there and everything seems kosher. My few little observations have been addressed in the different PRs and I can't come up with anything else
👍 💃 Ship it!

@russelldb
Copy link
Member Author

Addressed by 1.3.1.

@DeadZen
Copy link

DeadZen commented Apr 11, 2013

What about an ets:give_away to a manager process so the stats data isn't lost on a crash?

@russelldb
Copy link
Member Author

Would need support at the folsom level, it has been discussed in this issue boundary/folsom#30 (and others, probably, see the 'race' issue too)

These stats are 1 minute sliding windows, losing them is a shame, but not a disaster. This seemed like the best way for riak_kv. Even with heirs in ets, you still need to deal with the eventual possibility of the ets table going away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants