Orchestrator. New seeding algorithm, support for different seed methods and agent redesign #1120

MaxFedotov · 2020-04-12T16:02:59Z

This PR is an 'orchestrator' part of openark/orchestrator-agent#1

This PR adds following:

New seed processing algorithm
Redesign of agent-related structs and functions
Additional APIs, which provides information about seed and agent
Update of agent and seed UI

We define seed as an operation, which consists of a set of predefined stages, where 2 orchestrator agents participate - one agent is called source agent(so it is on a source side of seed), another - target agent (so it is on a target side of seed), and the goal of seed is to transfer all data from source agent to target agent and add target agent as a slave to source agent.

Seed can be executed using different seed methods - special ways or programs used to transfer data from source agent to target agent

Seed operation is completely processed by orchestrator and is ran on scheduled configurable intervals (SeedProcessIntervalSeconds).

By design, each seed operation consists of 5 stages, which are executed one after another:

Prepare - is executed on both agents
Backup - is executed either on target or on source agent depending on seed method chosen
Restore - is executed on target agent
ConnectSlave - is executed on target agent
Cleanup - is executed on both agents

Each of stages can be in one of 6 statuses:

Scheduled - on next seed processing cycle, orchestrator will call corresponding stage orchestrator-agent API (prepare\backup\restore...) and orchestrator-agent will start executing stage. If API call will be successful, stage status will be updated to Running. If not - it will be updated to Error
Running - on next seed processing cycle, orchestrator will call orchestrator-agent API to get information about stage and save it to a database. Stage will stay in this status until orchestrator-agent will return that either stage is Completed, and it this case seed will be moved to next stage with Scheduled status, or there were some errors during stage and it's status will be updated to Error
Completed - just an information status. Orchestrator do not process seeds in this status
Error - on the next seed processing cycle, orchestrator will check, if the number of times stage was in this status is not more than a configurable threshold (MaxRetriesForSeedStage). If it is more - seed will be marked as Failed and won't be processed. If it is less seed will be restarted - moved to a Scheduled status of the current stage (one exception from this rule are errors during Backup stage. If an error was on this stage - seed will be restarted from Prepare stage, as some of the seed methods do important things on this stage - for example, socat is started on target agent during Prepare stage). If stage was executed on both agents, and only one agent had an error during execution orchestrator will abort stage on the second agent before restarting it
Failed - just an information status. Orchestrator do not process seeds in this status
Aborting - if user hits an Abort button in UI or uses Abort API, stage will be moved to this status. Orchestrator will try to call abort API on an agent (or both agents, if this stage includes them both). If it will be successful - seed will be marked as Failed and won't be processed. If not - on the next seed process cycle orchestrator will try to abort it one more time until it succeeds.

Seed is considered Active (that means it should be processed by orchestrator) only if stage is in Scheduled, Running, Error or Aborting statuses.

All seed processing is done using this function

func ProcessSeeds() []*Seed {
	activeSeeds, err := ReadActiveSeeds()
	if err != nil {
		log.Errore(fmt.Errorf("Unable to read active seeds: %+v", err))
		return nil
	}
	if len(activeSeeds) > 0 {
		inst.AuditOperation("process-seeds", nil, fmt.Sprintf("Will process %d active seeds", len(activeSeeds)))
		var wg sync.WaitGroup
		for _, seed := range activeSeeds {
			wg.Add(1)
			switch seed.Status {
			case Scheduled:
				go seed.processScheduled(&wg)
			case Running:
				go seed.processRunning(&wg)
			case Error:
				go seed.processErrored(&wg)
			case Aborting:
				go seed.processAborting(&wg)
			}
		}
		wg.Wait()
		inst.AuditOperation("process-seeds", nil, "All active seeds processed")
	}
	return activeSeeds
}

Which starts a new goroutine depending on status for each of the Active seeds and waits until all of them will complete.

So the overall process will look like:

/agent-seed/:seedMethod/:targetHost/:sourceHost 'orchestrator' API is called. Different pre-seed checks are done and if they are successful, new seed is created. It will be in Scheduled status in Prepare stage
ProcessSeeds() function is called by a timed routine. It will start processScheduled() function in a goroutine. This function will call /api/prepare/:seedID/:seedMethod/:seedSide on both orchestrator-agents and it these calls will be successful - it will update Prepare stage status to Running for both agents
ProcessSeeds() function is called by a timed routine. It will start processRunning() function in a goroutine. This function will call /api/seed-stage-state/:seedID/:seedStage on both orchestrator-agents and log the result. If both agents return, that they completed Prepare stage, it will be marked as Completed and seed will be moved to Backup stage in Scheduled status. If one of the agents hadn't completed stage, seed will be still on this stage and during next ProcessSeeds() function call processRunning() function will be run again, and it will call /api/seed-stage-state/:seedID/:seedStage on remaining agent. The process will be continued until both agents complete stage
ProcessSeeds() function is called by a timed routine. It will start processScheduled() function in a goroutine. This function will call /api/backup/:seedID/:seedMethod/:seedHost/:mysqlPort on one of the agents depending on seed method chosen. If this call will be successful - it will update Backup stage status to Running for an agent.
ProcessSeeds() function is called by a timed routine. It will start processRunning() function in a goroutine. This function will call /api/seed-stage-state/:seedID/:seedStage on an agent. If an agent return that it completed the Backup stage it will be marked as Completed and seed will be moved to Restore stage in Scheduled status.

So basically this works like a state-machine, which moves seed from start to finish using following path:
New seed -> Prepare scheduled -> Prepare running -> Prepare completed -> Backup scheduled -> Backup running -> Backup completed -> Restore scheduled -> Restore running -> Restore completed -> ConnectSlave scheduled -> ConnectSlave running -> ConnectSlave completed -> Cleanup scheduled -> Cleanup running -> Cleanup completed -> Seed completed

I've also created different test cases, which are located in agent_test.go and simulates all this logic using docker and orchestator-agents mocks.

With this PR orchestrator UI is also updated to support all changes. I've tried to keep all previous functions regarding LVM\local snapshot hosts\remote snapshot hosts as much backward-compatible, as possible.
New UI includes agent page with various filters with auto-complete

Updated seeds page with filters with auto-complete and UI for starting new seed

Updated agent page (some of LVM specific UI elements are not shown because this agent do not have LVM seed method added)

We had already created our own auto-provisioning solution based on orchestrator, orchestrator-agent and puppet, where new hosts are automatically added to cluster as slaves. We started testing in on our staging environment, but I will be still considering this functionality in the beta stage, as there are obviously be some bugs, which we will catch with active production usage.

But I will be very grateful for your feedback and comments and will be glad to answer all questions :)
Thanks,
Max

…eReplicaHostnameFilters

…ortseed. Add additonal agents info to structs

…seed endpoints

…are stage wil backupside=source, as there will be active socat waiting for connections

MaxFedotov · 2020-04-12T16:30:21Z

Seems like there are some problems with vendoring packages, but I don't have such errors during local build, so I definitely need your help with this :)

MaxFedotov · 2020-04-12T19:32:12Z

I was able to fix build errors, but now it failed on running my tests for seed logic. The thing is that this test requires docker installed and some customization with \etc\hosts file

// before running add following to your /etc/hosts file depending on number of agents you plan to use
// 127.0.0.2 agent1
// 127.0.0.3 agent2
// ...
// 127.0.0.n agentn

I do not know how to configure this, so maybe exclude this test from running?

…agent wrong check for err!= nil

…etween hosts

Maksim Fedotov added 30 commits January 14, 2020 13:05

orchestrator-agent redesign. Structures

24206c9

agent redesign. Add structures. Change submit-agent to post

4eba4ef

support new agent logic p1. Add seedmethod

f4d971c

Merge branch 'master' into orchestrator_agent_dev

d592efb

agent. refactor seeding p1

3c27d3b

agent. refactore agent hosts and seeding p2

c1f0c5e

agent. refactor agent struct and methods

02d7552

agent. new seed process

0f1af60

update vendor dependencies for dockertest package

9a698bc

functional tests for agent + bugfix

94482da

add tests for seed processing + bugfix

151528e

add more tests + bugfix

a8113fe

redesign agent tests

9283bf2

add more tests + bugfix

39917f2

add post-seed-cmd for agent

a2ceb9d

add more tests + bugfix

31cc75a

instance_dao unmarshall count_mysql_snaphosts from db

fed54f4

instance_dao make InitializeInstanceDao public in order to run tests

4a4b39e

agent remove MySQLVersion, as Orchestrator already know it

573aab5

additional checks for new seed. AbortSeed + tests

a7f4c69

ContinuousSeedProcess only on leader

15c5a18

move agent commands to agent.go; update cli and api

d801cc9

remove unused agent variables from config

38e276f

api use getagentinfo

2490f84

add ReadFailedSeeds, ReadErroredSeeds + tests

8888243

add additional checks that targetInstance is not slave or master

11e2677

skip drop statements when using sqlite

c2ea445

replace chen isMaster to len(SlaveHosts) when starting seed

fdb5da1

fix bug with agent discovery during seed

55ee494

fix logging in seed stages

da37dd6

Maksim Fedotov added 21 commits March 11, 2020 21:13

seed. additional lvm checks

0e77b86

skip AddReplicaKey if it is specified in config.Config.DiscoveryIgnor…

c6f6bae

…eReplicaHostnameFilters

Merge branch 'skip_adding_replicas' into orchestrator_agent_dev

b57c070

seed. Log backup\restore progress event if size not changed

b7aa8af

seed. Move creating of progress tracking stucts to processRunning

245c26b

add UI for agents\seeds. Add additional APIs for UI. Fix bugs with ab…

d5bf77f

…ortseed. Add additonal agents info to structs

agent UI. add confirmation dialog for agent-commands

0bdf29a

update orchestrator css

64979ad

orchestrator js fix bug with toHumanFormat func if bytes is 0

18033ca

seed. fix a bug with processrunning on prepare stage

92920af

agent UI. add forgotten action for discover button

7c32b01

agent bugfix. wrong cluster alias shown on agents page

09fd2b8

agents ui bugfix. user /web/cluster link for cluster page

591bfc2

add agent-data api endpoint

7ff2db6

remove unnecessary isAuthorizedForAction check for information agent\…

5ce85a4

…seed endpoints

seed. when using abort always call abort api for target agent on prep…

5ff4468

…are stage wil backupside=source, as there will be active socat waiting for connections

add agents-failed-seeds api

f6c09a3

Merge branch 'master' into orchestrator_agent_dev

0aea224

update agents docs

ec3dd85

merge branch 'master' into orchestrator_agent_dev

43fc609

fix build errors

4218d84

Maksim Fedotov added 3 commits April 12, 2020 22:14

fix build errors p1

7acb357

fix build errors p2

57d2ce1

fix build errors p3

e7b9276

Maksim Fedotov added 4 commits April 17, 2020 15:08

bugfix. Seed set inst.GTIDHintForce only if master.UsingGTID(). /api/…

957ed0a

…agent wrong check for err!= nil

bugfix. Active seed states not displayed on agent page

014f8a2

bugfix. Active seed states not displayed on agent page

91ff9ef

seed. downtime target host in order to prevent clusterName flapping b…

c5b5b40

…etween hosts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Orchestrator. New seeding algorithm, support for different seed methods and agent redesign #1120

Orchestrator. New seeding algorithm, support for different seed methods and agent redesign #1120

Uh oh!

MaxFedotov commented Apr 12, 2020 •

edited

Loading

Uh oh!

MaxFedotov commented Apr 12, 2020

Uh oh!

MaxFedotov commented Apr 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Orchestrator. New seeding algorithm, support for different seed methods and agent redesign #1120

Are you sure you want to change the base?

Orchestrator. New seeding algorithm, support for different seed methods and agent redesign #1120

Uh oh!

Conversation

MaxFedotov commented Apr 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxFedotov commented Apr 12, 2020

Uh oh!

MaxFedotov commented Apr 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxFedotov commented Apr 12, 2020 •

edited

Loading