When bosh deploy
fails, sometimes the failing vm is missing from bosh vms
(deleted from bosh db) but still alive in Softlayer. If you do nothing, in the next bosh deploy
, bosh will treat it as a missing vm and create a new one with different IP. Sometimes we need to keep the vm IP unchanged, so the missing vm needs to be recovered in the bosh db.
If you don't perform this run book, the next bosh deploy
will create a new vm to backfill the missing vm with different IP. This will break the cases when the vm IP needs to be kept unchanged.
-
Make sure the failing vm can still been seen from Softlayer Portal and
/var/vcap/bosh/user_data.json
exists on the failing vm.Take
nats/0
as an example of the failing vm.update instances set vm_cid='12345678' where job='nats' and index=0; update instances set agent_id='1234abcd-12ab-34cd-56ef-123456abcdef' where job='nats' and index=0;
The
vm_cid
can be got from Softlayer Portal address such as12345678
in https://control.softlayer.com/devices/details/12345678/virtualGuestThe
agent_id
can be got from/var/vcap/bosh/user_data.json
on the failing vm.Login to the failing vm and create /var/vcap/bosh/spec.json like this:
{ "job": { "name": "REPLACE_job_name" }, "deployment": "REPLACE_deployment_name", "networks": { "default": { "cloud_properties": { "security_groups": [ "default", "cf" ] }, "default": [ "dns", "gateway" ], "dns": [ "REPLACE_dns_ip", "10.0.80.11", "10.0.80.12" ], "gateway": "REPLACE_gateway_ip", "ip": "REPLACE_vm_ip", "netmask": "REPLACE_netmask_ip", "type": "dynamic" } }, "index": REPLACE_job_index, "id": "", "persistent_disk": 0 }
- name: Replace with the job name. It's
nats
in the example. - dns: Replace with the dns ip. In case the PowerDNS on director is used, it's the director ip.
- gateway: Replace with the gateway ip. It can be found from
route -n
ornetstat -rn
. See example below. - ip: Replace with the vm ip. It can be found from
ifconfig
. See example below. - netmask: Replace with the netmask ip. It can be found from
ifconfig
. See example below. - deployment: Replace with the name of deployment which the vm belongs to.
- index: Replace with the job index. It's
0
in the example.
netstat -rn
orroute -n
output example:Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 <gateway_ip> 0.0.0.0 UG 0 0 0 eth1 9.0.0.0 10.121.120.193 255.0.0.0 UG 0 0 0 eth0 10.0.0.0 10.121.120.193 255.0.0.0 UG 0 0 0 eth0 10.121.120.192 0.0.0.0 255.255.255.192 U 0 0 0 eth0 161.26.0.0 10.121.120.193 255.255.0.0 UG 0 0 0 eth0 169.53.1.128 0.0.0.0 255.255.255.224 U 0 0 0 eth1
ifconfig
output example:eth0 Link encap:Ethernet HWaddr 06:99:32:27:34:f7 inet addr:<vm_private_ip> Bcast:10.121.120.255 Mask:<netmask_ip_private> inet6 addr: fe80::499:32ff:fe27:34f7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:596214 errors:0 dropped:0 overruns:0 frame:0 TX packets:5065 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:63759564 (63.7 MB) TX bytes:661260 (661.2 KB) eth1 Link encap:Ethernet HWaddr 06:da:3f:89:2a:b2 inet addr:<vm_public_ip> Bcast:169.53.1.159 Mask:<netmask_ip_public> inet6 addr: fe80::4da:3fff:fe89:2ab2/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:65584 errors:0 dropped:0 overruns:0 frame:0 TX packets:21240 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5413249 (5.4 MB) TX bytes:3236947 (3.2 MB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:6291 errors:0 dropped:0 overruns:0 frame:0 TX packets:6291 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:892748 (892.7 KB) TX bytes:892748 (892.7 KB)
sv restart agent
bosh vms
: the missing vm is backbosh cck
: no problems
- name: Replace with the job name. It's
-
Sometimes when
bosh deploy
fails, user_data.json has not yet been generated in cases like create_vm failure.Make sure the failing vm can still been seen from Softlayer Portal
1) Do the steps 1) & 2) in Scenario 1
Except the
agent_id
needs need to be got from the bosh debug log for the failure.bosh debug log example:
E, [2016-07-22 06:17:15 #13984] [canary_update(nats/0 (14109da6-cc51-49e8-8e99-d096e4f59550))] ERROR -- DirectorJobRunner: error creating vm: Creating Virtual_Guest with agent ID '1234abcd-12ab-34cd-56ef-123456abcdef': Attaching ephemeral disk to VirtualGuest `12345678`: Waiting for VirtualGuest `12345678` has Service Setup transaction complete: Getting Last Complete Transaction for virtual guest with ID '12345678': Get https://[user]:[api_key]@api.softlayer.com/rest/v3/SoftLayer_Virtual_Guest/12345678/getLastTransaction.json?objectMask=transactionGroup: net/http: TLS handshake timeout
user_data.json
example:{"agent_id":"1234abcd-12ab-34cd-56ef-123456abcdef","vm":{"name":"vm-1234abcd-12ab-34cd-56ef-123456abcdef","id":"vm-1234abcd-12ab-34cd-56ef-123456abcdef"},"mbus":"nats://nats:nats@[director_ip]:4222","ntp":[],"blobstore":{"provider":"dav","options":{"endpoint":"http://[director_ip]:25250","password":"[agent_password]","user":"agent"}},"networks":{"default":{"type":"dynamic","ip":"[vm_ip]","netmask":"[netmask]","gateway":"[gateway_ip]","dns":["[dns_ip]","10.0.80.11","10.0.80.12"],"default":["dns","gateway"],"preconfigured":true,"cloud_properties":{"security_groups":["default","cf"]}}},"disks":{"ephemeral":"/dev/xvdc","persistent":{}},"env":{}}
Update the items in bold:
agent_id
can be found in the bosh debug log- Keep "persistent" as empty which will be fixed later)
- Refer to Scenario 1 to get
vm_ip
,netmask
andgateway_ip
sv restart agent
Problem 1 of 1: Inconsistent mount information: Record shows that disk '12661853' should be mounted on 22354779. However it is currently : Not mounted in any VM. 1. Ignore 2. Reattach disk to instance 3. Reattach disk and reboot instance Please choose a resolution [1 - 3]: 2
bosh vms
: the missing vm is backbosh cck
: no problems