๐คนDeployments
Vetspire Engineers typically use a feature-branch development model. When branches are completed and reviewed, they get merged into Vetspire's master branch.
Once, at the start of every sprint (every two weeks), we deploy any completed changes on master into staging, and any work that has passed QA on staging into production.
Deployments are usually carried out by Chris Bailey, or Tomasz Tomczyk in the UK engineering team due to timezone reasons, though nothing is preventing any engineer from deploying changes as all that strictly needs to happen is code getting merged into Vetspire's production branch. Vetspire aims to deploy outside of business hours to mitigate any possible issues for the majority of our users in the US.
Deployment Timelines
There is a fortnightly Teams meeting to block out time for deployments, but generally deployments will begin at approximately 9:00 GMT/BST and must be completed by approximately 12:30 GMT/BST.
Any deployment which is at risk of not completing before 12:30 GMT/BST should be rolled back. Vetspire Engineering would much prefer rescheduling our deployments than affect real end users.
It is important to note that despite being out of business hours, Vetspire does still have users online at all hours of the day, so depending on the severity of any issues found, rolling back / deferring deployment may still be a good idea.
Deployment Process
Prior to a deployment, a copy of Vetspire's current Deployment Plan is typically made by Phillip Brewer on Confluence.
Due to Thrive's CAB process, we need to keep these deployment plans around to review and learn from should anything go wrong! Please ensure a deployment plan exists prior to a deployment. If not, feel free to clone the above linked deployment plan and rename accordingly.
Ideally, deployments are done on a call with multiple engineers. Such calls generally happen as Slack Huddles in the #dev
channel. Note that this step isn't absolutely neccessary, but deployments are stressful and very manual despite how many times engineers may have deployed, so help is often nice to have!
Pre-Deployment
Ensure a clone of the Production database is created.
New clone created with name
vetspire-production-clone-${date '+%Y-%m-%d'}
Delete any clones older than three deploys ago.
Ensure any PRs pointed at
master
orstaging
are merged in at the deployer's discretion.Ensure JIRA is up to date for all columns relevant for a deployment.
Make sure cards left in
In Progress
column aren't already merged inMake sure cards left in
Code Review
column aren't already merged inMake sure cards in
Merged to Master
column are merged in. Double check sprint and release version.Make sure cards in
Staging
column are merged in. Double check sprint and release version.Make sure cards in
QA Hold
column are merged in. Double check sprint and release version.Make sure cards in
Testing For Production
column are merged in. Double check sprint and release version.
Ensure
QA Hold
column doesn't contain any deployment blockers. Usually these are called out prior to deployment so left at deployer's discretion.Ensure all relevant branches are up to date locally.
git checkout master && git pull
git checkout staging && git pull
git checkout production && git pull
Ensure any hotfixes / changes between branches have been merged down.
git checkout staging && git merge production && git push
git checkout master && git merge staging && git push
Deployment Proper
Deploy changes from
staging
intoproduction
.git checkout production && git merge staging
Dry-run migrations on newly created database clone to make sure everything goes smoothly
./scripts/diff_migrations origin/production
(copy to clipboard)Test the above in
gcloud beta sql connect vetspire-production-clone-${date '+%Y-%m-%d'} --database=vetspire_staging --user proxyuser
Ensure nothing locks the database for approximately more than 30 seconds; though this is up to the deployer's discretion.
Ensure migrations align with what good looks like, such as:
When adding new column constraints, prefer splitting them into two migrations; one with a deferred check, and one adding the check after the fact. This reduces the need for a full table lock to a row level lock and lets Postgres make use of indexes.
When added new indexes, make sure that they are done concurrently.
When renaming columns or tables for particularly large tables (or high traffic tables), prefer to create new columns/tables and divert writes to minimize user impact.
Generally follow this guide.
git push
Deploy changes from
master
intostaging
.git checkout staging && git merge master
git push
Post-Deployment
Move all tickets in
Staging
,QA Hold
,Testing for Production
columns intoProduction
column in JIRA.Move all tickets in
Merged to master
column intoStaging
column in JIRA.Keep an eye on our Deployment Actions in case of pipeline failure.
Ensure Release Notes are published. Corrections/adjustments can be made per deployer's discretion.
Send a Slack message to the
General
channel to confirm the deployment was successful.Monitor Sentry for any spikes or new errors.
Initial spikes concerning
DBConnections
or other timeouts is normal at the time of writing.Any obvious deployment-related or deployment-introduced issues should be carded up and up to the deployer's discretion, fixed ASAP. Feel free to work with your TL or EM to declare issues as SEV-1s or SEV-0s where appropriate.
Disaster Recovery Process
Deployment failed causing pods to be missing or down
Roll back deployment
Find the most recent successful Production deployment via GitHub Actions
Take note of the commit SHA (last slug in the URL)
Update
/api/k8s/production.*.deployment
and changegcr.io/vetspire-app/vetspire:release
togct.io/vetspire-app/vetspire:release-${SHA}
.Ensure you're on the production cluster
gcloud container clusters get-credentials vetspire-api-prod --project vetspire-app --zone us-central1
kubectl apply -f k8s/production.deployment.yaml
Watch pods and see if issue is resolved
Alert EM/TL/DevOps channel, regardless of if issue is resolved.
Database is locked and users cannot access portions of the site
Connect directly to the production database
gcloud beta sql connect vetspire-production --database=vetspire_staging --user proxyuser
Try to determine what's going on.
List long running processes
Cancel processes
SELECT pg_cancel_backend($PID);
Force Kill processes
SELECT pg_terminate_backend($PID);
Identify if migrations were backwards compatible. If so, rolling back might be a good call.
Alert EM/TL/DevOps channel, regardless of if issue is resolved.
Last updated