How we upgrade major Rails versions
As a company whose product is built on top of Ruby on Rails, conducting a major version upgrade of the underlying framework is just about the biggest upkeep item we regularly undertake. The whole process takes months — with multiple cycles of development work, rounds of automated and manual testing, and a phased rollout process. Here's how we do it.
Gemfile.next
The core idea behind the way we upgrade Rails revolves around making the application compatible with both the current version and the next version simultaneously. Doing so means that during the upgrade process we're able to boot the application with these versions allowing tests to run against both. This makes it much easier to find regressions by toggling a switch that determines which version to run.
This is done by having a Gemfile.next
in addition to our current Gemfile
. Gemfile.next
is simply a symbolic link to the current Gemfile
:
$ ls -l Gemfile.next
lrwxrwxrwx shane shane 7 B Fri May 7 12:26:50 2021 Gemfile.next ⇒ Gemfile
Then, at the top of the Gemfile
is a short function definition:
def next?
File.basename(__FILE__) == "Gemfile.next"
end
This allows us to put conditionals in the Gemfile
for which version of Rails to use (and any other gems), as such:
gem 'rails', (next? ? '7.1.3.4' : '7.0.8.4')
Finally, we have a wrapper script: bin/next
. When prefixed with a given Rails command (such as bin/next rails console
), bin/next
will start the app with the "next" version of the gems in the Gemfile.
#!/bin/bash
# Use this file to run the app with the next version of Rails
#
# Usage:
# bin/next bundle install
# bin/next rails ...
export BUNDLE_GEMFILE=Gemfile.next
export BUNDLE_CACHE_PATH=vendor/cache.next
export BUNDLE_BIN=false
export NEXT=1
if [[ "${@}" =~ ^bundle ]]; then
$@
else
bundle exec $@
fi
Using this concept, we're able to easily switch between a Gemfile
using the current set of gems and the "next" gems for everything, including local development and running test suites.
Our upgrade process
- Update the
Gemfile
. This includes the following tasks:
- Copy
Gemfile.lock
toGemfile.next.lock
to start fresh from the current set of gem versions. - Set the new Rails gem version in the
Gemfile
. - Work through the necessary gem version upgrades to get a bundle that resolves all dependencies successfully.
- Review the changes in the Rails upgrade guide.
- Reviewing the Rails release notes early on is critical to avoid missing subtleties that might cause major problems later on during a deployment. Plus, it is easier to upgrade the application when you have a better idea of what the changes are.
- We also make use of the
rails app:update
task, but we prefer to do this manually for the sake of having more control over that process.
- Fix all tests and make any needed changes compatible with both versions of Rails.
- Of course, this step is the bulk of the work. Depending on the size of the test suite and complexity of the application, this part of the process can take weeks or months.
- Go through a round of manual testing with our Customer Success team.
- Deploy the work.
- First, we do a small rollout to a subset of customers.
- We then do the final switchover by promoting Gemfile.next.lock to Gemfile.lock.
Yeah — that's a lot. So let's break down the bigger steps.
Maintaining compatibility
Once we have a bundle for the next Rails version, the first step is to get the application booting with it and ensure tests pass with both versions.
Patches
Similar to many large Rails applications, we have our fair share of patches to core and third-party Rails gems. For example, we have a patch to Rails' Rack logger to log the full URL of a request (rather than just the path):
if Rails.version >= '7.2'
raise "Ensure patched methods below have not changed in Rails #{Rails.version}"
end
class Rails::Rack::Logger
# https://github.com/rails/rails/blob/v7.1.3.4/railties/lib/rails/rack/logger.rb#L54
def started_request_message(request)
format(
'Started %s "%s%s%s" for %s at %s',
request.request_method,
request.protocol,
request.host_with_port,
request.filtered_path,
request.ip,
Time.zone.now.to_s
)
end
end
Of note here is the Rails.version
, which is conditional at the top of the patch. When the next engineer tasked with upgrading Rails attempts to boot the application, an exception will require her to check the source of the patched method and ensure it has not changed in the new Rails version. She will then bump the conditional for the future Rails upgrade.
This approach ensures that we don't miss updating any patches that might silently fail if the class they are patching has changed, resulting in the code not being called. There should ideally be tests for this patched behavior as well. But depending on how the test is written, it's possible for these to provide a false positive result if the patched class changed in the right way. We find that having a loud exception forcing an engineer to check patches during Rails upgrades is the more surefire way to verify they are still up to date.
Application code
With the application booting, the mammoth task of fixing all the broken tests begins. bin/next rails test
or bin/next rspec
(depending on the test suite) makes it easy to run individual tests against the two versions of Rails and cross-reference if something goes awry.
Ideally, a fix can be made that will be compatible with both versions of Rails. But in many cases, it's necessary to leave a if Rails.version >= 'X.Y'
conditional in the code. This will need to be cleaned up after the final deployment, but it allows the application to eventually become compatible with both versions as we work through fixing all the tests.
Cache keys
Another small tip is to ensure cache keys will be invalidated between Rails versions. Simply adding the Rails version to the cache key can prevent a whole category of difficult-to-debug issues when stale data applicable only to an old Rails version is used in a newer, incompatible version upon production deployment. For example:
Rails.cache.fetch("some-key-#{Rails.version}") do
[business logic]
end
Running the test suite
The next challenge is running the test suite against both versions on the CI platform. There are a few considerations here:
- Running the test suite for the engineer(s) working on the upgrade directly to monitor progress toward a 100% test pass rate
- Once 100% of tests pass, running the test suite for other team members to ensure their ongoing work isn't creating regressions
- Minimizing the cost impact of doubling the resources to run the test suite over a potentially long period of time
We must first configure our CI pipeline to run all steps with a configurable command prefix. We will then enable a separate set of jobs to run everything with this environment variable set to either bin/next
or an empty string for the current bundle.
The next configuration is to not fail the pipeline if these tests fail while they are still being fixed. We have a $RSPEC_NEXT_REQUIRED
variable to control the reporting of the RSpec exit code. Initially, this is set to 0
to prevent the pipeline from being blocked. But once the tests all pass, we flip it to 1
. This transfers the burden of ensuring a passing test suite onto the whole team if any of its ongoing work introduces a failing test in the next Rails version. The setup looks like this:
${CI_BUNDLE_PREFIX} rspec [...]
rspec_status=$?
if [ "$CI_BUNDLE_PREFIX" = "bin/next" ] && [ "$RSPEC_NEXT_REQUIRED" = "0" ]; then
echo "Ignoring rspec exit code ${rspec_status} for bundle/next"
exit 0
fi
exit $rspec_status
Depending on the size of the test suite and the length of the upgrade process, it's also worth considering the additional resources and costs incurred from duplicating the test suite like this. We set up our configuration in a way that saves on costs: As long the $RSPEC_NEXT_REQUIRED
variable is set to false, we have an additional branch filter that will only run the next
jobs if the branch name matches a pattern such as /.*rails-next.*/
. We then remove this branch filter when we're ready to start running the tests on all branches closer to deployment.
Ensuring consistent Gemfiles
Another challenge that we run into is ensuring gems in the current Gemfile — which are updated during the deployment process — are also reflected in the next Gemfile. Because the intention is for the next Gemfile to have at least some different gem versions by nature of the upgrade, this can be difficult. There's no way to know which gems should be different and which should be consistent. Fortunately (in our case at least), our third-party dependencies do not change that frequently. So this is a small problem, but one we must still pay attention to.
The first line of defense here is to continually remind other team members to reflect any changes to the current Gemfile within the next Gemfile. However, it's only natural for people to forget about this sometimes. To combat this in the most frequent gems that we update, we have the following script in our CI steps:
function git_gem_revision {
# This searches through the given Gemfile.lock for a `GIT` block for the given gem and extracts its revision line
awk -v GEM="$1" '
$1 == "GIT" { git_gem = 1 }
$1 == "GEM" { git_gem = 0 }
git_gem && $1 == "remote:" && $2 ~ GEM".git$" { found_gem = 1 }
found_gem && $1 == "revision:" { print $2; exit }
' "$2"
}
GEMS=("aha-services" "calculated_attributes")
for GEM in "${GEMS[@]}"; do
if [ "`git_gem_revision "$GEM" "Gemfile.lock"`" != "`git_gem_revision "$GEM" "Gemfile.next.lock"`" ]; then
echo "$GEM revision in Gemfile.lock does not match revision in Gemfile.next.lock. Ensure these are consistent to avoid mismatching gem versions during Rails upgrades by running \"bin/next bundle update $GEM --conservative\""
exit 1
fi
done
Because our CI setup only has one set of gems installed for a specific step (either current or next), we must parse the Gemfile.lock
manually to get the revision for a given gem (rather than relying on bundle
to gem
to tell us what is installed, as it won't know!). In the cases above, we're only concerned with our first-party gems that are installed via Git. But this same method could be extended to gems from RubyGems as well. If an inconsistency is found between the versions in each Gemfile, the build will fail. This forces the engineer to ensure the versions are consistent.
As a last check, we will run bundle list
and bin/next bundle list
before the first production deployment and do a manual review to verify that the gem versions in the next bundle are the same versions or more recent. If anything got left behind, this is a good time to update it so nothing moves backward during the transition.
Manual testing
All tests pass at this point. Due to the large nature of a major Rails upgrade, we also involve our Customer Success team in a round of manual testing of all application functionality in a staging environment. Assuming a comprehensive test suite, this should hopefully undercover few to no legitimate issues — but it is still a valuable method for catching any gaps in the test suite which should be covered regardless. Manual testing is a slow, expensive, and time-consuming process, though. So it is sometimes understandable to skip or limit this when there is also high confidence in the test suite.
Deployment
With all the testing complete, it's finally time to deploy the upgrade. We do this in two stages:
- An initial rollout limited to a subset of customers that is done outside of busy business hours and can be quickly rolled back if necessary
- A final rollout to all customers with
Gemfile.next.lock
promoted toGemfile.lock
For the initial rollout, we have a second wrapper script for the bin/next
script called bin/conditional_next
. This script uses an environment variable, $NEXT
, to control whether to run a command with bin/next
or not. By setting this $NEXT
variable to "true" on a subset of servers/containers, we can do an initial phased rollout with the next
Gemfile. This also allows deployments of other unrelated changes to continue as normal.
#!/bin/bash
if [ "${NEXT:-0}" == "1" ]; then
echo "Running command with bin/next"
bin/next $@
else
$@
fi
Getting the next Rails version into production will likely reveal any outstanding missed issues quickly. Realistically, going through this process means potentially rolling back to the existing Rails version at least once (or twice … or three times). There are many moving parts here, and even all the testing in the world won't illuminate every problem prior to the first production deploy. Even if things look stable after this deployment, we like to leave this initial rollout running for a few hours or days before doing it more widely.
Once everything looks quiet and any remediation fixes have been merged, it's time to do the final merge and deploy. We do this with a simple cp Gemfile.next.lock Gemfile.lock
to promote the next
Gemfile to the main Gemfile. And with that, one more normal deployment will roll out the Rails upgrade to all production traffic. Everything will be smooth sailing given a sprinkle of good luck.
We are bootstrapped, profitable, fully remote, and hiring. Join our team.