Shane Tully · 2024-09-18 · ruby, rails

How we upgrade major Rails versions

As a company whose product is built on top of Ruby on Rails, conducting a major version upgrade of the underlying framework is just about the biggest upkeep item we regularly undertake. The whole process takes months — with multiple cycles of development work, rounds of automated and manual testing, and a phased rollout process. Here's how we do it.

Gemfile.next

The core idea behind the way we upgrade Rails revolves around making the application compatible with both the current version and the next version simultaneously. Doing so means that during the upgrade process we're able to boot the application with these versions allowing tests to run against both. This makes it much easier to find regressions by toggling a switch that determines which version to run.

This is done by having a Gemfile.next in addition to our current Gemfile. Gemfile.next is simply a symbolic link to the current Gemfile:

$ ls -l Gemfile.next
lrwxrwxrwx shane shane 7 B Fri May  7 12:26:50 2021 Gemfile.next ⇒ Gemfile

Then, at the top of the Gemfile is a short function definition:

def next?
  File.basename(__FILE__) == "Gemfile.next"
end

This allows us to put conditionals in the Gemfile for which version of Rails to use (and any other gems), as such:

gem 'rails', (next? ? '7.1.3.4' : '7.0.8.4')

Finally, we have a wrapper script: bin/next. When prefixed with a given Rails command (such as bin/next rails console), bin/next will start the app with the "next" version of the gems in the Gemfile.

#!/bin/bash

# Use this file to run the app with the next version of Rails
#
# Usage:
#   bin/next bundle install
#   bin/next rails ...

export BUNDLE_GEMFILE=Gemfile.next
export BUNDLE_CACHE_PATH=vendor/cache.next
export BUNDLE_BIN=false
export NEXT=1

if [[ "${@}" =~ ^bundle ]]; then
  $@
else
  bundle exec $@
fi

Using this concept, we're able to easily switch between a Gemfile using the current set of gems and the "next" gems for everything, including local development and running test suites.

Our upgrade process

Update the Gemfile. This includes the following tasks:

Copy Gemfile.lock to Gemfile.next.lock to start fresh from the current set of gem versions.
Set the new Rails gem version in the Gemfile.
Work through the necessary gem version upgrades to get a bundle that resolves all dependencies successfully.

Review the changes in the Rails upgrade guide.

Reviewing the Rails release notes early on is critical to avoid missing subtleties that might cause major problems later on during a deployment. Plus, it is easier to upgrade the application when you have a better idea of what the changes are.
We also make use of the rails app:update task, but we prefer to do this manually for the sake of having more control over that process.

Fix all tests and make any needed changes compatible with both versions of Rails.

Of course, this step is the bulk of the work. Depending on the size of the test suite and complexity of the application, this part of the process can take weeks or months.

Go through a round of manual testing with our Customer Success team.
Deploy the work.

First, we do a small rollout to a subset of customers.
We then do the final switchover by promoting Gemfile.next.lock to Gemfile.lock.

Yeah — that's a lot. So let's break down the bigger steps.

Maintaining compatibility

Once we have a bundle for the next Rails version, the first step is to get the application booting with it and ensure tests pass with both versions.

Patches

Similar to many large Rails applications, we have our fair share of patches to core and third-party Rails gems. For example, we have a patch to Rails' Rack logger to log the full URL of a request (rather than just the path):

if Rails.version >= '7.2'
  raise "Ensure patched methods below have not changed in Rails #{Rails.version}"
end

class Rails::Rack::Logger
  # https://github.com/rails/rails/blob/v7.1.3.4/railties/lib/rails/rack/logger.rb#L54
  def started_request_message(request)
    format(
      'Started %s "%s%s%s" for %s at %s',
      request.request_method,
      request.protocol,
      request.host_with_port,
      request.filtered_path,
      request.ip,
      Time.zone.now.to_s
    )
  end
end

Of note here is the Rails.version, which is conditional at the top of the patch. When the next engineer tasked with upgrading Rails attempts to boot the application, an exception will require her to check the source of the patched method and ensure it has not changed in the new Rails version. She will then bump the conditional for the future Rails upgrade.

This approach ensures that we don't miss updating any patches that might silently fail if the class they are patching has changed, resulting in the code not being called. There should ideally be tests for this patched behavior as well. But depending on how the test is written, it's possible for these to provide a false positive result if the patched class changed in the right way. We find that having a loud exception forcing an engineer to check patches during Rails upgrades is the more surefire way to verify they are still up to date.

Application code

With the application booting, the mammoth task of fixing all the broken tests begins. bin/next rails test or bin/next rspec (depending on the test suite) makes it easy to run individual tests against the two versions of Rails and cross-reference if something goes awry.

Ideally, a fix can be made that will be compatible with both versions of Rails. But in many cases, it's necessary to leave a if Rails.version >= 'X.Y' conditional in the code. This will need to be cleaned up after the final deployment, but it allows the application to eventually become compatible with both versions as we work through fixing all the tests.

Cache keys

Another small tip is to ensure cache keys will be invalidated between Rails versions. Simply adding the Rails version to the cache key can prevent a whole category of difficult-to-debug issues when stale data applicable only to an old Rails version is used in a newer, incompatible version upon production deployment. For example:

Rails.cache.fetch("some-key-#{Rails.version}") do
  [business logic]
end

Running the test suite

The next challenge is running the test suite against both versions on the CI platform. There are a few considerations here:

Running the test suite for the engineer(s) working on the upgrade directly to monitor progress toward a 100% test pass rate
Once 100% of tests pass, running the test suite for other team members to ensure their ongoing work isn't creating regressions
Minimizing the cost impact of doubling the resources to run the test suite over a potentially long period of time

We must first configure our CI pipeline to run all steps with a configurable command prefix. We will then enable a separate set of jobs to run everything with this environment variable set to either bin/next or an empty string for the current bundle.

The next configuration is to not fail the pipeline if these tests fail while they are still being fixed. We have a $RSPEC_NEXT_REQUIRED variable to control the reporting of the RSpec exit code. Initially, this is set to 0 to prevent the pipeline from being blocked. But once the tests all pass, we flip it to 1. This transfers the burden of ensuring a passing test suite onto the whole team if any of its ongoing work introduces a failing test in the next Rails version. The setup looks like this:

${CI_BUNDLE_PREFIX} rspec [...]
rspec_status=$?

if [ "$CI_BUNDLE_PREFIX" = "bin/next" ] && [ "$RSPEC_NEXT_REQUIRED" = "0" ]; then
  echo "Ignoring rspec exit code ${rspec_status} for bundle/next"
  exit 0
fi

exit $rspec_status

Depending on the size of the test suite and the length of the upgrade process, it's also worth considering the additional resources and costs incurred from duplicating the test suite like this. We set up our configuration in a way that saves on costs: As long the $RSPEC_NEXT_REQUIRED variable is set to false, we have an additional branch filter that will only run the next jobs if the branch name matches a pattern such as /.*rails-next.*/. We then remove this branch filter when we're ready to start running the tests on all branches closer to deployment.

Ensuring consistent Gemfiles

Another challenge that we run into is ensuring gems in the current Gemfile — which are updated during the deployment process — are also reflected in the next Gemfile. Because the intention is for the next Gemfile to have at least some different gem versions by nature of the upgrade, this can be difficult. There's no way to know which gems should be different and which should be consistent. Fortunately (in our case at least), our third-party dependencies do not change that frequently. So this is a small problem, but one we must still pay attention to.

The first line of defense here is to continually remind other team members to reflect any changes to the current Gemfile within the next Gemfile. However, it's only natural for people to forget about this sometimes. To combat this in the most frequent gems that we update, we have the following script in our CI steps:

function git_gem_revision {
  # This searches through the given Gemfile.lock for a `GIT` block for the given gem and extracts its revision line
  awk -v GEM="$1" '
    $1 == "GIT" { git_gem = 1 }
    $1 == "GEM" { git_gem = 0 }
    git_gem && $1 == "remote:" && $2 ~ GEM".git$" { found_gem = 1 }
    found_gem && $1 == "revision:" { print $2; exit }
  ' "$2"
}

GEMS=("aha-services" "calculated_attributes")

for GEM in "${GEMS[@]}"; do
  if [ "`git_gem_revision "$GEM" "Gemfile.lock"`" != "`git_gem_revision "$GEM" "Gemfile.next.lock"`" ]; then
    echo "$GEM revision in Gemfile.lock does not match revision in Gemfile.next.lock. Ensure these are consistent to avoid mismatching gem versions during Rails upgrades by running \"bin/next bundle update $GEM --conservative\""
    exit 1
  fi
done

Because our CI setup only has one set of gems installed for a specific step (either current or next), we must parse the Gemfile.lock manually to get the revision for a given gem (rather than relying on bundle to gem to tell us what is installed, as it won't know!). In the cases above, we're only concerned with our first-party gems that are installed via Git. But this same method could be extended to gems from RubyGems as well. If an inconsistency is found between the versions in each Gemfile, the build will fail. This forces the engineer to ensure the versions are consistent.

As a last check, we will run bundle list and bin/next bundle list before the first production deployment and do a manual review to verify that the gem versions in the next bundle are the same versions or more recent. If anything got left behind, this is a good time to update it so nothing moves backward during the transition.

Manual testing

All tests pass at this point. Due to the large nature of a major Rails upgrade, we also involve our Customer Success team in a round of manual testing of all application functionality in a staging environment. Assuming a comprehensive test suite, this should hopefully undercover few to no legitimate issues — but it is still a valuable method for catching any gaps in the test suite which should be covered regardless. Manual testing is a slow, expensive, and time-consuming process, though. So it is sometimes understandable to skip or limit this when there is also high confidence in the test suite.

Deployment

With all the testing complete, it's finally time to deploy the upgrade. We do this in two stages:

An initial rollout limited to a subset of customers that is done outside of busy business hours and can be quickly rolled back if necessary
A final rollout to all customers with Gemfile.next.lock promoted to Gemfile.lock

For the initial rollout, we have a second wrapper script for the bin/next script called bin/conditional_next. This script uses an environment variable, $NEXT, to control whether to run a command with bin/next or not. By setting this $NEXT variable to "true" on a subset of servers/containers, we can do an initial phased rollout with the next Gemfile. This also allows deployments of other unrelated changes to continue as normal.

#!/bin/bash

if [ "${NEXT:-0}" == "1" ]; then
  echo "Running command with bin/next"
  bin/next $@
else
  $@
fi

Getting the next Rails version into production will likely reveal any outstanding missed issues quickly. Realistically, going through this process means potentially rolling back to the existing Rails version at least once (or twice … or three times). There are many moving parts here, and even all the testing in the world won't illuminate every problem prior to the first production deploy. Even if things look stable after this deployment, we like to leave this initial rollout running for a few hours or days before doing it more widely.

Once everything looks quiet and any remediation fixes have been merged, it's time to do the final merge and deploy. We do this with a simple cp Gemfile.next.lock Gemfile.lock to promote the next Gemfile to the main Gemfile. And with that, one more normal deployment will roll out the Rails upgrade to all production traffic. Everything will be smooth sailing given a sprinkle of good luck.

We are bootstrapped, profitable, fully remote, and hiring. Join our team.