Migrating a legacy database into an existing Rails app

The Background

Sometimes, you need to connect to a legacy database and retrieve some data. Perhaps you are deep in the process of rewriting an application, or maybe your company has bought another company, leaving you, the poor lone developer, responsible for transferring all the data into the existing app.

In general, a process like this is considered a a data pipeline and involves extracting data from a foreign source, potentially transforming it, and then loading it into your destination of choice. This process is usually referred to as ETL. Whatever your reason for migrating legacy data, there are many commercial tools out there claiming to help with the task. These tools often come at a cost and tend to focus on more than just one-off migration tasks. They also address data sources such as CSV files, S3 buckets and APIs. In the Rails community, there is an excellent gem called kiba, which has both an open-source and a commercial licence, and focuses on such recurring processes.

The Scenario

In our case, we’ll consider something simpler. Our concrete scenario is as follows: We have a legacy relational database and an existing Rails app into which we want to migrate the legacy data. However, we acknowledge that the migration task involves more than just copying database rows; it may also require data transformation or processing (e.g. downloading and restoring files).
In order to avoid relying on gems and creating unnecessary dependencies, let’s use something that is already built into Rails: ActiveRecord and Rake tasks.

Let’s assume we have a working CMS-like system with authors and posts. Now our company aquires another company and their high-traffic blog system (called „OldBlog“) is supposed to be integrated into our app.

The adapter file

Let’s start with the adapter file for the legacy database. Since ActiveRecord reads the table definitions and columns on the fly, we can define according Rails models in an extra file, located in /lib for example. We take into concern, that in the OldBlog the authors are called users and the posts are called articles.

In the file you are of course free to use any Rails magic or Ruby code, which helps you with the migration later on. Here, it's demonstrated with custom associations and custom instance methods.

class OldBlog < ActiveRecord::Base
  self.abstract_class = true

  if ENV["OLD_BLOG_DATABASE_URL"].present?
    establish_connection(ENV["OLD_BLOG_DATABASE_URL"])
  elsif Rails.env.development?
    establish_connection({adapter: "mysql", host: "localhost", database: "old_system_development"})
  end

  # Mark every model as read only!
  def readonly?
    true
  end
end

class OldBlog::Article < OldBlog
  self.table_name = "articles"
  belongs_to :user
end

class OldBlog::User < OldBlog
  self.table_name = "users"
  has_many :articles, foreign_key: "user"

  def first_name
    name.split(" ").first
  end

  def last_name
    name.split(" ").last
  end
end

We can define a parent class from which all the other models inherit. In our example, the old database uses MySQL as the database system, while the existing Rails app could be using PostgreSQL or, of course, something else. Defining associations in the adapter file also makes navigating the Rails console easier when inspecting the old database. To migrate the database live, we can use the connection string to transfer the data to the production environment.

The core ingredient - a unique exchange key

To synchronise data between systems, we need a unique key for each row to be migrated. This allows us to periodically rerun our migration. Since we are approaching a row-by-row migration, we'll use that key to either instantiate a new record or work with an existing one, if the data row has already been migrated.
Non-Rails legacy databases probably do not specify an id column in every table, but you will usually find something (e.g. a combined key) that can be used to uniquelly refer to a specific record. We’ll store that key in our new system and perform our data sync around it.

class AddLegacyIdToPosts < ActiveRecord::Migration[8.0]
  def change
    add_column :posts, :legacy_id, :string
    add_index :posts, :legecy_id
  end
end

In our example, we just specify a legacy_id for the posts. We exchanging the authors, we’ll use the email address as the exchange key. These extra exchange keys can be removed once the migration process is fully live.

The Rake file

Now, let's define our Rake task. Again, we’ll use a row-by-row approach, meaning we’ll read every row from the legacy database, extract it, transform it, and load it into our database. Using find_each here ensures that we do not step into any memory issues.

namespace :etl do
  # IMPORTANT: the order of the tasks is important here. 

  task users: :environment do
   puts "Migrating users…"
      scope = OldSystem::User.all
      progress_bar = ProgressBar.new(scope.count)
      scope.find_each do |old_user|
        author = Author.where(email: old_user.email).first_or_initialize
        author.first_name = old_user.first_name
        author.last_name = old_user.last_name
        author.created_at = old_user.created
        # define as many attribute migration as needed 
        # ...
        author.save(validate: false)

        # perform any other needed task, such e.g. downloading author images from the old server and sotring them with ActiveStorage in the new system.
        progress_bar.increment!
      end
      puts "Migrating users done!"
    end
  end

  task articles: :environment do
   puts "Migrating articles…"
      scope = OldSystem::Article.all
      progress_bar = ProgressBar.new(scope.count)
      scope.find_each do |old_article|
        post = Post.where(legacy_id: old_article.id).first_or_initialize
        post.title = old_article.title
        post.created_at = old_article.created
        post.author = old_article.user
        # define as many attribute migration as needed 
        # ...
        post.save(validate: false)
        progress_bar.increment!
      end
      puts "Migrating articles done!"
    end
  end
end

These tasks can now be run locally and periodically, ensuring that every necessary data is migrated from the old system. Of course, the way you work with the data here is extremly dependent on your application and your needs. Our example ETL-process is just made up, but we can already see the flexibility in manipulating the data on the fly and probably guess, how strucutred this apporach could be with a larger domain model.

Conclusion

Although this example is simple, it is scalable in terms of the number of database tables that need to be migrated. In all cases, you’ll need to watch out for internal dependencies (for example, we need to migrate users before migrating articles). Also, once a data row has been migrated from the old system to the new one, we have not implemented a system for deleting delta records. There are approaches to this, but we usually found them unnecessary since there is probably a day X when you put the old system into maintenance mode and migrate everything over at once.

We have migrated several legacy databases for clients, and this process has always worked out pretty well. Sometimes, special treatment is required for extremely large tables, as the row-by-row approach is not applicable because it takes too much time. In that case, you’ll need to find an alternative solution, such as CSV table export/import. However, in general, you can achieve a lot with these simple steps.

Got a tricky migration on your hands? Drop us a message — we'd love to help.

Subscribe to get future articles via the RSS feed .

Migrating a legacy database into an existing Rails app.