Migrating a legacy database into an existing Rails app.
The Background
Sometimes, you need to connect to a legacy database and retrieve some data. Perhaps you are deep in the process of rewriting an application, or maybe your company has bought another company, leaving you, the poor lone developer, responsible for transferring all the data into the existing app.
In general, a process like this is considered a a data pipeline and involves extracting data from a foreign source, potentially transforming it, and then loading it into your destination of choice. This process is usually referred to as ETL. Whatever your reason for migrating legacy data, there are many commercial tools out there claiming to help with the task. These tools often come at a cost and tend to focus on more than just one-off migration tasks. They also address data sources such as CSV files, S3 buckets and APIs. In the Rails community, there is an excellent gem called kiba, which has both an open-source and a commercial licence, and focuses on such recurring processes.
The Scenario
In our case, we’ll consider something simpler. Our concrete scenario is as follows: We have a legacy relational database and an existing Rails app into which we want to migrate the legacy data. However, we acknowledge that the migration task involves more than just copying database rows; it may also require data transformation or processing (e.g. downloading and restoring files).
In order to avoid relying on gems and creating unnecessary dependencies, let’s use something that is already built into Rails: ActiveRecord and Rake tasks.
Let’s assume we have a working CMS-like system with authors and posts. Now our company aquires another company and their high-traffic blog system (called „OldBlog“) is supposed to be integrated into our app.
The adapter file
Let’s start with the adapter file for the legacy database. Since ActiveRecord reads the table definitions and columns on the fly, we can define according Rails models in an extra file, located in /lib
for example. We take into concern, that in the OldBlog the authors
are called users
and the posts
are called articles
.
In the file you are of course free to use any Rails magic or Ruby code, which helps you with the migration later on. Here, it's demonstrated with custom associations and custom instance methods.
class OldBlog < ActiveRecord::Base
self.abstract_class = true
if ENV["OLD_BLOG_DATABASE_URL"].present?
establish_connection(ENV["OLD_BLOG_DATABASE_URL"])
elsif Rails.env.development?
establish_connection({adapter: "mysql", host: "localhost", database: "old_system_development"})
end
# Mark every model as read only!
def readonly?
true
end
end
class OldBlog::Article < OldBlog
self.table_name = "articles"
belongs_to :user
end
class OldBlog::User < OldBlog
self.table_name = "users"
has_many :articles, foreign_key: "user"
def first_name
name.split(" ").first
end
def last_name
name.split(" ").last
end
end
We can define a parent class from which all the other models inherit. In our example, the old database uses MySQL as the database system, while the existing Rails app could be using PostgreSQL or, of course, something else. Defining associations in the adapter file also makes navigating the Rails console easier when inspecting the old database. To migrate the database live, we can use the connection string to transfer the data to the production environment.
The core ingredient - a unique exchange key
To synchronise data between systems, we need a unique key for each row to be migrated. This allows us to periodically rerun our migration. Since we are approaching a row-by-row migration, we'll use that key to either instantiate a new record or work with an existing one, if the data row has already been migrated.
Non-Rails legacy databases probably do not specify an id
column in every table, but you will usually find something (e.g. a combined key) that can be used to uniquelly refer to a specific record. We’ll store that key in our new system and perform our data sync around it.
class AddLegacyIdToPosts < ActiveRecord::Migration[8.0]
def change
add_column :posts, :legacy_id, :string
add_index :posts, :legecy_id
end
end
In our example, we just specify a legacy_id
for the posts
. We exchanging the authors, we’ll use the email address as the exchange key. These extra exchange keys can be removed once the migration process is fully live.
The Rake file
Now, let's define our Rake task. Again, we’ll use a row-by-row approach, meaning we’ll read every row from the legacy database, extract it, transform it, and load it into our database. Using find_each
here ensures that we do not step into any memory issues.
namespace :etl do
# IMPORTANT: the order of the tasks is important here.
task users: :environment do
puts "Migrating users…"
scope = OldSystem::User.all
progress_bar = ProgressBar.new(scope.count)
scope.find_each do |old_user|
author = Author.where(email: old_user.email).first_or_initialize
author.first_name = old_user.first_name
author.last_name = old_user.last_name
author.created_at = old_user.created
# define as many attribute migration as needed
# ...
author.save(validate: false)
# perform any other needed task, such e.g. downloading author images from the old server and sotring them with ActiveStorage in the new system.
progress_bar.increment!
end
puts "Migrating users done!"
end
end
task articles: :environment do
puts "Migrating articles…"
scope = OldSystem::Article.all
progress_bar = ProgressBar.new(scope.count)
scope.find_each do |old_article|
post = Post.where(legacy_id: old_article.id).first_or_initialize
post.title = old_article.title
post.created_at = old_article.created
post.author = old_article.user
# define as many attribute migration as needed
# ...
post.save(validate: false)
progress_bar.increment!
end
puts "Migrating articles done!"
end
end
end
These tasks can now be run locally and periodically, ensuring that every necessary data is migrated from the old system. Of course, the way you work with the data here is extremly dependent on your application and your needs. Our example ETL-process is just made up, but we can already see the flexibility in manipulating the data on the fly and probably guess, how strucutred this apporach could be with a larger domain model.
Conclusion
Although this example is simple, it is scalable in terms of the number of database tables that need to be migrated. In all cases, you’ll need to watch out for internal dependencies (for example, we need to migrate users before migrating articles). Also, once a data row has been migrated from the old system to the new one, we have not implemented a system for deleting delta records. There are approaches to this, but we usually found them unnecessary since there is probably a day X when you put the old system into maintenance mode and migrate everything over at once.
We have migrated several legacy databases for clients, and this process has always worked out pretty well. Sometimes, special treatment is required for extremely large tables, as the row-by-row approach is not applicable because it takes too much time. In that case, you’ll need to find an alternative solution, such as CSV table export/import. However, in general, you can achieve a lot with these simple steps.
Got a tricky migration on your hands? Drop us a message — we'd love to help.