Dealing with Massive Data in Rails

By Yujing Zheng | Oct 28, 2016

Rails is amazing. It has helped individuals and entrepreneurs build minimum viable products much faster. Here at ChaiOne, Rails is the default backend choice whether it is used as a whole website solution or just as the backend API server for the mobile apps we build. We love its elegant syntax and unparalleled development speed.

However, the performance of the Rails framework has long been questioned since day one especially when dealing with massive data. You’ve probably heard stories like How We Went from 30 Servers to 2: Go. Will Rails be the bottleneck when you try to handle millions of user data? Not if you follow these tips.

1. Avoid ActiveRecord if you can

ActiveRecord made things a lot easier, but it is not designed for raw data. When you want to apply a bunch of simple operations to millions of rows, you’d better stick with raw SQL statement. If you feel you really need an ORM level tool to help you visualize things, check out sequel.

2. Update all records

This is a rookie error we see all the time with people trying to iterate the whole table and update one single attribute:

  User.where(city: “Houston”).each do |user|
    user.note = “Houstonian”
    user.save
  end

The code is really readable yet deadly inefficient. If there are 100K Houston users, this could easily last for more than 24 hours. A much quicker and more efficient solution would be:

  User.update_all({note: “Houstonian”}, {city: “Houston”})

This would take no longer than 30 seconds for the same amount of data.

3. Only load columns you need

Code like User.where(city: “Houston”) will actually load all information of users from the database. If you simply don’t care about other information like age, gender, or occupation, you shouldn’t fetch them in the first place. Try to use select_column when you have numerous records.

  User.select(“city”, “state”).where(age: 29)

4. Replace the classic Model.all.each with find_in_batches

Chances are that with small applications this will not even be noticed, but this really matters. 100K user records could easily take 5+ more GB memory. Your server will probably crash. Therefore, we always recommend using find_in_batches, which perfectly solves this.

  User.find_in_batches(conditions: ‘grade = 2', batch_size: 500) do |students|
    students.each do |student|
      student.find_or_create_by_class_name(‘PE’)
    end
  end

5. Reduce transactions

  (0.2ms) BEGIN
  (0.4ms) COMMIT

Transactions happen every time we save the object. It will still occur millions of times. Even if we use find_in_batches, the only way to effectively reduce transactions is to group our operations. The previous example could still be further optimized to

  User.find_in_batches(conditions: ‘grade = 2', batch_size: 500) do |students|
    User.transaction do
      students.each do |student|
        student.find_or_create_by_class_name(‘PE’)
      end
    end
  end

This way, instead of commits after every single record, now it commits every 500 records, which is much more efficient.

6. Don’t forget to index

Always index important columns or column combinations you query most frequently. Otherwise, your where clauses will take forever.

7. Destroy is expensive

Destroy through ActiveRecord is a really expensive operation. Make sure you know what you are doing. One thing you do want to make sure is: although “destroy” and “delete” both remove the records, “destroy” will also go through all callbacks, which might be really time-consuming. The same goes to “destroy_all” and “delete_all”. So if you are pretty sure you just want to delete the records without touching anything else, you could just use “delete_all”.

Another case would be if you wanted to clean out the whole model table. Let’s say you want to delete all users, you could actually use “TRUNCATE”:

  ActiveRecord::Base.connection.execute(“TRUNCATE TABLE users”)

Anyhow, delete in database level is still really time-consuming. This is the reason why we sometimes go with the “soft delete” approach. Just make the deleted records “deleted = 1″ to make it faster.

8. You don’t have to run it immediately

Background job is your friend. Utilize tools like Resque and Sidekiq and also create threads to make your life much easier.


In a word, if your data is massive, try as much as you can to go to the ground level. While providing convenience, we have to admit ActiveRecord does slow your system down a bit. However, through these practical tips, you could still keep other Rails awesomeness while not losing too much performance. We enjoy Rails as much as you do!
ucd

Get in touch

Marketo Form