Moving Fast or Moving Efficiently?
I should know better, I really should.
I always talk up the benefits of automated testing, always push our team towards test-driven development, and make it a point talk about the time saved in the long run. This time I failed to heed my own advice and it bit me in the ass. Hopefully this post will act as a reminder to remain disciplined.
We’re working on a project that pulls a time series into our database on a nightly basis, does a little work with the data, adds a little of our own data, and gives us a nice real-time roll-up. The data is something that is usually only viewed quarterly since the aggregation takes a long time, but we hope to change that. Seeing the roll-up on a daily or weekly basis allows us to respond to challenges more quickly and keeps business running smoothly.
Our stack is pretty standard, we’re using Ruby on Rails with Postgres. Data is pulled in with an ActiveJob that runs nightly. We insert the data into Postgres without any manipulation, then use Postgres views to aggregate the data in various forms. Inside the Rails app we have read-only ActiveRecord models that reference the views. This is where the testing broke down.
Each row of an aggregate view may be impacted by a few hundred (or a few thousand) raw rows. This means that for each test we’d have to first populate hundreds of rows with known values, calculate our expectation by hand, then assert our expectation in the test. Writing it makes it seem so simple, but building out the actual tests was an extremely time-consuming process. It felt strange to have tests in ruby for calculations that were happening in the database, so I decided that testing in ruby didn’t make sense. I validated the formulas manually, brought them into the database, then smoke tested the values that we were getting out. Everything seemed to be running smoothly until we brought the system into production.
In production we ran into some odd behavior. Certain values were changing as we clicked through from our index pages into detail pages. Initially I thought that turbolinks was the culprit, but I was really just grasping at straws. It wasn’t until I started to think about spinning up a new instance, manually inserting values into the database, and looking at the results that I realized how truly absurd I was being. This is EXACTLY what automated tests are for. So I decided to go back and retroactively generate tests to validate the data in the views, it was validating Postgres calculation from Ruby, but it made the most sense given the rest of the app and my current workflow.
All in all it took about two days to pull together all of the tests. Along the way I found out that what was supposed to be one row in the view would sometimes split to two if certain conditions were met. This was an easy fix in Postgres, it took around 2 minutes once I saw the problem. The largest benefit is that we now have a hundred tests that validate our data for every case that we handle and we’ll know right away if future migrations break any of our expectations. Just as it should be.
Hopefully I’ll take my own advice next time, maybe I could have saved myself the trouble and identified some of these issues much earlier.