How to collect aggregated statistics in a Ruby on Rails application?
Date Published
Reading time
7 min
It is important to collect aggregated statistics so that management can analyze the data and make well-informed decisions. Sphere was retained by a client in the recruiting industry who, among other things, needed to collect the following data:
- Total shifts posted
- Total hours posted
- Total shifts worked
- Total hours worked
- Average length of shifts
- Average shifts per job
In addition, Sphere had to provide the possibility of "spoofing" the statistics to a certain point while the production database was being tested. Up to that point, the statistics should have been based not on the actual values from the database but on some customarily-entered data.
Since our client was using a Ruby on Rails application, we decided to write a statistics module in Ruby as well in order to leverage existing code and to simplify maintenance. We considered three implementation options:
Option | Advantages | Disadvantages
(1) Collect statistics on the fly. If total shifts posted must be calculated, a request is made to the corresponding table along with the constraints. | Statistics are always up to date. | Model code is polluted, as scopes and calculations for the statistics must be added.
Implementing the requirement of “spoofing” the statistics is difficult.
In order to calculate the data, SQL conditions like GROUP, JOIN, etc. must be added, which could lead to performance issues.
(2) Keep the statistics in separate tables which are updated on the fly. The data aggregated by day/employer is recorded and calculated in a separate table. If the data underlying the statistics are changed, a line recalculation takes place in the statistics table. | Statistics are always up to date. Implementing the requirement of “spoofing” the statistics is easy. | Model code is polluted, as adding callbacks to call the code of calculation of statistical data is necessary. If one model has changed, a request must be made for all models for this day for a given employer. In addition, the specifics of the application indicate that the models can change quite often during the day. Minimal statistics detailing period is 1 day.
(3) Keep the statistics in separate tables which are updated periodically. The data aggregated by day/employer is recorded and calculated in a separate table. Application data vary throughout the day. Then a special background task at the end of the day collects the changes and updates the statistics table. | Clean model code. All the logic is encapsulated in the collection of the statistics module. Implementing the requirement of “spoofing” the statistics is easy. | Statistics can be irrelevant as they do not change throughout the day. Minimal statistic detailing period is 1 day.
After presenting these three options to our client, we agreed to proceed with the third option.
Calculating & Storing Statistics
The
Statistics
::
Employer
model is used for calculating and storing statistical data. In its table, we store the date, employer's foreign key, and all other values needed to calculate the statistics (total hours posted, total hours worked, number of applications, and average number of applications).
class CreateStatisticsEmployers < ActiveRecord::Migrationdef changecreate_table :statistics_employers do |t|t.date :date, index: true, null: falset.references :employer_profile, index: true, foreign_key: true, null: falset.integer :jobs_count, default: 0, null: falset.integer :shifts_posted_count, default: 0, null: falset.decimal :hours_posted_count, default: 0, null: falset.integer :shifts_worked_count, default: 0, null: falset.decimal :hours_worked_count, default: 0, null: falseendendend
All formulas are contained in the model code:
module Statisticsclass Employer < ActiveRecord::Basebelongs_to :employer_profileclass << selfdef total_jobssum(:jobs_count)enddef total_shifts_postedsum(:shifts_posted_count)enddef total_hours_postedsum(:hours_posted_count)enddef average_length_of_shift_postedtotal_shifts_posted.zero? ? 0 : total_hours_posted / total_shifts_postedenddef average_shifts_per_jobtotal_jobs.zero? ? 0 : total_shifts_posted.to_f / total_jobsenddef total_shifts_workedsum(:shifts_worked_count)enddef total_hours_workedsum(:hours_worked_count)enddef average_length_of_shift_workedtotal_shifts_worked.zero? ? 0 : total_hours_worked / total_shifts_workedendendendend
Methods are using
ActiveRecord::Calculations
,
so they can be called up on any scope, which is useful for filtering by date/employer.
Collecting Statistics
The collection of statistics can be divided into three sub-tasks:
- What time to start daily statistics collection.
- What dates to collect statistics.
- How to collect statistics.
We have already answered the first question by choosing an embodiment (implementation variation). After analyzing the operation in the application, we found that the majority of shifts end before 2 a.m., so the statistics will be collected by schedule at 3 a.m.
Cron can be used to perform this task, but we decided to use
clockwork gem:
# clock.rbrequire 'clockwork'require './config/boot'require './config/environment'module Clockworkevery(1.day, 'statistics.collect', at: '3:00') { Statistics::CollectJob.perform_later }endStatistics::CollectJob is background job, consistently resolving the remaining two sub-tasks:# app/jobs/statistics/collect_job.rbmodule Statisticsclass CollectJob < ::BaseJobdef performStatistics::UntrackedDatesService.new.executeStatistics::UpdateUntrackedService.new.executeendendend
Statistics::UntrackedDatesService –
detects which dates are untracked and creates
UntrackedDate
for them. It always counts yesterday as untracked, as well as dates on models with
updated_at
after midnight the previous day.
UntrackedDate
is a very simple active record model that contains only
date
attribute with unique index.
As we collect statistics for jobs and shifts, we need to track
Job
and
Shift
model updates. Also, as we count jobs and posted shifts on each job's creation date, and worked shifts at the end time of each shift, we assume
Job#created_at
's and
JobShift#end_time
'
s dates are untracked if those jobs/shifts changed from the time of the last statistics update.
So the full code of
UntrackedDatesService
is
:
# app/services/statistics/untracked_dates_service.rbmodule Statisticsclass UntrackedDatesServiceattr_reader :working_datedef initialize(current_date = Date.current)@working_date = current_date - 1.dayenddef executeuntracked_dates.each do |date|Rails.logger.info "Marked #{date} as untracked"Statistics::UntrackedDate.find_or_create_by date: dateendendprivatedef untracked_dates[working_date,*untracked_past_jobs_dates,*untracked_past_shifts_posted_dates,*untracked_past_shifts_worked_dates].uniqenddef untracked_past_jobs_datesJob.where('updated_at >= ?', working_date.beginning_of_day).where('created_at < ?', working_date.beginning_of_day).pluck(:created_at).map(&:to_date)enddef untracked_past_shifts_posted_dates# similar logicenddef untracked_past_shifts_worked_dates# similar logicendendend
The last subtask is performed by
Statistics::UpdateUntrackedService.
It takes each untracked date, deletes
all
statistics for that day, and calculates new statistics. (Calculation is incapsulated in yet another service,
UpdateService
.) We need to delete all previous statistics to keep the process simple.
UpdateService
does not know why we mark this date as untracked. It just does what it is supposed to do.
In
UpdateService,
we create groupings by employer and calculate aggregated stats. Then we bulk insert all stats into the
Statistics::Employer
model:
module Statisticsclass UpdateServiceattr_reader :datedef initialize(date)@date = dateenddef executereturn if date < KEEP_LIVE_STATISTICS_FROMRails.logger.info "Updating statistics for #{date}"Statistics::Employer.where(date: date).delete_allStatistics::Employer.create employers_statisticsendprivatedef employers_statistics# Here we have a lot of ruby/rails/sql magic# and return array of hashes for each statistics entry# (i.e. grouped by date/employer_profile_id)endendend
This is all we need to collect and calculate statistics, but we have one more step to cover.
Callbacks
Sometimes a model’s time attributes can be changed. In that case, we can only track that statistics were changed in the new date, but not in the old one (because we can't know what the previous time was). So we have to use callbacks to track previous dates of previous timestamps.
Here is a
Tracking
module that could be required by any tracked model:
module Statisticsmodule Trackingextend ActiveSupport::Concernincluded docattr_accessor(:statistics_tracked_attributes) { Set.new }after_update :check_statistics_tracked_attributes_have_changedendclass_methods dodef track_attributes_for_statistics(*attributes)statistics_tracked_attributes.merge attributes.map(&:to_s)endenddef check_statistics_tracked_attributes_have_changed(statistics_tracked_attributes & changed).each do |attr|before, after = changes[attr]next unless date_changed?(before, after)Statistics::UntrackedDate.mark before.to_dateendenddef date_changed?(before, after)before && (!after || before.to_date != after.to_date)endendendand it is included into Jobinclude Statistics::Trackingtrack_attributes_for_statistics :created_atand JobShiftinclude Statistics::Trackingtrack_attributes_for_statistics :clocked_out_at
Now we have implemented full, easily expandable business logic to collect and output application statistics!
Output
Finally, we need all collected data to output. Since we use ActiveAdmin, I will show ARB code snippets and the screenshots it outputs.
First, we need filter form:
form_for search, url: admin_statistics_employer_path, method: 'post' do |f|f.text_field :employer_profile_idf.text_field :from , class: 'datepicker', 'data-datepicker-options' => '{"maxDate": "-1d"}'span '-'f.text_field :to, class: 'datepicker'f.submit 'Filter'endsearch here is a form object that takes params[:search] and returns scoped Statistics::Employer.where(date: from..to).We can output total statistics by the period:table dothead dotr doth :statth :value, style: 'text-align: right'endendtbody do%w(total_jobs total_shifts_posted total_hours_postedaverage_length_of_shift_posted average_shifts_per_jobtotal_shifts_worked total_hours_workedaverage_length_of_shift_worked).each do |stat_name|tr dotd stat_name.titleizetd number_with_delimiter(stats.public_send(stat_name).round(1)), style: 'text-align: right'endendendend
We can output monthly breakdown of all these stats, using chartkick
gem:
h3 'Shifts'div line_chart([{ name: 'posted', data: stats.group_by_month(:date).sum(:shifts_posted_count) },{ name: 'worked', data: stats.group_by_month(:date).sum(:shifts_worked_count) }])h3 'Hours'div line_chart([{ name: 'posted', data: stats.group_by_month(:date).sum(:hours_posted_count) },{ name: 'worked', data: stats.group_by_month(:date).sum(:hours_worked_count) }])
Summary
We would like to emphasize
the following
:
- The “spoofing” requirement is implemented using a constant Statistics::KEEP_LIVE_STATISTICS_FROM. (Did you noticed it in the code above?) The process of forming and loading made-up statistics prior to this date is beyond the scope of this article.
- Prepopulating the statistics with the existing data is performed with a straightforward rake task – just take each date application worked and pass it to UpdateService.
- In the real statistics, there are some more complex metrics, like breakdown of job roles. We used Postgresql hstore columns for storing it, but this topic is also beyond the scope of this article.
More to read

Not all AI software development companies are equal. Learn what separates firms that truly build with AI from those that just use the word. Includes real questions to ask and red flags to avoid.

SaaS made sense a decade ago. For many businesses today, custom AI-powered software delivers better ROI, faster. Here’s how to know when to make the switch, and how to do it without disrupting your operations.

Dive into Sphere's full-stack developer journey with AI – from tackling code with GitHub Copilot to unleashing problem-solving insights with ChatGPT. Explore the potential of AI in software development projects: which tools are truly handy, how many hours can you save, and what's the next big thing? Pavel Korchak shares his insights.

In 2023, interoperability is still top of the agenda for leaders and decision-makers, referring to the timely and secure access, integration and use of electronic health data to optimize health outcomes for individuals and populations.