Aggregated statistics in a Ruby on Rails app

It is important to collect aggregated statistics so that management can analyze the data and make well-informed decisions. Sphere was retained by a client in the recruiting industry who, among other things, needed to collect the following data:

Total shifts posted
Total hours posted
Total shifts worked
Total hours worked
Average length of shifts
Average shifts per job

In addition, Sphere had to provide the possibility of "spoofing" the statistics to a certain point while the production database was being tested. Up to that point, the statistics should have been based not on the actual values from the database but on some customarily-entered data.

Since our client was using a Ruby on Rails application, we decided to write a statistics module in Ruby as well in order to leverage existing code and to simplify maintenance. We considered three implementation options:

Option | Advantages | Disadvantages

(1) Collect statistics on the fly. If total shifts posted must be calculated, a request is made to the corresponding table along with the constraints. | Statistics are always up to date. | Model code is polluted, as scopes and calculations for the statistics must be added. Implementing the requirement of “spoofing” the statistics is difficult. In order to calculate the data, SQL conditions like GROUP, JOIN, etc. must be added, which could lead to performance issues.

(2) Keep the statistics in separate tables which are updated on the fly. The data aggregated by day/employer is recorded and calculated in a separate table. If the data underlying the statistics are changed, a line recalculation takes place in the statistics table. | Statistics are always up to date. Implementing the requirement of “spoofing” the statistics is easy. | Model code is polluted, as adding callbacks to call the code of calculation of statistical data is necessary. If one model has changed, a request must be made for all models for this day for a given employer. In addition, the specifics of the application indicate that the models can change quite often during the day. Minimal statistics detailing period is 1 day.

(3) Keep the statistics in separate tables which are updated periodically. The data aggregated by day/employer is recorded and calculated in a separate table. Application data vary throughout the day. Then a special background task at the end of the day collects the changes and updates the statistics table. | Clean model code. All the logic is encapsulated in the collection of the statistics module. Implementing the requirement of “spoofing” the statistics is easy. | Statistics can be irrelevant as they do not change throughout the day. Minimal statistic detailing period is 1 day.

After presenting these three options to our client, we agreed to proceed with the third option.

Calculating & Storing Statistics

The

Statistics

Employer

model is used for calculating and storing statistical data. In its table, we store the date, employer's foreign key, and all other values needed to calculate the statistics (total hours posted, total hours worked, number of applications, and average number of applications).

text

1class CreateStatisticsEmployers < ActiveRecord::Migration
2  def change
3    create_table :statistics_employers do |t|
4      t.date :date, index: true, null: false
5      t.references :employer_profile, index: true, foreign_key: true, null: false
6
7      t.integer :jobs_count, default: 0, null: false
8      t.integer :shifts_posted_count, default: 0, null: false
9      t.decimal :hours_posted_count, default: 0, null: false
10      t.integer :shifts_worked_count, default: 0, null: false
11      t.decimal :hours_worked_count, default: 0, null: false
12    end
13  end
14end

All formulas are contained in the model code:

text

1module Statistics
2  class Employer < ActiveRecord::Base
3    belongs_to :employer_profile
4
5    class << self
6      def total_jobs
7       sum(:jobs_count)
8      end
9
10      def total_shifts_posted
11       sum(:shifts_posted_count)
12      end
13
14      def total_hours_posted
15       sum(:hours_posted_count)
16      end
17
18      def average_length_of_shift_posted
19       total_shifts_posted.zero? ? 0 : total_hours_posted / total_shifts_posted
20      end
21
22      def average_shifts_per_job
23       total_jobs.zero? ? 0 : total_shifts_posted.to_f / total_jobs
24      end
25
26      def total_shifts_worked
27       sum(:shifts_worked_count)
28      end
29
30      def total_hours_worked
31       sum(:hours_worked_count)
32      end
33
34      def average_length_of_shift_worked
35       total_shifts_worked.zero? ? 0 : total_hours_worked / total_shifts_worked
36      end
37    end
38  end
39end

Methods are using

ActiveRecord::Calculations

so they can be called up on any scope, which is useful for filtering by date/employer.

Collecting Statistics

The collection of statistics can be divided into three sub-tasks:

What time to start daily statistics collection.
What dates to collect statistics.
How to collect statistics.

We have already answered the first question by choosing an embodiment (implementation variation). After analyzing the operation in the application, we found that the majority of shifts end before 2 a.m., so the statistics will be collected by schedule at 3 a.m.

Cron can be used to perform this task, but we decided to use

clockwork gem:

text

1# clock.rb
2require 'clockwork'
3require './config/boot'
4require './config/environment'
5
6module Clockwork
7  every(1.day, 'statistics.collect', at: '3:00') { Statistics::CollectJob.perform_later }
8end
9Statistics::CollectJob is background job, consistently resolving the remaining two sub-tasks:
10# app/jobs/statistics/collect_job.rb
11module Statistics
12  class CollectJob < ::BaseJob
13    def perform
14      Statistics::UntrackedDatesService.new.execute
15      Statistics::UpdateUntrackedService.new.execute
16    end
17  end
18end

Statistics::UntrackedDatesService –

detects which dates are untracked and creates

UntrackedDate

for them. It always counts yesterday as untracked, as well as dates on models with

updated_at

after midnight the previous day.

UntrackedDate

is a very simple active record model that contains only

date

attribute with unique index.

As we collect statistics for jobs and shifts, we need to track

Job

and

Shift

model updates. Also, as we count jobs and posted shifts on each job's creation date, and worked shifts at the end time of each shift, we assume

Job#created_at

's and

JobShift#end_time

s dates are untracked if those jobs/shifts changed from the time of the last statistics update.

So the full code of

UntrackedDatesService

text

1# app/services/statistics/untracked_dates_service.rb
2module Statistics
3  class UntrackedDatesService
4    attr_reader :working_date
5
6    def initialize(current_date = Date.current)
7      @working_date = current_date - 1.day
8    end
9
10    def execute
11      untracked_dates.each do |date|
12        Rails.logger.info "Marked #{date} as untracked"
13        Statistics::UntrackedDate.find_or_create_by date: date
14      end
15    end
16
17    private
18
19    def untracked_dates
20      [
21        working_date,
22        *untracked_past_jobs_dates,
23        *untracked_past_shifts_posted_dates,
24        *untracked_past_shifts_worked_dates
25      ].uniq
26    end
27
28    def untracked_past_jobs_dates
29      Job.where('updated_at >= ?', working_date.beginning_of_day)
30         .where('created_at < ?', working_date.beginning_of_day)
31         .pluck(:created_at).map(&:to_date)
32    end
33
34    def untracked_past_shifts_posted_dates
35      # similar logic
36    end
37
38    def untracked_past_shifts_worked_dates
39       # similar logic
40    end
41  end
42end

The last subtask is performed by

Statistics::UpdateUntrackedService.

It takes each untracked date, deletes

all

statistics for that day, and calculates new statistics. (Calculation is incapsulated in yet another service,

UpdateService

.) We need to delete all previous statistics to keep the process simple.

UpdateService

does not know why we mark this date as untracked. It just does what it is supposed to do.

UpdateService,

we create groupings by employer and calculate aggregated stats. Then we bulk insert all stats into the

Statistics::Employer

model:

text

1module Statistics
2  class UpdateService
3    attr_reader :date
4
5    def initialize(date)
6      @date = date
7    end
8
9    def execute
10      return if date < KEEP_LIVE_STATISTICS_FROM
11      Rails.logger.info "Updating statistics for #{date}"
12
13      Statistics::Employer.where(date: date).delete_all
14      Statistics::Employer.create employers_statistics
15    end
16
17    private
18
19    def employers_statistics
20      # Here we have a lot of ruby/rails/sql magic
21      # and return array of hashes for each statistics entry
22      # (i.e. grouped by date/employer_profile_id)
23
24    end
25
26  end
27end

This is all we need to collect and calculate statistics, but we have one more step to cover.

Callbacks

Sometimes a model’s time attributes can be changed. In that case, we can only track that statistics were changed in the new date, but not in the old one (because we can't know what the previous time was). So we have to use callbacks to track previous dates of previous timestamps.

Here is a

Tracking

module that could be required by any tracked model:

text

1module Statistics
2  module Tracking
3    extend ActiveSupport::Concern
4
5    included do
6      cattr_accessor(:statistics_tracked_attributes) { Set.new }
7      after_update :check_statistics_tracked_attributes_have_changed
8    end
9
10    class_methods do
11      def track_attributes_for_statistics(*attributes)
12        statistics_tracked_attributes.merge attributes.map(&:to_s)
13      end
14    end
15
16    def check_statistics_tracked_attributes_have_changed
17      (statistics_tracked_attributes & changed).each do |attr|
18        before, after = changes[attr]
19        next unless date_changed?(before, after)
20        Statistics::UntrackedDate.mark before.to_date
21      end
22    end
23
24    def date_changed?(before, after)
25      before && (!after || before.to_date != after.to_date)
26    end
27  end
28end
29and it is included into Job
30  include Statistics::Tracking
31  track_attributes_for_statistics :created_at
32and JobShift
33  include Statistics::Tracking
34  track_attributes_for_statistics :clocked_out_at

Now we have implemented full, easily expandable business logic to collect and output application statistics!

Output

Finally, we need all collected data to output. Since we use ActiveAdmin, I will show ARB code snippets and the screenshots it outputs.

First, we need filter form:

text

1form_for search, url: admin_statistics_employer_path, method: 'post' do |f|
2  f.text_field :employer_profile_id
3  f.text_field :from , class: 'datepicker', 'data-datepicker-options' => '{"maxDate": "-1d"}'
4  span '-'
5  f.text_field :to, class: 'datepicker'
6  f.submit 'Filter'
7end
8search here is a form object that takes params[:search] and returns scoped Statistics::Employer.where(date: from..to).
9We can output total statistics by the period:
10table do
11  thead do
12    tr do
13      th :stat
14      th :value, style: 'text-align: right'
15    end
16  end
17  tbody do
18    %w(total_jobs total_shifts_posted total_hours_posted
19       average_length_of_shift_posted average_shifts_per_job
20       total_shifts_worked total_hours_worked
21       average_length_of_shift_worked).each do |stat_name|
22      tr do
23        td stat_name.titleize
24        td number_with_delimiter(stats.public_send(stat_name).round(1)), style: 'text-align: right'
25      end
26    end
27  end
28end

We can output monthly breakdown of all these stats, using chartkick

gem:

text

1h3 'Shifts'
2div line_chart(
3  [
4    { name: 'posted', data: stats.group_by_month(:date).sum(:shifts_posted_count) },
5    { name: 'worked', data: stats.group_by_month(:date).sum(:shifts_worked_count) }
6  ]
7)
8
9h3 'Hours'
10div line_chart(
11  [
12    { name: 'posted', data: stats.group_by_month(:date).sum(:hours_posted_count) },
13    { name: 'worked', data: stats.group_by_month(:date).sum(:hours_worked_count) }
14  ]
15)

Summary

We would like to emphasize

the following

The “spoofing” requirement is implemented using a constant Statistics::KEEP_LIVE_STATISTICS_FROM. (Did you noticed it in the code above?) The process of forming and loading made-up statistics prior to this date is beyond the scope of this article.
Prepopulating the statistics with the existing data is performed with a straightforward rake task – just take each date application worked and pass it to UpdateService.
In the real statistics, there are some more complex metrics, like breakdown of job roles. We used Postgresql hstore columns for storing it, but this topic is also beyond the scope of this article.