How We Solved Ruby's Noisy Logs

This is a guest post from Jon Bryan, the CTO at Concierge Live, on how they tamed their Rails logs.

At Concierge Live we manage a Rails application that serves over 10,000 users on a daily basis. With scale comes errors: workers crash, network requests fail, services go down, things break. Visibility into our infrastructure is essential.

The problem

To gain visibility into our app we tried traditional application monitoring and exception tracking solutions. They always fell short. They usaully expect some very basic CRUD cycle, or assume that we only care about exceptions. Meanwhile, I have a worker that runs into a bug that none of these services give me insight into. We know we have the data in our logs, but unfortunately our logs look like this:

I, [2017-06-04T18:04:53.672231 #42348]  INFO -- : [] [df88dbaa-50fd-4178-85d7-d66279ea33b6] [] [bfa8242cd9733bf0211e334be203f0d0] Starting catalogue sync job
I, [2017-06-04T18:04:53.682311 #42348]  INFO -- : [] [df88dbaa-50fd-4178-85d7-d66279ea33b6] [] [bfa8242cd9733bf0211e334be203f0d0] Page 1 synced with 1000 records
I, [2017-06-04T18:04:53.694324 #42348]  INFO -- : [] [df88dbaa-50fd-4178-85d7-d66279ea33b6] [] [bfa8242cd9733bf0211e334be203f0d0] Page 2 synced with 1000 records
I, [2017-06-04T18:04:53.704234 #42348]  INFO -- : [] [df88dbaa-50fd-4178-85d7-d66279ea33b6] [] [bfa8242cd9733bf0211e334be203f0d0] Undefined method [] for nil
I, [2017-06-04T18:04:53.714543 #42348]  INFO -- : [] [df88dbaa-50fd-4178-85d7-d66279ea33b6] [] [bfa8242cd9733bf0211e334be203f0d0]   /Users/cl/.rvm/gems/ruby-1.9.3-p0/gems/rails-4.2.1/lib/rails/activerecord/connection.rb:435:in `to_s'
I, [2017-06-04T18:04:53.725532 #42348]  INFO -- : [] [df88dbaa-50fd-4178-85d7-d66279ea33b6] [] [bfa8242cd9733bf0211e334be203f0d0]   /Users/cl/.rvm/gems/ruby-1.9.3-p0/gems/rails-4.2.1/lib/rails/activerecord/connection.rb:135:in `execute'
I, [2017-06-04T18:04:53.733434 #42348]  INFO -- : [] [df88dbaa-50fd-4178-85d7-d66279ea33b6] [] [bfa8242cd9733bf0211e334be203f0d0]   /Users/cl/.rvm/gems/ruby-1.9.3-p0/primaryapp/workers/catalogue_sync.rb:435:in `perform'

Despite having all the information there the logs are next to impossible to use and read. Traditional logging systems didn't help. Obviously they won't parse the above format, and when we changed our logs into JSON payloads they were even harder to read.

The solution

The first solution for us was readable structured logging, which is an oxymoron. The more data we packed in our logs, the harder they were to read. After doing some research on how to solve this, we came across Timber. Specifically the Timber ruby gem. It solves this problem by integrating with our app instead of requiring us to change our log messages and format them in a specific way. Ah! Setup was a breeze, we didn't have to change anything in our app:

# Gemfile
gem 'timber'

And then:

bundle exec timber install

The second solution was an interface that made it simple to use my logs. This is best explained with a demo. Here's me using our logs today:

Concierge Live Demo

To point out the differences:

  1. My logs are readable! Notice they aren't prefixed with noisy tags, nor did I have to convert my logs to JSON.
  2. All of the data I need (and more) is still attached to the log line. I can view it by clicking it.
  3. I can search my logs using this structured data. Ex: http_response.time_ms:>500 😲

Taking it further with context

To my surprise, Timber captured additional data through it's integrations. My favorite integration is it's devise integration. This automatically added user context to our logs. It allows us to search our logs by users. For example, an engineer on my team was able to quickly resolve a user reported issue by running the following query:

Even our support reps are able to use our logs to provide more context around errors! This has become an invaluable tool in our support process.

This got me thinking, how could I use this for other processes in our application? Background jobs! To continue with my worker example above, we added a context that made tracking job executions easy:

Timber.with_context(catalogue_sync_job: {id: 1234}) do"Event catalogue sync job started")
  # ..."Event catalogue sync job finished")

Querying for these logs was simple:

Ever since we implemented this, we've been able to reduce debugging time, find issues we didn't even know existed, and increase overall visibility into our system. We couldn't be happier with Timber.