DBits

Developer Blog

Speeding Up Ruby Regular Expressions

Introduction

Ruby regular expressions could be called slow if you were to compare them to perl regex or grep or the like. Performance is somewhat relativistic depending on the exact problem. However given the option between two duplicate end results, one being faster with minimal to no trade-offs, you’d probably take the latter.

Problem

Parsing logs in Ruby with Regexp#match can be slow. You can optimize your regular expression to a degree, and I admit I probably didn’t explore that area to the absolute fullest, however at some point you will still deal with the cost of the #match method.

I encountered this problem while writing splog. Splog gives you a simple means to parse any file based on yml defined regular expressions. When developing splog I was first testing with anywhere from a 1 to a 1000 line log file. After I implemented the initial round of features I stress tested with a 90k line log file. That is when I encountered an ETA of 4 hours, something had to change.

Analysis

Let’s look at some of the profiling data for ruby regular expressions. A bit of code first. To parse the log with a particular regular expression I first must parse the current line then the second line:

1
m = @pattern.match(line)

The reason I have to match the subsequent line is to handle rolling next lines into the current line. Think java exceptions across multiple log entries.

The problem though is for a 90k line file I am doing at least 180k Regexp#match calls. This get’s expensive fast. To illustrate here is the time and ruby-prof of 100 lines of the 90k line server.log:

1
2
time head -n 100 /tmp/server.log | ruby -Ilib ./bin/splog --line-count=100 -p jboss_log4j_common -c  -o -v  
11.13s user 0.05s system 99% cpu 11.185 total
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
head -n 100 /tmp/server.log | ruby lib/splog-prof.rb --line-count=100 -p jboss_log4j_common -c /tmp/.splog.yml -o -v
%self      total      self      wait     child     calls  name
 98.84     11.117    11.117     0.000     0.000      298   Regexp#match
  0.07      0.008     0.008     0.000     0.000     9751   Kernel#class
  0.07      0.012     0.008     0.000     0.004      398   Array#each
  0.05      0.006     0.006     0.000     0.000      995   String#inspect
  0.04     11.189     0.005     0.000    11.184      298   Splog::LogParser#parse_line
  0.04      0.022     0.005     0.000     0.017      199   Hash#each
  0.04      0.004     0.004     0.000     0.000      199   MatchData#names
  0.03      0.047     0.004     0.000     0.043     3184  *Splog::LogParser#sigflat
  0.02     11.248     0.003     0.000    11.245        1   Enumerator::Generator#each
  0.02      0.008     0.002     0.000     0.006      199   Array#to_s
  0.02      0.002     0.002     0.000     0.000     1194   Hash#[]=
  0.02      0.010     0.002     0.000     0.008      199   Array#map!
  0.01      0.001     0.001     0.000     0.000      995   MatchData#[]
  0.01      0.001     0.001     0.000     0.000      199   Digest::Base#update
  0.01      0.001     0.001     0.000     0.000      995   Hash#has_key?
  0.01      0.001     0.001     0.000     0.000      199   Enumerator#peek
  0.01      0.003     0.001     0.000     0.002      199   Digest::Instance#digest
  0.01      0.004     0.001     0.000     0.003      199   <Class::Digest::Class>#digest
  0.01      0.052     0.001     0.000     0.051      199   Splog::LogParser#createsig
  0.01      0.001     0.001     0.000     0.000      199   Array#sort!
  0.01      0.055     0.001     0.054     0.000      100   Enumerator::Yielder#<<
  0.00      0.004     0.000     0.000     0.004      199   <Class::Digest::Class>#hexdigest
  0.00      0.000     0.000     0.000     0.000      199   Module#to_s
  0.00      0.000     0.000     0.000     0.000      398   Digest::Base#reset
  0.00      0.000     0.000     0.000     0.000      199   <Class::Digest::Base>#allocate
  0.00      0.000     0.000     0.000     0.000      199   Digest::Base#finish
  0.00      0.000     0.000     0.000     0.000      198   NilClass#nil?
  0.00      0.000     0.000     0.000     0.000      198   Kernel#nil?
  0.00      0.000     0.000     0.000     0.000      199   Digest::Class#initialize
  0.00      0.000     0.000     0.000     0.000      100   Enumerator#next

11 seconds for only 100 lines is bad. Bad enough to lead me to hours of trial and error and attempting to optimize the regex patterns and scouring stack overflow and google for answers.

The best answer I got from hours of research was ‘use =~ since #match is slow’. I tried that and it made no difference since I still needed the matching group data since the named groups are what defines the json fields and what is inserted into Mongo.

After a lot of trial and error and testing out different designs I had to rethink the problem. Ruby is a joy to code in and implements the syntax of regular expressions just fine. But Ruby isn’t optimized like perl or grep from a performance standpoint.

With that in mind how about let’s use a scalpal instead of a hammer for this problem.

Solution

Let native grep handle testing all subsequent lines to see if they need to be Regexp#match then let Ruby #match only when really necessary. Not only did this reduce the #match calls by 50% but it sped up the program by upwards of 1000% percent.

Since grep doesn’t support named groups I couldn’t use grep everywhere, but that’s no worry.

The main change is to take anywhere I match a next line in the enumerator and replace it with system call to grep. This also required that I change the regular expression to remove the named groups like ?<SomeGroup> and replace them with an empty string since grep doesn’t support named group syntax:

1
2
# where parse_line invokes Regexp#match on the next log line to test match forward matches
parsed_next_line = @pattern_match_forward.nil? ? parse_line(next_line) : parse_line(next_line, {:regex => @pattern_match_forward})

Instead of resulting in a Regexp#match, just call the Open3.capture3 to get a success or not success on the native match invocation on the pattern

1
2
3
4
5
6
7
8
# where @pattern_match_forward is the named group version with the named groups replaced with '' to be compatible with grep
o, e, s = nil
begin
    o, e, s = Open3.capture3(@pattern_match_forward.nil? ? @pattern_egrep : @pattern_match_forward_egrep, :stdin_data=>next_line)
rescue Errno::EPIPE
    puts "Connection broke!"
    nil
end

In the above Open3.capture3 return the s represents a successful match verified by s && s.success?

If you want an example to C&P, try the below code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
require 'open3'

text = "Hello World!"
regexp = /(orld)/
regexp_egrep = "egrep \"(orld)\""
p regexp_egrep
o, e, s = Open3.capture3(regexp_egrep, :stdin_data=>text)

puts text =~ regexp
puts s.success?

regexp = /(foo)/
regexp_egrep = "egrep \"(foo)\""

o, e, s = Open3.capture3(regexp_egrep, :stdin_data=>text)
puts text =~ regexp
puts s.success?

Now that native grep is being used let’s take a look at time and ruby-prof again.

1
2
time head -n 100 /tmp/server.log | ruby -Ilib ./bin/splog --line-count=100 -p jboss_log4j_common -c  -o -v  
0.96s user 0.84s system 104% cpu 1.723 total

That simple change reduced the time from 11 seconds to 1.7! The difference is much more dramatic when parsing the whole file. On my quad core laptop parsing the entire 90k log file with the old code had an ETA of 4 hours while with the new code it was only 18 minutes.

Let’s look at the ruby-prof of the first 100 lines as well

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
head -n 100 /tmp/server.log | ruby lib/splog-prof.rb --line-count=100 -p jboss_log4j_common -c /tmp/.splog.yml -o -v
 %self      total      self      wait     child     calls  name
 29.98      0.314     0.303     0.000     0.010      396   <Class::Thread>#new
 26.94      0.273     0.273     0.000     0.000      198   Kernel#spawn
 18.12      0.185     0.183     0.000     0.002      198   <Module::Process>#detach
  2.36      0.896     0.024     0.000     0.873      198   Open3#popen_run
  1.29      0.054     0.013     0.041     0.000      594   Thread#value
  1.22      0.031     0.012     0.018     0.000      198   IO#write
  1.07      0.020     0.011     0.000     0.009      596   Array#each
  1.01      0.010     0.010     0.000     0.000      396   Thread#initialize
  0.87      1.012     0.009     0.000     1.003        1   Enumerator::Generator#each
  0.83      0.008     0.008     0.000     0.000     1188   IO#initialize
  0.82      0.008     0.008     0.000     0.000      100   Regexp#match
  0.77      0.008     0.008     0.000     0.000     1188   IO#close
  0.62      0.016     0.006     0.000     0.010      594   <Class::IO>#pipe
  0.47      0.921     0.005     0.000     0.916      198   Open3#popen3
  0.46      0.931     0.005     0.000     0.926      198   Open3#capture3
  0.36      0.004     0.004     0.000     0.000     4900   Kernel#class
  0.31      0.003     0.003     0.000     0.000      500   String#inspect
  0.28      0.003     0.003     0.000     0.000     1194   Hash#[]=
  0.27      0.046     0.003     0.000     0.043      100   Splog::LogParser#parse_line

The beautiful part of this profiling is there are no longer 298 expensive calls to Regexp#match. There are only now as many calls to Regexp#match as there are lines in the file. This dramatically speeds the code up. Offloading the match forward parses to grep definitely pays off here with minimal cost to spawning the grep process.

Final Thoughts

It wasn’t my first or second option to go outside of Ruby to speed up the regular expressions. I felt it was too much of a hack at first. If I can make a performance optimization within a framework/library/language versus outside of it and all things are equal, I’d always choose in.. I definitely know there are other optimizations I can make, however once I ruled out all reasonable ones that I could find, this solution presented itself and it is actually a very clean solution.

I struggled with finding a good solution to the slowness of Ruby Regexp#match on large files, so hopefully this post will save some of you some headaches when searching for options.

Note that the code in this post was run on ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]

RequireJS + AngularJS + Pnotify

Introduction

Let’s add pnotify: Pines Notify to an AngularJS webapp using RequireJS. Now typically with an AngularJS app you would do this per the pnotify directions. When using RequireJS the steps are a bit different however.

Basic Example

Steps:

  • From pnotify/Github Copy jquery.pnotify.js to your lib/notify directory and jquery.pnotify.default.css and jquery.pnotify.default.icons.css to your css directory. You can choose the directory structure as you see fit, as long as it is setup properly in RequireJS

  • Add the css to your index, no need to add the pnotify js however, that will be handled by require.

1
2
3
4
5
6
7
8
9
10
11
!!! transitional
html(lang="en", ng-app="")
    head
        title Data Glue
        meta(name="viewport",content="width=device-width, initial-scale=1.0")
        script(data-main="js/main",src="js/lib/require/require.js")
        link(rel="stylesheet",type="text/css",media="all",href="css/bootstrap.min.css")
        link(rel="stylesheet",type="text/css",media="all",href="css/app.css")
        link(rel="stylesheet",type="text/css",media="all",href="css/pines/jquery.pnotify.default.css")
        link(rel="stylesheet",type="text/css",media="all",href="css/pines/jquery.pnotify.default.icons.css")

  • Your main.js should looks similar to:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
require.config({
    paths: {
        jquery: 'lib/jquery/jquery-2.0.3',
        angular: 'lib/angular/angular',
        text: 'lib/require/text',
        bootstrap: 'lib/bootstrap/bootstrap',
        'jquery.pnotify': 'lib/pines/jquery.pnotify'
        // You could also reference the github path
        //'jquery.pnotify': 'https://raw.github.com/sciactive/pnotify/master/jquery.pnotify.min'

    },
    baseUrl: 'js',
    shim: {
        'angular' : {'exports' : 'angular'},
        'angularMocks': {deps:['angular'], 'exports':'angular.mock'},
        'bootstrap': ["jquery"],
        // Nothing too special here, just remember to depend on jquery
        'jquery.pnotify': ["jquery"]
    },
    priority: [
        "angular"
    ],
    // http://stackoverflow.com/questions/8315088/prevent-requirejs-from-caching-required-scripts
    urlArgs: "bust=" + (new Date()).getTime()
});

require([
    'angular',
    'app',
    'routes',
    'bootstrap'
], function(angular, app, routes) {
    'use strict';
    angular.bootstrap(document, [app['name']]);
});
  • Now for defining a Controller using RequireJS that works with AngularJS. The key here is to depend on jquery and jquery.pnotify in the RequireJS define, not in the second AngularJS []
1
2
3
4
5
6
7
8
9
10
11
define ["jquery", "jquery.pnotify"], ($) ->
  ['$scope', ($scope) ->
    $.pnotify
        title: 'Regular Notice',
        text: 'Check me out! I\'m a notice.'

    # because this has happened asynchronously we've missed
    # Angular's initial call to $apply after the controller has been loaded
    # hence we need to explicityly call it at the end of our Controller constructor
    $scope.$apply();
  ]
  • This works great, however we can make this even more consice and modular by turning the pnotify into a service.

Open your services.js make sure that jquery and jquery.pnotify are dependencies, then add a factory for the pnotify.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
define(['angular', 'jquery', 'jquery.pnotify'], function (angular, $) {
    'use strict';

    angular.module('myApp.services', [])
        .value('version', '0.1')
        .service('sharedProperties', function () {
            return {
                foo: 'bar'
            };
        })
        .factory('notificationService', ['$http', function($http){
            return {
                notify: function(hash) {
                    $.pnotify(hash);
                }
            };
        }]);
});

The benefit here is not only does this result in less code in the controllers, but also that the notifications are abstracted so you could easily switch out the notification framework or add additional logging or the like without modifying your controllers.

Speaking of controllers, let’s change the previous controller definition to:

1
2
3
4
5
6
7
8
9
10
11
define [], () ->
  ['$scope', '$http', 'notificationService', ($scope, $http, notificationService) ->
    notificationService.notify
      title: 'Regular Notice',
      text: 'Check me out! I\'m a notice.'

    # because this has happened asynchroneusly we've missed
    # Angular's initial call to $apply after the controller has been loaded
    # hence we need to explicityly call it at the end of our Controller constructor
    $scope.$apply();
  ]

That’s it, you can depend on that notificationService in Angular now for easy notifications.

Suppressing knockout.js Evaluation Errors

Introduction

This is not intended to be a beginners guide or tutorial on knockout.js. For those of you who are unfarmiliar with Knockout.js please see Knockout.js

I’ve coded quite extensively in knockout.js, it isn’t a large framework, and it does have a few minior pain points, but overall it is a highly effective framework that allows for significant productivity.

Problem

You may venture into working with much more unstructured content with Knockout. I personally prefer schemaless/modeless designs when working with projects that lend to that style. When I say schemaless or modeless I mean passing json directly from the database through the response to Knockout observables.

1
2
3
4
5
6
7
8
9
10
11
12
13
class @ViewModelProducts
    constructor: () ->
        @products = ko.observableArray([])

    fetch_products: () ->
        the_promise = $.get('/products', {somearg: undefined})
        $.when(the_promise).done( (response) =>
            # Remember the => is required in the nested ajax to access @
            # I'm passing in either the js obj if it is native or the parsed version if it was stringified
            @products if typeof response is "object" then return response else if response is "" then return undefined else return JSON.parse(response)
        ).fail( (response) ->
            console.log response.error
        )

There is definitely a risk associated with this however, and that is Knockout failing on evaluating a field that doesn’t exist, either by intention or not. This risk would be a valid argument of those arguing for type safety, which could be somewhat imposed more strictly by using javascript classes.

However, I’ve found that creating a coffeescript class/viewmodel and using the Knockout mapping plugin to map the json is not necessarily as efficient as simply reading the json straight into an observable. Don’t get me wrong, the Knockout mapping works fantastically, but there is a decent amount of code just to write each class that you want to map.

1
2
3
4
5
6
7
8
9
10
table
    thead
        tr
            th Name
            th Price
    tbody(data-bind="foreach: viewModels.viewModelProducts.products")
        tr
            td(data-bind="text: name")
            td(data-bind="text: vendor")
            td(data-bind="text: price")

Now let’s say that the json returns several correct items then one with a name and price but not a vendor field. This is going to cause Knockout.js to fail and not process anymore data-binds on the page. This can be very inconvenient. You may reflexibly think this is a good thing, and it very well may be, but there is definitely a use case for a backend model that may be missing fields unintentially or by design.

The above will result in the the products being listed up to the point of the one missing the vendor field, then none after that point:

1
2
3
Uncaught Error: Unable to parse bindings.
Message: ReferenceError: vendor is not defined;
Bindings value: text: vendor

This is the problem. In the event of a developer mistake, or missing data, my opinion is it is better to continue processing instead of failing the rest of the page.

Solution

The solution to the above problem requires hacking Knockout.js to prevent it from terminating early in the processesing of a page upon a data-bind evaluation error.

To contain all of my custom Knockout bindings I created a file called koBindingHandlers.coffee is included in my header after the knockout library. In other works, knockout must first be available for knockout to be overridden.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Only one section overridden here, that is the catch of the 'parseBindingsString'
ko.utils.extend ko.bindingProvider.prototype, {
  'nodeHasBindings': (node) ->
    if node.nodeType is 1
      return node.getAttribute('data-bind') != null
    else if node.nodeType is 8
      return ko.virtualElements.virtualNodeBindingValue(node) != null
    else
      return false

  'getBindings': (node, bindingContext) ->
    bindingsString = this['getBindingsString'](node, bindingContext);
    if bindingsString
      return this['parseBindingsString'](bindingsString, bindingContext, node)
    else
      return null

  'getBindingsString': (node, bindingContext) ->
    if node.nodeType is 1
      return node.getAttribute('data-bind')
    else if node.nodeType is 8
      ko.virtualElements.virtualNodeBindingValue(node)
    else
      return null

  'parseBindingsString': (bindingsString, bindingContext, node) ->
    try
      bindingFunction = createBindingsStringEvaluatorViaCache(bindingsString, this.bindingCache)
      return bindingFunction(bindingContext, node);
    catch ex
      # Comment this out!
      # throw new Error("Unable to parse bindings.\nMessage: " + ex + ";\nBindings value: " + bindingsString)
      # Optionally uncomment this to debug problems, otherwise all errors are supressed.
      # console.warn "Unable to parse bindings.\nMessage: " + ex + ";\nBindings value: " + bindingsString
      return undefined
}
ko.bindingProvider['instance'] = new ko.bindingProvider()

createBindingsStringEvaluatorViaCache = (bindingsString, cache) ->
  cacheKey = bindingsString;
  return cache[cacheKey] || (cache[cacheKey] = createBindingsStringEvaluator(bindingsString))

createBindingsStringEvaluator = (bindingsString) ->
  rewrittenBindings = ko.expressionRewriting.preProcessBindings(bindingsString)
  functionBody = "with($context){with($data||{}){return{" + rewrittenBindings + "}}}"
  return new Function("$context", "$element", functionBody)

In the above code, in ‘parseBindingsString’ I commented out:

1
throw new Error("Unable to parse bindings.\nMessage: " + ex + ";\nBindings value: " + bindingsString)

When this js error is thrown it causes the early termination of processing of any more Knockout bindings. The sideeffect is it does become very difficult at times to debug when you’ve made a small mistake. In that case you can simply uncomment:

1
console.warn "Unable to parse bindings.\nMessage: " + ex + ";\nBindings value: " + bindingsString

In practice I find it rare that implementing the above workaround causes any development pain. If something seems off or not rendering, I simply uncomment the console.log figure out what is breaking the Knockout binding, fix it, and continue.

I hope this helps some of you other Knockout users since this was originally one of my pain points with the Library.