Ruby regular expressions could be called slow if you were to compare them to perl regex or grep or the like. Performance is somewhat relativistic depending on the exact problem. However given the option between two duplicate end results, one being faster with minimal to no trade-offs, you’d probably take the latter.
Parsing logs in Ruby with Regexp#match can be slow. You can optimize your regular expression to a degree, and I admit I probably didn’t explore that area to the absolute fullest, however at some point you will still deal with the cost of the #match method.
I encountered this problem while writing splog. Splog gives you a simple means to parse any file based on yml defined regular expressions. When developing splog I was first testing with anywhere from a 1 to a 1000 line log file. After I implemented the initial round of features I stress tested with a 90k line log file. That is when I encountered an ETA of 4 hours, something had to change.
Let’s look at some of the profiling data for ruby regular expressions. A bit of code first. To parse the log with a particular regular expression I first must parse the current line then the second line:
The reason I have to match the subsequent line is to handle rolling next lines into the current line. Think java exceptions across multiple log entries.
The problem though is for a 90k line file I am doing at least 180k Regexp#match calls. This get’s expensive fast. To illustrate here is the
ruby-prof of 100 lines of the 90k line server.log:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
11 seconds for only 100 lines is bad. Bad enough to lead me to hours of trial and error and attempting to optimize the regex patterns and scouring stack overflow and google for answers.
The best answer I got from hours of research was ‘use =~ since #match is slow’. I tried that and it made no difference since I still needed the matching group data since the named groups are what defines the json fields and what is inserted into Mongo.
After a lot of trial and error and testing out different designs I had to rethink the problem. Ruby is a joy to code in and implements the syntax of regular expressions just fine. But Ruby isn’t optimized like perl or grep from a performance standpoint.
With that in mind how about let’s use a scalpal instead of a hammer for this problem.
Let native grep handle testing all subsequent lines to see if they need to be Regexp#match then let Ruby #match only when really necessary. Not only did this reduce the #match calls by 50% but it sped up the program by upwards of 1000% percent.
Since grep doesn’t support named groups I couldn’t use grep everywhere, but that’s no worry.
The main change is to take anywhere I match a next line in the enumerator and replace it with system call to grep. This also required that I change the regular expression to remove the named groups like
?<SomeGroup> and replace them with an empty string since grep doesn’t support named group syntax:
Instead of resulting in a Regexp#match, just call the Open3.capture3 to get a success or not success on the native match invocation on the pattern
1 2 3 4 5 6 7 8
In the above Open3.capture3 return the
s represents a successful match verified by
s && s.success?
If you want an example to C&P, try the below code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Now that native grep is being used let’s take a look at time and ruby-prof again.
That simple change reduced the time from 11 seconds to 1.7! The difference is much more dramatic when parsing the whole file. On my quad core laptop parsing the entire 90k log file with the old code had an ETA of 4 hours while with the new code it was only 18 minutes.
Let’s look at the
ruby-prof of the first 100 lines as well
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
The beautiful part of this profiling is there are no longer 298 expensive calls to Regexp#match. There are only now as many calls to Regexp#match as there are lines in the file. This dramatically speeds the code up. Offloading the match forward parses to grep definitely pays off here with minimal cost to spawning the grep process.
It wasn’t my first or second option to go outside of Ruby to speed up the regular expressions. I felt it was too much of a hack at first. If I can make a performance optimization within a framework/library/language versus outside of it and all things are equal, I’d always choose in.. I definitely know there are other optimizations I can make, however once I ruled out all reasonable ones that I could find, this solution presented itself and it is actually a very clean solution.
I struggled with finding a good solution to the slowness of Ruby Regexp#match on large files, so hopefully this post will save some of you some headaches when searching for options.
Note that the code in this post was run on ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]