Planet.debian.net

Russ Allbery: Log parsing and infinite streams

2013-01-23

I have a problem I have to solve for work that involves correlating Apache
access and error logs. Part of WebAuth logs successful authentications to
the Apache error log, and I want to correlate User-Agent strings with
users so that we can figure out what devices our users are using by
percentage of users rather than percentage of hits. The problem, as those
who have tried to do this prior to Apache 2.4 know, is that Apache doesn't
provide any easy way to correlate access and error log entries (made even
more complex because two separate components are involved).

I could have just hacked something together, but I've written way too many
ad hoc log parsers, and, see, I was reading this book....

The book in question is Mark Jason Dominus's Higher Order Perl.
I'm not quite done with it, and will post a full review when I am. I have
some problems with it, mostly around the author's choice of example
problems. But there is one chapter on infinite streams, and the moment I
read that chapter on the train, it struck me as the perfect solution to
log parsing problems.

I'm not much of a functional programming person (which is where Dominus is
drawing most of the material for this book), so I don't know if this
terminology is standard or somewhat unique to the book. An infinite
stream in this context is basically a variation on an interator that lets
you look at the next item without consuming it. The power comes from
putting this modified iterator in front of a generator and use it to
consume one element at a time, and then compose this with transformation
and filtering functions. That gives you all the power of the iterator to
store state and lets you process one logical element from the stream at a
time, without inverting your flow of control.

Dominus provides code in the book for a very nice functional
implementation of this that's about as close as you're probably going to
get to writing Haskell in Perl. Unfortunately, his publisher decided to
write their own free software license, so the license is kind of weird and
doesn't explicitly permit redistribution of modified code. It's probably
fine, but I didn't feel like dealing with it, and I'm more comfortable
with writing object-oriented code at the moment (at least in Perl), so I
decided to write an object-oriented version of the same code specific for
log parsing.

That's what I've been doing since shortly after lunch, and I can't
remember the last time I've had this much fun writing code. I have a
reasonable first cut at a (fully tested and fully test-driven) log parsing
framework built on top of a reasonably generic implementation of the core
ideas behind infinite streams. I also used this as an opportunity to
experiment with Module::Build, and have discovered that the things I most
disliked about it have apparently all been fixed. And I'm also using Perl
5.10 features. (I was tempted to start using some things from 5.12, but I
do actually need to run this on Debian stable.) It's rather satisfying to
write a thoroughly modern Perl module.

There are some definite drawbacks to writing this in an object-oriented
fashion. There's rather more machinery and glue that has to be set up,
it's probably a bit slower, and it tends to accumulate layers of calls.
One of the advantages of the method with standalone functions and a very
simple, transparent data structure is that it's easier to eliminate
unnecessary call nesting. But I suspect the object-oriented version will
do what I want without any difficulties, and if I feel very inspired, I
can always fiddle with it later.

Maybe I'll eventually use this as a project to experiment with Moose as
well.

I'm surprised that no one else has done this, but I poked around on CPAN a
fair bit and couldn't find anything. This will all show up on CPAN (as
Log::Stream) as soon as I've finished enough of it to implement my sample
application. And then I'll hopefully find some time to rewrite our
metrics system using it, which should simplify it considerably....