Shortly after launching my Nginx-based cache + thumbnailing web-service, I realized I had no visibility into the performance of the service. I was curious what my hit-ratios were like, how much time was spent during a cache-miss, basic stuff like that. Nginx has monitoring tools, but it looks like they're only available to people who pay for Nginx Plus, so I decided to see if I could roll my own. In this post, I'll describe how I used Lua, cosockets, and Redis to extract real-time metrics from my thumbnail service.
Nginx's caching module exports a useful variable, upstream_cache_status, which indicates how the cache handled the request, for example HIT or MISS. Example configurations often show this variable being included in a response header, but you could also throw it in your access log. I considered doing just that -- including it in the access log and then retroactively parsing the logs to analyze the cache performance. This was unappealing for a couple reasons, though:
I wanted the data to be as close to real-time as possible, meaning I'd rather push the data then pull it from a log file.
Log files get rolled up, compressed, and thrown away. This adds complexity.
Most of my sites have access logging disabled for performance. If I could stay in RAM that'd be better.
These considerations led me to think of the problem as a producer/consumer situation. Nginx would produce the data, and another process would consume it in close to real-time. I knew Nginx supported embedded Lua scripting, and after reading up on the lua-nginx module I hit on using the cosocket API to push data into a Redis queue.
The lua-nginx module adds entry-points for user code execution at various stages during the request/response cycle. I had a bit of trouble finding good documentation on what exactly these stages were or what data would be available when the hook was executed (if anyone has a link, please share it in a comment). I settled on using the response header hook, but as I said, there may be a better place to do this.
Inside the response header hook, I assembled all the data I considered interesting and, using a plain old TCP socket, sent the data to my Redis server.
Let's look at some code. Below is the generalized function I wrote to take the cache status for a request to a given URL, and write it to a Redis list. The ngx.socket.tcp function creates a cosocket. A cosocket is neat because it runs in a lightweight thread (coroutine), hooking into Nginx's own event loop.
We will use the header_filter_by_lua_file directive to invoke this script whenever the cache server handles a request. In the image service's config file, I've got something like this:
Now that we've got our Lua script and told Nginx when to run it, we need to actually call the log_status() function with the relevant values. After the log_status function definition, I added the following code:
Whoa, what the hell, you are probably thinking. Why not just write:
Well, it turns out that the cosocket API is not available everywhere. Specifically, you cannot use cosocket in the set_by_lua*, log_by_lua*, header_filter_by_lua*, body_filter_by_lua*, etc. I don't know the specific technical reasons, but I guess it has something to do with Nginx internals.
What does matter is that we can work around this limitation by using ngx.timer. ngx.timer is basically like settimeout in JavaScript. We can use ngx.timer.at to schedule the invocation of the log_status() function and everyone is happy. In fact, the lua-nginx module documentation specifically suggests using this approach.
To process the messages, I set up a little Python script that does a blocking-pop on the queue, splitting out the URI and upstream status, and chucking them into a more useful data-structure.
I didn't stop there, though...
I figured that since I had gone through the trouble of cobbling this mess together, I might as well make the most of my new queue and push as much data through it as I could possibly want. For example, I was able to calculate how long it took to process a request by calling ngx.now() - ngx.req.start_time(). I added a new timing parameter to the log_status function and am now busily collecting the page timings. If you wanted to, you could also process the value of a cookie, the presence of a querystring argument, a request header...really just about anything.
This is all pretty new territory for me, so it's very possible this code has problems that I'm not aware of. If you spot a problem, or have suggestions about a cleaner / better way of getting at this data, please don't hesitate to leave a comment or contact me.
I'm only looking at the tip of the iceberg, but I'm already so impressed by the power of embedded Lua scripting in Nginx. If this topic interests you as well, you might want to check out the OpenResty project. OpenResty is Nginx + LuaJIT + tons of extensions for things like database access, JSON parsing, Redis client, and more. On the database side, the tarantool project seems really interesting. As I understand it, it's Lua with a NoSQL database baked-in. And based on some recent benchmarks, it's ridiculously fast. Lastly, if you're interested in setting up a cache + thumbnailing web service, you might enjoy this blog post.
Thanks for taking the time to read this post, I hope you found it interesting!