2013-08-09

The issue I'm having is that occasionally a connection for a clientside request + the assosiated DB connection (using Oracle 10g RAC) get left open on JBoss EAP 4.3.0 CP10. Also a JBM JMS queue dealing with messages bound back to the server from and external service related to the request gains messages that seem to stay in delivering state awfully long if not infinitely (DeliveringCount keeps abnormally high).

So far no errors have been found anywhere to explain this, and the only way to work around it is to restart JBoss when the connection pools are starting to get full. The "hanging" of the request processing occurs quite rarely, my estimate is that it happens for less than 0,01% of requests, but curiously enough it only happens for certain kind of requests (getting a live position for an object in the system).

More detailed description

It's a closed-source application but as I have understood things go roughly like this:

the client sends a HTTP request to the server (via Apache proxy talking AJP to the AS)

the server starts a transaction with its data store and sends a request to a positioning service via JMS

the positioning service responds with the most recent data it has (should happen within 1 sec)

the server responds to the client with the position data (and closes the transaction if not done already earlier)

There would seem to be some activity with the object persistence DB of the application although it really does not store the live position on the object. Anyhow, no matter the reason for having also the DB connection open during the request-response process, both the AJP connection and the DB connection are left open without ever finalising the transaction. The AJP connection is waiting for a lock from another AJP thread which has already gone doing something else.

Below is a sample stack trace extracted with jconsole for a thread that is waiting eternally.

There has been a suspection that something is causing connections that have little traffic to be dropped, but our hosting provider claims that there is no stateful component between the hosts. The AS is on a virtualised Windows Server 2008 hosted in a VMWare ESX5i cluster, also the proxy is hosted on the ESX. Oracle RAC (on "real" HW) is physically within the same server room and within the same subnet.

I've inspected the network connections during a quiet time early in the morning. Even though Tomcat status page on JBoss shows a number of long living connections (which have very little traffic) which more or less coincides with the number of open DB connections, looking at the output of netstat -no on the JBoss server showed only one connection between Apache and JBoss. The open DB connections did show in netstat output. Looking at the netstat -no output on Apache server showed large number of connections to JBoss in TIME_WAIT and plenty of client connections in FIN_WAIT2.

I have waded through the JBoss issue tracker without finding a matching case, and I have also spent literally days googling around for anything like this but so far all has been in vain. Then again, this exceeds my expertise on application servers and locking/transactions.

The fact that there is a thread waiting for a lock to be released by another thread makes me wonder if there is a bug in JBoss transaction manager but at least it is a rare bug and seems oddly to interfere only with transactions of one kind. OTOH I'm not sure which way the TCP connections point at, and communication with the staff of the application supplier hasn't yet provided any solution. I've also experimented with timeout and keepalive settings on both Apache and JBoss AJP connections without any change in the situation. Also the fact that the open DB connections are toward the object persistence DB (xa-datasource) and not toward the message persistence DB (local-tx-datasource) is a bit confusing as there are those messages in delivering state...

Some hypotheses

One hypothesis that I have been playing with is might it be that the connection between the client and the proxy gets dropped (maybe the client just closes down) and JBoss doesn't acknowledge that (maybe it has no way of knowing?) and thus it can't ever send the response back to the client (thus the JMS messages in delivering status). However, I'd assume that if that was the case there would be some kind of timeout that would terminate the transaction. However, I have not come across any docs about such a case.

The other hypothesis that has surfaced is that maybe the DB connection for JMS persistence gets dropped for some reason. However, I find it terribly odd that it only would affect one kind of messages and even those quite rarely (there are positioning requests alone about one per second, plus all the other messaging that goes on).

Edit: A third hypothesis, of course, is that the application has a bug somewhere related to opening and closing of connections/transactions. Since it's closed source I can't review it myself but I'll ask for the supplier to do it.

Any thoughts, pointers, ideas, or topics for further research? Any tricks how to dig deeper in the system for more leads?

Show more