2013-09-12

I have a hadoop reduce task attempt that will never fail or get completed unless I fail/kill it manually.

The problem surfaces when the task tracker node (due to network issues that I am still investigating) looses connectivity with other task trackers/data nodes, but not with the job tracker.

Basically, the reduce task is not able to fetch the necessary data from other data nodes due to time out issues and it blacklists them. So far, so good, the blacklisting is expected and needed, the problem is that it will keep retry the same blacklisted hosts for hours (honoring what it seems to be an exponential back-off algorithm) until I manually kill it. Latest long running task had been >9 hours retrying.

I see hundreds of messages like these in the log:

Is there any way or setting to specify that after n retries or seconds the task should fail on and get restarted on its own in another task tracker host?

These are some of the relevant reduce/timeout Hadoop cluster parameters I have set in my cluster:

BTW, this job is running on an AWS EMR cluster (Hadoop version: 0.20.205).

Thanks in advance.

Show more