2017-03-03

I could use some help / suggestions for getting Cloudera Manager installed, please - i am having weird problems that manifest themselves in odd ways including comms problems and weird file system permissioning errors. I have asked Cloudera for help but had absolutely zero response - so much for their support and customer care. (If nothing else, I am now questioning the wisdom of relying of Cloudera - if they behave like this, what is their support like generally? Dare we trust support of something that will be mission-critical to them? Hmm.)

The cluster that I am trying to set up consists of four HP machines that I plan to run as a master and three slaves. All have identical hardware configs and are running the same software at the same level - the OS is Centos 7.3. I have root access to all machines and (despite being a security risk) for now, their internal software firewalls are disabled and all have the same root password. Anyhow, two days ago I (downloaded and) installed Cloudera Manager on one of the machines. This went ok and I could log on to it via a browser and port 7180. So, I tried to install the manager agent (etc) software on the other three machines, only to get the ubiquitous "failed to receive heartbeat from agent" error message. If I look in the log files on these machines, the error dump traceback starts with a file called 'connection.py' where there is a message "connection failed" without any further information. Three machines, three identical errors.

So, I followed the advice on the failure message that came up on the manager 'console' screen and checked that ports were free/open as needed and that firewalls etc were not interfering. All was/is good there. Physically, the machines are all connected to ports on a single network switch so there are no (external, hardware) firewalls, bridges, or whatever between them. If I run an external port scanner, I can see the ports open - 7180 and 7182 on the cluster manager, and 9000/9001 on the other machines (as well as the usual 22 for SSH, etc). If I sign on to any of these machines, I can 'see' all the others and using tools like netcat, telnet, or curl I can access and get a response from all the ports that should be open - including 7182 on the master - so indeed, they are open and listening. I can run SSH/Putty/etc to the machines without any problem, so to forestall the obvious replies/questions, there seem to be no firewall or network problems 'in the way'.

So I hit the internet and found multiple bits of advice about fixing this problem, none of which have worked. I tried installing the other three machines several times, before giving up and going home. Yesterday when I came back into the office, I picked up where I had left off and tried to reinstall the agent software. Here's where it starts to get weird - this time, despite nothing having changed overnight, one installed perfectly first time around, but the other two still failed. Multiple times, with the same errors (and multiple retries did not work). So, I added the one machine that had installed to a cluster with the master, and could 'see' it - get readings from the admin page for CPU use and so forth. Hurrah! Half way there....

This morning however, there are new problems. Again, nothing has changed overnight, but the master can no longer get 'readings' from the agent on the other machine, and if I go to the 'show machine details' page on the master, it is full of messages about such things as file system permissioning errors (weird ones - "permissions on file xyz are 655, expecting 651" for example - which are both odd setups in their own right - I'd expect either 755 or 644 in the usual way) and messages about things like 'root access errors'. I am using Cloudera Manager "multi-user" setup, so dont expect to encounter filesystem permissioning issues, and if something is running as root (even through sudo) it will have access to everything on the system.

Clearly therefore, the software is trying to do something very odd, for unknown reasons. It is something deeper / odder than just comms issues too. Has anybody ever seen this sort of behaviour before? Anybody have any idea what might be the cause? Oh, and I've rebooted the machines and also run the diagnostics in case of a hardware problem, but there is nothin.

I am hoping that somebody here can shed some light on this - Cloudera themselves are no help - the company's UK office have simply failed to respond to all requests - for help, or even for pricing and licence(etc.) information. The impression that I get is that even though I am working with a global player in the FS market, we are too small for them to be bothered with. Or maybe it is because we are outside the US. In any event, I have now wasted three full days trying to resolve this without making any real progress. So, I am going to try once more before I abandon Cloudera entirely - although it is supposed to be the 'easiest to manage' Hadoop distro, I've not even gotten Cloudera Manager installed yet, never mind CDH. Given this, I might as well dump it and just install vanilla Apache Hadoop. Or maybe use HortonWorks or MapR instead. If anybody has any experience here too that they are willing to share, I am all ears...

Regards, and thanks in advance,
Rick

Show more