Using Cloudera Manager 3.x on RHEL/CentOS 5.x Linux with Python 2.6

sysadminIf you’ve installed Cloudera Manager on your RHEL/CentOS 5.x Hadoop cluster, and you’re using a custom Python 2.6 install because you need Yelp’s mrjob, as written about in my previous article, then you’re going to need to make some changes in order to have Cloudera Manager start the TaskTracker with your custom Python, instead of with the Python embedded in the Cloudera Manager package.

Even if you’re just using Python streaming with MapReduce and need some custom modules, you’re going to have problems because Cloudera Manager starts the TaskTrackers which inherit the Python environment that is embedded in the Cloudera Manager package. If you’re in this situation, though, you just need to install your custom modules in the embedded environment by using the easy_install that comes with it:

/usr/lib64/cmf/agent/build/env/bin/easy_install

Below are instructions for having Cloudera Manager use your custom Python 2.6, so that when it starts the TaskTrackers, they’ll inherit your Python’s environment.

The first indication that we had a problem was when our MapReduce job was failing when the math module was unavailable. It didn’t even get as far as importing the MRJob module.

File "/usr/lib64/python2.6/json/encoder.py", line 5, in <module>
import math
ImportError: /usr/lib64/cmf/agent/build/env/lib64/python2.4/lib-dynload/mathmodule.so: undefined symbol: Py_InitModule4

That’s when I found that the TaskTracker was inheriting the Python environment from Cloudera Manager.

The first step is to make sure that you have the openssl-devel RPM installed, so that you can compile the M2Crypto Python module, with:

yum install openssl-devel

Next, install these Python modules with easy_install or pip:

argparse
avro
psutil (Must be the included version because they backported partition.opts from 0.5.1 as partition.options)
supervisor (Must be version 3.0a12)
CherryPy
Mako

Note: You might have problems installing supervisor as a binary egg, because supervisor.options is called, which doesn’t use proper package utilities to open the versions.txt file.

If you don’t know how to install a specific version in easy_install, you can do it like this:

easy_install psutil==0.4.1
easy_install supervisor==3.0a12

Next, you must install M2Crypto from source, because easy_install and pip won’t compile it correctly on RHEL/CentOS. When you download the source and untar it, you’ll find a script inside the tarball called fedora_setup.py, you’ll need to run that like this:

./fedora_setup.py build
./fedora_setup.py install

Next, you must make the ClusterStats modules available. The easiest way to do this is by adding a .pth file to your site-packages directory. I create a file at /usr/lib/python2.6/site-packages/scm-agent.pth that contains these lines:

/usr/lib64/cmf/agent/src/
/usr/lib64/cmf/agent/build/env/lib/python2.4/site-packages/ClusterStatsClient-v2.0.1_54_gadc386e-py2.4.egg
/usr/lib64/cmf/agent/build/env/lib/python2.4/site-packages/ClusterStatsCommonTest-0.1-py2.4.egg
/usr/lib64/cmf/agent/build/env/lib/python2.4/site-packages/ClusterStatsCommon-0.1-py2.4.egg
/usr/lib64/cmf/agent/build/env/lib/python2.4/site-packages/ClusterStatsLogStreaming-UNKNOWN-py2.4.egg

Finally, update the path to the python binary from:

/usr/lib64/cmf/agent/build/env/bin/python

to your own python:

/usr/local/bin/python

In the following three files:

/usr/sbin/cmf-agent
/usr/lib64/cmf/agent/build/env/bin/supervisord
/usr/lib64/cmf/agent/build/env/bin/supervisorctl

If you’re running Cloudera Manager already, you’ll need to restart Cloudera Manager on each node with:

service cloudera-scm-agent restart

The only thing left to do now is to restart the TaskTrackers on your nodes, using Cloudera Manager. Your MapReduce jobs will now be using your custom Python environment.