Making StarCluster Work with OpenStack

If you are a reader of the Cloud-Enabled Space Weather Platform (CESWP) blog you'€™ll realize that one of our latest obsessions has been clusters in the cloud. Since the initial post and our first efforts, we'€™ve developed a solid idea of what we want from these clusters. Essentially, we want to find a way to perform grid computing using cloud resources. Even better would be finding a method that allows our scientists to choose the model or simulation they want to run, and its inputs, and also be able to monitor the progress of their simulation, all without having to move beyond a web interface. While the latter is still a work in progress, we have found a solution to the former using StarCluster.

As its website states, "StarCluster is a utility for creating and managing general purpose computing clusters hosted on Amazon'€™s Elastic Compute Cloud (EC2)." It offers the ability to easily launch clusters of VMs capable of performing grid computing using OpenMPI and Sun Grid Engine. Since StarCluster naturally uses the EC2 API, this suggested that we would only have had to point it at our OpenStack cloud, and everything would be good to go. However, as we discovered, this was not the case.

The first problem we encountered was that StarCluster would not let us start an x86_64 instance (all CESWP images are x86_64) on some of our machine types, instead telling us that the architecture was wrong. Upon further inspection, we found that StarCluster defines its expected machine types and their architectures, which in this case were based on Amazon'€™s EC2 configuration. Since CESWP is purely 64-bit, we decided to simply disable the architecture checking, by commenting out the following block of code in cluster.py:

(864)   """
       try:
           self.__check_platform(node_image_id, node_instance_type)
     
       ...
      
       elif master_instance_type and not master_image_id:
           try:
               self.__check_platform(node_image_id, master_instance_type)
           except exception.ClusterValidationError,e:
               raise exception.ClusterValidationError(
                   'Incompatible node_image_id and master_instance_typen' + e.msg
               )
(892)   """

This solved the first issue. Before I move on to our next problem, some explanation regarding how OpenStack handles device naming is probably in order. Currently, OpenStack is quite strict with respect to device naming when attaching volumes. In fact, it will completely ignore any specified name and attach the volume to the next available spot in the /dev/vd* series.  StarCluster, on the other hand, tries to start at /dev/sdz and works backwards from there. We have no ephemeral storage configured for any of our machine types, so the only device name that will always be in use is /dev/vda for the root file system. We therefore modified StarCluster to start attaching volumes at /dev/vdb so they would match up with the locations where OpenStack was actually attaching them. The following regex will make the necessary changes:

sed -i '€˜s//sd//vd/'€™ *.py
sed -i '€˜s//vdz//vdb/'€™ volume.py
sed -i '€˜s/(lowercase)/1[::-2]/'€™ cluster.py
sed -i '€˜s/(lowercase)[::-1]/1[1:]/'€™ volume.py

Continuing on with volumes, we found that StarCluster would not allow us to mount a device without a specified partition. For example, once StarCluster had attached a device to /dev/vdb (after our modifications above), attempts to mount the device would default to /dev/vdb1. This is a problem, as most of our users simply put a file system on the device without first partitioning it.  Again, a few regex will fix the problem:

sed -i '€˜s/[9,10]/[8,10]/'€™ utils.py
sed -i '€˜0,/([1-9])/s//1?/'€™ utils.py
sed -i '€œs/(vol.get('€˜partition'€™),1)/1) or '€˜'€™/'€ cluster.py
sed -i '€œs/('€™PARTITION'€™: (int, False, )1)/1None)/'€ static.py

Once these issues were dealt with we were able to get StarCluster up and running and spawning virtual machines without a care in the world. However, we encountered our next roadblock when we decided to test out the Sun Grid Engine. It simply did not work. The SGE processes on each node seemed incapable of talking to each other.  

Having a look at each node'€™s hosts file quickly told us why. The specified host name for each node was its IP address, which should have been each node'€™s instance ID for a properly configured image in OpenStack. We soon discovered that OpenStack returns the IP address of an instance when asked for its host name. This also meant that SGE was being given IP addresses instead of the expected host names for all of its nodes. The nodes could still connect to each other, but when SGE compared the host name in its configuration file to the actual host name of a node, they differed and SGE would rightly drop the connection.

Since we depend on our instances having their instance ID as the host name, we modified StarCluster to accept a node'€™s instance ID as the host name.  Whether this works depends on the proper network configuration of the image being used for clustering:

sed -i '€œ0,/self.private_dns_name/s//self.id+'€™.novalocal'€™/'€ node.py
sed -i '€œ0,/self.private_dns_name_short/s//self.id/'€ node.py
sed -i '€œs/(admin_list = admin_list).+/1 + '€˜ '€˜ +
node.network_names['€˜INTERNAL_NAME'€™]/'€ clustersetup.py

Now, at least in initial tests, we have StarCluster chugging along quite nicely. As I mentioned above, our intent is to eventually integrate this utility into the CESWP project. It should serve as a platform for running various simulations and models that our users need, in situations where they don'€™t need, or even want, the on-demand power of the cloud.