DropAndCompute

From myExperiment
Jump to: navigation, search

How to make it trivial and uniform for users to access Computational Grid resources (such as: Condor)— a gentle ramp for scientists to access e-science gateways

Ian Cottam, EPS Faculty IS, The University of Manchester, v0.8, Feb. 2010

Contents

Introduction

Different computational grid systems (Condor, Sun Grid Engine, Globus, etc.) require different system software and mechanisms to access them. Nearly always the command line is involved, and sometimes one has to remotely logon to a foreign site – perhaps running an unfamiliar operating system – to use the grid resource; the latter case also often requires the user to transfer files manually back and forth (via sftp or similar).

Herein we propose a ‘mash-up’ of Dropbox and computational grid resource(s) to provide a gentle ramp for scientists to access e-science gateways. Some attributes of this approach are:

  • A simple and uniform drag-and-drop GUI interface to all resource grids/pools.
  • No use of terminal windows or command lines.
  • No need to login to remote hosts or install complicated grid-enabling software locally.
  • No need for the user to have an account on the remote resources (instead they are ‘accounted’ by having a shared folder allocated). Of course, nothing stops the users from having accounts should that be preferred.
  • No need for complicated Virtual Private Networks, IP Tunnelling, connection brokers, or similar, in order to access grid resources on e.g. private subnets (provided at least one node is on the public Internet, which is the norm).
  • Pop-ups notify users of important events (basically, log and output files being created when a job has been accepted, and when the generated result files arrive).
  • Somewhat increased security as the user only has (indirect) access to a small subset of grid resource commands.

Setting up the approach

  1. The owner of a computational grid resource rents a 50GB or 100GB account from Dropbox linking it to a head node or any submission node on the grid/pool. Note that other amounts are becoming available at different prices, and the free 2GB account can be used for testing and then upgraded later. See [1] for details.
  2. The owner creates sub-folders for each user, or potential user, of the grid resource within the dropbox folder. Any sensible naming convention can be used; such as Condor-MIB-Ian-Cottam; Condor-MIB-Joe-Bloggs; etc.
  3. When a user is granted access to the grid resource, the grid owner, having carried out step 2, uses Dropbox to share the appropriate sub-folder with the email address of the user.
  4. The grid owner runs the Bash script below giving as argument the name of the shared sub-folder. (Clearly, this is repeated for each user.) This script is specific to Condor, but should be straightforward to amend for other grid systems.
  5. The user installs a Dropbox free 2GB account. This is platform independent and typically takes about 2 minutes; and is the only software that the user needs and sees for all grid resources.
  6. The user accepts the offer, via Dropbox, to share the folder from the grid resource owner. This creates a folder of the same name in the user’s dropbox folder. It can be convenient to place a link/alias to this folder on the desktop. (See the screen capture movie link below to see such in action.)

This method automatically limits each user to 2GB of file space on the grid resource, which, in many cases, is desirable to stop users from filling file systems. Dropbox obviates the need for other file space quota mechanisms. In the (rare?) case that a user needs, and is granted, more space, he or she would have to purchase a larger account from Dropbox.

Using the approach

This is just a summary; the reader is directed at the screen capture movie referenced below.

  • The user prepares a folder containing all the files, and sub-folders, necessary to submit to the grid resource. (In the trial, using Condor, two restrictions have been made; viz: the folder should be called submit, and within it the file describing the submission should be called submit.txt. Further, the submit.txt file must always include a Requirements line specifying the Arch and Opsys pair desired; this is because you don't really know what the actual submit machine's default is.)
  • The user compresses the folder. This is usually easy to set up as a right-click action on many systems; for example, it is built-in to Mac OS X, Ubuntu Linux and 7-zip (etc.) can be used in this way on Windows.
  • The user drags the submit.zip file to the shared folder, or for convenience a link to the shared folder kept on the user’s desktop. Dropbox then automatically synchronises it with the folder on the grid node.
  • The Bash script on the computational grid node spots it arriving, uncompresses it, does some housekeeping, and submits the job.
  • Dropbox uses its notify system to tell the user the job has been submitted, or reports any errors in a file created remotely and synchronised back to the user.
  • The jobs runs on the grid.
  • When the job finishes, or otherwise creates output files, such files and sub-folders are automatically synchronised back to the user by Dropbox.
  • Dropbox notifies the user that result files have arrived.
  • The user drags the submit folder (now containing result files too) out of the dropbox share to the location of choice.
  • In the event the job does not run, the user can make use of two files that the Bash script creates. The first is called <jobnumber>.debug and the second <jobnumber>.kill. By dragging the .debug file to the top level of the share the remote Bash script will gather as much useful debugging information it can and place it in a file in the shared folder, which of course is then synchronised back to the user who is notified of its arrival. The .kill file is used similarly, but simply removes the job from the grid’s queue.

Movie of a trial site in action

To see the idea in action, a screen capture movie has been made of how to access Condor sites at The University of Manchester, and the MIB Condor pool in particular. The Quicktime movie can be downloaded here. It was produced on an Apple Macbook, but VLC or Quicktime for Windows, for example, should be able to view it on a Windows PC.

Further work

The short list below identifies what still remains to be done to make this approach a production quality system:

  • as jobs are all run by the same account on the grid, some special accounting code may be desirable (as might a tweak to fair scheduling policies)
  • as per above, users could guess other people’s job numbers and issue .kill requests
  • there is no equivalent to, e.g., condor_status – this is because most grids/pools have a web page for such (and it would be straightforward to add if required)
  • similar scripts need to be written for other, non Condor-based, grids
  • Dropbox currently hosts all files that pass through it on encrypted Amazon S3 servers in the USA, which may be a concern to some (in the future, European data may be housed at Amazon’s Dublin data centre).

For the first two cases, a variant on the setup should fix them; i.e. create local accounts for each user and create dropbox accounts for each of them. So instead of one dropbox process and n DropAndCompute ones, you have n of both.

For the final case, and if you don't want the Americans to see your experimental data or results, here is a suggestion on how you might modify the approach: use EncFS (Google for it). An encrypted file system is kept within the shared dropbox folder. EncFS keeps another folder — outside of the dropbox area — where you can see the unencrypted version of your files. As these never pass through the Dropbox servers, one is safe from prying eyes. Unfortunately, it only works with Macs and Linux boxes (not Windows); similar, but alternative, solutions to user-side encryption may be possible for Windows' users.

You also need to start EncFS from the command line (and not with, e.g., MacFusion) so that you can pass it the --public flag. Such being needed to allow other users to write to the folder. For example:

sudo encfs --public ~ian/Desktop/MCM/Dropbox/.secure ~ian/Desktop/SecureDrop

Bash script glue code

April 2011: please note that this code is now somewhat dated. Please contact me for latest version -Ian

trial.sh

#!/bin/bash
# Runs Condor jobs dropped into $1
# They must be called submit.zip for this trial and have a submit.txt within

if   test $# != 1; then echo "trial: supply shared dropbox folder as single arg"; exit 1; fi
if ! test -d $1;   then echo "trial: arg must be a folder"; exit 1; fi
CONDORFOLDER=$1

CONDORHOME=/Users/condor/condor/bin

controlc()
{
  echo "Stopping $TRIALKILL"
  kill $TRIALKILL
  exit 0
}


# first run script that monitors for debug and kill requests recording PID
./trial-kill.sh $CONDORFOLDER & TRIALKILL=$!
echo "Also starting trial-kill as process: $TRIALKILL"
trap controlc sigint

# now loop looking for jobs to run
while true
do

 cd $CONDORFOLDER

 while ! test -f submit.zip; do sleep 30; done
 # zipped folder of files has started to arrive
 while test x != "x`lsof submit.zip`"; do sleep 10; done
 # no one else has zip folder open so it should have completely arrived

 SUFFIX=`date | sed 's/ //g;s/://g'`
 mkdir /tmp/submit$$
 unzip -d/tmp/submit$$ -qq -o submit.zip
 rm -f submit.zip
 # checks to see a submit folder was expanded
 if ! test -d /tmp/submit$$/submit; then
  rm -f /tmp/submit$$/submit # in case it was eg a text file
  mkdir /tmp/submit$$/submit
  echo 'There was no folder called  submit  in the expanded zip' > /tmp/submit$$/submit/NO_SUBMIT_FOLDER_FOUND.txt
 fi
 cp -R  /tmp/submit$$/submit submit$SUFFIX
 rm -rf /tmp/submit$$

 chmod -R a+rwx  "submit$SUFFIX"
 cd "submit$SUFFIX"
 rm -f NO-SUBMIT-FILE.txt SUBMIT-FAILED-WITH-ERRORS.txt SUBMIT-WARNING.txt
 if ! test -f submit.txt; then
  echo 'No submit.txt file found  - giving up' > NO-SUBMIT-FILE.txt
 else
  grep -w -i -q Arch submit.txt
  ARCH=$?
  grep -w -i -q OpSys submit.txt
  OPSYS=$?
  if test $ARCH = 1 -o $OPSYS = 1; then
    echo "Looks like no Arch / OpSys specified in Requirements" >SUBMIT-WARNING.txt
  fi
  # convert a shell script if used  to UNIX eol format in case sent from Windows
  SCRIPT=`tr -d '\r' <submit.txt |grep -i 'executable *=' | sed 's/.*= *//'`
  if test -f $SCRIPT && file $SCRIPT | grep -q 'shell script'; then
' $SCRIPT; then
    tr -d '\r' < $SCRIPT >/tmp/script$$
    mv /tmp/script$$ $SCRIPT
    chmod a+x $SCRIPT
   fi
  fi 2>/dev/null
  
  $CONDORHOME/condor_submit submit.txt 2>/tmp/condor$$ | 
      tee /tmp/condor2$$ |
             awk '$6 != "" {print $6 "debug"; print $6 "kill";}' >/tmp/condor1$$
  if test -s /tmp/condor$$
  then 
    mv /tmp/condor$$ SUBMIT-FAILED-WITH-ERRORS.txt
  fi
  cat /tmp/condor2$$
  # create files to request debug info for job or to kill it
  if test -s /tmp/condor1$$; then touch `cat /tmp/condor1$$`; fi
  rm -f /tmp/condor$$ /tmp/condor1$$ /tmp/condor2$$
 fi

done

exit 0




trial-kill.sh


#!/bin/bash
CONDORHOME=/Users/condor/condor/bin
CONDORFOLDER=$1
cd $CONDORFOLDER

while true
do
 while test `ls *.debug *.kill 2>/dev/null | wc -l` = "0"; do sleep 30; done
 for i in `ls *.debug *.kill`
 do
 if test `basename $i .kill`.kill = $i; then
  JOB=`basename $i .kill`
  echo 'Attempting to remove job: ' $JOB
  $CONDORHOME/condor_rm $JOB
  rm -f $JOB.kill
 else
  # must be debug request
  JOB=`basename $i .debug`
  echo 'Attempting to get debug infor for job: ' $JOB
  $CONDORHOME/condor_q $JOB > /tmp/condor$$
  $CONDORHOME/condor_q  -better-analyze $JOB >> /tmp/condor$$
  mv $JOB.debug $JOB.debugged
  mv /tmp/condor$$ QUEUE-DEBUGINFO-$JOB.txt
 fi
 done 2>/dev/null
done



exit 0

Personal tools