Getopt::Long Lesson (CBMG688P Best Practices laboratory - Stoltzfus)

In this lesson, you will learn to use a module called Getopt::Long to implement a POSIX-compliant command-line interface.

background

see slide presentation

exercise

  1. first we will add the file esearch_wrapper.pl to our CVS repository
    unix_prompt$ cd ~/My_CBMG688P/my_cvs/Perl
    unix_prompt$ cp ../../Lab14Link/Materials/esearch_wrapper.pl .
    unix_prompt$ chmod +x esearch_wrapper.pl
    unix_prompt$ cvs add esearch_wrapper.pl 
    unix_prompt$ cvs commit -m "exercise to learn Getopt::Long"
    
  2. Now, let's look at the file, which is a wrapper for NCBI's esearch utility. It uses wget to execute the web query, though I have commented out that line (we'll uncomment it in a few minutes). Here it is:
    #!/usr/local/bin/perl -w 
    #
    # $Id$
    #
    use strict; 
    
    # process command-line arguments 
    #
    my ( $email, $db, $search_terms ); 
    $email = shift;  
    $db = shift; 
    $search_terms = shift; 
    
    # construct the URL (note: $0 is the current command)
    #
    my $base_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"; 
    my $query_url = sprintf( "%s?tool=%s&email=%s&db=%s&term=%s", $base_url, $0, $email, $db, $search_terms ); 
    
    # execute the query 
    #
    # `wget -O esearch_query_result.xml "$query_url"`;
    
    printf STDERR "The results of the query ($query_url) are in esearch_query_results.xml\n";
    
    # and exit
    #
    exit; 
    
  3. Before going any further, let's fix an obvious problem with this code: "esearch_results.xml" appears twice, with a slight spelling error. Will we confuse users with this? Even without the spelling error, if we wanted to change the file name later, would we remember to change every instance? Repeating string literals or other code in a file is not a good practice. The best practice is to make a repeated thing a named (referenced) entity, and then just reference it. In this case, we make the filename a variable, then invoke the variable twice. In a few minutes, we will add a command-line option so that the user can set this variable from the command-line, which isn't possible until we fix this. Improvements to software often depend on modularity: its harder to debug, test, document, and re-use software that isn't modular.
  4. Now, let's "commit our changes" using CVS.
    unix_prompt$ cvs update
    unix_prompt$ cvs commit
    
  5. I also think that we should put the wget command in a separate subroutine. Why? Because wget is crude, and in the future I would like to try other ways of exectuting the query, like libwww or bioperl. But we will leave that for another day.
  6. Before we implement a command-line interface, we need to design it. Here is an example of what I want it to look like:
    unix_prompt$ esearch_wrapper.pl --email=arlin@umd.edu --db=genome --query="Thermanaerovibrio+acidaminovorans[organism]"
    
  7. Now, we are ready to implement this using Getopt::Long.
  8. How about adding another option? If its going to write a file, I'd like the chance to say what it is, like this:
    unix_prompt$ esearch_wrapper.pl [other options] --file=my_results.xml
  9. Now, let's go two steps further to make this useful to users. We'll add a subroutine that explains usage, and invoke that for the user.
  10. Note that, up to know, this script has been all interface. We haven't written any implementations, and the key line in the script (wget) is commented out. Shall we uncomment it? The result is that we get an XML file with query results. Try it.
  11. Last, let's break into that XML file. Remember that XML is a standard format. This means that we can parse it using out-of-the-box tools. Here's how: For more information, see the XML::Simple link on the Lab14 web page.