Getopt::Long Lesson (CBMG688P Best Practices laboratory - Stoltzfus)

In this lesson, you will learn to use a module called Getopt::Long to implement a POSIX-compliant command-line interface.

background

see slide presentation

exercise

first we will add the file esearch_wrapper.pl to our CVS repository

unix_prompt$ cd ~/My_CBMG688P/my_cvs/Perl
unix_prompt$ cp ../../Lab14Link/Materials/esearch_wrapper.pl .
unix_prompt$ chmod +x esearch_wrapper.pl
unix_prompt$ cvs add esearch_wrapper.pl 
unix_prompt$ cvs commit -m "exercise to learn Getopt::Long"

Now, let's look at the file, which is a wrapper for NCBI's esearch utility. It uses wget to execute the web query, though I have commented out that line (we'll uncomment it in a few minutes). Here it is:

#!/usr/local/bin/perl -w 
#
# $Id$
#
use strict; 

# process command-line arguments 
#
my ( $email, $db, $search_terms ); 
$email = shift;  
$db = shift; 
$search_terms = shift; 

# construct the URL (note: $0 is the current command)
#
my $base_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"; 
my $query_url = sprintf( "%s?tool=%s&email=%s&db=%s&term=%s", $base_url, $0, $email, $db, $search_terms ); 

# execute the query 
#
# `wget -O esearch_query_result.xml "$query_url"`;

printf STDERR "The results of the query ($query_url) are in esearch_query_results.xml\n";

# and exit
#
exit;

Before going any further, let's fix an obvious problem with this code: "esearch_results.xml" appears twice, with a slight spelling error. Will we confuse users with this? Even without the spelling error, if we wanted to change the file name later, would we remember to change every instance? Repeating string literals or other code in a file is not a good practice. The best practice is to make a repeated thing a named (referenced) entity, and then just reference it. In this case, we make the filename a variable, then invoke the variable twice. In a few minutes, we will add a command-line option so that the user can set this variable from the command-line, which isn't possible until we fix this. Improvements to software often depend on modularity: its harder to debug, test, document, and re-use software that isn't modular.
- Add this
```
my $outfile = "esearch_query_results.xml"; 
```
- Now fix the wget command (please leave it commented-out for now) to look like this:
```
# `wget -O $outfile "$query_url"`;
```
- And now fix the print statement to look like this:
```
printf STDERR "The results of the query ($query_url) are in $outfile\n"; 
```

Now, let's "commit our changes" using CVS.

unix_prompt$ cvs update
unix_prompt$ cvs commit

I also think that we should put the wget command in a separate subroutine. Why? Because wget is crude, and in the future I would like to try other ways of exectuting the query, like libwww or bioperl. But we will leave that for another day.

Before we implement a command-line interface, we need to design it. Here is an example of what I want it to look like:

unix_prompt$ esearch_wrapper.pl --email=arlin@umd.edu --db=genome --query="Thermanaerovibrio+acidaminovorans[organism]"

Now, we are ready to implement this using Getopt::Long.
- First, to use Getopt::Long, we need to tell Perl this:
```
use Getopt::Long;
```
- Now, let's delete all of the following stuff we don't need:
```
$email = shift;
$db = shift;
$search_terms = shift;
```
- Finally, we'll invoke Getopt::Long::GetOptions to process the command-line (explain syntax)
```
GetOptions(
    "email=s" => \$email,   
    "query=s" => \$search_terms, 
    "db=s" => \$db   
  );  
```
- That's all. Are we ready to try it out? Please try out the command-line and check what our message says about the query. Some things to try:
  - change the order in which the options are invoked
  - try using abbreviations (works as long as they are unique), e.g. --qu
  - you can even use single-letter-flags with one dash (if unique), e.g., -q

How about adding another option? If its going to write a file, I'd like the chance to say what it is, like this:

unix_prompt$ esearch_wrapper.pl [other options] --file=my_results.xml

To do this, we just need to add a line (don't forget the preceding comma):

GetOptions(
    "email=s" => \$email,   
    "query=s" => \$search_terms, 
    "db=s" => \$db,   
    "file=s" => \$outfile 
  );

Now, let's go two steps further to make this useful to users. We'll add a subroutine that explains usage, and invoke that for the user.

Paste this into the end of the code (after exit;).


sub Usage
{
  my $command = $0; 
  $command =~ s#^[^\s]/##; 
  printf STDERR "@_\n" if ( @_ );
  printf STDERR "usage: $command --db NCBI_DB --email EMAIL --query TERMS [--help|-?]\n";
  exit;
}

and then we'll need to specify another option when we invoke GetOptions (don't forget the preceding comma) and invoke this after option-parsing (note semi-colon):

my $help;
GetOptions(
    "email=s" => \$email,   
    "query=s" => \$search_terms, 
    "db=s" => \$db,   
    "file=s" => \$outfile , 
    "help|?" => \$help 
  ) or Usage( "Invalid command-line options. " ); 
  
Usage() if defined $help;

Note that, up to know, this script has been all interface. We haven't written any implementations, and the key line in the script (wget) is commented out. Shall we uncomment it? The result is that we get an XML file with query results. Try it.
Last, let's break into that XML file. Remember that XML is a standard format. This means that we can parse it using out-of-the-box tools. Here's how:
- add this at the beginning
```
use XML::Simple; 
use Data::Dumper; 
```
- and add this after the "print" but before "exit":
```
# now lets see what we got
my $xml = new XML::Simple;
my $output = $xml->XMLin("$outfile");
print Dumper($output);
```
- Note that the above code takes the XML and puts it into a perl hash. We can access it via $output, which is the hashref, like this:
```
printf ("the Id that we really want is %s\n", $output->{ IdList }->{ Id } );
```
For more information, see the XML::Simple link on the Lab14 web page.