GenePainter v.2.0 maps gene structures onto multiple sequence alignments
Contact: Martin Kollmar (mako[at]nmr.mpibpc.mpg.de)


INTRODUCTION
------------
The conservation of intron positions comprises information useful for de novo gene prediction as well as for analyzing the origin of introns. Here, we present GenePainter, a standalone tool for mapping gene structures onto protein multiple sequence alignments (MSA). Gene structures, as provided by WebScipio, are aligned with respect to the exact positions of the introns (down to nucleotide level) and intron phase. Output can be viewed in various formats, ranging from plain text to graphical output formats.


INSTALLATION
------------

Unpack
------
Use one of the following methods, depending on the type of your archive file:
$ unzip gene_painter.zip
$ tar -xzf gene_painter.tgz


Compilation
-----------
No compilation required.


Ruby version
------------
Ruby version 2.0.0 or higher is required. If necessary, consider using Ruby Version Manager (RVM; https://rvm.io/) to install and work with multiple ruby environments on your machine. 


RUNNING GENEPAINTER
-------------------

Ruby interpreter
----------------
Invoke GenePainter via one of the following options:

1) as a script
$ ruby gene_painter.rb

2) as a program
$ ./gene_painter.rb

Usage
-----
Usage: gene_painter.rb -i path_to_alignment -p path_to_genestructure_folder [options]
Standard output format: Mark exons by '-' and introns by '|'

    -i, --input <path_to_alignment>  Path to fasta-formatted multiple sequence alignment
    -p <path_to_genestructures>,     Path to folder containing gene structures in YAML or GFF format
        --path

Options
-------
Text-based output format:
        --intron-phase               Mark introns by their phase instead of '|'
        --phylo                      Mark exons by '0' and introns by '1'
        --spaces                     Mark exons by space (' ') instead of '-'
        --no-standard-output         Specify to skip standard output format.
        --alignment                  Output the alignment file with additional lines containing intron phases

Graphical output format:
        --svg                        Drawn a graphical representation of genes in SVG format
                                     Per default, detailed representation will be produced
                                     Use parameter '--svg-format' to get less details
        --svg-format FORMAT          FORMAT: ["normal", "reduced", "both"]
                                     'normal' draws details of aligned exons and introns [default]
                                     'reduced' focuses on common introns only
                                     'both' draws both formats

        --pdb FILE                   Mark consensus or merged gene structure in pdb FILE
                                     Consenus gene structure contains introns conserved in N % of all genes
                                     Specify N with option --consensus N; [default: 80%]
                                     Two scripts for execution in PyMol are provided:
                                     'color_exons.py' to mark consensus exons
                                     'color_splicesites.py' to mark splice junctions of consensus exons
        --pdb-chain CHAIN            Mark gene structures for chain CHAIN
                                     [Default: Use chain A]
        --pdb-ref-prot PROT          Use protein PROT as reference for alignment with pdb sequence
                                     [Default: First protein in alignment]
        --pdb-ref-prot-struct        Color only intron positions occuring in the reference protein structure.

        --tree                       Generate newick tree file and SVG representation

Meta information and statistics:
        --consensus N                Introns conserved in N % genes.
                                     Specify N as decimal number between 0 and 1
        --merge                      Merge all introns into a single exon intron pattern
        --statistics                 Output additional file with statistics about common introns
                                     To include information about taxomony, specify '--taxomony' and '--taxonomy-to-fasta' options

Taxonomy:
        --taxonomy FILE              Mark introns by taxonomy
                                     NCBI taxonomy database dump file FILE
        --taxonomy-to-fasta FILE     Text-based file mapping gene structure file names to species names
                                     Gene1[,Gene2]:Species
        --taxonomy-common-to x,y,z   Mark introns common to taxa x,y,z
                                     List must consist of at least one NCBI taxon
        --[no-]exclusively-in-taxa   Mark introns occuring (not) exclusively in listed taxa
                                     Default: Not exclusively.
        --introns-per-taxon          Newly gained introns for every inner node in taxonomy
        --no-grep                    Read the NCBI taxomony dump into RAM
                                     This will require some hundert MBs of RAM additionally
                                     Default: taxomony dump is parsed with 'grep' calls
        --nice                       Give grep calls to parse taxonomy dump a lower priority
                                     Please make sure to have 'nice' in your executable path when using this option

Data and output selection:
        --data-selection             Use only a selection of all data, i.e. gene structures
                                     Output is based on intron positions occuring in selection
        --analyse-data-selection-only
                                     Ignore all data that are not in selection
                                     Effects only taxonomy options: If not specified, all data are used for taxonomic analysis
        --output-selection           Text-based and graphical output contains only selected data.
                                     Analysis of intron positions is based on all data.
Selection criteria for data and output selection
        --selection-based-on-regex "REGEX"
                                     Regular expression applied on gene structure file names
        --selection-based-on-list x,y,z
                                     List of gene structures to be used
        --selection-based-on-species <"species">
                                     Use all gene structures associated with species
                                     Specify also --taxonomy-to-fasta to map gene structure file names to species names

General options:
    -o, --outfile <file_name>        Name of the output file
        --range START,STOP           Restrict genes to range START-STOP in alignment
        --[no-]delete-range          (Not) Delete specified range
        --keep-common-gaps           Keep common gaps in alignment
        --no-best-position-introns   Plot introns always onto beginning of a gap
                                     Default: Align introns if their position differs by alignment gaps only
        --[no-]separate-introns-in-textbased-output
                                     (Not) Separate each consecutive pair of introns by an exon placeholder in text-based output formats.
                                     Default: Separate introns unless the output lines get too long.

    -h, --help                       Show this message

