GenePainter v.2.0 maps gene structures onto multiple sequence alignments
Contact: Martin Kollmar (mako[at]nmr.mpibpc.mpg.de)


INTRODUCTION
------------
The conservation of intron positions comprises information useful for de novo gene prediction as well as for analyzing the origin of introns. Here, we present GenePainter, a standalone tool for mapping gene structures onto pr
otein multiple sequence alignments (MSA). Gene structures, as provided by WebScipio, are aligned with respect to the exact positions of the introns (down to nucleotide level) and intron phase. Output can be viewed in various 
formats, ranging from plain text to graphical output formats.


Restrictions and Licence
------------------------
GenePainter can be used under a GNU General Public License.


INSTALLATION
------------

Unpack
------
Use one of the following methods, depending on the type of your archive file:
$ unzip gene_painter.zip
$ tar -xzf gene_painter.tgz


Compilation
-----------
No compilation required.


Ruby version
------------
Ruby version 2.0.0 or higher is required. If necessary, consider using Ruby Version Manager (RVM; https://rvm.io/) to install and work with multiple ruby environments on your machine. 


RUNNING GENEPAINTER
-------------------

Ruby interpreter
----------------
Invoke GenePainter via one of the following options:

1) as a script
$ ruby gene_painter.rb

2) as a program
$ ./gene_painter.rb

Usage
-----
Usage: gene_painter.rb -i path_to_alignment -p path_to_genestructure_folder [options]
Standard output format: Mark exons by '-' and introns by '|'

    -i, --input <path_to_alignment>  Path to fasta-formatted multiple sequence alignment
    -p <path_to_genestructures>,     Path to folder containing gene structures in YAML or GFF format. 
        --path                       Required file extension is one of .yaml, .gff, .gff3.

Options
-------
Text-based output format:
        --intron-phase               Mark introns by their phase instead of '|'
        --phylo                      Mark exons by '0' and introns by '1'
        --spaces                     Mark exons by space (' ') instead of '-'
        --no-standard-output         Specify to skip standard output format.
        --alignment                  Output the alignment file with additional lines containing intron phases
        --fuzzy N                    Introns at most N base pairs apart from each other are aligned

Graphical output format:
        --svg                        Drawn a graphical representation of genes in SVG format
                                     Per default, detailed representation will be produced
                                     Use parameter '--svg-format' to get less details
        --svg-format FORMAT          FORMAT: ["normal", "reduced", "both"]
                                     'normal' draws details of aligned exons and introns [default]
                                     'reduced' focuses on common introns only
                                     'both' draws both formats

        --pdb FILE                   Mark consensus or merged gene structure in pdb FILE
                                     Consenus gene structure contains introns conserved in N % of all genes
                                     Specify N with option --consensus N; [default: 80%]
                                     Two scripts for execution in PyMol are provided:
                                     'color_exons.py' to mark consensus exons
                                     'color_splicesites.py' to mark splice junctions of consensus exons
        --pdb-chain CHAIN            Mark gene structures for chain CHAIN
                                     [Default: Use chain A]
        --pdb-ref-prot PROT          Use protein PROT as reference for alignment with pdb sequence
                                     [Default: First protein in alignment]
        --pdb-ref-prot-struct        Color only intron positions occuring in the reference protein structure.

        --tree                       Generate newick tree file and SVG representation

Meta information and statistics:
        --consensus N                Introns conserved in N % genes.
                                     Specify N as decimal number between 0 and 1
        --merge                      Merge all introns into a single exon intron pattern
        --statistics                 Output additional file with statistics about common introns
                                     To include information about taxomony, specify '--taxomony' and '--taxonomy-to-fasta' options

Taxonomy:
        --taxonomy FILE              NCBI taxonomy database dump file FILE
                                     OR path to extract from NCBI taxonomy:
                                     Lineage must be semicolon-separated list of taxa from root to species.
        --taxonomy-to-fasta FILE     Text-based file mapping gene structure file names to species names
                                     Mandatory format:
                                     One or more genes given as semicolon-separated list and species name
                                     Delimiter between gene list and species name must be a colon
                                     The species name itself must be enclosed by double quotes like this "SPECIES"
        --taxonomy-common-to x,y,z   Mark introns common to taxa x,y,z
                                     List must consist of at least one NCBI taxon
        --[no-]exclusively-in-taxa   Mark introns occuring (not) exclusively in listed taxa
                                     Default: Not exclusively.
        --introns-per-taxon          Newly gained introns for every inner node in taxonomy
        --no-grep                    Read the NCBI taxomony dump into RAM
                                     This will require some hundert MBs of RAM additionally
                                     Default: taxomony dump is parsed with 'grep' calls
        --nice                       Give grep calls to parse taxonomy dump a lower priority
                                     Please make sure to have 'nice' in your executable path when using this option

Analysis and output of all or subset of data:
        --analyse-all-output-all     Analyse all data and provide full output [default]
        --analyse-all-output-selection
                                     Analyse all data and provide text-based and graphical output for selection only
                                     All introns are analysed, including those not present in selection
        --analyse-selection-output-selection
                                     Analyse selected data and provide output for selection only
        --analyse-selection-on-all-data-output-selection
                                     Analyse intron positions of selected data in all data and provide output for selection only
                                     Introns present in selection are analysed in all data
Selection criteria for data and output selection
        --selection-based-on-regex <"regex">
                                     Regular expression applied on gene structure file names
        --selection-based-on-list x,y,z
                                     List of gene structures to be used
        --selection-based-on-species <"species">
                                     Use all gene structures associated with species
                                     Specify also --taxonomy-to-fasta to map gene structure file names to species names
        --select-all                 No selection applied (default)

General options:
    -o, --outfile <file_name>        Prefix of the output file(s)
                                     Default: genepainter
        --path-to-output <path>      Path to location for the output file(s)
                                     Default: same location as GenePainter source files
        --range START,STOP           Restrict genes to range START-STOP in alignment
                                     Might also be list if format START1,STOP1,START2,STOP2
                                     Keyword 'end' might be used to mark last position in alignment
        --[no-]delete-range          (Not) Delete specified range
        --keep-common-gaps           Keep common gaps in alignment
        --[no-]separate-introns-in-textbased-output
                                     (Not) Separate each consecutive pair of introns by an exon placeholder in text-based output formats.
                                     Default: Separate introns unless the output lines get too long.

    -h, --help                       Show this message
