File download: http://fileshare.csb.univie.ac.at/vog/
File names and formats are similar to those used in eggNOG 4.5:
Text file listing the lettercodes of functional categories. These consist of "X" (unused in NCBI COG functional categories), followed by a lower case character indicating the functional category.
Tab separated file of genomes used for VOG construction.
Columns: species name|taxon id|phage/nonphage|source|source version
FASTA formatted file of all proteins from the genomes in vog.species.list. Protein IDs encode the taxonomy id of the genome and the RefSeq protein id. For peptides from polyproteins also the corresponding protein id of the polyprotein (CDS) is given.
FASTA formatted file of all gene sequences from the genomes in vog.species.list. Same IDs as in the protein file are used. For polyprotein genes the partial gene sequences of the peptides as well as the complete gene sequences of the polyprotein are contained.
Compressed archive of FASTA formatted files of the proteins contained in each VOG.
Compressed archive of multiple sequence alignments for each VOG.
Compressed archive of the HMMER3 compatible Hidden Markov Models obtained from the multiple sequence alignments for each VOG.
Tab separated file of VOGs and the comma separated lists of their member protein ids.
Tab separated file of VOGs and their consensus functional annotations (preferrably from Swissprot annotations, if not available then the annotations from RefSeq were used).
Tab separated file of VOGs and the taxonomic lineage of the last common aencestor (LCA) of member genomes. Genomes with unclassified taxonomic lineages have not been used for LCA determination. The numbers of genomes per VOG and LCA, as well as the total numbers of genomes in the LCA are given.
Tab separated file of VOGs and their specificic occurrence in virus genomes. For this purpose the homology of all member proteins to cellular genomes from eggNOG 4.5 have been determined with three different stringencies:
High stringency: blastp e-Value <=1e-04 and hits in maximal 2 cellular genomes
Medium stringency: blastp e-Value <=1e-10 and hits in maximal 3 cellular genomes
Low stringency: blastp e-Value <=1e-15 and hits in maximal 4 cellular genomes
"Only in viruses" has been set true if members matched not more than the maximal number of genomes at the e-Value threshold for each stringency level.
Columns: GroupName|Only in viruses (high stringency)|Only in viruses (medium stringency)|Only in viruses (low stringency)
This file is useful to extract virus-specific markers from all VOGs, based on your preferred level of stringency.