VOGDB groups and file formats

Groups

Since release 221, VOGDB groups protein hierarchically on three levels:

  1. VOG : Virus orthologous groups (built from bidirectional sequence similarities)
  2. VFAM: Virus protein families (built from vogs by HMM-HMM clustering)
  3. VFOLD: Virus protein structural folds (built from vfams by clustering of predicted 3D structures of representative proteins)

Releases until 220 only comprise VFAM groups.

File formats

File suffixDescription
*.annotations.tsv.gzTab separated file of groups and their consensus functional annotations (preferrably from Swissprot annotations, if not available then the annotations from RefSeq were used).
Columns: GroupName|ProteinCount|SpeciesCount|FunctionalCategory|ConsensusFunctionalDescription
*.faa.tar.gzCompressed archive of FASTA formatted files of the proteins per vogdb group.
*.functional_categories.txtText file listing the lettercodes of functional categories. These consist of X (unused in NCBI COG functional categories), followed by a lower case character indicating the functional category.
*.genes.all.fa.gzFASTA formatted file of all gene sequences from the genomes in vog.species.list. Same IDs as in the protein file are used. For polyprotein genes the partial gene sequences of the peptides as well as the complete gene sequences of the polyprotein are contained.
*.host.txtTab separated file of host information and classification for virus taxa.
Columns: taxon id|phage/nonphage|host|superkingdom of host.
*.hmm.tar.gzCompressed archive of the HMMER3 compatible Hidden Markov Models obtained from the multiple sequence alignments for each vogdb group.
*.lca.tsv.gzTab separated file of VOGs and the taxonomic lineage of the last common aencestor (LCA) of member genomes. Genomes with unclassified taxonomic lineages have not been used for LCA determination, which can result in VOG without lca (if all proteins of a VOG are from unclassified lineages). The numbers of genomes per VOG and LCA, as well as the total numbers of genomes in the LCA are given.
Columns: GroupName|GenomesInGroupAndLCA|GenomesTotalInLCA|LastCommonAncestor_TaxonName|LastCommonAncestor_TaxonID
*.members.tsv.gzTab separated file of VOGs and the comma separated lists of their member protein ids.
Columns: GroupName|ProteinCount|SpeciesCount|FunctionalCategory|ProteinIDs
*.proteins.all.fa.gzFASTA formatted file of all proteins from the genomes in vog.species.list. Protein IDs encode the taxonomy id of the genome and the RefSeq protein id. For peptides from polyproteins also the corresponding protein id of the polyprotein (CDS) is given.
*.raw_algs.tar.gzCompressed archive of multiple sequence alignments for each VOGDB group.
*.species.txtTab separated file of virus genomes used for VOG construction.
Columns: species name|taxon id|source|source version
*.virusonly.tsv.gzTab separated file of VOGs and their specificic occurrence in virus genomes. For this purpose the homology of all member proteins to cellular genomes from eggNOG 4.5 have been determined with three different stringencies:
High stringency: blastp e-Value <=1e-04 and hits in maximal 2 cellular genomes;
Medium stringency: blastp e-Value <=1e-10 and hits in maximal 3 cellular genomes;
Low stringency: blastp e-Value <=1e-15 and hits in maximal 4 cellular genomes;
The column Only_in_viruses has been set true if members matched not more than the maximal number of genomes at the e-Value threshold for each stringency level.
Columns: GroupName|Only in viruses (high stringency)|Only in viruses (medium stringency)|Only in viruses (low stringency) 1=True; 0=False. This file is useful to extract virus-specific markers from all VOGs, based on your preferred level of stringency.

License

All data published are licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).