The Human Variation Databases for SIFT, Human Genome Assembly, NCBI Build 36 
					Last updated on December, 2008


This document describes the human variation databases used by SIFT exome tool for 
nonsynonymous single nucleotide variants (nsSNVs) available on the JCVI FTP site under the 
/pub/data/sift/Human_db_36 directory. The direct URL is: 
ftp://ftp.jcvi.org/pub/data/sift/Human_db_36/

To download the contents of the FTP site via linux terminal, please use "anonymous" username and your 
email as the password as follows,
>ftp ftp.jcvi.org
Connected to ftp.jcvi.org (172.20.17.13).
220 JCVI FTP Server
Name (ftp.jcvi.org:pkumar):anonymous
331 Anonymous login ok, send your complete email address as your password
Password:usename@domain.com

1. General Introduction

SIFT exome tool for nsSNVs (http://sift.jcvi.org/www/SIFT_chr_coords_submit.html) is used to characterize 
nonsynonymous single nucleotide variants in the entire human exome for their effect on the protein 
function. The tool uses sqlite databases built using sqlite 3.6.13, containg precomputed sift scores 
and annotation for each position (with all possible nucleotide substitutions) in the human exome to 
characterize nsSNVs. For more information, please view the exome tool help page 
(http://sift.jcvi.org/www/chr_coords_example.html). 

2. Contents of the /pub/data/sift/Human_db_36/ directory

The human genome variation databases backing the exome tool mentioned above were built using the SQLite 
version 3.6.13, a self-contained, serverless SQL database engine that can be downloaded from 
http://www.sqlite.org/.

There 24 SQLite databases in gzipped format each corresponding to one of the chromosomes of the human 
genome, chromosomes 1 through 22, X and Y. The chromosome files can be unzipped using the gunzip utility 
on linux (http://www.gzip.org/) or winrar (http://www.rarlab.com/) on windows platform.
 
3. Structure and contents of the databases

Each sqlite chromosome database, when unzipped, can be examined using the SQLite database engine. For 
more information on how to use SQLite, please visit http://www.sqlite.org/.

Each database is segmented into tables containing consecutive 10mn bases spanning the entire human exome. 
For example, the tables representing the first and second 10mn bases of chromosome 1 are named chr1_1_10000855
 and chr1_9994482_20000000 respectively. Notice that the absolute start and stop coordinates may not be 
exact multiples of 10mn. This is to make sure that if an exonic region is shared between two intervals,
it gets included completely within atleast one of the intervals.

1|CHR|TEXT - Human chromosome [1-22, X, Y]
2|COORD1|NUMERIC - Space based chromosome coordinate 1
3|COORD2|NUMERIC - Space based chromosomal nucleotide coordinate 2 (also absolute nucleotide coordinate)
4|ORN|TEXT - Strand orientation [1,-1]
5|RSID|TEXT - dbSNP rsID if exists
6|ENSG|TEXT - Ensembl gene ID, NULL if non-coding
7|ENST|TEXT - Ensembl transcript ID, NULL if non-coding
8|ENSP|TEXT - Ensembl protein ID, NULL if non-coding
9|REGION|TEXT - [CDS, intron, 3'UTR, 5'UTR, DOWNSTREAM, UPSTREAM]
10|SNP|TEXT - [Synonymous, Nonsynonymous], NULL if non-coding
11|NT1|CHAR - Reference nucleotide at this position [A,T,G,C] 
12|NT2|CHAR - Altered nucleotide at this position [A,T,G,C]
13|NTPOS1|NUMERIC - Space based mRNA coordinate 1 
14|NTPOS2|NUMERIC - Space based mRNA coordinate 2 (also absolute nucleotide coordinate)
15|CODON1|TEXT - Reference codon [GCT] for example
16|CODON2|TEXT - Altered codon [GaT] (altered base in lowercase)
17|AA1|CHAR - Reference amino acid at this position
18|AA2|CHAR - Altered amino acid at this position
19|AAPOS1|NUMERIC - Space based amino acid coordinate 1
20|AAPOS2|NUMERIC - Space based amino acid coordinate 2 (also absolute residue coordinate)
21|CDS|NUMERIC - 1 if coordinate in CDS, 0 otherwise
22|AA1_VALID|INTEGER - 0 if reference amino acid does not agree with the actual amino acid in protein 
		       due to ensembl annotation error
23|ENST_VALID|INTEGER - Depricated
24|SCORE|NUMERIC - SIFT score [0-1] - for more info see http://sift.jcvi.org
25|MEDIAN|NUMERIC - SIFT median information content
26|SEQS_REP|INTEGER - Number of homologs having reference amino acid at this position

4. Downloading and exploring the database

The chromosome databases can either be explored using the SQLite database engine or be integrated into 
exisinting pipelines written in various languages. 

4a. After downloading the contents of this FTP location, place them in $SIFT_HOME/db/Human_db directory 
and make sure SQLite database engine is installed and running. 

4b. Within the human_db directory, launch SQLite with one of the chromosome databses
>sqlite3 Human_CHR1.sqlite

4c. .tables command will display all the 10mn interval tables in this database:
chr1_110000001_120000000  chr1_199859224_210070737  chr1_48771114_60000988  
chr1_120000001_130000000  chr1_1_10000855           chr1_59535241_70361752  
chr1_130000001_140000000  chr1_20000001_30000000    chr1_69806669_80000000  
chr1_140000001_150002664  chr1_209983422_220000000  chr1_80000001_90000000  
chr1_149998743_160000000  chr1_220000001_230243641  chr1_90000001_100003937 
chr1_160000001_170000000  chr1_229829184_240031580  chr1_9994482_20000000   
chr1_170000001_180037339  chr1_239882203_250000000  chr1_99946847_110000000 
chr1_179648918_190000000  chr1_30000001_40027120  
chr1_190000001_200062720  chr1_39977117_50262172  

4d. You may viwe the table_info / column names using the following command 
sqlite> PRAGMA table_info (chr1_1_10000855);
0|CHR|TEXT|1||1
1|COORD1|NUMERIC|1||0
2|COORD2|NUMERIC|1||1
3|ORN|TEXT|0||1
4|RSID|TEXT|0||0
5|ENSG|TEXT|0||0
6|ENST|TEXT|0||1
7|ENSP|TEXT|0||0
8|REGION|TEXT|0||0
9|SNP|TEXT|0||0
10|NT1|CHAR|0||0
11|NT2|CHAR|0||1
12|NTPOS1|NUMERIC|0||0
13|NTPOS2|NUMERIC|0||0
14|CODON1|TEXT|0||0
15|CODON2|TEXT|0||0
16|AA1|CHAR|0||0
17|AA2|CHAR|0||1
18|AAPOS1|NUMERIC|0||0
19|AAPOS2|NUMERIC|0||0
20|CDS|NUMERIC|0||0
21|AA1_VALID|INTEGER|0||0
22|ENST_VALID|INTEGER|0||0
23|SCORE|NUMERIC|0||0
24|MEDIAN|NUMERIC|0||0
25|SEQS_REP|INTEGER|0||0

4e. Sample queries:
select * from chr1_1_10000855 where ENST = 'ENST00000328596';

select ENST,REGION,NT1,NT2,AA1,AA2,CODON1,CODON2,SCORE,RSID from chr1_1_10000855 where ENST = 
'ENST00000328596' AND AA1_VALID = 1 AND CDS = 1 and RSID <> 'novel';

4f. Calling database from script (example perl/DBI)

-----------Start of script------------------------

use DBI;

my $chr = "1"; #for example
my $table_chr = "chr1_1_10000855"; #for example

#Connect to database
my $db_chr =DBI->connect( "dbi:SQLite:dbname=$SIFT_HOME/db/Human_db/Human_CHR$chr.sqlite","", "", 
{ RaiseError => 1, AutoCommit => 1 } );
#$db_chr->do('PRAGMA synchronous=1'); #optional
#$db_chr->do('PRAGMA cache_size=4000'); #optional

#Prepared statment
my $sth_db_chr = $db_chr->prepare("select ENST,REGION,NT1,NT2,AA1,AA2,CODON1,CODON2,SCORE,RSID from 
$table_chr where ENST = ? AND AA1_VALID = 1 AND CDS = 1 and RSID <> \'novel\'");

#get output
my @rows;
my @query_result;
$sth_db_chr->execute("ENST00000328596");
while (@rows = $sth_db_chr->fetchrow_array()){
	push @query_result, join("\t",@rows);
}

#Print output
foreach my $row (@query_result){
	chomp $row;
	my @elts = split /\t/, $row;
	my $enst = $elts[0];
	my $region = $elts[1];
	my $nt1 = $elts[2];
	my $nt2 = $elts[3];
	my $aa1 = $elts[4];
	my $aa2 = $elts[5];
	my $codon1 = $elts[6];
	my $codon2 = $elts[7];
	my $score = $elts[8];
	my $rsid = $elts[9];
	
	#Do stuff with extracted columns or print complete row as foll..
	print "$row\n";
}

----------End of script-------------------------------

5. Description of Human_Supp database

This database accompanies the human chromosome databases described above and may be used to extract 
additional transcript and gene level information about the results obtained by querying the 
chromosome databases.

Tables
1. ALLELE_FREQ
2. GENE_INFO

5a. ALLELE_FREQ: This table provides allele frequency information for all SNPs documented in the Hapmap 
frequency database. The source of this data is Hapmap's NCBI 36 phase III data available at
http://hapmap.ncbi.nlm.nih.gov/downloads/frequencies/latest_phaseIII_ncbi_b36/fwd_strand/non-redundant/

Columns:
1|RSID|TEXT - dbSNP rsID
2|AVERAGE_FREQ_1|NUMERIC - Reference allele frequency (weighted average across all populations)
3|AVERAGE_FREQ_2|NUMERIC - Minor allele frequency (weighted average across all populations)
4|CEU_FREQ_1|NUMERIC - Reference allele frequency (CEU population)
5|CEU_FREQ_2|NUMERIC - Minor allele frequency (CEU population)

5b. GENE_INFO: This table provides transcript / gene level information compiled from Ensembl Biomart. 
In addition to SIFT scores, this information is useful for further prioritizing variants by looking 
at gene specific properties corresponding to the mutation.

Columns:
1|ENST|TEXT - Ensembl transcript ID
2|ENSP|TEXT - Ensembl protein ID
3|ENSG|TEXT - Ensembl gene ID
4|GENE_NAME|TEXT - Ensembl gene name
5|GENE_DESC|TEXT - Ensembl gene description
6|ENSFM|TEXT - Ensembl protein family ID
7|FAM_DESC|TEXT - Ensembl protein family description
8|GENE_STATUS|TEXT - Gene status [known/novel]
9|FAM_SIZE|INTEGER - Ensembl protein family size
10|KAKS_MOUSE|NUMERIC - Ka/Ks human - mouse (ratio of nonsynonymous to synonymous mutation rates)
11|KAKS_MACAQUE|NUMERIC - Ka/Ks human - macaque
12|MIM_STATUS|TEXT - OMIM disease if exists

6. Technical support

Questions and comments about this document should be submitted via SIFT contact form available at 
http://sift.jcvi.org/sift-bin/contact.pl