The Human Variation Databases for SIFT, Human Genome Assembly, NCBI Build 36
Last updated on December, 2008
This document describes the human variation databases used by SIFT exome tool for
nonsynonymous single nucleotide variants (nsSNVs) available on the JCVI FTP site under the
/pub/data/sift/Human_db_36 directory. The direct URL is:
ftp://ftp.jcvi.org/pub/data/sift/Human_db_36/
To download the contents of the FTP site via linux terminal, please use "anonymous" username and your
email as the password as follows,
>ftp ftp.jcvi.org
Connected to ftp.jcvi.org (172.20.17.13).
220 JCVI FTP Server
Name (ftp.jcvi.org:pkumar):anonymous
331 Anonymous login ok, send your complete email address as your password
Password:usename@domain.com
1. General Introduction
SIFT exome tool for nsSNVs (http://sift.jcvi.org/www/SIFT_chr_coords_submit.html) is used to characterize
nonsynonymous single nucleotide variants in the entire human exome for their effect on the protein
function. The tool uses sqlite databases built using sqlite 3.6.13, containg precomputed sift scores
and annotation for each position (with all possible nucleotide substitutions) in the human exome to
characterize nsSNVs. For more information, please view the exome tool help page
(http://sift.jcvi.org/www/chr_coords_example.html).
2. Contents of the /pub/data/sift/Human_db_36/ directory
The human genome variation databases backing the exome tool mentioned above were built using the SQLite
version 3.6.13, a self-contained, serverless SQL database engine that can be downloaded from
http://www.sqlite.org/.
There 24 SQLite databases in gzipped format each corresponding to one of the chromosomes of the human
genome, chromosomes 1 through 22, X and Y. The chromosome files can be unzipped using the gunzip utility
on linux (http://www.gzip.org/) or winrar (http://www.rarlab.com/) on windows platform.
3. Structure and contents of the databases
Each sqlite chromosome database, when unzipped, can be examined using the SQLite database engine. For
more information on how to use SQLite, please visit http://www.sqlite.org/.
Each database is segmented into tables containing consecutive 10mn bases spanning the entire human exome.
For example, the tables representing the first and second 10mn bases of chromosome 1 are named chr1_1_10000855
and chr1_9994482_20000000 respectively. Notice that the absolute start and stop coordinates may not be
exact multiples of 10mn. This is to make sure that if an exonic region is shared between two intervals,
it gets included completely within atleast one of the intervals.
1|CHR|TEXT - Human chromosome [1-22, X, Y]
2|COORD1|NUMERIC - Space based chromosome coordinate 1
3|COORD2|NUMERIC - Space based chromosomal nucleotide coordinate 2 (also absolute nucleotide coordinate)
4|ORN|TEXT - Strand orientation [1,-1]
5|RSID|TEXT - dbSNP rsID if exists
6|ENSG|TEXT - Ensembl gene ID, NULL if non-coding
7|ENST|TEXT - Ensembl transcript ID, NULL if non-coding
8|ENSP|TEXT - Ensembl protein ID, NULL if non-coding
9|REGION|TEXT - [CDS, intron, 3'UTR, 5'UTR, DOWNSTREAM, UPSTREAM]
10|SNP|TEXT - [Synonymous, Nonsynonymous], NULL if non-coding
11|NT1|CHAR - Reference nucleotide at this position [A,T,G,C]
12|NT2|CHAR - Altered nucleotide at this position [A,T,G,C]
13|NTPOS1|NUMERIC - Space based mRNA coordinate 1
14|NTPOS2|NUMERIC - Space based mRNA coordinate 2 (also absolute nucleotide coordinate)
15|CODON1|TEXT - Reference codon [GCT] for example
16|CODON2|TEXT - Altered codon [GaT] (altered base in lowercase)
17|AA1|CHAR - Reference amino acid at this position
18|AA2|CHAR - Altered amino acid at this position
19|AAPOS1|NUMERIC - Space based amino acid coordinate 1
20|AAPOS2|NUMERIC - Space based amino acid coordinate 2 (also absolute residue coordinate)
21|CDS|NUMERIC - 1 if coordinate in CDS, 0 otherwise
22|AA1_VALID|INTEGER - 0 if reference amino acid does not agree with the actual amino acid in protein
due to ensembl annotation error
23|ENST_VALID|INTEGER - Depricated
24|SCORE|NUMERIC - SIFT score [0-1] - for more info see http://sift.jcvi.org
25|MEDIAN|NUMERIC - SIFT median information content
26|SEQS_REP|INTEGER - Number of homologs having reference amino acid at this position
4. Downloading and exploring the database
The chromosome databases can either be explored using the SQLite database engine or be integrated into
exisinting pipelines written in various languages.
4a. After downloading the contents of this FTP location, place them in $SIFT_HOME/db/Human_db directory
and make sure SQLite database engine is installed and running.
4b. Within the human_db directory, launch SQLite with one of the chromosome databses
>sqlite3 Human_CHR1.sqlite
4c. .tables command will display all the 10mn interval tables in this database:
chr1_110000001_120000000 chr1_199859224_210070737 chr1_48771114_60000988
chr1_120000001_130000000 chr1_1_10000855 chr1_59535241_70361752
chr1_130000001_140000000 chr1_20000001_30000000 chr1_69806669_80000000
chr1_140000001_150002664 chr1_209983422_220000000 chr1_80000001_90000000
chr1_149998743_160000000 chr1_220000001_230243641 chr1_90000001_100003937
chr1_160000001_170000000 chr1_229829184_240031580 chr1_9994482_20000000
chr1_170000001_180037339 chr1_239882203_250000000 chr1_99946847_110000000
chr1_179648918_190000000 chr1_30000001_40027120
chr1_190000001_200062720 chr1_39977117_50262172
4d. You may viwe the table_info / column names using the following command
sqlite> PRAGMA table_info (chr1_1_10000855);
0|CHR|TEXT|1||1
1|COORD1|NUMERIC|1||0
2|COORD2|NUMERIC|1||1
3|ORN|TEXT|0||1
4|RSID|TEXT|0||0
5|ENSG|TEXT|0||0
6|ENST|TEXT|0||1
7|ENSP|TEXT|0||0
8|REGION|TEXT|0||0
9|SNP|TEXT|0||0
10|NT1|CHAR|0||0
11|NT2|CHAR|0||1
12|NTPOS1|NUMERIC|0||0
13|NTPOS2|NUMERIC|0||0
14|CODON1|TEXT|0||0
15|CODON2|TEXT|0||0
16|AA1|CHAR|0||0
17|AA2|CHAR|0||1
18|AAPOS1|NUMERIC|0||0
19|AAPOS2|NUMERIC|0||0
20|CDS|NUMERIC|0||0
21|AA1_VALID|INTEGER|0||0
22|ENST_VALID|INTEGER|0||0
23|SCORE|NUMERIC|0||0
24|MEDIAN|NUMERIC|0||0
25|SEQS_REP|INTEGER|0||0
4e. Sample queries:
select * from chr1_1_10000855 where ENST = 'ENST00000328596';
select ENST,REGION,NT1,NT2,AA1,AA2,CODON1,CODON2,SCORE,RSID from chr1_1_10000855 where ENST =
'ENST00000328596' AND AA1_VALID = 1 AND CDS = 1 and RSID <> 'novel';
4f. Calling database from script (example perl/DBI)
-----------Start of script------------------------
use DBI;
my $chr = "1"; #for example
my $table_chr = "chr1_1_10000855"; #for example
#Connect to database
my $db_chr =DBI->connect( "dbi:SQLite:dbname=$SIFT_HOME/db/Human_db/Human_CHR$chr.sqlite","", "",
{ RaiseError => 1, AutoCommit => 1 } );
#$db_chr->do('PRAGMA synchronous=1'); #optional
#$db_chr->do('PRAGMA cache_size=4000'); #optional
#Prepared statment
my $sth_db_chr = $db_chr->prepare("select ENST,REGION,NT1,NT2,AA1,AA2,CODON1,CODON2,SCORE,RSID from
$table_chr where ENST = ? AND AA1_VALID = 1 AND CDS = 1 and RSID <> \'novel\'");
#get output
my @rows;
my @query_result;
$sth_db_chr->execute("ENST00000328596");
while (@rows = $sth_db_chr->fetchrow_array()){
push @query_result, join("\t",@rows);
}
#Print output
foreach my $row (@query_result){
chomp $row;
my @elts = split /\t/, $row;
my $enst = $elts[0];
my $region = $elts[1];
my $nt1 = $elts[2];
my $nt2 = $elts[3];
my $aa1 = $elts[4];
my $aa2 = $elts[5];
my $codon1 = $elts[6];
my $codon2 = $elts[7];
my $score = $elts[8];
my $rsid = $elts[9];
#Do stuff with extracted columns or print complete row as foll..
print "$row\n";
}
----------End of script-------------------------------
5. Description of Human_Supp database
This database accompanies the human chromosome databases described above and may be used to extract
additional transcript and gene level information about the results obtained by querying the
chromosome databases.
Tables
1. ALLELE_FREQ
2. GENE_INFO
5a. ALLELE_FREQ: This table provides allele frequency information for all SNPs documented in the Hapmap
frequency database. The source of this data is Hapmap's NCBI 36 phase III data available at
http://hapmap.ncbi.nlm.nih.gov/downloads/frequencies/latest_phaseIII_ncbi_b36/fwd_strand/non-redundant/
Columns:
1|RSID|TEXT - dbSNP rsID
2|AVERAGE_FREQ_1|NUMERIC - Reference allele frequency (weighted average across all populations)
3|AVERAGE_FREQ_2|NUMERIC - Minor allele frequency (weighted average across all populations)
4|CEU_FREQ_1|NUMERIC - Reference allele frequency (CEU population)
5|CEU_FREQ_2|NUMERIC - Minor allele frequency (CEU population)
5b. GENE_INFO: This table provides transcript / gene level information compiled from Ensembl Biomart.
In addition to SIFT scores, this information is useful for further prioritizing variants by looking
at gene specific properties corresponding to the mutation.
Columns:
1|ENST|TEXT - Ensembl transcript ID
2|ENSP|TEXT - Ensembl protein ID
3|ENSG|TEXT - Ensembl gene ID
4|GENE_NAME|TEXT - Ensembl gene name
5|GENE_DESC|TEXT - Ensembl gene description
6|ENSFM|TEXT - Ensembl protein family ID
7|FAM_DESC|TEXT - Ensembl protein family description
8|GENE_STATUS|TEXT - Gene status [known/novel]
9|FAM_SIZE|INTEGER - Ensembl protein family size
10|KAKS_MOUSE|NUMERIC - Ka/Ks human - mouse (ratio of nonsynonymous to synonymous mutation rates)
11|KAKS_MACAQUE|NUMERIC - Ka/Ks human - macaque
12|MIM_STATUS|TEXT - OMIM disease if exists
6. Technical support
Questions and comments about this document should be submitted via SIFT contact form available at
http://sift.jcvi.org/sift-bin/contact.pl