The Human Variation Databases for SIFT, Human Genome Assembly, NCBI Build 36 Last updated on December, 2008 This document describes the human variation databases used by SIFT exome tool for nonsynonymous single nucleotide variants (nsSNVs) available on the JCVI FTP site under the /pub/data/sift/Human_db_36 directory. The direct URL is: ftp://ftp.jcvi.org/pub/data/sift/Human_db_36/ To download the contents of the FTP site via linux terminal, please use "anonymous" username and your email as the password as follows, >ftp ftp.jcvi.org Connected to ftp.jcvi.org (172.20.17.13). 220 JCVI FTP Server Name (ftp.jcvi.org:pkumar):anonymous 331 Anonymous login ok, send your complete email address as your password Password:usename@domain.com 1. General Introduction SIFT exome tool for nsSNVs (http://sift.jcvi.org/www/SIFT_chr_coords_submit.html) is used to characterize nonsynonymous single nucleotide variants in the entire human exome for their effect on the protein function. The tool uses sqlite databases built using sqlite 3.6.13, containg precomputed sift scores and annotation for each position (with all possible nucleotide substitutions) in the human exome to characterize nsSNVs. For more information, please view the exome tool help page (http://sift.jcvi.org/www/chr_coords_example.html). 2. Contents of the /pub/data/sift/Human_db_36/ directory The human genome variation databases backing the exome tool mentioned above were built using the SQLite version 3.6.13, a self-contained, serverless SQL database engine that can be downloaded from http://www.sqlite.org/. There 24 SQLite databases in gzipped format each corresponding to one of the chromosomes of the human genome, chromosomes 1 through 22, X and Y. The chromosome files can be unzipped using the gunzip utility on linux (http://www.gzip.org/) or winrar (http://www.rarlab.com/) on windows platform. 3. Structure and contents of the databases Each sqlite chromosome database, when unzipped, can be examined using the SQLite database engine. For more information on how to use SQLite, please visit http://www.sqlite.org/. Each database is segmented into tables containing consecutive 10mn bases spanning the entire human exome. For example, the tables representing the first and second 10mn bases of chromosome 1 are named chr1_1_10000855 and chr1_9994482_20000000 respectively. Notice that the absolute start and stop coordinates may not be exact multiples of 10mn. This is to make sure that if an exonic region is shared between two intervals, it gets included completely within atleast one of the intervals. 1|CHR|TEXT - Human chromosome [1-22, X, Y] 2|COORD1|NUMERIC - Space based chromosome coordinate 1 3|COORD2|NUMERIC - Space based chromosomal nucleotide coordinate 2 (also absolute nucleotide coordinate) 4|ORN|TEXT - Strand orientation [1,-1] 5|RSID|TEXT - dbSNP rsID if exists 6|ENSG|TEXT - Ensembl gene ID, NULL if non-coding 7|ENST|TEXT - Ensembl transcript ID, NULL if non-coding 8|ENSP|TEXT - Ensembl protein ID, NULL if non-coding 9|REGION|TEXT - [CDS, intron, 3'UTR, 5'UTR, DOWNSTREAM, UPSTREAM] 10|SNP|TEXT - [Synonymous, Nonsynonymous], NULL if non-coding 11|NT1|CHAR - Reference nucleotide at this position [A,T,G,C] 12|NT2|CHAR - Altered nucleotide at this position [A,T,G,C] 13|NTPOS1|NUMERIC - Space based mRNA coordinate 1 14|NTPOS2|NUMERIC - Space based mRNA coordinate 2 (also absolute nucleotide coordinate) 15|CODON1|TEXT - Reference codon [GCT] for example 16|CODON2|TEXT - Altered codon [GaT] (altered base in lowercase) 17|AA1|CHAR - Reference amino acid at this position 18|AA2|CHAR - Altered amino acid at this position 19|AAPOS1|NUMERIC - Space based amino acid coordinate 1 20|AAPOS2|NUMERIC - Space based amino acid coordinate 2 (also absolute residue coordinate) 21|CDS|NUMERIC - 1 if coordinate in CDS, 0 otherwise 22|AA1_VALID|INTEGER - 0 if reference amino acid does not agree with the actual amino acid in protein due to ensembl annotation error 23|ENST_VALID|INTEGER - Depricated 24|SCORE|NUMERIC - SIFT score [0-1] - for more info see http://sift.jcvi.org 25|MEDIAN|NUMERIC - SIFT median information content 26|SEQS_REP|INTEGER - Number of homologs having reference amino acid at this position 4. Downloading and exploring the database The chromosome databases can either be explored using the SQLite database engine or be integrated into exisinting pipelines written in various languages. 4a. After downloading the contents of this FTP location, place them in $SIFT_HOME/db/Human_db directory and make sure SQLite database engine is installed and running. 4b. Within the human_db directory, launch SQLite with one of the chromosome databses >sqlite3 Human_CHR1.sqlite 4c. .tables command will display all the 10mn interval tables in this database: chr1_110000001_120000000 chr1_199859224_210070737 chr1_48771114_60000988 chr1_120000001_130000000 chr1_1_10000855 chr1_59535241_70361752 chr1_130000001_140000000 chr1_20000001_30000000 chr1_69806669_80000000 chr1_140000001_150002664 chr1_209983422_220000000 chr1_80000001_90000000 chr1_149998743_160000000 chr1_220000001_230243641 chr1_90000001_100003937 chr1_160000001_170000000 chr1_229829184_240031580 chr1_9994482_20000000 chr1_170000001_180037339 chr1_239882203_250000000 chr1_99946847_110000000 chr1_179648918_190000000 chr1_30000001_40027120 chr1_190000001_200062720 chr1_39977117_50262172 4d. You may viwe the table_info / column names using the following command sqlite> PRAGMA table_info (chr1_1_10000855); 0|CHR|TEXT|1||1 1|COORD1|NUMERIC|1||0 2|COORD2|NUMERIC|1||1 3|ORN|TEXT|0||1 4|RSID|TEXT|0||0 5|ENSG|TEXT|0||0 6|ENST|TEXT|0||1 7|ENSP|TEXT|0||0 8|REGION|TEXT|0||0 9|SNP|TEXT|0||0 10|NT1|CHAR|0||0 11|NT2|CHAR|0||1 12|NTPOS1|NUMERIC|0||0 13|NTPOS2|NUMERIC|0||0 14|CODON1|TEXT|0||0 15|CODON2|TEXT|0||0 16|AA1|CHAR|0||0 17|AA2|CHAR|0||1 18|AAPOS1|NUMERIC|0||0 19|AAPOS2|NUMERIC|0||0 20|CDS|NUMERIC|0||0 21|AA1_VALID|INTEGER|0||0 22|ENST_VALID|INTEGER|0||0 23|SCORE|NUMERIC|0||0 24|MEDIAN|NUMERIC|0||0 25|SEQS_REP|INTEGER|0||0 4e. Sample queries: select * from chr1_1_10000855 where ENST = 'ENST00000328596'; select ENST,REGION,NT1,NT2,AA1,AA2,CODON1,CODON2,SCORE,RSID from chr1_1_10000855 where ENST = 'ENST00000328596' AND AA1_VALID = 1 AND CDS = 1 and RSID <> 'novel'; 4f. Calling database from script (example perl/DBI) -----------Start of script------------------------ use DBI; my $chr = "1"; #for example my $table_chr = "chr1_1_10000855"; #for example #Connect to database my $db_chr =DBI->connect( "dbi:SQLite:dbname=$SIFT_HOME/db/Human_db/Human_CHR$chr.sqlite","", "", { RaiseError => 1, AutoCommit => 1 } ); #$db_chr->do('PRAGMA synchronous=1'); #optional #$db_chr->do('PRAGMA cache_size=4000'); #optional #Prepared statment my $sth_db_chr = $db_chr->prepare("select ENST,REGION,NT1,NT2,AA1,AA2,CODON1,CODON2,SCORE,RSID from $table_chr where ENST = ? AND AA1_VALID = 1 AND CDS = 1 and RSID <> \'novel\'"); #get output my @rows; my @query_result; $sth_db_chr->execute("ENST00000328596"); while (@rows = $sth_db_chr->fetchrow_array()){ push @query_result, join("\t",@rows); } #Print output foreach my $row (@query_result){ chomp $row; my @elts = split /\t/, $row; my $enst = $elts[0]; my $region = $elts[1]; my $nt1 = $elts[2]; my $nt2 = $elts[3]; my $aa1 = $elts[4]; my $aa2 = $elts[5]; my $codon1 = $elts[6]; my $codon2 = $elts[7]; my $score = $elts[8]; my $rsid = $elts[9]; #Do stuff with extracted columns or print complete row as foll.. print "$row\n"; } ----------End of script------------------------------- 5. Description of Human_Supp database This database accompanies the human chromosome databases described above and may be used to extract additional transcript and gene level information about the results obtained by querying the chromosome databases. Tables 1. ALLELE_FREQ 2. GENE_INFO 5a. ALLELE_FREQ: This table provides allele frequency information for all SNPs documented in the Hapmap frequency database. The source of this data is Hapmap's NCBI 36 phase III data available at http://hapmap.ncbi.nlm.nih.gov/downloads/frequencies/latest_phaseIII_ncbi_b36/fwd_strand/non-redundant/ Columns: 1|RSID|TEXT - dbSNP rsID 2|AVERAGE_FREQ_1|NUMERIC - Reference allele frequency (weighted average across all populations) 3|AVERAGE_FREQ_2|NUMERIC - Minor allele frequency (weighted average across all populations) 4|CEU_FREQ_1|NUMERIC - Reference allele frequency (CEU population) 5|CEU_FREQ_2|NUMERIC - Minor allele frequency (CEU population) 5b. GENE_INFO: This table provides transcript / gene level information compiled from Ensembl Biomart. In addition to SIFT scores, this information is useful for further prioritizing variants by looking at gene specific properties corresponding to the mutation. Columns: 1|ENST|TEXT - Ensembl transcript ID 2|ENSP|TEXT - Ensembl protein ID 3|ENSG|TEXT - Ensembl gene ID 4|GENE_NAME|TEXT - Ensembl gene name 5|GENE_DESC|TEXT - Ensembl gene description 6|ENSFM|TEXT - Ensembl protein family ID 7|FAM_DESC|TEXT - Ensembl protein family description 8|GENE_STATUS|TEXT - Gene status [known/novel] 9|FAM_SIZE|INTEGER - Ensembl protein family size 10|KAKS_MOUSE|NUMERIC - Ka/Ks human - mouse (ratio of nonsynonymous to synonymous mutation rates) 11|KAKS_MACAQUE|NUMERIC - Ka/Ks human - macaque 12|MIM_STATUS|TEXT - OMIM disease if exists 6. Technical support Questions and comments about this document should be submitted via SIFT contact form available at http://sift.jcvi.org/sift-bin/contact.pl