Thursday, July 9, 2009

How many yeast proteins have PDB structures?

Several ways to go:
1, from PDB website
http://www.pdb.org/pdb/home/home.do
Use advanced search mode by clicking the button on the upper right corner. Choose Taxonomy and search "Saccharomyces cerevisiae" or "cerevisiae" in the pop-up windowns. Yeah, you got 1615 structures as of 20090709. Click on then and go to left navigation panel on the new page.
If you are just interested in the PDB id, it is easy, click on "Results ID List" will get it. If you are interested in the structures, then "Select All" and "Downloaded Selected". If you are interested in other information, like me, the correspondence of Uniprot ID and PDB ID(with chain information), you should use "Custom Report" beneath "Tabulate", choose whatever information you want in the final table, then create report, and download it(csv file format perferred). For 1615 structures, it will take quite some time if you choose to include more information.

2, from Uniprot website.
http://www.uniprot.org/uniprot/
Fill in taxonomy:4932 in the query box will return you all proteins in yeast(taxonomy ID: 4932)

Wednesday, July 8, 2009

Things about PDB

1, as of July, 2009, PDB contains almost 60,000 structures, among which about 54,000 proteins, and again among which 47,000 X-ray PDB structures.

2, you can download PDB file separately from ftp://ftp.wwpdb.org/pub/pdb/data/structures/all/pdb/, or http://www.rcsb.org/pdb/files/. For example,
wget ftp://ftp.wwpdb.org/pub/pdb/data/structures/all/pdb/pdb1a2k.ent.gz
wget http://www.rcsb.org/pdb/files/4hhb.pdb.gz
Yes, it is possible to use Bioperl:
use Bio::Structure::IO;

$in = Bio::Structure::IO->new(-file => "pdb1a2k.ent",
-format => 'pdb');

while ( my $struc = $in->next_structure() ) {
print "Structure ", $struc->id,"\n";
}

3, EBI has some curated information on PDB structures, check here:

ftp://ftp.ebi.ac.uk/pub/databases/rcsb/pdb-remediated/

4, PDB sequences could be downloaded from NCBI website. So you need not generate by yourself from parsing the structures, which is also error-prone.
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/pdbaa.gz
5, each entry of pdbaa represents a sequence, but could correspond to multiple chains of different structures. This sequence has a NCBI gi, and thus easy to follow. The residue number in .pdb files is based on these sequences. And usually, PDB chains only contain a part of the sequence. Yes, sometimes it introduces more residues, but usually it does not matter a lot.