The Protein page provides detailed information about the protein encoded by a particular gene. This page contains locus specific nomenclature and protein product information; a brief description of the role of the protein within the cell; the predicted primary protein sequence; detailed domain/motif information; and basic information derived from the protein sequence, including physico-chemical properties and other values. Links provide access to detailed prediction-based and manually curated referenced information and to various external resources.
The Protein Overview section contains several fields of nomenclature for the protein in question. If a field doesn't have a value, it will not be listed. Potential fields that may be listed under this heading include:
The Experimental Data section contains two subsections, one for protein half life and the other for protein abundance. Proteome-wide, steady-state protein turnover rate (i.e. protein half life), was calculated under standard growth conditions in synthetic medium using pulse stable isotope labeling of amino acids or SILAC (Christiano et al. 2014). The rate of decay of native proteins was analyzed using high-resolution mass spectrometry-based proteomics. The resulting distribution of half-lives span two orders of magnitude, ranging from a few minutes to more than 100 hr, defining three classes of proteins. Class I contains very short-lived proteins (~2%) including many that drive the cell cycle, Class III contains very stable proteins (86%), many of which drive growth and mass accumulation, and Class II proteins (12.5%) are defined by intermediate half-lives, including these that mediate regulated processes, such as nutrient transport.
The Protein Abundance subsection contains reanalyzed data from 21 quantitative analyses of the proteome, visualized using a variety of experimental methods (mass spectrometry, GFP microscopy, GFP flow cytometry and tandem affinity purification/immunoblot). Data from the primary studies were mode-shift normalized and scaled to the intuitive abundance unit of molecules per cell (Ho et al. 2018). The normalized abundance measurements and associated metadata (media, visualization and strain background) from untreated cells are displayed in the protein abundance table. For some GFP-based studies, changes in protein abundance for cells treated with various environmental stressors including DNA replication stress (hydroxyurea, methyl methanesulfonate), oxidative stress (hydrogen peroxide), reductive stress (dithiothreitol), nutritional stress (nitrogen starvation), quiescence and rapamycin treatment were also normalized and converted to molecules per cell. These values are also displayed in this table along with additional metadata including the perturbation, the treatment time and the concentration of chemicals, if applicable. When the protein abundance in stressed cells is more than two standard deviations from the untreated average abundance, a fold-change value is also displayed. The default display is alphabetical based on the original reference but can be changed using the arrows located in the table header or filtering by entering keyword(s) into the text box. Note, some low abundance proteins were not visualized in untreated cells based on the autofluorescence filtering of the data but became visible after treatment and therefore have a treated value(s) but no untreated value.
Finally, for a given protein, all values from untreated datasets were used to calculate a median abundance and a median absolute deviation (MAD) using a constant of C=1. These values are displayed on the Locus Summary page in the Protein section. Note, there are a few cases where the median value displayed is based on data from a single study and as such a median absolute deviation could not be calculated.
The Domains and Classification section displays the results of domain predictions for yeast protein sequences using the InterProScan program (Jones P. et al. (2014)). InterProScan is a tool that combines different protein signature recognition methods into one resource. The Interpro database integrates motif, domain and protein family HMM information from the following member databases: Gene3D, PANTHER, Pfam, Phobius, PIR Superfamily, PRINTS, SignalP, SMART, Superfamily, TIGRFAM and TMHMM. The domain predictions are refreshed every 3 months, to keep them up-to-date. The predictions are shown both in tabular and graphical form.
This section of the page has several subsections that show sequence based information about the protein.
The amino acid sequence is displayed in 60-residue blocks. Residues are numbered on the left side. The sequence shown by default is that of the reference strain, but the pull-down menu allows selection of the sequence from any of the other cerevisiae strains whose sequence is available in SGD. Also included is a button to Download the sequence, which loads a flat-text browser page with the amino acid displayed in FASTA format. Known modification sites (currently phosphorylation) are highlighted on the sequence by color. (See next section for further information on modifications.)
If there are protein modification data available for the protein, this section of the page shows this information in tabular format. Protein phosphorylation data, and other modification types, are curated by SGD. The modified sites shown match the sequence for the selected strain displayed above - changing the selected strain may change the listed sites. Annotations are assumed to be valid for each strain, unless the indicated residue is not present - polymorphic, mutated, etc. - in a given strain.
Data in this section are calculated from the protein sequence using BioPerl Seq libraries and CODONW software.
Amino Acid Composition
The Amino Acid Composition is based on the primary sequence. The table contains three columns: the first lists the one letter designations for the twenty amino acids, the second column lists the number of amino acids present in one molecule, and the third contains the composition expressed as a percentage.
Physico-Chemical Properties
This section contains various physico-chemical properties of the protein calculated from the sequence, including:
i=L-1
II = (10/L) * Sum DIWV(x[i]y[i+1])
i=1
where: L is the length of sequence
DIWV is the instability weight value
and x[i]y[i+1] is a dipeptide starting at position i.
Proteins with an instability index less than 40 are predicted to be stable, whereas those with a value greater than 40 are predicted to be unstable.
Aliphatic index = X(Ala) + a * X(Val) + b * ( X(Ile) + X(Leu) )
where X(Ala), X(Val), X(Ile), and X(Leu) are mole percent (100 X mole fraction) of alanine, valine, isoleucine, and leucine. The coefficients a and b are the relative volume of valine side chains (a = 2.9) and of Leu/Ile side chains (b = 3.9) relative to that of alanine side chains.
Coding Region Translation Calculations
Values for Codon Bias Index (CBI), Codon Adaptation Index (CAI), Frequency of Optimal Codons (Fop), Hydropathicity of Protein (GRAVY score), and Aromaticity Score (AROMO) are calculated based on the specific genetic code and codon usage of a given organism and organelle. These values were calculated using the CodonW software program written by John Peden.
CodonW analyzes the correspondence between amino acids and codon usage in a set of protein sequences, based on a given genetic code (i.e. that used in the S. cerevisiae nucleus versus that used in its mitochondrion). CodonW was designed to work with any genetic code. Decisions regarding whether an amino acid is synonymous or non-synonymous, the translation of a codon, the number of codons in a codon family, how many synonyms a codon has, are all determined at run time. Seven alternatives to the universal genetic code have been built in to the program, including S. cerevisiae chromosomal codon usage and S. cerevisiae mitochondrial codon usage. In SGD, we have used these two built-in options, as appropriate, to perform codon usage-based calculations for chromosomally-encoded or mitochondrially-encoded ORFs. Note that codon usage-based calculations are not currently performed for ORFs present within transposable elements (Ty elements), because the codon usage of transposable element genes differs from that of chromosomal genes (see the CodonW tutorial).
Extinction Coefficient
The extinction coefficient (epsilon) is the wavelength-dependent molar absorptivity coefficient with units of M-1 cm-1. The extinction coefficient provides an indication of the amount of light that a given protein will absorb at a certain wavelength (usually 280 nm). During protein purification a spectrophotometer can be used to follow the protein of interest if the extinction coefficient is known. The molar extinction coefficient of a protein can be estimated based on its amino acid composition. The extinction coefficient of the native protein in water can be calculated based on the molar extinction coefficient of tyrosine, tryptophan and cystine (cysteine does not absorb much at wavelengths greater than 260 nm while cystine does) using the following equation:
E(Prot) = Numb(Tyr)*Ext(Tyr) + Numb(Trp)*Ext(Trp) + Numb(Cystine)*Ext(Cystine)
where: Ext(Tyr) = 1490
Ext(Trp) = 5500
Ext(Cystine) = 125
The absorbance (optical density) can then be calculated using the following formula:
Absorb(Prot) = E x l x C
where: E = extinction coefficient
l = pathlength (cm)
C = protein concentration (M)
Two extinction coefficient values are calculated by ProtParam, the first value is based on the assumption that all cysteine residues appear as half cystines, and the second assumes that no cysteines appear as half cystines. The computation has been demonstrated to be quite reliable for proteins that contain Trp residues, but for proteins without Trp residues there may be more than a 10% error.
These calculations are based on the method developed by Edelhoch, 1967, using extinction coefficients for Trp and Tyr, as determined by Pace et al., 1995. The values used in the calculation of extinction coefficients for denatured proteins were also found to be accurate for calculating coefficients for the native protein (Gill and von Hippel, 1989). In general, since Trp residues contribute much more to the overall extinction coefficient than Tyr and cystine residues, the calculations tend to be much closer to measured values for proteins that contain Trp residues.
Atomic Composition
The Atomic Composition Table displays the composition of the protein, with respect to the number of atoms of carbon, hydrogen, nitrogen, oxygen, and sulfur that it contains as well as the total number of atoms and the resulting formula.
This section of the page provides access to a compendium of Saccharomyces cerevisiae sequence entries for alleles and strains that are located in various external databases including GenBank/EMBL/DDBJ, NCBI, EBI, and MIPS. Sequence entries are listed by accession and/or version numbers according to the source. Additional information is available in the All Associated Sequences help page.
This section provides access to a number of external resources relevant to the query protein. This includes sequence entries located at various homolog related resources, interaction databases, protein databases, and localization resources.
All data presented in tables on the Protein page can be downloaded by clicking the download button at the bottom left of each table. All data are downloaded in tab-delimited text format, except the sequence data which is provided in FASTA format.
All of the data displayed on the Protein page, plus additional data, are available from YeastMine. You can search for and download data, or create gene lists and analyze them further using additional YeastMine queries.
YeastMine templates (pre-composed queries) for protein data include: