PROTEIN IDENTIFICATION

FROM THE MOLECULAR

WEIGHT OF ITS

FRAGMENTS

by

Gaston H. Gonnet

Informatik ETH Zürich

Heidelberg, Dec. 3, 1992

A much newer version of this work can be found here

 

 

PROTEIN DIGESTION

Example protein:

YKVTLDQNRREGDIAKPNAED ...

will be broken into the following parts by Trypsin:

YK, VTLDQNR, R, EGDIAKPNAED ...

(splits after every R or K not followed by a P)

or if digested by Asp-N

YKVTLDQ, NRREGDIAKP, NAED...

(splits before every N)

 

Figure 2. Computer match between silver stained 2-D PAGE patterns of liver (Fig. 1), plasma, red blood cell, rectal adenocarcinoma samples and an Amido Black-stained PVDF membrane pattern of liver sample (Fig. 6). This figure was obtained using all-spots, allareas, viewmod, modified automatch, showpairs, metal, gelsharper and showgroups programs of the Melanie/Elsie computer system [68, 71]. The color vectors link a few matched spots between the liver "master" picture, the PVDF membrane and the other type of samples. TCTP, translationally controlled tumor protein.

Figure 3 Enlargemement of the higher molecular weight area of Fig. 1. "U" indicates unknown sequence in Swiss-Prot database. The numbers provide a reference to Table 1. The green labels highlight proteins identified by gel comparison and the red labels those identified by N -terminal microsequencing. The blue labels highlight polypeptides which could not be N -terminally microsequenced either because of too low protein concentration or because of N -terminal blockage. HSP-60, heat shock protein 60.

Figure 4. Enlargement of the acidic and lower molecular weight area of Fig. 1. The yellow arrows highlight some spots which were unsuccessfully sequenced. We are currently attempting to get internal sequence information after in situ digestion, extraction and microbore reversed-phase HPLC. SRBP, serum retinol binding protein. Other details as in Fig. 3.

Figure 5. Enlargement of the basic area of Fig. 1. Labeling as in Fig. 3.

 

Question: Is it possible to predict composition from molecular weight?

Answer: NO.

  • No information about ordering
  • Too many possible combinations
  • Mol. weight

    error

    400

    1000

    2000

    ±0.5

    386

    9.780.723.528

    1.577 x 1022

    ±0.05

    0

    2.792.745.483

    8.280 x 1020

    ±0.005

    0

    391.021.208

    4.608 x 1019

    ±0.0005

    0

    173.920.080

    1.979 x 1018

      Number of sequences with given weight within error tolerance.

       

  • Two different amino acids have the same molecular weight.
  • (this is an NP complete problem, hence no efficient solutions are known.)

    Alanine

    A

    71.079

    Arginine

    R

    156.188

    Asparagine

    N

    114.104

    Aspartic acid

    D

    115.089

    Cysteine

    C

    103.144

    Glutamine

    Q

    128.131

    Glutamic acid

    E

    129.116

    Glycine

    G

    57.052

    Histidine

    H

    137.142

    Isoleucine

    I

    113.160

    Leucine

    L

    113.160

    Lysine

    K

    128.174

    Methionine

    M

    131.198

    Phenylalanine

    F

    147.177

    Proline

    P

    97.117

    Serine

    S

    87.078

    Threonine

    T

    101.105

    Tryptophan

    W

    186.213

    Tyrosine

    Y

    163.170

    Valine

    V

    99.113

     

    Question: Is it possible to find a sequence (or a very similar sequence) within a database from its molecular weights?

    Answer: Yes.

    We present an algorithm which does approximate search of a protein in a database based on the weights of the results of an enzymatic digestion.

    The basic algorithm compares the weights of the fragments obtained with a mass spectrometer with the weights resulting from a theoretical digestion of the sequences in the database.

     

    MOTIVATION

    - Reading of 2D gel electrophoretograms (2D gels)

    Diagnosing diseases by 2D gel geometries

    Identifying substances present/absent in healthy/sick cells.

    - Determination of whether a protein is known or not before its sequencing

    - In general: recognition of documented proteins from very small samples (fractions of pico-moles)

     

    -

    Without errors the comparison is rather trivial, it is a special case of multidimensional search.

    But our methods have to tolerate errors:

    a) Recording error < 1%

    b) Searched sequence not verbatim in databas (due to mutations)

    c) Mutations may cause different digestions

    d) Impurities in the sample and in the digester produce spurious data

    e) Partial or incorrect digestion

    f) Systematic error of apparatus

     

    General Algorithm

  • Given a database D = {Di} where Di are vectors with ni values.
  • Given a vector X with dimension k.
  • Define dist (Di, X) = di a distance function.
  • For a random vector Y of dimension n, compute
    Prk,n,*E = Prob {dist(Y,x) <=p }.
    *E (is going to be replaced by p due to that in HTML-files there does not exist a standard for greek-symbols).
  • Select the database entries for which Prk,ni,di is lowest (rarest event).
  •  


     

    What is the probability that each of the k boxes has 1 or more balls?

     

    Let a1, a2, .... aK,b be formal variables,

    Gk, n, p = [a, p + a2p + .... akp + b(1-Kp)]n

     

     

     

     

    To normalize the interval to (0,1) we must divide the logarithms by (log wmax -log wmin) where wmax and wmin are the highest and lowest weights measured.

    In our example:

    n

    K

    p

    Prk, n, p

    Prk, n, p

    5

    0

    0

    1

    1

    5

    1

    0.0017

    0.0084

    0.0084

    5

    2

    0.0064

    0.00080

    0.00099

    5

    3

    0.11

    0.056

    0.80

    Non overlapping boxes

    Overlapping boxes

     

     

     

     
     

     

    ONE DIGESTER (TRYPSIN) TWO PROTEINS 3X2 WEIGHTS

    Score  n k  n k   AC     DE                   0S
    110.1 13 3 23 2 P18961; SERINE/THREONINE-PROTEIN KINASE YPK2/YKR2 (EC 2.7.1.-).
                                 SACCHAROMYCES CEREVISIAE (BAKER'S YEAST).
                                 Unmatched weights: [1711.0]. Unmatched weights:
                                 [1306.3, 1456.0].
    104.1 15 2 20 2 P36026; UBIQUITIN CARBOXYL-TERMINAL HYDROLASE 11 (EC 3.1.2.15)
                                 (UBIQUITIN THIOLESTERASE 11) (UBIQUITIN-SPECIFIC
                                 PROCESSING PROTEASE 11) (DEUBIQUITINATING ENZYME 
                                 11). SACCHAROHYCES CEREVISIAE (BAKER'S YEAST).
                                 Unmatched weights: [2232.0, 3739.0]. Unmatched 
                                 weights: [1785.0, 6509.0].
     81.7  7 3 12 1 P26484; FIXC PROTEIN. AZORHIZOBIUM CAULINODANS. Unmatched 
                                 weights: [3739.0]. Unmatched weights: [1306.3, 
                                 1785.0, 6509.0].
     80.7 11 3 19 2 P06776; 3', 5'-CYCLIC-NUCLEOTIDE PHOSPHODIESTERASE 2 (EC 3.1.4.
                                 17) (PDEASE 2) (HIGH AFFINITY CAMP
                                 PHOSPHODIESTERASE). SACCHAROMYCES CEREVISIAE 
                                 (BAKER'S YEAST). Unmatched weights: [3739.0].
                                 Unmatched weights: [1306.3, 1785.0].
     79.9 30 2 40 2 Q07518; RNA REPLICATION PROTEIN (CONTAINS: RNA-DIRECTED RNA 
                                 POLYMERASE (EC 2.7.7.48) / PROBABLE HELICASE) (156 
                                 KD PROTEIN) (ORF 1). PLANTAGO ASIATICA MOSAIC 
                                 POTEXVIRUS (P1AMV). Unmatched weights: [2020.0, 
                                 2232.0]. Unmatched weights: [1456.0, 6509.0].
     77.6 26 1 26 2 P29465; CHITIN SYNTHASE 3 (EC 2.4.1.16) (CHITIN-UDP ACETYL-
                                 GLUCOSAMINYL TRANSFERASE 3). SACCHAROMYCES
                                 CEREVISIAE (BAKER'S YEAST). Unmatched weights:
                                 [1711.0, 2020.0, 3739.0]. Unmatched weights:
                                 [1456.0, 6509.0].
     77.5  3 3  7 3 P14886; NIFY PROTEIN. AZOTOBACTER VINELANDII. Unmatched 
                                 weights: [3739.0]. Unmatched weights: [1456.0].
     

     

    SAMPLE RESULTS RECEIVED BY E-MAIL

     
    
    MassSearch Trypsin: 1264.8, 1520.2, 955.9, 2487.0, 1094.1 AspN: 1624.4, 2961.4, 718.8, 716.9, 1890.0  
    The output of the above request is: Searching on SwissProt version 26. The sequences are printed in decreasing order of significance. Scores lower than 90 are probably not significant. For digester Trypsin, the fragment weights were: 1264.8 1520.2 955.9 2487.0 1094.1 For digester AspN, the fragment weights were: 1624.4 2961.4 718.8 716.9 1890.0     Score n k n k AC DE 0S 143.9 7 5 8 3 P02594; CALMODULIN. ELECTROPHORUS ELECTRICUS (ELECTRIC EEL). 143.9 7 5 8 3 P02593; CALMODULIN. HOMO SAPIENS (HUMAN), ORYCTOLAGUS CUNICULUS (RABBIT), BOS TAURUS (BOVINE), RATTUS NORVEGICUS (RAT), GALLUS GALLUS (CHICKEN), XENOPUS LAEVIS (AFRICAN CLAWED FROG), ONCORHYNCHUS SP. (SALMON) , AND ARBACIA PUNCTULATA (PUNCTUATE SEA URCHIN). 112.6 7 4 8 2 P21251; CALMODULIN. STICHOPUS JAPONICUS (SEA CUCUMBER). 94.7 21 3 22 2 P07265; MALTASE (EC 3.2.1.20). SACCHAROMYCES CARLSBERGENSIS (LAGER BEER YEAST). 94.2 7 4 8 2 P07181; CALMODULIN, DROSOPHILA MELANOGASTER (FRUIT FLY), LOCUSTA MIGRATORIA (MIGRATORY LOCUST), AND APLYSIA CALIFORNICA (CALIFORNIA SEA HARE).  

     

    e-mail RESULTS DNA SEARCHING RANDOM WEIGHTS

    (NO SIGNIFICANT MATCH EXPECTED)

     
    
    DNAMassSearch ApproxMass: 50000 Trypsin: M=83.092, 1264.8, 1520.2, 955.9, 2487.0, 1O94.1 AspN: Deuterated, 1624.4, 2961.4, 718.8, 716.9, 1890.0  
    The output of the above request is: Searching on EMBL version 35. The sequences are printed in decreasing order of significance. Scores lower than 100 are probably not significant. For digester Trypsin, the fragment weights were: 1264.8 1520.2 955.9 2487.0 1094.1 For digester AspN, the fragment weights were: 1624.4 2961.4 718.8 716.9 1890.0     Score n k n k AC DE 0S 100.3 85 2 46 3 M58040; Rat transferrin receptor mRNA, 3' end. Rattus norvegicus (rat) 100.1 99 2 49 3 Z18629; B. subtilis comF gene Bacillus subti1is 98.0 18 3 8 2 M37510; J04774; Human methylmalonyl CoA mutase (MUT) gene, exon 13. Homo sapiens (human) 93.4 42 4 30 3 M76493; H. contortus beta tubulin (tub8-9) mRNA, complete cds. Haemonchus contortus 93.4 21 3 9 3 M18356; Rat cytochrome P-450 (M-1) gene, exon 1. Rattus norvegicus (rat) 90.0 105 3 47 2 X65055; C.elegans cepgpC gene for P-glycoprotein C Caenorhabditis elegans (nematode)

     

     

    DYNAMIC PROGRAMMING MASS SEARCH

    SOLVES THE PROBLEM OF SEARCHING FOR A SUBSEQUENCE (FRAGMENT) GIVEN BY ITS PARTIAL WEIGHTS.

     

    E.G. Frag:

    K M E T E V A I E Y K S

    1'427.6

    (KM) E T E V A I E Y K S

    1'168.2

    (KME) T E V A I E Y K S

    1'039.1

    (KMET) E V A I E Y K S

    0'938.0

    etc.

    Then Find the Database sequence which matches those weights best.

     

    MASS Searching using Dynamic programming

    M[1] := [1427.6, 1299.5, 1168.3, 1039.1]; (original)

    M[1] := [1427.6, , 1168.3, 1039.1]; (test)


    Matching against sequence entry 688: AC15_HUMAN: ACTIVATOR 1 140 KD SUBUNIT (REP LICATION FACTOR C LARGE SUBUNIT)

    Simil: 27.79 MatchSimil: 18.35 MassSimil: 9.44

    ...kmEtevaieyks...

    ...KMETEVAIEYKS...


    Matching against sequence entry 12657: FTSZ_STAAU: CELL DIVISION FTSZ PROTEIN.

    Simil: 27.11 MatchSimil: 18.32 MassSimil: 8.79

    ...agmEkaikavvpaag...

    ...AGMEKAIKAVVPAAG...


    Matching against sequence entry 47590: YMX2_YEAST: HYPOTHETICAL COX1/OXI3 INTRON 2 PROTEIN (AI2).:

    Simil: 27.65 MatchSimil: 18.35 MassSimil: 9.31

    ...kmEehilrgvgr...

    ...KMEEHILRGVGR...

     

    AVAILABILITY

     

     

    Zurich, 6th November 1997.