Depto. de Ciencias de la Computacion, Universidad De Chile, Santiago, Casilla 2777, Chile

Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada

We use matrix recurrences to analyze the expected behaviour of algorithms on trees. We apply this technique to the average case analysis of balanced search trees and digital trees. In particular we give the exact solution for a fringe analysis problem, a thechnique used for search trees, that was unknown before. This method also makes easier to solve some scalar recurrences. BibTeX

Data Structuring Group, Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada

We present an average case analysis of the Karp-Rabin string matching algorithm. This algorithm is a probabilistic algorithm, that adapts hashing techniques to string searching. We also propose an efficient implementation of this algorithm. BibTeX

Depto. de Cs. de la Computacion, Universidad De Chile, Santiago, Casilla 2777, Chile

Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada

Depto. de CIencia da Computacao, Universidade Federal de Minas Gerais,
Belo Horizonte, Minas Gerais, Brasil

In this paper we improve previous bounds on expected measures of AVL trees by using fringe analysis. A new way of handling larger tree collections that are not closed is presented. An inherent difficulty posed by the transformations necessary to keep the AVL tree balanced makes its analysis difficult when using fringe analysis methods. We derive a technique to cope with this difficulty obtaining the exact solution for fringe parameters even when unknown probabilities are involved. We show that the probability of a rotation in an insertion is between 0.37 and 0.73, that the fraction of balanced nodes is between 0.56 and 0.78, and that the expected number of comparisons in a search seems to be at most 12% more than in the complete balanced tree. BibTeX

Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.

Laboratorium fuer Organische Chemie, Swiss Federal Institute of Technology, Zurich, Switzerland.

The Computational Biochemistry Research Group (CBRG) is a research group at the Eidgenoessische Technische Hochschule Zurich, (ETH) involving researchers from Organic Chemistry and Computer Science. Its main research goal is the derivation of biological/genetic information and theories through the analysis of genetic sequences.

INRIA, Rocquencourt, F-78150 Le Chesnay, France.

Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.

LIENS, Ecole Normale Superieure, 45 rue d'Ulm, F-75005 Paris, France

Department of Computer Science, Australian National University, Canberra ACT 2601, Australia

Quadtrees constitute a hierarchical data structure which permits fast access to multi-dimensional data. This paper presents the analysis of the expected cost of various types of searches in quadtrees - fully specified and partial match queries. The data model assumes random points with independently drawn coordinate values. The analysis leads to a class of "full-history" divide-and-conquer recurrences. These recurrences are solved using generating functions, either exactly for dimension d=2, or asymptotically for higher dimensions. The exact solutions involve hypergeometric functions. The general asymptotic solutions rely on the classification of singularities of linear differential equations with analytic coefficients, and on singularity analysis techniques. These methods are applicable to the asymptotic solution of a wide range of linear recurrences, as may occur in particular in the analysis of multidimensional searching problems. BibTeX

Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.

The entire protein sequence database has been exhaustively matched. Definitive mutation matrices and models for scoring gaps were obtained from the matching and used to organize the sequence database as sets of evolutionarily connected components. The methods developed are general and can be used to manage sequence data generated by major genome sequencing projects. The alignments made possible by the exhaustive matching are the starting point for successful de novo prediction of the folded structures of proteins, for reconstructing sequences of ancient proteins and metabolisms in ancient organisms, and for obtaining new perspectives in structural biochemistry.

Institute for Organic Chemistry, and
Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich.

The exhaustive matching of the protein sequence database makes possible a broadly based study of insertions and deletions (indels) during divergent evolution. In this study, the probability of a gap in an alignment of a pair of homologous protein sequences was found to increase with the evolutionary distance measured in PAM units (number of accepted point mutations per 100 amino acid residues). A relationship between the average number of amino acid residues between indels and evolutionary distance suggests that a unit 30 to 40 amino acid residues in length remains, on average, undisrupted by indels during divergent evolution. Further, the probability of a gap was found to be inversely proportional to gap length raised to the 1.7 power. This empirical law fits closely over the entire range of gap lengths examined. Gap length distribution is largely independent of evolutionary distance. These results rule out the widely used linear gap penalty as a satisfactory formula for scoring gaps when constructing alignments. Further, the observed gap length distribution can be explained by a simple model of selective pressures governing the acceptance of indels during divergent evolution. Finally, this model provides theoretical support for using indels as part of "parsing algorithms", important in the de novo prediction of the folded structure of proteins from the sequence data.

Protein Chemistry Laboratory and Department of Biology,
Institute for Biochemistry and
Institute for Scientific Computation, Swiss Federal Institute of
Technology (E.T.H.), Zurich.

We have developed an algorithm for identifying proteins at the sub-microgram level without sequence determination by chemical degradation. The protein, usually isolated by one- or two-dimensional gel electrophoresis, is digested by enzymatic or chemical means and the masses of the resulting peptides are determined by mass spectrometry. The resulting mass profile, i.e., the list of the molecular masses of peptides produced by the digestion, serves as a fingerprint which uniquely defines a particular protein. This fingerprint may be used to search the database of known sequences to find proteins with a similar profile. If the protein is not yet sequenced the profile can serve as a unique marker. This provides a rapid and sensitive link between genomic sequences and 2D gel electrophoresis mapping of cellular proteins.

Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.

In practice, there exist three different "religions" about stock market prediction. First of all, there is the Random Walk hypothesis, which basically says that the stock market is not predictable at all. In the second theory - the Fundamental Analysis - forecasts are based on economic data. Finally, there is the Technical Analysis approach, which is trying to predict future price changes using only historical prices and volumes. In this work, we are studying the third approach. We consider twenty different models based on various technical data. These models depend also on a small number of parameters. In our prediction process, we try to maximize our gain by making one decision every day - either buy or sell. To make the decision at a certain day, we use a model whose parameters are optimized to yield a maximum gain over a particular "training time" - the most recent 5, 10, 25, 50, 75 or 125 past days. We test the models and prediction process with various stocks over a simulation time of typically half a year. One goal is to find the optimal training time for a certain model. In general, we found that we cannot expect to find a certain model with a certain training time that can be applied to any stock, as every stock is unique in its behaviour, for example different type, activity, popularity, investors, etc.... The tests also showed that it is rather difficult to outperform the buy & hold strategy if the stock performance is good. Statistical analysis shows that random models (models whose decisions are randomly buy or sell) have a mean gain of about half that of the buy & hold strategy. More importantly, the gain produced by random models typically has a high standard deviation; this makes it difficult to show that a given (presumably non random) model is not just good on the test-set by coincidence. From the mathematical point of view, Technical Analysis is more like voodoo. On the one hand, it seems that the indicators of the Technical Analysis approach are rather bad and are not able to outperform the buy & hold strategy in general. On the other hand, forecasts could become true if there exist many investors who believe in the indicators - as a kind of a self-fulfilling prophecy. In any case, it is still unknown whether short-term price movements in the stock market are predictable or not. Although the models we choose here are based on the Technical Analysis approach (mainly because data procurement was easier) the main focus of this work is on the prediction and maximization methods; these methods remain exactly the same also for models based on fundamental data or even other kinds of prediction problems. The computationally most expensive part of prediction problems is the maximization of score-functions. Scorefunctions return a score (in our case the gain) for the simulation with certain model parameters on the training set. In our case we need to find a maximum of the score-function every time we make a single prediction: the model parameters need to be re-optimized over the training time. Such score-functions are very special in a way that all efficient standard maximization methods fail - the functions are constant on their arguments (the gradient at almost all points is 0). To guarantee the we have found the global maximum of such a function, we must visit every single plateau. For functions in one dimension, there is an known algorithm that has running time linear in the number of plateaus. For linear models, we have developed an efficient algorithm that is also able to find the global maximum in two dimensions. We propose an extension of this algorithm to any dimension. However, for functions that have a huge number of plateaus, it may be impossible to visit every single plateau within a reasonable amount of time (depending on computational resources). To handle these cases we developed a heuristic algorithm based on spirals in two dimensions and on hyper-spheres in any dimension. A comparison with other maximization methods shows that this heuristic algorithm is quite good and fast.

CBRG, Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.

Depto. de Ciencias de la Computacion, Universidad de Chile, Casilla 2777, Santiago, Chile.

Centre for the New OED and Text Research, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1

We survey new indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text.

CBRG, Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.

Sequence alignment is a central problem in bioinformatics. The classical dynamic programming algorithm aligns two sequences by optimizing over possible insertions, deletions and substitution. However, other evolutionary events can be observed, such as inversions, tandem duplications or moves (transpositions). It has been established that the extension of the problem to move operations is NP-complete. Previous work has shown that an extension restricted to non-overlapping inversions can be solved in O(n^3) with a restricted scoring scheme. In this paper, we show that the alignment problem extended to non-overlapping moves can be solved in O(n^5) for general scoring schemes, O(n^4 logn) for concave scoring schemes and O(n^4) for restricted scoring schemes. Furthermore, we show that the alignment problem extended to non-overlapping moves, inversions and tandem duplications can be solved with the same time complexities. Finally, an example of an alignment with non-overlapping moves is provided. BibTeX

CBRG, Institute for Scientific Computation,
Swiss Federal Institute of Technology, Zurich, Switzerland.

Distance information is usually obtained from aligned sequences. Computation of distances plays a crucial role in many aspects of bioinformatics, in particular in phylogenetic studies. In this paper, we show that any method estimating distances directly from scoring matrices is not consistent. Consistency is the property of a method which guarantees that given enough data it will construct the correct tree. Then we show how to correct this problem and to obtain consistent methods from scoring matrices. Following this line, we give an algorithm to compute the most effective/powerful estimator (the one with the lowest variance) for distances. Finally we illustrate all the steps of these computations with a complete example. We derive an optimal scoring matrix to estimate the distances between human mitochondrial DNA sequences. This optimal scoring matrix is interesting in itself, as there is quite a bit of interest in estimating the phylogenies of humans based on mDNA.

Last updated on Fri Oct 19 12:40:46 MEST 2007 by ghg