Index of various papers, usually not easily available,
by Gaston H. Gonnet or colleagues
Ricardo Baeza-Yates and
Gaston H. Gonnet
Depto. de Ciencias de la Computacion, Universidad De Chile, Santiago, Casilla 2777, Chile
Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Advances in Computing and Information - ICCI'90, International Conference on Computing and Information, Niagara Falls, Canada, May 23-26, 1990, pp 110-119
Abstract
We use matrix recurrences to analyze the expected behaviour of algorithms
on trees.
We apply this technique to the average case analysis of balanced search
trees and digital trees.
In particular we give the exact solution for a fringe analysis problem,
a thechnique used for search trees, that was unknown before.
This method also makes easier to solve some scalar recurrences.
BibTeX
Gaston H. Gonnet and
Ricardo A. Baeza-Yates
Data Structuring Group, Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Information Processing Letters, Volume 34, Issue 5 (May 1990), pp 271-274, 1990
Abstract
We present an average case analysis of the Karp-Rabin string matching
algorithm.
This algorithm is a probabilistic algorithm, that adapts hashing techniques
to string searching.
We also propose an efficient implementation of this algorithm.
BibTeX
Ricardo Baeza-Yates,
Gaston H. Gonnet and
Nivio Ziviani
Depto. de Cs. de la Computacion, Universidad De Chile, Santiago, Casilla 2777, Chile
Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Depto. de CIencia da Computacao, Universidade Federal de Minas Gerais,
Belo Horizonte, Minas Gerais, Brasil
Lecture Notes In Computer Science; Vol. 447, Proceedings of the 2nd Scandinavian Workshop on Algorithm Theory, Bergen, Norway, July 11-14, 1990, pp 143-159
Abstract
In this paper we improve previous bounds on expected measures of AVL
trees by using fringe analysis.
A new way of handling larger tree collections that are not closed is
presented.
An inherent difficulty posed by the transformations necessary to keep the
AVL tree balanced makes its analysis difficult when using fringe analysis
methods.
We derive a technique to cope with this difficulty obtaining the exact
solution for fringe parameters even when unknown probabilities are involved.
We show that the probability of a rotation in an insertion is between
0.37 and 0.73, that the fraction of balanced nodes is between 0.56 and 0.78,
and that the expected number of comparisons in a search seems to be at most
12% more than in the complete balanced tree.
BibTeX
Gaston H. Gonnet and
Steven A. Benner
Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.
Laboratorium fuer Organische Chemie, Swiss Federal Institute of Technology, Zurich, Switzerland.
Technical report 154 of the Department Informatik, March 1991
Abstract
The Computational Biochemistry Research Group (CBRG) is a research
group at the Eidgenoessische Technische Hochschule Zurich, (ETH)
involving researchers from Organic Chemistry and Computer Science.
Its main research goal is the derivation of biological/genetic
information and theories through the analysis of genetic sequences.
Philippe Flajolet,
Gaston Gonnet,
Claude Puech, and
J. M. Robson
INRIA, Rocquencourt, F-78150 Le Chesnay, France.
Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.
LIENS, Ecole Normale Superieure, 45 rue d'Ulm, F-75005 Paris, France
Department of Computer Science, Australian National University, Canberra ACT 2601, Australia
Second Annual ACM-SIAM Symposium on Discrete Algorithms, (SODA), San Francisco, January 1991, pp 100-109
Abstract
Quadtrees constitute a hierarchical
data structure which permits fast access to multi-dimensional data.
This paper presents the analysis of the expected cost of various types
of searches in quadtrees - fully specified and partial match queries.
The data model assumes random points with independently drawn
coordinate values.
The analysis leads to a class of "full-history" divide-and-conquer
recurrences.
These recurrences are solved using generating functions, either exactly
for dimension d=2, or asymptotically for higher dimensions.
The exact solutions involve hypergeometric functions.
The general asymptotic solutions rely on the classification of
singularities of linear differential equations with analytic
coefficients, and on singularity analysis techniques.
These methods are applicable to the asymptotic solution of a
wide range of linear recurrences, as may occur in particular in the
analysis of multidimensional searching problems.
BibTeX
GH Gonnet,
MA Cohen, and
SA Benner
Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.
Science, Vol 256, Issue 5062, June 5, 1992, pp 1443-1445
Abstract
The entire protein sequence database has been exhaustively matched.
Definitive mutation matrices and models for scoring gaps were
obtained from the matching and used to organize the
sequence database as sets of evolutionarily connected components. The
methods developed are general and can be used to
manage sequence data generated by major genome sequencing projects.
The alignments made possible by the exhaustive matching are
the starting point for successful de novo prediction of
the folded structures of proteins, for reconstructing sequences of
ancient proteins and metabolisms in ancient organisms, and for
obtaining new perspectives in structural biochemistry.
Steven A. Benner,
Mark A. Cohen and
Gaston H. Gonnet
Institute for Organic Chemistry, and
Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich.
Journal of Molecular Biology, Vol 229, Issue 4, Feb 20, 1993, pp 1065-1082.
Abstract
The exhaustive matching of the protein sequence database makes possible a broadly
based study of insertions and deletions (indels) during divergent evolution.
In this study, the probability of a gap in an
alignment of a pair of homologous protein sequences was found
to increase with the evolutionary distance measured in PAM units
(number of accepted point mutations per 100 amino acid residues).
A relationship between the average number of amino acid residues
between indels and evolutionary distance suggests that a unit
30 to 40 amino acid residues in length remains,
on average, undisrupted by indels during divergent evolution. Further,
the probability of a gap was found to be
inversely proportional to gap length raised to the 1.7
power. This empirical law fits closely over the entire
range of gap lengths examined. Gap length distribution is
largely independent of evolutionary distance. These results rule out
the widely used linear gap penalty as a satisfactory
formula for scoring gaps when constructing alignments. Further, the
observed gap length distribution can be explained by a
simple model of selective pressures governing the acceptance of
indels during divergent evolution. Finally, this model provides theoretical
support for using indels as part of "parsing algorithms",
important in the de novo prediction of the folded
structure of proteins from the sequence data.
Peter James,
Manfredo Quadroni,
Ernesto Carafoli and
Gaston Gonnet
Protein Chemistry Laboratory and Department of Biology,
Institute for Biochemistry and
Institute for Scientific Computation, Swiss Federal Institute of
Technology (E.T.H.), Zurich.
Biochemical and Biophysical Research Communications, Vol 195, Issue 1, Aug 31, 1993, pp 58-64
Abstract
We have developed an algorithm for identifying proteins at the
sub-microgram level without sequence determination by chemical degradation. The
protein, usually isolated by one- or two-dimensional gel electrophoresis,
is digested by enzymatic or chemical means and the
masses of the resulting peptides are determined by mass
spectrometry. The resulting mass profile, i.e., the list of
the molecular masses of peptides produced by the digestion,
serves as a fingerprint which uniquely defines a particular
protein. This fingerprint may be used to search the
database of known sequences to find proteins with a
similar profile. If the protein is not yet sequenced
the profile can serve as a unique marker. This
provides a rapid and sensitive link between genomic sequences
and 2D gel electrophoresis mapping of cellular proteins.
Thomas Meyer
Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.
Diploma Project, Nov 1st, 1999 - Feb 29th, 2000
Abstract
In practice, there exist three different "religions" about stock market prediction. First of all, there is the Random
Walk hypothesis, which basically says that the stock market is not predictable at all. In the second theory - the
Fundamental Analysis - forecasts are based on economic data. Finally, there is the Technical Analysis approach,
which is trying to predict future price changes using only historical prices and volumes. In this work, we are
studying the third approach.
We consider twenty different models based on various technical data. These models depend also on a small
number of parameters. In our prediction process, we try to maximize our gain by making one decision every day
- either buy or sell. To make the decision at a certain day, we use a model whose parameters are optimized to
yield a maximum gain over a particular "training time" - the most recent 5, 10, 25, 50, 75 or 125 past days.
We test the models and prediction process with various stocks over a simulation time of typically half a year.
One goal is to find the optimal training time for a certain model. In general, we found that we cannot expect to
find a certain model with a certain training time that can be applied to any stock, as every stock is unique in its
behaviour, for example different type, activity, popularity, investors, etc....
The tests also showed that it is rather difficult to outperform the buy & hold strategy if the stock performance is
good. Statistical analysis shows that random models (models whose decisions are randomly buy or sell) have a
mean gain of about half that of the buy & hold strategy. More importantly, the gain produced by random models
typically has a high standard deviation; this makes it difficult to show that a given (presumably non random)
model is not just good on the test-set by coincidence.
From the mathematical point of view, Technical Analysis is more like voodoo. On the one hand, it seems that the
indicators of the Technical Analysis approach are rather bad and are not able to outperform the buy & hold
strategy in general. On the other hand, forecasts could become true if there exist many investors who believe in
the indicators - as a kind of a self-fulfilling prophecy. In any case, it is still unknown whether short-term price
movements in the stock market are predictable or not.
Although the models we choose here are based on the Technical Analysis approach (mainly because data
procurement was easier) the main focus of this work is on the prediction and maximization methods; these
methods remain exactly the same also for models based on fundamental data or even other kinds of prediction
problems.
The computationally most expensive part of prediction problems is the maximization of score-functions. Scorefunctions
return a score (in our case the gain) for the simulation with certain model parameters on the training
set. In our case we need to find a maximum of the score-function every time we make a single prediction: the
model parameters need to be re-optimized over the training time.
Such score-functions are very special in a way that all efficient standard maximization methods fail - the
functions are constant on their arguments (the gradient at almost all points is 0). To guarantee the we have found
the global maximum of such a function, we must visit every single plateau.
For functions in one dimension, there is an known algorithm that has running time linear in the number of
plateaus. For linear models, we have developed an efficient algorithm that is also able to find the global
maximum in two dimensions. We propose an extension of this algorithm to any dimension.
However, for functions that have a huge number of plateaus, it may be impossible to visit every single plateau
within a reasonable amount of time (depending on computational resources). To handle these cases we
developed a heuristic algorithm based on spirals in two dimensions and on hyper-spheres in any dimension. A
comparison with other maximization methods shows that this heuristic algorithm is quite good and fast.
Gaston H. Gonnet,
Ricardo A. Baeza-Yates and
Tim Snider
CBRG, Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.
Depto. de Ciencias de la Computacion, Universidad de Chile, Casilla 2777, Santiago, Chile.
Centre for the New OED and Text Research, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1
Technical report of the Centre for the New OED
Abstract
We survey new indices for text, with emphasis on Pat arrays (also called
suffix arrays).
A Pat array is an index based on a new model of text which does not use the
concept of word and does not need to know the structure of the text.
Christian Ledergerber and
Christophe Dessimoz
CBRG, Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.
Slides of the presentation at the conference (COCOON 2007, Banff, Alberta, Canada, LNCS 4598, pp. 151-164)
Abstract
Sequence alignment is a central problem in bioinformatics.
The classical dynamic programming algorithm aligns two sequences by
optimizing over possible insertions, deletions and substitution.
However, other evolutionary events can be observed, such as inversions,
tandem duplications or moves (transpositions).
It has been established that the extension of the problem to move
operations is NP-complete.
Previous work has shown that an extension restricted to non-overlapping
inversions can be solved in O(n^3) with a restricted scoring scheme.
In this paper, we show that the alignment problem extended to
non-overlapping moves can be solved in O(n^5) for general scoring schemes,
O(n^4 logn) for concave scoring schemes and O(n^4)
for restricted scoring schemes.
Furthermore, we show that the alignment problem extended to non-overlapping
moves, inversions and tandem duplications can be solved with
the same time complexities.
Finally, an example of an alignment with non-overlapping moves is provided.
BibTeX
Gaston Gonnet and
Chantal Korostensky
CBRG, Institute for Scientific Computation,
Swiss Federal Institute of Technology, Zurich, Switzerland.
Unpublished document, Dec 2004
Abstract
Distance information is usually obtained from aligned sequences.
Computation of distances plays a crucial role in many aspects of
bioinformatics, in particular in phylogenetic studies.
In this paper, we show that any method estimating distances directly
from scoring matrices is not consistent.
Consistency is the property of a method which guarantees that
given enough data it will construct the correct tree.
Then we show how to correct this problem and to obtain consistent
methods from scoring matrices.
Following this line, we give an algorithm to compute the most
effective/powerful estimator (the one with the lowest variance)
for distances.
Finally we illustrate all the steps of these computations with
a complete example. We derive an optimal scoring matrix to
estimate the distances between human mitochondrial DNA sequences.
This optimal scoring matrix is interesting in itself, as there
is quite a bit of interest in estimating the phylogenies of
humans based on mDNA.
Last updated on Fri Oct 19 12:40:46 MEST 2007 by ghg