|
|
|
Over 100 international research, private sector, and student participants gathered together at the Fields Institute from February 1-4 for a workshop on Data Analysis for Commercial & Industrial Applications organized by the Fields Institute and the Nortel Institute for Telecommunications of the University of Toronto.
The aim of the workshop was to bridge leading-edge mathematical techniques with commercial and industrial applications of data analysis and to present problems motivated by commercial and industrial needs.
Speakers presented ongoing research and data analysis challenges while sharing mathematical results and ideas on data analysis across the mathematics, statistics, physics, biophysics, computer science, telecommunications, and engineering communities. The various categories of mathematical methodologies for discussion included: stochastic processes and Markov chains, nonlinear dynamics and nonlinear time-series analysis, multi-fractal analysis, data mining, and data and signal processing. MITACS and Nortel Networks sponsored the Workshop.
Program and Abstracts
Tuesday, February 1, 2000
11:00-11:30 - Opening
address Claudine Simson, V.P., Disruptive Technology, Network and
Business Solutions, Nortel Networks
11:30-12:30 -
Modern data analysis and its application to Nortel Networks
data Otakar Fojt, The University of York
In this talk we outline an approach to the analysis
of sequential manufacturing and telecom traffic data from industry using
techniques from nonlinear dynamics. The aim of the talk is to show the
potential of nonlinear techniques for processing real world data and
developing new advanced methods of commercial data analysis.
The basic idea is to consider a factory as a
dynamical system. A process in the factory generates data, which contains
information about the state of the system. If it is possible to analyse
this data in such a way that knowledge of the system is increased, control
and decision-making processes can be improved. This will result, if
applied, in a basis of competitive advantage to the factory.
First, we give details of the general idea and the
type of recorded data together with the necessary preprocessing
techniques. We follow this with a description of our analysis. Our
approach consists of state space reconstruction, applications of principal
component analysis and nonlinear deterministic prediction algorithms. The
talk will conclude with our results and with suggestions for future work.
1:30-2:00 - The need for real-time data analysis in
telecommunications Chris Hobbs, Sr. Mgr., System Architecture,
Nortel Networks A telecommunications network
typically comprises many independently-controlled layers: from the
physical fibre interconnectivity, through wavelengths, STS connexions, ATM
Virtual Channels, MPLS Paths to the end-to-end connexions established for
user services. Each of these layers generates statistics that, in a large
network, may easily be measured in tens of gigaBytes per hour.
Traditionally, the layers have been controlled
individually since the complexity of "tuning" a lower layer to the traffic
it is carrying has been too great for human operators (particularly where
the carried traffic itself has complex statistics) and since the work
involved in moving connexions (particularly fibres and wavelengths) has
been prohibitive. Technological advances in Optical
Switches, capable of logically relaying fibre or wavelengths in
micro-seconds, have made flexible network rebalancing possible and
Carriers, the owners of these large networks, are demanding lower costs by
combining layers and exploiting this new agility. In order to address this
problem, the Terabytes of data being extracted daily from the large
networks need to be analysed: initially statically to determine the gross
inter-related behaviours, and then dynamically to detect and react to
changing traffic patterns.
2:30-3:30 - Noise reduction
for human speech using chaos-like features Holger Kantz,
Max-Planck-Institut für Physik komplexer Systeme
A local projective noise reduction scheme,
originally developed for low-dimensional stationary signals, is
successfully applied to human speech. This is possible by exploiting
properties of the speech signal which mimic structure exhibited by
deterministic chaotic systems. In high-dimensional embedding spaces, the
strong non-stationarity is resolved as a sequence of different dynamical
regimes of moderate complexity. This filtering technique does not make use
of the spectral contents of the signal and is far superior to the
Ephraim-Malah adaptive filter.
4:00-5:00 - Scaling
phenomena in telecommunications Murad Taqqu, Boston University
(Lecture co-sponsored by Dept. of Statistics, University of Toronto)
Ethernet local area network traffic appears to be
approximately statistically self-similar. This discovery, made about eight
years ago, has had a profound impact on the field. I will try to explain
what statistical self-similarity means and how it is detected. I will also
indicate how its presence can be explained physically, by aggregating a
large number of "on-off" renewal processes, whose distributions are
heavy-tailed. As the size of the aggregation becomes large, then, after
rescaling, the behavior turns out to be the Gaussian self-similar process
called fractional Brownian motion. If, however, the rewards instead of
being 0 and 1 are heavy-tailed as well, then the limit is a stable
non-Gaussian process with infinite variance and dependent increments.
Since linear fractional stable motion is the stable counterpart of the
Gaussian fractional Brownian motion, a natural conjecture is that the
limit process is linear fractional stable motion. This conjecture, it
turns out, is false. The limit is a new type of infinite variance
self-similar process.
Wednesday, February 2, 2000
9:30-10:30 -
Electrical/Biological networks of nonlinear neurons Henry
Abarbanel, Institute for Nonlinear Science at USCD, San Diego
Using analysis tools for time series from nonlinear
sources, we have been able to characterize the chaotic oscillations of
individual neurons in a small biological network that controls simple
behavior in an invertebrate. Using these characteristics, we have built
computer simulations and simple analog electronic circuits, which
reproduce the biological oscillations. We have performed experiments in
which biological neurons are replaced by the electronic neurons retaining
the functional behavior of the biological circuits. We will describe the
nonlinear analysis tools (widely applicable), the electronic neurons, and
the experiments on neural transplants.
11:00-11:30 -
E-commerce and data mining challenges Weidong Kou, IBM Centre
for Advanced Studies E-commerce over Internet
is having a profound impact on the global economy. Goldman, Sachs &
Co. estimates B2B e-commerce revenue alone will grow to $1.5 trillion (US)
over the next five years. Electronic commerce is becoming a major channel
for conducting business, with increasing number organizations developing,
deploying and installing e-commerce products, applications and solutions.
With rapid e-commerce growth, there are many
challenges, for example, how to analyze e-commerce data and provide an
organization with meaningful information to improve their product and
services offering to target customers, and how to group millions web users
who access a web site so that the organization can serve each group of
users better and can reduce the business cost and increase the revenue.
These challenges would bring a lot of opportunities for data mining
researchers to develop better intelligent algorithms and systems to solve
the practical e-commerce problems. In this talk, we will use IBM
Net.Commerce as example to explain the e-commerce development and
challenges that we face today.
11:30-12:00 - Occurrence
of ill-defined probability distribution in real-world data John
Hudson, Advisor, Radio Technology, Nortel Networks
In many communications problems the statistics of
the data, communication channels, and behaviour of users is ill defined
and not handled well by the simpler concepts in classical probability
theory. We can have data with alpha-stable (infinite variance)
characteristics, long-tailed and large variance log normal distributions,
self similarity in the time domain, and so on. If the higher moments of
the underlying distributions do not exist or have disproportionate values
then laws of large numbers and the central limit theorem may not be safely
applied to a surprising number of problems. The behaviour of some control
mechanisms can begin to take on a chaotic appearance when driven by such
data. In this talk, some of the properties of data,
channels and systems that are confronting workers in the communication
field are discussed. It is illustrated with examples taken from network
data traffic, Internet browsing, radio propagation, video images, speech
statistics and so on.
1:30-2:30 - The analysis of
experimental time series Tom Mullin, The University of
Manchester We will discuss the application of
modern dynamical systems time series analysis methods to data from
experimental systems. These will include vibrating beams, nonlinear
oscillators and physiological measures. The emphasis will be placed on
obtaining quantitative estimates of the essential dynamics. We will also
describe the application of data synergy methods to multivariate data.
2:30-3:00 - Fuzzy-pharmacology: Rationale and
applications Beth Sproule, Faculty of Pharmacy and Department of
Psychiatry Psychopharmacology, SunnyBrook Health Sciences Centre,
Toronto Pharmacological investigations are
undertaken in order to optimize the use of medications. The complexity and
variability associated with biological data has prompted our explorations
into the use fuzzy logic for modeling pharmacological systems. Fuzzy logic
approaches have been used in other areas of medicine (e.g., imaging
technologies, control of biomedical devices, decision support systems),
however, their uses in pharmacology are incipient. The results of our
preliminary studies will be presented in which we assessed the feasibility
of fuzzy logic: a) to predict serum lithium concentrations in elderly
patients; and b) to predict the response of alcohol dependent patients to
citalopram in attempting to reduce their drinking. Since then many current
projects have evolved. Approaches to this line of investigation will be
presented.
3:30-4:30 - Geospatial backbones of
environmental monitoring programs: the challenges of timely data
acquisition, processing and visualization Chad P. Gubala, Director,
The Scientific Assessment Technologies Laboratory University of
Toronto When considering ‘environmental’ issues
or legalities, a general and useful description of a pollutant is an
element or entity in the wrong place at the wrong time and perhaps in the
wrong amount. Prior to the establishment of cost-effective global
positioning, monitoring the fate and transport of environmental pollutants
was limited to reduced scale and statistically based sampling programs.
Whole systems models developed from parcels of environmental studies have
been limited in predictive capability due to unnoticed attributes,
undocumented synergies or antagonisms and un-quantifiable spatial and
temporal variances. Advances in the areas of
commercial geospatial technologies and high-speed sensors arrays have now
offered the possibility of assessing a whole ecosystem in near real time
and in a spatially complete manner. This capacity should then greatly
improve quantitative environmental modeling and the adaptive management
process, further ‘tuning’ the balance between global environments and
economies. However, the promise of increased knowledge about our natural
resources is now limited by our capacity to move the data collected from
integrated geopositioning and sensor systems into meaningful management
products. This talk describes these limitations and addresses the needs
for developments in the areas of real time analytical protocols.
4:30-5:00 - Data mining and its challenges in the
banking industry Chen Wei Xu, Manager, Statistical Modeling,
Customer Knowledge Management, Bank of Montreal
Thursday, February 3, 2000
9:30-10:30 Elements of
fuzzy system modeling I.B. Turksen, University of Toronto
In most system modeling methodologies, we attempt to
find out, in an inductive manner, how a particular system behaves. That
is, we essentially try to determine how the input factors affect the
performance measure of our concern. There are at least three approaches to
system modeling: (1) personal experience, (2) expert interviews and
teachings, and (3) data mining with historical data.
In all these approaches, there are two fundamental
theoretical base structures for system modeling: (1) classical two -
valued set and logic theory based functional analyses and / or (2) novel
(35 years old) Infinite (fuzzy) - valued set and logic based super
functional analyses. Furthermore there are to basic learning methods in
these two approaches: (1) unsupervised learning and (2) supervised
learning. The basic difference between these two methods of learning is
that the first has no goal whereas the second has a goal. Generally the
goal of supervised learning is to assure that the model result compared to
the actual is minimized. In classical two-valued set
and logic based functional analyses, the world and its systems are seen
through the two-valued, black and white, restricted view of, what is
called, the clear patterns. Unfortunately, first the two - valued
dichotomy forces one to make arbitrary choices when there are many
alternatives to choose from. Secondly, functional view can only represent
many to one mapping by its very definition. Thirdly, the combination of
variables are assumed to be additive and multiplicative leading to linear
superposition schema in functional representation of systems. In this
view, logical “OR ness” is simply mapped to “algebraic plus” and “AND
ness” to “algebraic multiplication”. Fourthly, imprecision in data are
generally assumed to originate due to random occurrences.
Whereas, in fuzzy (infinite) - valued set and logic
based super functional analyses, the world and its systems are seen
through information granules which admit an unrestricted view of fuzzy
patterns. Fortunately, first we are not forced to make arbitrary choices
but have the freedom to choose the gradation that is appropriate for a
given situation. Secondly, super functional view allows us to make many to
many mapping. That is membership functions are identified to specify
patterns via fuzzy cluster analyses. But then we can establish cluster to
cluster mappings over these functions that gives us super functional
representations. Thirdly, the combination of variables are generally super
additive or sub additive requiring highly nonlinear representations. In
fuzzy theory there are infinitely many ways to represent “AND ness”
(conjunction) and “OR ness” (disjunction) depending on context and the
behavior of a given system. Fourthly, imprecision in data are generally
deterministic due to incapability of our measurement devices.
In our integrated fuzzy system modeling approach, we
first use fuzzy clustering techniques to learn patterns with fuzzy scatter
matrices and diagrams to determine the essential fuzzy clusters, i.e., the
effective rules of system behavior. This is an unsupervised learning
method. Next we fit membership functions to these clusters. As well we
determine significant and critical variables that affect the system
behavior drastically and moderately, etc. Later, we
apply supervised learning to determine the nonlinear operators that
combine the fuzzy clusters in many to many maps of input and output
variables in order to achieve minimum system model error. In this
supervised learning we also implement compensation and compromise between
the extreme values of formulas that specify combination of concepts and
hence the appropriate combination of variables as well as alternate
inference schemas. Real-life system model building
examples include: (1) a continuous caster model that attempts to balance
tardiness of customer delivery due dates versus mixed grade steel
production and (2) pharmacological models that attempt to determine the
effects of medication on humans. Simulated system model building examples
include: (1) utilization of Internet data links, (2) analyses of traffic
characteristics, and (3) discard rate prediction.
11:00-12:00 - A Steel Industry Viewpoint on Fuzzy
Technology -Scheduling Analysis Application Michael Dudzic,
Manager, Process Automation Technology, Dofasco Inc.
This presentation will discuss the experiences in
the use of Fuzzy Expert system technologies as it was applied in a
proof-of-concept project looking at 2 specific issues in scheduling the #1
Continuous Caster at Dofasco. This talk complements I. B. Turksen’s talk
on Elements of Fuzzy System Modeling. & The Application of
Multivariate Statistical Technologies at Dofasco
This presentation will discuss the experiences in
the use of Multivariate Statistics (Principle Component Analysis and
Partial Least Squares) in applications at Dofasco. The focus example will
be the on-line monitoring system at the #1 Continuous Caster.
1:00-2:00 - Recent developments in decision tree
models Hugh Chipman, University of Waterloo
Decision trees are an appealing predictive model
because of their interpretability and flexibility. In this talk, I will
outline some recent developments in decision tree modeling, including
improvements in model search techniques, and enrichments to the tree
model, such as linear models within terminal nodes.
2:30-3:30 - A hybrid predictive model for database
marketing Zhen Mei, Generation 5 We
discuss a simple hybrid approach for predicting response rate in mailing
campaigns and for predicting certain demographic and expenditure
characteristics in customer database. This method is based on cluster
analysis and predictive modeling. As an example we model home ownership
for the State of New York. & Missing value
filling Wenxue Huang , Generation 5 The
talk is about the missing value filling methodology and software that are
being developed by Generation 5 and focused on the mathematics for target
data being interval-scaled. A local-and-global (or
vertical-and-horizontal) balanced approach in a multivariate and a large
database setting will be discussed. The methodology and software may apply
to doing prediction: filling in missing values is equivalent to predicting
instant target values based on reliable complete historical records and
current incomplete input.
4:00-5:00 - Challenges in the
development of segmentation solutions in the banking industry and a
genetic algorithms approach Chris Ralph, Senior Manager Market
Segmentation, Bank of Montreal The Bank of
Montreal team is in the process of building market segmentation solutions
for a few different lines of business using syndicated survey data. The
dataset consists of 4,200 responses from households across Canada
(geographically unbiased sample), and contains detailed information on
their financial holdings across all institutions, as well as channel
usage, banking habits, and household profile information. The process we
typically follow in the development of a segmentation solution consists of
the following steps:
1) Standard preprocessing stuff (treating outliers, missing values,
standardization.) --> 3-5 days 2) Data reduction via factor
analysis, PCA, or simple cross-correlations to help avoid redundancy in
the cluster runs --> 2-3 days 3) Brainstorming sessions with the
lines of business to help us understand key business issues, and
generate a list of potential driver variables --> 1-2 weeks 4)
AIternative cluster runs using the brainstorming suggestions and data
reduction output to generate potential solutions through trial and
error. --> 2-4 weeks
The evaluation of solutions in Step 4 involves
making trade-offs between the number of clusters, cluster size, cluster
overlap, and the degree to which the current solution meets the needs of
the business as determined through the brainstorming sessions. This is
usually a painful process that relies heavily on the experience of the
analyst to bridge the gap between cluster solution statistics and
relevance to the business. Given the highly manual nature of this task, we
can only evaluate a very small subset of the universe of possible
solutions, and different analysts will generate very different solutions.
The discussion will focus on the development of an
objective function which captures both business rules and cluster
statistics, and which allows for the evaluation and ranking of a much
larger number of potential solutions. The elements of the objective
function will be described in fairly simple terms, which apply to any
segmentation problem, and show how genetic algorithms may be used to
“evolve” potential solutions. An open discussion will be encouraged of
ways to improve the encoding of the problem and the objective function, as
well as a discussion of the challenges associated with the integration of
business rules. There are also plenty of issues surrounding the use of
genetic algorithms to help optimize the search through the space of
possible solutions. The current objective function
captures the business rules simply by measuring the average variance of
key “business driver” variables across the clusters, where these variables
have been selected ahead of time in cooperation with the line of business.
The higher the variance of these variables across the segments, the more
distinct and relevant the clusters should be. Average cluster overlap is
calculated by building n-dimensional hypersheres (where n = # of cluster
drivers) around the centroids of the clusters, where the radius of the
hypershere is between 2 and 3 RMS standard deviations. Overlap is defined
as occurring when any single observation falls within the hypershere of a
cluster, which it has not been assigned to. Cluster size may be integrated
into the objective function, where solutions are penalized for having
clusters that are either too large or too small.
Friday, February 4, 2000
9:30-10:30 -
Interdisciplinary application of time series methods inspired by chaos
theory Thomas Schreiber, University of Wuppertal
We report on real world applications of time series
methods developed on the basis of the theory of deterministic chaos.
First, we demonstrate statistical criteria for the necessity of a
nonlinear approach. Nonlinear processes are not in general purely
deterministic. Then we discuss modified methods that can cope with noise
and nonstationarities. In particular, we will discuss nonlinear filtering,
signal classification, and the detection of nonlinear coherence between
processes.
11:00-12:00 - Symbolic data compression
concepts for analyzing experimental data Matt Kennel, Institute for
Nonlinear Science at USCD, San Diego
1:00-2:00 -
Geometric time series analysis Mark Muldoon, University of
Science and Technology in Manchester A
discussion of a circle of techniques, all developed within the last 20
years and all loosely organized around the idea that one can extract
detailed information about a dynamical system (say, the equations of
motion governing some industrial process...) by forming vectors out of
successive entries in a time series of measurements.
2:30-3:30 - Chaotic communication using optical and
wireless devices Henry Abarbanel, Institute for Nonlinear Science
at USCD, San Diego
3:30-4:30 - Status of cosmic
microwave background data analysis: motivations and methods Simon
Prunet, CITA (Canadian Institute for Theoretical Astrophysics), University
of Toronto After a brief review of the physics
that motivates measurements of Cosmic Microwave Background anisotropies, I
will present the current observational status, the analysis methods used
so far, and the challenge posed by the upcoming huge data sets from future
satellite experiments.
|
|