Public Workshop: Friday, August 10, 2001

Location: DIMACS Center, CoRE Building, Rutgers University

**J. Douglas Carroll**, Rutgers University, dcarroll@rci.rutgers.edu**Phipps Arabie**, Rutgers University, arabie@andromeda.rutgers.edu**Lawrence J. Hubert**, University of Illinois, lhubert@s.psych.uiuc.edu

**This material is based upon work supported by the National Science Foundation under Grant No. 0100921**

The term ``datamining'' has come to describe any type of data analysis, with an emphasis on the need to discern new, useful information from the large quantities of data stored in today's databases. Though the term ``datamining'' is relatively new, the fundamental ideas behind it are not, and the potential approaches to it are grounded in the basic theoretical and algorithmic approaches to data analysis developed over the last several decades. What is new is the challenge of modifying these approaches and developing new ones that can scale to the very large size of the problems that are faced or that apply to new types of data and in new types of applications. Most algorithms currently used in datamining do not scale well when applied to very large data sets, often because they rely on random access to the data sets, which scales only while the data sets fit entirely in relatively small main memories. Multidimensional scaling (MDS) is a case in point. The methodology has its roots in work going back to the first half of the 20th century and its modern roots in work of Shepard [B35, B36], Kruskal [B29, B30], and others. Yet, the traditional algorithms of MDS do not always work efficiently with today's large data sets and the traditional models of MDS are not always the most appropriate for the complex data types that we are seeing today. There has been in recent years a major trend toward the development of totally new and different algorithmic procedures for MDS as well as the modification of traditional procedures; an emphasis on the development of new models of MDS and on the development of new algorithmic procedures for fitting these new models; and a stress on the application of MDS to new types of data and new fields. These new algorithms, new models and new applications are being developed from different points of view and by researchers with different backgrounds and interests. It is our plan to get the developers of new models of, new algorithmic approaches to, and new applications of MDS together to share ideas and explore new approaches. We will also include experts in the applications areas and individuals who have not necessarily worked on MDS before but are expert in methodologies such as combinatorial optimization that are increasingly relevant to MDS.

Among the specific areas of emphasis for this working group are challenges to MDS research from:

- the field of graph layout/graph drawing;
- chemometrics;
- biomolecular conformation;
- telephone call graphs, scientific collaboration graphs and other large social networks;
- the application of methods of social networks to ecological data;
- the increasing stress on data visualization arising from such areas as graph layout, fraud and intrusion detection, and amino acid databases;
- the increasing importance for MDS of discrete models and hybrid models mixing spatial and discrete components;
- the application of methods from combinatorial optimization to fit MDS and MDS-related models to proximity data;
- the increasing emphasis on approximations and heuristic techniques that results from increasingly large data sets.
- the need to find a global optimum in ``nonmetric MDS'' when data sets are large and finding the best local optimum isn't feasible.

### Sources of Data

MDS is widely used in psychology, the field in which it was originally developed, to determine the underlying perceptual structure of important classes of psychological stimuli in terms of important perceptual dimensions underlying these stimuli. For example, in the case of color vision, numerous MDS studies have confirmed that the color space is 3-dimensional, with one dimension corresponding to a red-green dimension, a second corresponding to a yellow-blue dimension, and a third corresponding to lightness (or brightness) -- essentially the intensity, ranging from black to a very intense white. Similar studies have been done of perception of nations, of words describing personality traits, of kinship terms, of speech and acoustical stimuli, perception of human faces, and many other areas of psychology, often unearthing psychological dimensions underlying perception and other behavior that had not been anticipated in advance. We will involve in our working group an expert on the application of MDS to new problems in social and clinical psychology.In marketing, where MDS has also been widely applied, MDS of proximities is used to develop perceptual maps of products, which can be used to provide greater insight into the nature of that product class, and to devise predictive models of consumer choice. For example, an MDS map of cars will generally reveal, among others, a ``sportiness'' and a ``luxuriousness'' dimension, as well as more specific dimensions having to do with styling, fuel economy, overall size, etc. Data on preferences for cars by consumers can be used to fit either a vector or ideal point model or one can map either subject vectors or ideal points into the multidimensional perceptual space, in order to account for each individual consumer's preferences in terms of these perceptual dimensions. One application to marketing involves what is sometimes called ``gap analysis''; i.e., looking for gaps or empty spaces in the product map which, if filled, would correspond to the most preferred product for a substantial subgroup of consumers. Such gaps might represent opportunities for a product manager to develop a new product or products that would command a significant market share, and thus result in significant profits to his or her company. As noted in the discussion of our streaming data analysis working group, massive data sets from supermarkets provide important challenges for datamining. We will involve in our working group an expert on the use of MDS in marketing in the food industry.

MDS has been used for a long time in the study of social networks. Recently, there has been much interest in social networks called telephone call graphs, where the vertices are telephone numbers and the edges correspond to calls between them. Such call graphs also arise in the fraud and intrusion detection applications to be studied by the working group in streaming data analysis and mining. These graphs are extremely large. Buja, et al. [B5] use MDS to visualize a part of a call graph consisting of millions of vertices (data from [B1]). Social networks also arise from scientific collaborations ([B38]). In [B5], large scientific collaboration networks are visualized using MDS methodology, clearly identifying clusters. We will include experts on these new, large social networks in our working group.

MDS is used regularly in biology, for example in ecology to compare communities or other biological assemblages, in biological taxonomy to discriminate between populations and/or species on the basis of morphometric or genetic data, and in evolutionary theory. An early attempt to apply MDS methods to ecology is in [B6]. Ecological data of a multidimensional quantitative, semiquantitative, and qualitative nature of all kinds are described in the recent book [B32]. Recently, MDS methods have been used in [B34] in conjunction with methods developed to study role assignments in social networks. This is a fascinating new idea and challenge for MDS and an exciting application of social-scientific methods in the biological sciences. We will bring together in our working group those who have worked in the theory of social roles and those who are working on applying this theory to ecology, as well as experts in MDS and other classification methodologies useful in ecology.

MDS-like models arise in the field of biomolecular conformation in biochemistry. This work, concerned with reconstructing the geometry of molecules from neighborhood/distance information, goes back to the 1970's ([B18]), but has found modern stimulus through the growing field of computational chemistry ([B20], [B22]). While there are only 20 amino acids, the understanding of their interrelations, similarities, and differences is very complex and amino acid databases include hundreds of quantitative properties of such amino acids. To summarize properties of amino acids compactly, a readily visualizable mapping of these properties would be very helpful and could benefit from MDS-like methods. Early approaches to this problem can be found in [B21]. We will bring together biologists, experts in classification/clustering working on biomolecular and protein databases, and MDS experts interested in biochemistry in our working group. DIMACS is a natural place for such interdisciplinary work; our ``Special Focus on Mathematical Support for Molecular Biology'' has brought together mathematicians, statisticians, computer scientists, molecular biologists, and computational chemists with great success since 1994.

Broader models of MDS developed by psychometricians, mathematicians, and statisticians have been applied in recent years to chemical data by a group of researchers called TRICAP (TRIlinear models in Chemistry and Psychology). A recent special issue of the journal

*Chemometrics*(Vol. 14, #3, 2000) dealt with the methodology for and applications of multilinear models and the challenges for development of new, more sophisticated methods for handling 3-way and multi-way generalizations of factor and/or components analysis. Applications included data about polynuclear aromatic hydrocarbons, pesticides, and fluorescence. Our working group will include individuals who are interested in the relation between MDS and chemometrics.The field of graph layout/graph drawing has become a very important field from the point of view of information visualization. There is now an annual graph drawing conference and the drawing of large graphs arises in many contexts, including communication networks, electrical networks, wiring diagrams, etc. Graphs with thousands of vertices arise frequently in such applications. The paper [B5] describes the use of interactive graphics in MDS to handle rather large data sets in a graph layout context. (See [B33] for a predecessor paper.) DIMACS has sponsored a variety of programs on graph layout, including summer tutorial programs on network visualization, and is well connected to the research community in this field. We will involve people with interests in graph layout and MDS in the working group.

### A Narrow Definition of MDS

*MDS*in its narrow meaning starts with a matrix of*proximities*(similarities, dissimilarities, or other measures of ``closeness'') defined on pairs of stimuli or other objects. MDS consists of a family of*spatial models*in which these proximities are modeled by distances between points representing the objects in an unknown multidimensional metric space, with distances related to proximities via a linear, monotonic, or other simple function. The general problem of MDS,*narrowly*defined, is to solve for an estimate of this unknown multidimensional space.*Three-way MDS*(or individual differences MDS) deals with a number of proximity matrices, one for each of a number of subjects or others sources of data, which can be viewed as a three-way array (e.g., subjects by stimuli by stimuli). We think of two*modes*-- a mode corresponds to a set of entities (in our example, subjects and stimuli). Thus, a method of analysis of such data is referred to as*three-way, two-mode MDS*. The most widely used approach to three-way MDS entails fitting some variant of the INDSCAL model (for INdividual Differences Multidimensional SCALing Model) ([B8]). This model assumes weighted Euclidean distances with a different pattern of dimension weights for each subject or other source of data corresponding to the third way of the data array.### A Broader Definition of MDS

A broader definition of MDS would include non-spatial geometric models, otherwise known as*discrete geometric models*, for proximities. For example, we can use a tree structure as a target metric space. Here, objects are identified with terminal nodes and distances in the tree structure are defined by an ultrametric or a path length metric in which distance is the length of the unique path connecting two terminal nodes. Multiple tree structure models comprise two or more trees, with overall distance being defined as the sum of distances from each of the multiple trees. One can also use cluster structures with distances or other proximity measures defined in various ways. For example, in the ADCLUS/INDCLUS approach to overlapping clustering, similarity is defined by a weighted sum of common features defined as clusters of which the two objects are both members. This proximity measure can be shown to be inversely related to a distance which will, under the right conditions, correspond to ultrametric distances defined on a hierarchical tree structure (see [B10]).In addition to spatial and discrete geometric models, MDS also considers a family of

*hybrid models*comprising a mixture of spatial and discrete components; e.g., a mixture of an*R*-dimensional Euclidean spatial representation and a multiple tree structure.The broad definition of MDS also includes models for two-way, three-way and higher-way multivariate data. Certain models were called ``unfolding models'' by Coombs, but are perhaps better described as ``ideal point'' models. Here, scale values representing preferences are assumed to be inversely related to distances between ideal points representing subjects (say) and stimuli in the two-way case, or subjects by occasions (say) and stimuli in the three-way case (or, more generally, combinations of all ways/modes other than the stimulus mode and stimuli in the general

*N*-way case). Similar models apply when preference is replaced by any notion of ``dominance'' and preference scales by dominance scales. Another class of non-distance-based spatial models for*N*-way multivariate data comprise various bilinear, trilinear or multilinear models. Perhaps the simplest case, given two-way data (say subjects by stimuli preference data), is called the*vector model*. The two-way factor or principal components model can be viewed as a special case of the scalar products model underlying the vector model. In the case of three-way or higher-way data one general model assumed is the one called the CANDECOMP/PARAFAC model. This generalizes the bilinear scalar product model to a general multilinear model which can be viewed as a kind of generalized scalar product model. A more general multilinear model that is also considered is Tucker's three-mode factor analysis model, which can also be generalized to the multi-way case. A recent review of the field, based on the ``broad'' viewpoint described in this subsection, and with a discussion of examples of a wide variety of specific models and methods, is provided by Carroll and Arabie [B7].The increasingly varied types of applications of MDS and the increasingly complex types of data to which MDS methods are applied has led to this proliferation in models for MDS. In this working group, we will discuss a variety of such models and pinpoint conditions under which data sets might be most appropriately modeled by different types of spatial, discrete geometric, or hybrid models.

There have also been attempts to unify approaches to these more general MDS models. A recent paper [B9] formulated a general model and methodology called ``CANDCLUS'' that includes a large number of these models as special cases. The authors are in the process of implementing one special case involving fitting a hybrid model combining INDSCAL structure and the individual differences (overlapping) clustering model INDCLUS. Algorithms implementing a number of other special cases will be pursued by this working group. More generally, other unifying approaches to MDS models will be a major topic of discussion.

### Algorithms

Traditionally, most algorithms for fitting MDS models of various types have involved use of a combination of gradient based procedures and alternating least squares (ALS) for least squares fitting of such models. Some fairly recent algorithmic developments due to Hubert, Arabie, and Hesson-McInnis [B27] entailed use of dynamic programming for fitting such spatial MDS models as the two-way city block model. Hubert and Arabie [B24] had used a similar approach to solve the unidimensional MDS problem, following the argument of de Leeuw and Heiser [B19] that this unidimensional problem was basically a combinatorial optimization problem. Hubert and Arabie [B25, B26] and Hubert, Arabie and Meulman [B28] have also used dynamic programming to fit other MDS-related structures, including tree structure, and have speculated that perhaps a much wider class of such models are better approached via combinatorial than continuous optimization (or by a combination of the two). For example, one could use combinatorial techniques to find an optimal ordering of stimuli on each dimension, followed by continuous optimization to fit continuous parameters subject to resulting ordinal constraints.Brusco and Stahl [B3, B4], Chen [B17] and Simantiraki [B37] have applied combinations of linear programming combined with simulated annealing and mixed integer-linear programming to solving MDS problems, particularly involving either unidimensional problems or MDS problems with city block or other

*L*metrics, via optimizing a Least Absolute Deviation (LAD) or Minimax Deviation fit measure.^{p}One goal of this working group is to explore use of such approaches as dynamic, linear, mixed integer-linear, and other mathematical programming techniques instead of, or in combination with, continuous optimization, to fit various MDS and MDS-related models to both proximity and more general multi-way data. Most of these techniques are sufficiently computationally intensive that, even with relatively small problems, they will generally require considerable computer time, and some problems may be beyond reach via such methods without also utilizing such supplementary techniques as simulated annealing, or heuristic methods designed to produce very good starting points. Even good local-search heuristics can be computationally inefficient given today's large data sets and need new methods such as the ``morphing'' process of Brusco [B2]. DIMACS is a natural place in which to explore the use of methods of dynamic, linear, mixed integer-linear programming. The center has an active and well-known research group in this field and just recently concluded a ``Special Year in Large Scale Discrete Optimization''. In that special year, there was an emphasis on new and more powerful algorithms developed in recent years, practical implementations involving codes and systems for such discrete optimization problems, and on practical applications leading to the fast solution of real world problems such as airline and railway scheduling, chemical process design, and telecommunications network design. Some DIMACS members and others with expertise in combinatorial optimization, but who have not previously worked on MDS, will be involved in our working group.

In addition to such techniques as these, more sophisticated continuous optimization techniques like majorization have been used extensively to deal with various MDS problems. For example, Heiser [B23] has used this approach to devise a method for fitting a weighted city block metric to three-way proximity data; in effect, this results in a city block version of INDSCAL. While we will put a lot of emphasis on non-continuous optimization, we will also explore these more sophisticated continuous optimization techniques.

In the case of very large data sets, various heuristic methods and/or approximate techniques need to be pursued. Another theme of this working group will be approximation and heuristics. These methods will enable us to fit models while not guaranteeing precise satisfaction of an OLS, LAD or other explicit fit measure. For example, a general approach called CANDELINC ([B15, B13]) offers the promise of getting good approximate solutions with large data sets, as tested empirically (with fairly small data sets) in [B11, B12]. Heuristic methods have been fairly successful in fitting tree structure and multiple tree structure models to proximity data ([B14, B16]). Combining such heuristic and approximate methods with some optimization techniques only recently applied to these problems should lead to improved algorithms both for finding globally optimal solutions for relatively small datasets, as well as for finding suboptimal, but practically acceptable, solutions for larger sets of data, and the working group will pursue this idea.

Kruskal and Hart [B31] did some very early work on generating an approximate MDS solution for a very large data set by first selecting a ``representative'' subset of objects, fitting a small dimensional MDS solution to these representative points, and then ``mapping'' the remaining points into this small dimensional space based on their similarities to those representative points. Approaches such as this, in addition to some of the other heuristic and approximate methods discussed above, might be usefully applied to fitting various MDS and MDS-related solutions to very large data sets, and our working group will discuss them.

Shepard [B35, B36] and Kruskal [B29, B230] introduced ``nonmetric MDS'' to deal with the situation when the input data is only known up to order, but the spacing between data values is not meaningful. (Thus, there is no metric on the inherently unidimensional data values.) The traditional analytical solutions of MDS are then no longer feasible and one needs to be concerned with algorithms that only use ordinal properties of the data. Thus, one must not use any sort of mean, though various generalizations of medians are appropriate. Kruskal introduced gradient-based optimization procedures to seek parameter values optimizing a measure of fit, a normalized least squares measure of fit between distances and a monotonic transformation of the data. Since this pioneering work, an enormous literature has grown up entailing use of many different variants of gradiant based methods. However, virtually all the methods developed are subject to the problem that they find only a local optimum of an overall objective function, rather than the global optimum. In general, there has so far been no good way to assure that the global optimum has been obtained. The best one can do is run the algorithm being used from a number of different random starts and assume that if the solution obtained from a number of different starting configurations is the same, then this is probably the global optimum. This working assumption in practice is challenged by increasingly large data sets and thus the issue of nonmetric MDS bears revisiting by our working group.

Working group home page

DIMACS Homepage

Contacting the Center

Document last modified on April 11, 2001.