Archetypal Analysis With Missing Data: See All Samples by Looking at a Few Based on Extreme Profiles

Abstract In this article, we propose several methodologies for handling missing or incomplete data in archetype analysis (AA) and archetypoid analysis (ADA). AA seeks to find archetypes, which are convex combinations of data points, and to approximate the samples as mixtures of those archetypes. In ADA, the representative archetypal data belong to the sample, that is, they are actual data points. With the proposed procedures, missing data are not discarded or previously filled by imputation and the theoretical properties regarding location of archetypes are guaranteed, unlike the previous approaches. The new procedures adapt the AA algorithm either by considering the missing values in the computation of the solution or by skipping them. In the first case, the solutions of previous approaches are modified to fulfill the theory and a new procedure is proposed, where the missing values are updated by the fitted values. In this second case, the procedure is based on the estimation of dissimilarities between samples and the projection of these dissimilarities in a new space, where AA or ADA is applied, and those results are used to provide a solution in the original space. A comparative analysis is carried out in a simulation study, with favorable results. The methodology is also applied to two real datasets: a well-known climate dataset and a global development dataset. We illustrate how these unsupervised methodologies allow complex data to be understood, even by nonexperts. Supplementary materials for this article are available online.


Introduction
In everyday life, archetypes are original examples or perfect models that embody the fundamental characteristics of a thing. For example, in Greek mythology there are many hybrid beings composed of archetypes, such as a minotaur, that is, a mixture of two archetypes, a man and a bull. Archetypes are common in behavior, modern psychological theory and literary analysis, and they form the basis of eclecticism art. The concept of archetypes or pure types in statistics follows the same ideas and was formulated by Cutler and Breiman (1994). It is a matrix factorization technique where the data are explained as compositions of a few pure patterns.
Data mining seeks to discover unknown and unanticipated structure in the data (Hand 1998;Costantini, Linting, and Porzio 2010), together with the visualization of that structure (Daszykowski, Walczak, and Massart 2003). Apart from understanding and describing the entire dataset, we would like to be able to extract information, that is, easily interpretable, even by nonexperts. Dimension reduction techniques are very useful for exploring multivariate data (Larose 2006;Giudici and Figini 2009;Fogel et al. 2013). As only input and no output features are present, this is an unsupervised statistical learning problem (see Hastie, Tibshirani, and Friedman 2009, chap. 14, for a complete review of unsupervised learning techniques). Data decomposition techniques are frequently used to find the latent components. A dataset is viewed as a linear combination of several factors. Different unsupervised techniques with specific objectives are generated depending on the constraints on the factors and how they are combined (Mørup and Hansen 2012;Thurau et al. 2012;Vinué, Epifanio, and Alemany 2015). At one extreme, we find principal component analysis (PCA), which explains data variability satisfactorily at expense of the interpretability of the factors, since this is compromised due to the construction of the factors as linear combinations of variables. At the other extreme, we find clustering techniques such as k-means or k-medoids, whose factors are readily interpreted. These factors are the centroids of the clusters, which are averages of groups of data in the case of k-means and medoids (concrete instances) in the case of k-medoids. Nonetheless, in contrast with PCA, their modeling flexibility is undermined by the binary assignment of data to the clusters.

Archetypal Analysis
Archetype analysis (AA) lies somewhere in between these two techniques, since its factors can be interpreted as easily as those of clustering methodologies, but its modeling flexibility is higher than for clustering techniques. A table summarizing the relationship between several unsupervised multivariate techniques is provided by Mørup and Hansen (2012) and Vinué, Epifanio, and Alemany (2015). Cutler and Breiman (1994) formulated AA in such a way that each instance of the dataset is approximated by a mixture (convex combination) of pure or extremal types called archetypes. Furthermore, archetypes are built as mixtures of the cases in the dataset. Although archetypes are easily interpretable, as they are artificial constructions, there may not be instances in the dataset with characteristics similar to those of the archetypes, which may make it unsuitable in some fields (Seiler and Wohlrabe 2013). To solve this question the new concept of archetypoids was introduced by Vinué, Epifanio, and Alemany (2015). In archetypoid analysis (ADA) each instance in the dataset is approximated by a mixture of a set of actual extreme observations called archetypoids.
This procedure not only allows us to relate the cases in the dataset to extreme patterns but also facilitates comprehension of the data. Humans understand the data better when the instances are shown through their extreme constituents (Davis and Love 2010) or when characteristics of one instance are shown as opposed to those of another (Thurau et al. 2012). In fact, Jones and Rice (1992) used functions with extreme principal component scores to describe and display the important characteristics of a set of functions. This could be considered as seeking the archetypoid functions. Nonetheless, unlike PCA, the objective of AA is to recover extreme instances, and functions with extreme PCA scores do not necessarily correspond to archetypal cases. This is explained in Cutler and Breiman (1994) and shown in Epifanio, Vinué, and Alemany (2013) through a problem where archetypes could not be restored with PCA even if all the components had been taken into account. Not only that, Stone and Cutler (1996) also showed that AA may be more appropriate than PCA when the data do not have elliptical distributions. In summary, as regards human interpretability, the central points returned by clustering techniques do not seem as favorable as extreme types, which are also more readily comprehensible than a linear combination of data, such as that returned by PCA.

Missing Data
AA and ADA, like the majority of statistical techniques, assume the completeness of the data. Nevertheless, incomplete data, that is, data with missing values, are common in real applications (Lott and Reiter 2018;Wang and Johnson 2019;Xia and Yang 2016;Akande, Li, and Reiter 2017;Maity, Pradhan, and Das 2018). According to Little and Rubin (2002), three types of missing data can be established: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In simple terms, MCAR means that the probability that an observation is missing is not related to its value or to any other values in the dataset, unlike MAR, where that probability is related to the values for some other observed variables. However, when that probability is related to its value, we say that the data are MNAR.
One way to solve this problem is to ignore the incomplete instances and to work only with the complete cases. However, this means discarding information that may be valuable, especially if the percentage of missing cases is high. Another way to solve this problem is to impute data, and once the missing values are filled, then proceed with the statistical procedure as normal. This approach could work well if the percentage of missing values is not too high, otherwise errors derived by imputation become increasingly important (Eirola et al. 2013). Furthermore, probabilistic imputation methods are usually computationally expensive. Another approach in factorial methods and cluster analysis is to adapt the corresponding algorithm by either skipping or considering the missing values in the computation of outputs (Dray and Josse 2015). Some PCA algorithms that consider missing values are the nonlinear iterative partial least squares (NIPALS) algorithm (Dray and Dufour 2007) or the iterative PCA method (Kiers 1997), also known as the EM-PCA algorithm (Josse and Husson 2012), while in the clustering case some examples are the k-POD algorithm (Chi, Chi, and Baraniuk 2016) and mixture model clustering to handle missing data (Hunt and Jorgensen 2003). With regard to PCA algorithms that skip missing values, that is, the PCA algorithm is adapted so that missing values are not considered in the computation, some examples are the pairwise correlation approach and the approach based on principal coordinates analysis of the Euclidean distance matrix computed between the individuals as described by Dray and Josse (2015), while in the clustering case we can cite, for instance, the partitioning around medoids (PAM) algorithm (Kaufman and Rousseeuw 1990, chap. 2).
Two approaches for computing AA with incomplete data have been considered to date, the first by Mørup and Hansen (2012) and the second by Epifanio, Ibáñez, and Simó (2018). Both approaches modified the original AA algorithm to consider the missing values. Nonetheless, in both approaches the results did not fit with the expected theoretical results, as will be shown in the supplemental materials. On the one hand, we propose a modification of the solutions of both approaches to fulfill the theory. Furthermore, we propose a new procedure, which also considers the missing values, based on updating the missing values with the fitted values.
On the other hand, we consider a different point of view that could, in fact, be used not only for AA but also for ADA and that skips missing values. We propose to estimate the Euclidean distances between cases, then a multidimensional scaling tech-nique is applied to these dissimilarities so that distances between the points returned in the Euclidean space are approximately equal to the original dissimilarities. Once complete vectors are available, AA or ADA can be applied and then those results are used to build the AA and ADA solution in the original space. In other words, as we cannot work in the original space due to the incompleteness of the data, a kind of projection is performed on the data (in fact, on the dissimilarity matrix), and the concrete statistical technique is applied in this new space. Then, we take advantage of the learning obtained in this new space to provide a solution in the initial space. Note that if the dissimilarities in n cases were Euclidean distances, they could be represented exactly in at most n − 1 dimensions (Mardia, Kent, and Bibby 1979, Theorem 14.4.1) by means of classical multidimensional scaling (cMDS), so we could recover the original vectors and the solution in the cMDS space would coincide with that of the initial space.
In summary, the aim of this article is to propose a new methodology with missing data for AA and ADA that guarantees the fulfillment of the theoretical properties regarding the location of archetypes and archetypoids. Furthermore, on the one hand, the proposed procedures do not substantially increase the computational burden of the original algorithms, and on the other hand, they do not require any assumptions on the missingness patterns due to the absence of the completely observed data formulation. Moreover, these methods can provide an imputation for the missing values despite this not being their objective. With the proposed procedures, missing values can appear in the archetypal profiles, which conforms to the fact that instances with missing values can be archetypal cases.
The outline of the article is as follows: In Section 2, we review AA and ADA and the current AA algorithms that consider missing values. In Section 3, we introduce our proposal for handling missing data with archetypal analysis. A toy example illustrates our proposed methodology and the flaws of the previous approaches in the supplementary materials. In Section 4, a comparison is made in a simulation study. In Section 5, our proposal is applied to two real datasets. Section 6 contains conclusions and some ideas for future work. The datasets and code in R (R Development Core Team 2018) for reproducing the results are available as supplementary materials.

AA and ADA With Complete Data
Let X be an n × m matrix with n instances and m variables. Three matrices are searched in AA: (1) the k × m matrix Z, whose rows contain the k archetypes z j ; (2) an n × k matrix α = (α ij ) with the mixture coefficients that approximate each instance x i by a mixture of the archetypes (O x i = k j=1 α ij z j ); and (3) a k × n matrix β = (β jl ) with the mixture coefficients that set each archetype (z j = n l=1 β jl x l ). To establish these matrices, the minimization of the following residual sum of squares (RSS) with the respective restrictions is carried out ( · denotes the Euclidean norm): β jl = 1 with β jl ≥ 0 for j = 1, . . . , k.
It is imperative to highlight that archetypes do not necessarily match actual instances. Specifically, this will only occur when one and only one β jl is equal to one for each archetype, that is, when each archetype is made up of only one instance. Accordingly, in ADA the former restriction b) is modified by the following one, and as a result the former continuous optimization problem of AA is converted into a mixed-integer optimization problem: To solve the AA problem, Cutler and Breiman (1994) developed an alternating minimizing algorithm that alternates between computing the optimum α for given archetypes Z and the best archetypes Z for a given α. This involves solving convex least-squares problems at each stage, which is done through a penalized version of the nonnegative least-squares algorithm (Lawson and Hanson 1974). That algorithm was implemented in the R package archetypes by Eugster and Leisch (2009), although with various changes (previous data standardization and use of the spectral norm in Equation (1) instead of the Frobenius norm for matrices). However, in our R implementation these changes do not occur and the data are not standardized by default, and Equation (1) is indeed the objective function to minimize. A total of 20 random starts is considered in the experiments and the best repetition is chosen as the solution.
To solve the ADA problem, Vinué, Epifanio, and Alemany (2015) proposed an algorithm inspired by the scheme of the PAM clustering algorithm (Kaufman and Rousseeuw 1990). Two phases make up the algorithm: the BUILD stage and the SWAP stage. A starting set of archetypoids is determined in the BUILD stage, which is upgraded during the SWAP stage by swapping the chosen instances for unselected observations and inspecting whether these interchanges decrease the RSS. That algorithm was implemented by Vinué (2017) in the R package Anthropometry with three possible starting sets in the BUILD phase: cand ns , cand α , and cand β . The nearest neighbors in Euclidean distance to the k archetypes form the cand ns set. The cand α set is constituted by the instances with the maximum α value for each archetype j, while the cand β set is composed of the instances with the maximum β value for each archetype j. Each of these three sets goes through the SWAP stage and three sets are retrieved. From these three sets, the one with the minimum RSS (often the same set is returned from the three initializations) is chosen as the ADA solution.
Let us see the locations of the archetypal representatives. On the one hand, if k = 1, the archetype corresponds to the mean and the archetypoid to the medoid (Kaufman and Rousseeuw 1990). Nevertheless, if k > 1, the archetypes are located on the boundary of the convex hull of the data (see Cutler and Breiman 1994), although this is not necessarily the case for archetypoids (see Vinué, Epifanio, and Alemany 2015). On the other hand, archetypes are not necessarily nested, and neither are archetypoids. As a consequence, different ks may reflect different structures of the data. As with any unsupervised technique, the selection of the number of components k is an open question. If the user has prior knowledge of the arrangement of the data, the value of k can be chosen based on that information. Otherwise, the elbow criterion, which has been widely used (Cutler and Breiman 1994;Eugster and Leisch 2009;Vinué, Epifanio, and Alemany 2015) could be considered. The elbow criterion means representing the RSS for different k values and selecting the value k as the point where the elbow is found.

Previous Techniques for AA With Missing Data
Let X be an n × m matrix as before, but now missing values (NAs) can be present in the data. Let us suppose that there is no row or column with all its values missing; otherwise, that row or column would have to be erased. The simplest approach is to remove the incomplete instances and analyze only the complete ones, with which usable information is wasted. This approach is referred to as COM. Another approach, which is referred to as IMP, is to estimate the missing values and analyze all the instances. Missing value estimators range from the simplest one, such as using the mean values of the non-missing values in the respective feature, to more complex ones that exploit the information from other features. In the experiments carried out in Sections 4 and 5, multiple imputation using additive regression, bootstrapping and predictive mean matching is used, which is implemented with the aregImpute function in the R package Hmisc (Harrell et al. 2016) (the default type "pmn" is only changed to "regression" if imputations are not completed due to a high percentage of missingness). The number of multiple imputations is 5 and the 5 imputed datasets are appended to be combined.
To the best of our knowledge, two specific techniques have been developed for AA with missing data. The first was introduced by Mørup and Hansen (2012), which is referred to as AAMOHAN. For handling incomplete cases, the objective function for minimizing RSS was modified and includes a parameter ( ) as a regularization for avoiding division by zero. In the experiments, two values for are used: 1e − 3, the default, and 1e − 9. Furthermore, in the implementation, the missing values are substituted by zeros, which could mean that archetypes were located outside the convex hull. As a consequence, the results did not fit with the expected theoretical results. The second was introduced by Epifanio, Ibáñez, and Simó (2018) and is referred to as AAEIS. Different weights for nonmissing and missing values (zero in that case) are considered to solve the problem. However, those weights could also mean that archetypes were located outside the convex hull of the data for k > 1.

Projecting Dissimilarities for Computing AA and ADA
The outline of the procedure is as follows: Step 1. Estimate an n × n dissimilarity matrix from all the pairwise dissimilarities between data points.
Step 2. Project the dissimilarities in such a way that the cases are embedded in a Euclidean space.
Step 3. Apply AA or ADA to the projected space according to the analysis pursued. Keep β.
Step 4. Compute the archetypes or archetypoids in the original space with β from Step 3. Compute α and RSS in the original space. For Step 1, the dissimilarities are estimated directly; no previous missing data imputation is carried out. This leads to more reliable estimates according to Eirola et al. (2013).
In the experiments carried out in Sections 4 and 5, the wellknown partial distance strategy (PDS) is used (Dixon 1979). PDS estimates the squared Euclidean distance by calculating the sum of squared differences of the mutually known variables, and scaling the value proportionally to account for the missing values. The Euclidean distances (ED) are estimated in this way by daisy function from the R package cluster (Maechler et al. 2017) and recorded in the matrix D with elements d ij .
Many mappings can be chosen for Step 2. In the experiments in Sections 4 and 5, only two are considered. The first is the well-known classical multidimensional scaling (cMDS), which takes a set of dissimilarities and returns a set of points such that the Euclidean distances between these points are approximately equal to those dissimilarities. If possible, we consider n − 1 as the maximum dimension of the space in which the data are to be represented. If the dissimilarities are Euclidean distances, they can be represented exactly in at most n − 1 dimensions (Mardia, Kent, and Bibby 1979, Theorem 14.4.1). D is only Euclidean if B is positive semidefinite (Mardia, Kent, and Bibby 1979, Theorem 14.2.1), where B = (I − n −1 ee )M(I − n −1 ee ), I is the n × n identity matrix, e is the n × 1 vector with all its elements equal to unity and M is a matrix with elements m ij = −0.5 * d 2 ij . Due to the missing values, in the experiments D is not Euclidean. Therefore, in second place we also consider a method that also works when the dissimilarity is not a distance, the h-plot (HP) (Epifanio 2013). Note that cMDS tries to preserve the original interpoint dissimilarities in the new space, unlike h-plot, which tries to preserve relationships between dissimilarity variables. This perspective is particularly suitable when working with nonmetric dissimilarities, because the dissimilarities cannot be represented exactly in a Euclidean space, since the matrix is not Euclidean, as in this case. For the HP method, both the estimated Euclidean distances and their squares, as originally defined by Dixon (1979), have been used in the experiments, since raising dissimilarities to a power could be useful in this method (Epifanio 2013). For Step 4, the archetypoids can be directly determined in the original space, as they correspond to concrete cases. But to determine the archetypes it is necessary to define Z = β × X when X has missing values. Note that archetypes are mixtures of the observations, that is, they are defined as weighted averages of the observations. Therefore, if we have missing values, those weights have to be scaled to account for the missing values in a similar way to PDS. In particular, and where W is an n × m matrix with 0 whenever the element is missing in X and 1 otherwise. Then, we find the best α for a given Z by solving n convex least-squares problems (i = 1, . . . , n). It should be noted that each problem is independent of the rest, and its result can be calculated by excluding the coordinates with missing values subject to α i ≥ 0 and k j=1 α ij = 1. Finally, RSS can be computed as the sum of the n distances between x i and k j=1 α ij z j using PDS, which is the same approach followed by Epifanio, Ibáñez, and Simó (2018). Note that we can provide an imputation for the missing values, despite this not being the objective, by usingO x i .
In summary, for AA or ADA, according to the analysis to be carried out, three possible alternatives are considered in the experiments, since we contemplate two mappings in Step 2 and two different dissimilarities in Step 1 for the second mapping, which handles nonmetric dissimilarity matrices. These three combinations will be labeled as: AAEDcMDS, archetypal analysis using Euclidean distance in step 1 and cMDS in Step 2; AAEDHP, archetypal analysis using Euclidean distance in Step 1 and HP in Step 2, and AAPDSHP, archetypal analysis using partial distance in Step 1 and HP in Step 2.

Kernel Method for AA
Instead of projecting the dissimilarities, the kernel-AA algorithm by Mørup and Hansen (2012), which generalizes the AA algorithm to kernel representations, can be used to compute AA, as it is based on pairwise relations between the data points. So, we need to define the kernel matrix. In particular, the Gaussian radial basis function K(x, y) = exp(−γ x − y 2 ), which is one of the most popular choices for a kernel function, can be formulated only in terms of the dissimilarities between samples. We use this kernel in the experiments, with the parameter that sets the spread of the kernel, γ , equal to 0.1. No significant improvements are observed if we use other γ values. This approach is referred to as AAK.

Modified AAMOHAN and AAEIS for AA
To ensure that the archetypes are not outside the convex hull, instead of working with the archetypes returned by AAMOHAN and AAEIS, we consider the β values returned by these methods, and build the archetypes as in Equation (2), and estimate the α values by means of Equation (3). This approach is referred to as MAAMOHAN and MAAEIS. Furthermore, estimating the archetypes in this way allows there to be archetypes with missing values.

Intrinsic Imputation for AA
The original AA algorithm can be adapted to handle missing data by making internal imputations during the parameter updates. In each iteration of the standard iterative alternating procedure of estimating α and β, an imputation step is introduced, in which missing data entries are imputed according to the current AA model. For the initialization, we need to start with archetypes without missing values and that fulfill the constraints. So, the starting archetypes are built considering only complete cases. This does not prevent us from obtaining archetypes formed by mixtures of cases with missing values, during the iterative phase of improvement. To fulfill the theory, that is, that archetypes are a mixture of the data, the final archetypes are built from β values returned by this algorithm, as in Equation (2), and α values are estimated by means of Equation (3). This approach is referred to as AAII.

ADA With Missing Values
The methods by Mørup and Hansen (2012) are not defined for ADA, while extending AAEIS and AAII to ADA would imply that only the complete cases could act as archetypoids, as explained by Epifanio, Ibáñez, and Simó (2018). This would considerably restrict the feasible set of solutions. Therefore, we consider that the best options for computing ADA with missing data are the methodology introduced in Section 3.1.1 and applying ADA after imputing values.

Simulation Study
Three simulation studies are carried out to compare AA with missing data. Two of them are similar to those followed by Epifanio, Ibáñez, and Simó (2018), and the other is inspired by Chi, Chi, and Baraniuk (2016). A toy example, where we show that AAMOHAN and AAEIS do not fulfill the theory, is analyzed in the supplementary materials. Here, two well-known benchmark datasets are analyzed: an artificial one (waveform data) and a real one (wine dataset). Both of them have a factor with labels, that is, discarded in the computation but is used to assess the solution. Although the objective of AA is not data clustering, we can assign each case to the group in which its corresponding alpha is the maximum. We will use this to compare the clustering AA solutions with those of k-POD in two sets in which differentiated (separate) clusters do not exist. Like Chi, Chi, and Baraniuk (2016), we use the Rand score (Rand 1971) to compare each clustering result to the true class label variable, with the adjustedRand function in the R package clues (Chang et al. 2010). This index ranges between 0 and 1, where higher scores indicate greater agreement and, as a consequence, more accurate clustering performance. The algorithm for k-POD is available in the R package kpodclustr (Chi and Chi 2014).

Simulation Study With Wine and Waveform Data
We follow the same experimental set-up performed by Chi, Chi, and Baraniuk (2016) and Epifanio, Ibáñez, and Simó (2018). The first set considered is the wine dataset from the UCI Machine Learning repository (Dheeru and Karra Taniskidou 2017). A total of 13 chemical analyses are recorded for each of the 178 samples, together with the wine type (there are three classes). Secondly, a benchmark dataset, the waveform dataset defined by Breiman et al. (1984, pp. 49-55), is generated in each trial using the function mlbench.waveform from the R package mlbench (Leisch and Dimitriadou 2010). Each dataset is constituted by n = 150 samples and m = 21 continuous variables, together with a factor recording the 3 classes (33% for each of the 3 classes). Each class is generated from a convex combination of 2 of 3 "base" waves.
Each of the following steps is repeated 100 times. To simulate the MCAR mechanism, we randomly remove entries to obtain approximately 10 and 30% overall missingness in the wine dataset and 50% overall missingness in the waveform datasets. To simulate the MAR mechanism, we randomly remove 50% of the values in the 1st, 4th, and 7th variables in the wine dataset, and in the 5th, 10th, and from the 15th to 21st variables in the waveform datasets, which means approximately 12% and 21% overall missingness, respectively. To simulate the MNAR mechanism, we randomly remove 95% of the entries in the bottom 25th quantile in each of the variables in the wine dataset, and 75% of the entries in the bottom 50th quantile in each of the variables of the waveform dataset, which means approximately 24% and 38% overall missingness, respectively. We consider high percentages of missingness since, although they were uncommon in the past, nowadays they are frequent in data from online social networks and recommender systems, as explained in Chi, Chi, and Baraniuk (2016). The different approaches are then applied to each dataset with k = 3. The datasets generated from wine are standardized. As in the previous Section, D is not Euclidean.
A summary (mean and standard deviation) of the RSS/n from each approach is displayed in Table 1. As we also know the original datasets, we compute the matrix α that approximates the original data (without missing values) using the archetypes kept for each strategy and compute the RSS/n. The idea is to judge the capacity of each method to recover the original data. The results obtained using the original data without missing values are referred to as ORG. A summary (mean and standard deviation) of these quantities is displayed in Table 2. In Table 3, we show the summary of the Frobenius norm of the difference between the archetypes obtained with each approach and the original ones (we consider the permutation that gives the least error to match the archetypes). Finally, a summary of the Rand scores is shown in Table 4. The best result for each scenario is highlighted in bold. An empty space in the tables indicates failure to complete the experiment for a given scenario. This happens with waveform data due to the high percentage of missingness. For example, there are no complete cases. This means that AAII could not be initialized. For that reason, if there are no complete cases, we have imputed the minimum value of each variable to the missing entries to build the starting archetypes; for the rest of the algorithm this imputation is discarded. For IMP, in some cases the imputation mechanism fails with the first seed used, then another seed is used to run the imputation function. For the MCAR mechanism with the waveform data, there are a few missing entries in the dissimilarities, since a few rows do not have any variables without missing values in common. Therefore, we have worked with the complete cases of the dissimilarity matrix.
The worst results are achieved in all scenarios and for all measures with COM, as it discards information. In the cases where the percentage of missingness is high, as no complete instances are available, no results are returned. Let us analyze results for each measure. For the measure in Table 1, the best strategy is MAAMOHAN, except in one scenario where IMP is the best, but MAAMOHAN is quite near. AAII, despite not being the most accurate in any of the scenarios, it is always near the best for all the scenarios. However, IMP is not a good approach for any of the scenarios: IMP is not among the most accurate for MAR with the wine dataset and MCAR with the waveform dataset.
For the measure in Table 2, again the most accurate approaches are IMP and MAAMOHAN, with the exception of MNAR with the waveform dataset, for which the best result is obtained by AAEDcMDS. As before, IMP is not among the most accurate for all the scenarios. However, MAAMOHAN, AAII, and also AAEDcMDS consistently return good results in all the scenarios. Note that the gold standard reference is ORG, as it uses all the information, without missing values. The proposed procedures provide results that are close to ORG, despite working with missing values, with the exception of MNAR with the waveform dataset, where the distance is higher. In two scenarios, IMP reports better results than ORG, since the percentage of missingness is not very high and more accurate results can be achieved, as we work with more data due to the multiple imputations.
The best results are obtained for different strategies in each scenario for the measure in Table 3. The methods that give the best results in any of the scenarios are (the number of scenarios where each method is the best appears in brackets): AAII (1), AAEDcMDS (1), MAAMOHAN (4), and IMP (1). With the waveform data set scenarios, where a high percentage of missingness is used, IMP does not provide good results. However, AAII and MAAMOHAN and AAEDcMDS (with the exception of MNAR with the wine dataset) consistently give good results in all the scenarios.
Finally, as regards the results in Table 4, in the majority of the scenarios a method based on dissimilarities is best. In particular, the methods that provide the most accurate results are (the number of scenarios where each method is the best appears in brackets): AAK (2), AAEDHP (2), AAEDcMDS (1), IMP (1), and k-POD (1). With the exception of MNAR for the waveform dataset, the best results are obtained using a method for AA with missing data instead of one intended for clustering, such as k-POD, since those data sets do not have differentiated clusters. With the exception of MNAR for the wine dataset, the results with missing data are not so far from those obtained without missing data (ORG). In fact, for AAK with the MAR scenario and the waveform dataset, the mean is higher.
MAAEIS and AAPDSHP are not the best method in any of the scenarios or measures. But AAPDSHP is the best in the toy example of supplementary materials and MAAEIS is the best method, with lowest RSS, for the real dataset in Section 5.1.

Air Quality
It is normal for temperature data (and climate data in general) to include missing observations, unreasonable readings, spurious zeroes, and so on (Kotsiantis et al. 2006). In this section, we are going to analyze a well-known dataset that contains daily readings of ozone, solar radiation, average wind speed and maximum daily temperatures in New York from May 1, 1973to September 30, 1973. Ozone is measured as mean ozone in parts per billion from 13:00 to 15:00 hours on Roosevelt Island. Solar radiation is measured in Langleys in the frequency band 4000-7700 Angstroms from 08:00 to 12:00 hours in Central Park. Average wind speed is measured in miles per hour at 07:00 and 10:00 hours at La Guardia Airport and the maximum daily temperature is measured in degrees Fahrenheit at La Guardia Airport. The dataset is available from the R package datasets (R Development Core Team 2018; Chambers et al. 1983). However, this dataset has 37 missing ozone observations (24.03%) and 7 missing solar radiation measurements (4.55%). Our aim is to find a set of archetypes (mixtures of real days) to reflect extreme patterns and to facilitate comprehension of the dataset.
First of all, the variables are standardized, since their range and meaning are very different. We apply the procedures for different numbers of archetypes and represent the screeplots. As stated in Section 2.1, the elbow criterion suggests that k = 3 archetypes should be chosen in all cases. We omit the figures in the interests of brevity. Table 5 shows the RSS/n for each approach with k = 3. The lowest value is achieved by MAAEIS. We analyze those results in detail, although the three archetypes found with the different strategies have similar characteristics and similar comments could be done for the rest of strategies. The screeplot for MAAEIS is displayed in Figure 1. Table 6 shows the percentiles of the three archetypes regarding the variables analyzed. The first archetype presents very high values for solar radiation and wind and medium-low values for ozone and temperature. The second archetype shows very low values for ozone, solar radiation and temperature, and high values for wind, while the third archetype presents very high values for ozone and temperature and very low values for wind, with medium values for solar radiation. In other words, archetype 3 is the archetypal really hot day. However, archetypes 1 and 2 are very windy days, archetype 2 being colder than archetype 1, although what really contrasts between these two archetypes is the solar radiation, which is extremely high for archetype 1, but extremely low for archetype 2. For MAAEIS, archetype 1 is basically formed by the 22nd May. Archetype 2 is mainly a mixture of 9th and 21st May, while archetype 3 is basically a mixture of 29-30th August and 3rd September.
As an illustration, let us consider the three archetypes obtained with the MAAEIS strategy. Having obtained the 3 archetypes with this strategy, it is possible to obtain the n × 3 matrix α with the coefficients that approximates each observation as mixture of the three archetypes. Figure 2 shows ternary plots for these coefficients. In the first plot all the observations are jointly represented, and later, a ternary plot is shown with the observations for each month, so differences between months are easily observed. The points in each plot are labeled with the number of the day that it is being represented in each case. In these figures, the three archetypes represent the three vertices of the triangles, and each point represents a different observation (day) as a mixture of the three archetypes. The scales are included in the triangles in gray in segments parallel to each side. May days cluster close to archetypes 1 and 2, in particular, with low values for archetype 3 except 30th May. However, June days show medium values for archetype 3, that is, they are hotter than May, but not as hot as in summer. July and August days display similar profiles, except some specific days; the majority of the days have high alpha values for archetype 3, that is, they are hot days. In September, however, the alpha values for archetype 3 diminish; they are medium-low values, except for the first days of September. Many days in September are explained by approximately a mixture of 50% archetype 1, 25% archetype 2, and 25% archetype 3.
As carried out by Cutler and Breiman (1994), the alpha coefficients can also be used to see how the individual variables vary as functions of archetypes. For example, we regressed on terms of up to 3rd degree in α i1 and α i2 for each variable. The values of α i3 are not considered as α i1 + α i2 + α i3 = 1. The R 2 are 0.89, 0.91, 0.82, and 0.79 for Ozone, Solar Radiation, Wind, and Temperature, respectively. Therefore, the data can be surprisingly well represented as a mixture of three archetypal days.

A Snapshot of the World's Countries
A solid understanding of the world is the first step for improving living conditions for all people worldwide. However, global development databases, with indicators for each country or region over the years, are not complete; in fact, missing values are quite common, especially in certain indicators.
In this section, we analyze six socio-demographic aspects of society, specifically the following indicators: total fertility rate (TFR), life expectancy at birth (LEB), maternal mortality ratio (MMR), infant mortality rate (IMR), adult obesity rate (AOR), and children under the age of 5 years underweight (CUW), for each country in the world in 2013. TFR and LEB from 1960 to 2013 of countries with nearly complete data over the years were analyzed in Epifanio (2016), who used them to find functional archetypoids to represent the big picture of global development. They were also previously selected by Rosling in the TED talk "The best stats you've ever seen" (Rosling 2006). Here, to aid an understanding the world today, we focus on the 2013 data as a snapshot of where we are now. The data are freely available at Shackman (2013) and come from the CIA World Factbook 2013 (Central Intelligence Agency 2013).
TFR represents the number of children who would be born to a woman if she were to live to the end of her childbearing years and bear children in accordance with current age-specific fertility rates. LEB means the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life. MMR is the annual number of female deaths per 100,000 live births from any cause related to or aggravated by pregnancy or its management (excluding accidental or incidental causes). This ratio includes deaths during pregnancy, childbirth, or within 42 days of termination of pregnancy, irrespective of the duration and site of the pregnancy, for a specified year. IMR measures the number of deaths of infants under one year old in a given year per 1000 live births in the same year. This rate is often used as an indicator of the level of health in a country. AOR gives the percentage of a country's population considered to be obese. Obesity is defined as an adult having a body mass index (BMI) greater than or equal to 30.0, and finally CUW gives the percentage of children under five years old considered to be underweight. This statistic is an indicator of the nutritional status of a community.
All the variables present missing observations, but only two countries, Montenegro and Tokelau, have missing data in the variable TFR. These two countries have been removed from the analysis, otherwise "Nas" appear in the dissimilarity matrix. So finally, a total of 224 countries are considered and the percentages of missing data are of 18.75% for MMR; 0.45% for IMR; 0.89% for LEB; 0% for TFR; 15.63% for AOR; and 41.52% for CUW.
The variables are standardized because they are measured in noncompatible units. Interpretation is easier if the representative are "extreme countries" rather than "combinations of countries"; therefore, ADA is considered in this case. We only present results for the EDcMDS strategy, which gives the lowest RSS for all k's and strategies based on projecting dissimilarities. In the next Section, we will analyze the results when imputation and clustering methods are used.
Although archetypoids are not necessarily nested, in this case they are somehow nested, and when the number of archetypoids k, increases, new finer patterns appear. Therefore, let us see what happens when different numbers of archetypoids are considered. Table 7 shows the archetypoids obtained, together with their respective percentiles. In all cases, the archetypoids, are representative of extreme patterns.
For k = 2, the archetypoids are Malta and Nigeria. As can be seen in Table 7, Nigeria shows high percentiles for MMR, IMR, TFR, and CUW, while Malta shows low percentiles for these indicators and high percentiles for LEB and AOR. To see how the patterns of other countries are expressed as mixtures of these archetypoids, coefficients α are estimated (see Table 8 for some examples) and represented in Figure 3. The first map shows α i1 , the coefficients with respect to the first archetypoid (Malta), and the second map shows α i2 , the coefficients with respect to the second archetypoid (Nigeria). The darker the color on the map, the greater the value α ij for country i and archetypoid j, according to the heat color palette (light yellow signals low values, whereas red indicates high values). So, darker colors on the map indicate that the indicators of each country are better explained by the corresponding archetypoid. Territories with no information are displayed in green. As can be seen (Figure 3), most "developed" countries have high coefficients with respect to the archetypoid Malta, while the poorest African countries together with Afghanistan have high coefficients with respect with the archetypoid Nigeria, and other countries such as Bolivia or India have similar coefficients in both archetypoids    (Table 8); they are explained fifty-fifty by each archetypoid. This kind of maps is usually referred to as "abundance maps" in the hyperspectral imaging field. For k = 3, the archetypoids are Greece, Nigeria, and American Samoa. Table 7 shows that in terms of the variables analyzed, the profile of Greece is very similar to the profile of Malta, and the third archetypoid, American Samoa introduces a profile characterized by a very high percentage for obese population, a high percentile for TFR and a medium percentile for LEB. If we estimate the mixture coefficients α to formulate the other countries as mixtures of these archetypoids, the countries with highest coefficients for the third archetypoid are islands such as Nauru (0.9154), Tonga (0.7294), Cook Islands (0.7841), the Gaza Strip (0.8337), and the West Bank (0.7469). As they are all very small countries, they cannot be seen in Figure 4.
For k = 4, the profile of the poorest countries is subdivided into two profiles, represented by Zambia and Chad, respectively.
As can be seen in Table 8 and Figure 4, countries like Angola or the Central African Republic are now better represented by one archetypoid or the other. Note that Chad's percentiles are very extreme, with very high or very low values. However, Zambia's percentiles are not as extreme as Chad's, except for TFR, the percentile for which is higher for Zambia.
For k = 5, the profile of the "developed countries, " represented by Greece, is now subdivided into two new profiles characterized by Japan and Bosnia and Herzegovina, respectively. The great difference between these two countries lies in LEB (whose percentile is 62 for Bosnia Herzegovina and 99 for Japan), but especially in AOR (whose percentile is 77 for Bosnia Herzegovina and 19 for Japan). Finally, for k = 6 a new profile appears, represented by India (see Table 7 and Figure 5). It can be seen in Figure 5 that Japan-like countries are developed countries in Asia, such as Singapore or South Korea, Western European countries, and Australia and  New Zealand, while Czech Republic-like countries are mainly Eastern European countries. As mentioned before, Zambialike countries are mainly in Africa, as are Chad-like countries, although Afghanistan is also well explained by Chad. American Samoa-like countries are mainly small islands that cannot be seen on the maps. Finally, India-like countries correspond to many Southeast Asian countries. Moreover, the maps reveal the composition of other countries. For example, the USA is approximately formed by a mixture between Japan (45%), American Samoa (33%), and the Czech Rep. (20%), while Brazil is a mixture between the Czech Rep. (55%) and India (42%) (see also  Table 8).
Our intuition tells us that the variables vary continuously across countries, that is, we do not expect there to be clearly differentiated (separate) groups of countries. This is corroborated by Figure 6, which displays dissimilarities between countries. Although the objective of AA is not data clustering, we can assign each country to the group in which its corresponding alpha is the maximum. With k = 6, the number of countries for each archetypoid are: 62 (Japan), 72 (Czech Republic), 30 (Zambia), 18 (Chad), 17 (American Samoa), and 25 (India).

Results for Other Alternatives
First, we can remove the cases with missing entries, which reduces the sample size from 224 to 128, nearly half of the countries. The archetypoids with k = 5 are: China, Uganda, Kuwait, India, and Chad. Note that none of those countries have very low percentiles in MMR, IMR, TFR and a very high percentile in LEB, that is, the profiles corresponding to Japan and Czech Rep., which were the most numerous groups. So, when working with complete data we cannot recover the most common extreme profiles. Second, we can impute missing values and apply ADA. The archetypoids obtained with k = 5 are: British Virgin Islands, Gabon, Macau, American Samoa, and Somalia, whose percentiles are not as extreme as those of the archetypoids obtained using cMDS. Furthermore, the RSS with imputation is 0.415, which is higher than the RSS by cMDS (0.338).
Third, we apply k-POD with k = 5. To find a country, that is, representative of each cluster, we consider the nearest country to the center of each cluster. The representative cluster centers are therefore: Curaçao, Nigeria, Samoa, Puerto Rico, and Haiti. Note that many of those countries belong to the same zone, the Caribbean sea, which makes it difficult for humans to understand the differences in profiles between clusters. Their profiles are not as differentiated as those of archetypoids. This also happen if we apply PAM with k = 5 with the estimated dissimilarities; the medoids are: South Sudan, Puerto Rico, West Bank, Curaçao, and Western Sahara, and the silhouette coefficient is 0.16, which indicates that there is no clear cluster structure (Kaufman and Rousseeuw 1990), as seen in Figure 6. Furthermore, note that clustering results return assignations to each cluster without any explanation about the way they belong to each cluster, unlike the information returned by alpha values of ADA. For example, k-POD includes countries such as USA, Brazil (and many other South and Central America countries), Czech Republic (and other Eastern European countries), Cape Verde, Gaza Strip, Iran, Iraq, etc. in the same cluster, when their variable values are not so similar, as can be seen in the alpha coefficients shown in Table 8. The information returned by ADA is richer, since we can know the composition of data. That information is not returned either if we apply fuzzy clustering to the dissimilarities, with the fanny function from the R package cluster (Maechler et al. 2017). The memberships, which range from 0 to 1, are spread out between the groups, they are not sparse like the alpha values. In fact, none of the fuzzy memberships is zero and the maximum value is 0.49. This happens because, as mentioned above, there is no clear cluster structure. The groups returned by ADA are, therefore, more reasonable and their interpretation easier since ADA identifies a sparser representation of each country in terms of the archetypoids.

Conclusion
New procedures for handling missing data in AA or ADA have been proposed and compared in a simulation study with previous approaches, offering favorable results. With the proposed procedures, the theoretical properties regarding location of archetypes are guaranteed, unlike the previous approaches. Furthermore, our procedures are the only ones to date that can return archetypal representatives with NAs, which are a natural part of a dataset with missing values. The information gathered in cases with missing values is not discarded, and possible errors due to imputation do not occur either. Moreover, no assumption on the missingness mechanism is needed. The procedures do not substantially increase the computational cost of the original algorithms either, since the proposed modifications are not computationally expensive.
Based on the simulation results, it seems that depending on the data set, one method could work better than another, but there are no significant differences between the best methods. Depending also on whether the interest is in fitting (low RSS) or obtaining accurate archetypes or clustering the data, one method could be more appropriate than another: MAAMO-HAN seems a good alternative for the first cases, whereas a method based on dissimilarities seems the best option for the last case. IMP has return good results in many settings, but it failed with wine dataset for MAR. Furthermore, IMP was not good for recovering the archetypes in waveform dataset, which had a high percentage of missigness. Our contribution is to provide a range of methods to choose from depending on the problem.
For the methodology based on dissimilarities, the cornerstone of the procedure is the estimation of the dissimilarities between samples with missing values and their subsequent projection in a new space. Simple methods have been used for the estimation of the dissimilarities, but even so, promising results have been obtained. As future work, more sophisticated estimators, such as those proposed in Eirola et al. (2014) or Mesquita et al. (2017), could be used. Another open question is the computation of standard errors of the proposed estimates, which could be based on a resampling technique (bootstrapping).
In terms of fields of the application, we have focused on socio-demographic aspects of society, but many other aspects could be analyzed, such as politics, economics, technology, etc. In fact, practically any application is possible, since missing data occur in almost all areas of research.
Another possibility for future work would be to extend AA and ADA to datasets with categorical or mixed data with missing values. Note that the "don't know" and "no opinion" answers are not uncommon in surveys. Furthermore, the perspective of using dissimilarities and their projection can be analyzed with other multivariate techniques.

Supplementary Materials
Toy example: A toy example illustrates our proposed methodology and the flaws of the previous approaches. (.pdf file) Data and code: Datasets and code in R for reproducing the results. (.rar file)