Forecasting basketball players' performance using sparse functional data *

Statistics and analytic methods are becoming increasingly important in basketball. In particular, predicting players' performance using past observations is a considerable challenge. The purpose of this study is to forecast the future behavior of basketball players. The available data are sparse functional data, which are very common in sports. So far, however, no forecasting method designed for sparse functional data has been used in sports. A methodology based on two methods to handle sparse and irregular data, together with the analogous method and functional archetypoid analysis is proposed. Results in comparison with traditional methods show that our approach is competitive and additionally provides prediction intervals. The methodology can also be used in other sports when sparse longitudinal data are available.


INTRODUCTION
The statistical analysis of sports is a fast-growing field. In particular, sports forecasting is a strongly expanding field. Proof of this is the increasing number of forecasting methods developed covering several sports and sportive topics [1]. Certainly, in professional sports, getting any notion in advance about the future performance of players can have important consequences on the roster composition in terms of renewing or buying players, or in terms of establishing levels of remuneration in function of that performance, for instance. One of the sports that has been revolutionized by data analytics is basketball. Basketball analytics started to attract attention with the publications in refs. [2,3]. More recently, other papers and books have been released [4][5][6]. Technological advances have made it possible to collect more data than ever about what is happening on the field, requiring new methods of analysis. There is currently a need for innovative methods that exploit The data and software associated with this paper are available at https:// www.uv.es/vivigui/software * Not a special issue article the full potential of the data and that make it possible to generate additional value for athletes and technical staff. Here too, one of the main challenges is to use past performance to predict future performance [5].
To address this open question, some forecasting methods have been developed. Following ref. [7,Chapter 1.4], two main approaches can be distinguished based on the type of data used: time-series forecasting and cross-sectional forecasting. On the one hand, forecasting using data collected over time describes the likely outcome of the time series in the immediate future, based on knowledge of recent outcomes. On the other hand, cross-sectional forecasting methods use data collected at a single point in time. The goal here is to predict a target variable using some explanatory variables which are related to it. Two well-elaborated methods can be found using historical time data: (a) College Prospect Rating, which is a score assigned to college basketball players that attempts to estimate their NBA potential [5,8]; (b) Career-Arc Regression Model Estimator with Local Optimization (CARMELO), which, for a player of interest, identifies similar players throughout NBA history and uses their careers to forecast the future player's activity [9].
Regarding cross-sectional models, a Weibull-Gamma with covariates timing model is proposed in ref. [10] to predict the points scored by players over time. In this case, the time variable is years playing in NBA. Another interesting approach is presented in ref. [11], where correlations and regression models are computed to figure out which foreign players will be successful in the NBA, by using their previous statistics in international competitions.
In addition to the effort of predicting individual performance, there have also been other approaches focusing on teams and other features of the game. Some models using simulation have been developed to forecast the outcome of a basketball match [12,13]. A comparison between predictions based on NCAAB and NBA match data is discussed in ref. [14]. A dynamic paired comparison model is described in ref. [15] for the results of matches in two basketball and football tournaments. Furthermore, in ref. [16] a process model is used with player tracking data for predicting possession outcomes.
We wish to consider a new perspective by using functional data analysis (FDA) in sports. FDA is a relatively new branch of Statistics that analyzes data drawn from continuous underlying processes, often time, that is, a whole function is a datum. Let us assume that n smooth functions, x 1 (t), … , x n (t), are observed, with the ith function measured at t i1 , … , in points. In our study, x i (t) represents the metric value of player i for a certain age t. An important point we would like to emphasize here is that the time component of the FDA approach we are considering will represent players' ages. As such, in this paper we propose different models for aging curves, which is a well-recognized and important topic within the more general area of forecasting player performance. As mentioned in ref. [9], the most important attribute of all, in terms of determining a player's future career trajectory, is his/her age.
The goals of FDA coincide with those of any other branch of Statistics, and the classical summary statistics can be also defined, such as the mean function ). An excellent overview of FDA is found in ref. [17], while methodologies for studying functional data nonparametrically are found in ref. [18]. Ramsay et al [19] introduce related software and Ramsay and Silverman [20] present some interesting applications in different fields. Other recent applications include refs. [21,22]. In all these problems, a continuous function lies behind these data even though functions are sampled discretely at certain points. The FDA framework is highly flexible because the sampling time points do not have to be equally spaced and both the argument values and their cardinality can vary across cases. When functions are observed over a relatively sparse set of points, we have sparse functional data. An excellent survey on sparsely sampled functions is provided in ref. [23].
As regards the forecast of functional time series, there is a body of research, such as refs. [24][25][26], where functions are measured over a fine grid of points. However, only a few works deal with the problem of forecasting sparse functional data [27]. Notice that when functions are observed over a dense grid of time points, it is possible to fit a separate function for each case using any reasonable basis. Nevertheless, in the sparse case, this approach fails and the information from all functions must be used to fit each function. This is because each individual has been observed at different time points. Therefore, any fixed grid that is formed will contain many missing observations for each curve. A very good explanation of this issue can be found in ref. [23].
Sports data are sparse and irregular. They are sparse because most players do not have a very long career in the same league. And they are irregular because each player's career lasts for a different length of time. Despite the fact that time series data or movement trajectories are very common in sports, FDA has been mostly used in sport biomechanics or medicine [28,29]. To the best of our knowledge, there are only two references about sports analytics using FDA. In ref. [30], FDA was introduced for the study of players' aging curves and both hypothesis testing and exploratory analysis were performed. Vinué and Epifanio [31] extended archetypoid analysis (ADA) for sparse functional data (see also refs. [32,33]), showing the potential of FDA in sports analytics. In particular, in ref. [31, section 5.2] it was demonstrated that advanced analysis with FDA reveals patterns in the players' trajectories over the years that could not be discovered if data were simply aggregated (eg, averaged). We take advantage of this fact and continue the work done in ref. [31, section 5.2] with a view to predict how players will evolve.
In this paper, we propose a methodology to predict player's performances using sparse functional data. Two metrics will be analyzed: Box Plus/Minus (BPM) 1 and Win Shares (WS). 2 Analysis using BPM will allow us to establish a plausible comparison with CARMELO, while analysis with another variable such as WS will allow us to evaluate differences in career arcs. To that end, we will focus on two existing methods designed to handle sparse and irregular data: (a) regularized optimization for prediction and estimation with sparse (ROPES) data, originally developed by Alexander Dokumentov and Rob Hyndman [27,34]; (b) principal components analysis through conditional expectation (PACE), originally developed by Fang Yao, Hans-Georg Müller and Jane-Lin Wang [35].
Our methodology will also involve using the method of analogues based on functional archetypoid analysis (FADA), which will allow us to refine predictions for the players of interest and to achieve a more reliable forecast, in line with the expectations of basketball analysts. We will apply them to a very comprehensive database of NBA players. Results will be obtained using the R software [36].
As noted earlier, forecasting future performance is also very relevant to other sports (see for instance ref. [37]). We would like to emphasize that our methodology can also be used in other sports when sparse longitudinal data are available. Data and R code (including a web application created with the R package shiny [38]) to reproduce the results can be freely accessed at https://www.uv.es/vivigui/software. The rest of the paper is organized as follows: Section 2 reviews ROPES, PACE, ADA and FADA. Section 3 will be concerned with the data and input variables used. Section 4 presents two analyses: (a) ROPES and PACE are compared with each other and with standard benchmarks and (b) the reliability of ROPES predictions for current players using the method of analogues with FADA is shown. A comparison with CARMELO results is also provided and the implications of the archetypoid coefficients are discussed. The paper ends with some conclusions in Section 5. An appendix, supporting information contains a validation study to choose an optimal blend of tuning parameters which ROPES depends on, which is crucial for good predictive activity. It also shows how this methodology can also be proposed for forecasting international players. The appendix is available as an online supplement.

Regularized optimization for prediction and estimation with sparse
The method ROPES, proposed by Alexander Dokumentov and Rob Hyndman [27,34], solves problems involving decomposing, smoothing and forecasting two-dimensional sparse data. In practical terms, where the aim is to interpolate and extrapolate the sparse longitudinal data, made up of n observations, and presented over the time dimension with m time points, the following optimization problem is solved: where: • Y is an n × m matrix.
• ⊙ is the element-wise matrix multiplication. • W is an n × m "masking matrix" of weights. ROPES is equivalent to maximum likelihood estimation with partially observed data, which allows the calculation of confidence and prediction intervals. They are estimated using a Monte-Carlo style method. The original two sources [27,34] should be referred to for all the specific details.

Principal components analysis through conditional expectation
Functional principal components analysis (FPCA) is a common tool to reduce the dimension of data when the observations are random curves. The usual computational methods for FPCA based on function discretization or basis function expansions are inefficient when data with only a few repeated and sufficiently irregularly spaced measurements per subject are available. As a quick reminder, when functions are measured over a fine grid of time points, a separate function for each individual can be used. In the sparse case, however, the information from all functions must be used to fit each function.
A version of FPCA, in which the FPC scores are framed as conditional expectations, was developed by Fang Yao, Hans-Georg Müller and Jane-Lin Wang to overcome this issue [35]. This method was referred to as PACE for sparse and irregular longitudinal data. In practice, the prediction for the trajectory X i (t) for the ith subject, using the first p q eigenfunctions, is:̂( wherêis the estimate of the mean function E(X(t)) = (t) and iq are the FPC scores. PACE and its implementation in the R library fdapace [39] use local smoothing techniques to estimate the mean and covariance functions of the trajectories, specifically a local weighted bilinear smoother is used for estimating the covariance. Generalized cross validation is used for bandwidth choice, which is the default method for the FPCA function in the R library fdapace (default parameters are considered; eg, 10-folds and a Gaussian kernel are used). The number of components p is determined using the fraction-of-variance-explained threshold (0.9999 by default) computed during the SVD of the fitted covariance function. The eigenfunctionŝ( ) and the number p are estimated with the training set, and they are used in the estimation of the scores for the test set. This is the procedure we will follow in Section 4.1. With the scores and the estimated eigenfunctions, we obtain an approximation of the trajectories and they can be used to predict unobserved portions of the functions. Yao et al [35] also explain the construction of asymptotic pointwise confidence intervals for individual trajectories and asymptotic simultaneous confidence bands.

Archetypoid analysis
ADA was presented in ref. [33] and is an extension of archetypal analysis defined in ref. [40] (see refs. [41,42] for other derived methodologies). In ADA, archetypes correspond to real observations (the so-called archetypoids). Let X be an n × p matrix of real numbers representing a multivariate data set with n observations and p variables. For a given g, the objective of ADA is to find a g × p matrix Z that characterizes the archetypal patterns in the data. In ADA, the optimization problem is formulated as follows:

ADA for sparse data with FDA
ADA was defined for functions in ref. [32], where it was shown that functional archetypoids can be obtained as in the multivariate case if the functions are expressed in an orthonormal basis, simply by applying ADA to the basis coefficients. When functions are measured over some sparse points, we have sparse functional data. The basic idea of FADA is as follows. Based on the Karhunen-Loève expansion, the functions are approximated as in Equation (2). Because the eigenfunctions are orthonormal, to obtain FADA we can apply ADA to the n × p matrix X, with the scores (the coefficients in the Karhunen-Loève basis).

DATA
We have used the R package ballr [44] to obtain the total advanced statistics for each NBA player from the 1973-1974 season to the 2017-2018 season, including the player's age on 1 February of that season (the convention in https://www. basketball-reference.com/ is to provide players' age at the start of 1 February of every season). From the total set of statistics, we have focused on BPM and WS. BPM is a box score-based variable for estimating basketball players' quality and contribution to the team. It takes into account both the players' statistics and the team's overall performance. The final value enables us to evaluate the player' performance relative to the league average. BPM is a per-100-possession statistic and its scale is as follows: 0 is the league average, +5 means that the player has contributed 5 more points than an average player over 100 possessions, −2 is replacement level and −5 is very bad. Replacement level players are those who replace a roster spot for short-term contracts, so they are not normal members of a team's rotation. We have chosen BPM because it was created to only use the information that is available historically. This means that BPM only takes into account those stats that have always been available in the box-scores. It does not consider new stats derived from play-by-play data or from tracking data. According to the website where BPM is explained, 3 "it is possible to create a better stat than BPM for measuring players, but difficult to make a better one that can also be used historically," which fits perfectly with the goal of our method. BPM is available from the 1973-1974 season.
We have chosen a second metric, which is also widely used: WS. It also has the advantage of taking the surrounding team into account. In particular, WS is a player statistic that distributes the team's success among the team players. It is calculated using player, team and league statistics. The sum of all the players' WS in a given team will be approximately equal to that team's total wins for the season. A player with negative WS means that the player took away wins that the teammates had generated.
The reason for analyzing two variables is to investigate differences in career arcs for different aspects of skill. This allows us to highlight the power of our approach and could be of interest to basketball fans/analysts. Any other statistic can be chosen.
We have removed the observations with fewer than five games played. They were related to very extreme BPM values,  Figure 1 illustrates the type of data we are working with. It shows the observations of certain players, whose values are represented as connected points. Players' ages will represent the time points to be used by our methodology. The initial range of ages in the database went from 18 to 44 years old. However, there were only a few measurements between ages 41 and 44, related to a few long-lasting players, so we have removed them. The age range finally considered is from 18 to 40 years old, that is, there are 23 time points. In total there are 3075 players.

RESULTS
Section 4.1 contains a comparison analysis between ROPES, PACE and two benchmark methods. Section 4.2 discusses the implications of the archetypoid coefficients. Section 4.3 specifies the type of projections obtained. Section 4.4 presents an interactive web application.

Comparison with other methods
In order to evaluate the usefulness of ROPES and PACE, we carry out a comparison with each other and with two benchmark methods, such as the average method and the naïve 4 https://www.basketball-reference.com/players/m/muresgh01.html In order to check the performance of all the methods, we have applied them to the test set of 385 players (see the validation study of the appendix to know how this set is created). Table 1 reports an extract of the results. It contains the following information for all players in the 2017-2018 season: (a) their age; (b) their actual BPM value; (c) the predictions with ROPES (using the optimal combination from the validation study of the appendix), PACE and the simple methods; (d) the squared difference between predictions and actual values (BPM pr − BPM) 2 and (e) the resulting total MSE (highlighted in bold). PACE obtains the smallest MSE, followed by ROPES. It is interesting to note that the mean BPM obtained with the naïve method is practically the same as the actual one (both rounded to −0.92). This result is most probably because over-and under-predictions cancel each other out.   Remarkably, the naïve method shows outliers in all the intervals. Figure 2 is showing that PACE and ROPES tend to overestimate bad players and underestimate good ones. This can very helpful for team managers in their search for future stars (this additional interpretation was suggested by one of the referees).
Overall, PACE is the method that performs best. ROPES is able to beat the simple benchmark methods, showing an improvement with respect to them. The main drawback of the current PACE implementation is the lack of prediction intervals. The main goal of this paper is to draw attention to the added value of using an FDA approach to forecast players' performance, which has not been done so far. Therefore, even though PACE should give somewhat more accurate predictions than ROPES, in next section we will use ROPES to forecast future players' activity because it does provide prediction intervals. Prediction intervals are very helpful and important because they express how much uncertainty is associated with the forecast.

Implications of the archetypoid coefficients
We have analyzed a total of eight players from different teams, namely Devin Booker, Clint Capela, Joel Embiid, Nikola Jokic, Tyus Jones, Zach LaVine, Donovan Mitchell and Jayson Tatum. They are representing several career status. Embiid, Jokic, Mitchell and Tatum are already established figures (especially Embiid and Jokic). Booker and Capela are a step below the super stars but they are also very good players. Something similar could be said about Tyus Jones, who is constantly improving his skills. Finally, LaVine is an offensive specialist.
In a first attempt to compute predictions using all the players of the data set, we realized that the ROPES method had some pull towards the mean of the entire sample (like the other methods discussed in Section 4.1 but not as strong as them). This gave unrealistic performance predictions for both the best and most promising players. Therefore, in order to refine predictions, it is much more suitable to use the so-called "method of analogues." The idea is to find players related to the one of interest and then use their documented activity to obtain the predictions. We know how other players already performed, so we can use their information to gain an approximate idea about the future performances of others. The method of analogues has been used for years in fields such as climatology [45] and epidemiology [46]. Recently, an R package has been released that contains analogue methods for paleoecology [47]. The CARMELO method is also based on this scheme.
In order to find related players, we use ADA ( see ref. [33] for theoretical details). ADA searches for extreme observations (the so-called archetypoids) to describe the frontiers of the data. In this technique, the BPM (and WS) function of a player is approximated by a mixture of archetypoids, which are themselves functions of boundary players (outstanding-positive or negative-performers). Archetypoids are specific players and the coefficients represent how much each archetypoid contributes to the approximation of each individual. The most comparable archetypoid should be the one corresponding to the largest value of the coefficients for the player of interest.
We choose the number of archetypoids for each metric following the screeplot explained by [33]. Five are selected for BPM and four for WS. . Therefore, Stone is representing even a much more extreme bad pattern than Gray. We acknowledge that the five archetypoids computed for BPM are not exactly representing all the players' typologies available in the database. In some cases, a greater number of archetypoids is needed to capture other players' profiles. Even though we have determined the optimal number of archetypoids with the screeplot, in a real situation it would be up to the analyst to decide how many representative cases to consider.
Regarding the WS metric, the archetypoids are (their career WS shown in brackets): Steve Burtt (0), Ben Wallace (5.84), Otis Birdsong (4.03) and LeBron James (14.6). Again, as expected, James is the representative of super star players. The fact that James is selected as the "best" archetypoid in both metrics is indicating how he can excel in many aspects of the game.
Otis Birdsong and Ben Wallace represent very good players. Otis Birdsong played 12 NBA seasons and appeared in 4 NBA All-Star Games. He was selected with the second pick of the 1977 NBA draft. Ben Wallace was very good at grabbing rebounds and blocking opponent shots. He won the NBA  Table 2 shows the values for the eight players selected for the BPM and WS archetypoids.
As mentioned before, in ADA each datum is expressed as a mixture of actual observations (archetypoids). In particular, the coefficients of each player are of great utility because they allow us to determine the composition of each player according to the archetypoid players, and to establish a clustering of similar players [31]. A discussion of the implications of the resulting archetypoid coefficients is given next. The BPM composition of Embiid is as follows. Embiid's profile matches 49% of James', 23% of Gray's, 15% of Dawkins' and 13% of Curry's, so this is reflecting the fact that Embiid is on his way to become a super star like James is, but he still has some room of improvement. Regarding the WS composition, Embiid's profile is 42% explained by Burtt's, 29% by Birdsong's, 18% by Wallace's and 11% by James'. In this case, the Embiid's room for development is even more evident. The archetypoid coefficients for Jokic are quite impressive. His highest is with James in both variables. His BPM profile matches 68% of James' BPM profile (the highest similarity to James in this analysis by far) and his WS profile matches 43% of James' WS profile (only Tatum has a close value) and 35% of Birdsong's. This high similarity with respect to James implies that Jokic's performance is already very good. Proof of his remarkable activity is that he has received his first All-Star and All-NBA First Team selections in 2018-2019 season. In addition, his 12 triple-doubles ranked second on the season behind only Russell Westbrook [6]. Capela is another player worth highlighting. His highest BPM coefficient is also with James, though his profile still matches 33% of Stones', which indicates some shortcomings in his performance. Regarding his WS composition, his profile matches 76% of Birdsong's, 14% of Wallace's and 10% of James', which indicates a remarkable team productivity. On the other hand, the BPM profiles for Booker, Jones, Tatum and Mitchell are as close to James' as to Gray's. In terms of James' coefficient, it is very difficult to approach his excellence, so we should not expect Booker, Jones, Tatum and Mitchell to achieve such an incredible level. However, the Gray's coefficient is also indicating that their performance is not very far away from an average player yet, so they still need to make further efforts to display a difference from competitors. Their WS profiles reaffirm this claim. Finally, the LaVine's BPM and WS profiles are most similar to Gray's and Burtt's profiles, respectively, so he is not showing an outstanding performance at all.

4.3
Projections of future performance with ROPES and the method of analogues.

Comparison with CARMELO
In order to select the cluster of analogous players, we first choose the archetypoid with the highest . As an illustration, Embiid's greatest similarity for BPM is with James. Then, the group of BPM similar players to Embiid is made up of James, together with the other players whose largest coefficient is also for James and who have an value greater than Embiid's . We will use this cluster to forecast the Embiid's future career arc. Current stars such as Chris Paul or Kevin Durant and stars of previous seasons such as Michael Jordan or Charles Barkley belong to this set. Likewise for WS.
The ROPES algorithm (with the lambda combination obtained in the validation study) is used to obtain p-forecasting intervals, where p = 0.05 is the selected significance level. We will discuss the ROPES predictions with the ones that the CARMELO 2018-2019 version provides. 6 CARMELO is a basketball forecasting system released in the 2015-2016 season. Successive versions present some improvements [9]. To the best of our knowledge, it is the only publicly available projection system to compare our approach 6 https://projects.fivethirtyeight.com/carmelo/ against. For each player of interest, CARMELO computes the similarity scores between that player and all historical players. To that end, it uses a number of statistics and players' attributes and a version of a nearest neighbor algorithm. The wins above replacement (WAR) metric is computed for all historical players with a positive similarity score. The forecast is given by averaging these WAR values.
WAR reflects a combination of a player's projected playing time and his projected productivity while on the court. Productivity is measured by a blend of two-thirds Real Plus-Minus (RPM) and one-third BPM. BPM was solely used to make the 2016-2017 forecasts, but the combination of RPM and BPM is used for the 2018-2019 forecasts (as in 2015-2016 and 2017-2018). According to the developers of CARMELO, the RPM/BPM blend seems to outperform BPM alone. The RPM statistic quantifies how much a player hurts or helps his/her team when (s)he is on the court. There has been some controversy regarding the validity of RPM, since the computations are not detailed. 7 In fact, the CARMELO methodology cannot be replicated either. In addition, for seasons before 2000-2001, no RPM is available and CARMELO uses BPM only. The final point that needs to be made is that RPM is not available in our database. Therefore, we would like to draw the reader's attention to the fact that our results are not directly equivalent to those of CARMELO, because the target variable is not exactly the same. However, both approaches should be complementary. Figure 3 shows the ROPES forecast obtained for the eight players selected. In addition, for the sake of a convenient comparison with CARMELO, Figure 4 shows the screenshots of the CARMELO curves for the same eight players.
In Figure 3 we see that the predictions for Jokic show that his BPM and WS are expected to increase in the next three seasons and will remain quite high for several seasons (though they will go down from age 28). His lower and upper predictions are also high values. The width of the prediction intervals is constant over seasons. In general, the width of the intervals remains quite stable for most players. As a referee rightly mentioned, the intervals show several possible scenarios for some players, going from high to low values, so there is some uncertainty related to the point predictions. The CARMELO forecast for Jokic indicates some decrease in his performance, but still keeping high numbers ( Figure 4D). The width of the CARMELO intervals fluctuates, the uncertainty in the '20 season is bigger than in the '19 and '21 seasons. The BPM-ROPES forecast for Embiid shows that he will improve his BPM in the two coming seasons and then his performance will slowly decline. Regarding WS, it indicates a constant decrease over time. CARMELO also indicates that Embiid's performance will increase within two seasons and then his values will decrease (the intervals move between narrow and wide intervals, Figure 4C). The BPM-ROPES forecast for Capela shows a constant increase in the coming years, keeping good values for many seasons. On the other hand, his WS values will constantly decrease. In this case, the intervals are widening over the years (especially in the BPM facet). The CARMELO forecast for Capela is a bit more conservative. Its prediction intervals are a bit wider in the '20 and '21 seasons than in the '19 season and then narrow a bit again ( Figure 4B). The BPM-ROPES forecast for Devin Booker and Tyus Jones is somewhat similar to Capela's. However, their WS-ROPES forecast is a flat arc around 0. The CARMELO predictions for these two players show a certain increase in their activity (Figure 4a,e). The width of the prediction interval for Booker remains quite constant. For Jones the width interval fluctuates between some wide and narrow ranges. The BPM-ROPES prediction for Donovan Mitchell and Jayson Tatum are similar to Jokic's, though their values are not so outstanding. Their WS prediction is a flat arc around 0. The CARMELO predictions for Mitchell and Tatum also show a constant increase in their activity (Figure 4g,h). In both cases, the width of the prediction interval also fluctuates between some wide and narrow ranges. Finally, the BPM-ROPES forecast for LaVine shows a constant increase but keeping negative numbers, especially for the next 4 years. His WS prediction moves around 0. CARMELO also suggests an ordinary and flat performance in the coming seasons ( Figure 4f). It is worth mentioning that some of the WS predictions stop at age 33 because this is the last age for which the set of analogous players shows values. For Mitchell, Booker and Tatum, WS-ROPES has been too conservative. Higher values would have been a bit more realistic, since everything seems to indicate that these three players will become very good players in the near future. Another aspect that demands a careful examination of the results is that the WS prediction for Booker climbs to about 0.5 in the last 2 years. These are some pitfalls worth highlighting for end-users of this methodology.
As a final point, it is important to remember that statistical models are not completely reliable for long-term forecasts, because the assumption that the future looks similar to the past slowly breaks down the further we go into the future. So the predictions should be constantly updated as new data becomes available.

Web application
Additionally, an interactive web application available at https://www.uv.es/vivigui/AppPredPerf.html allows the user

CONCLUSIONS
Basketball, like any other sport, contains a lot of uncertainty. A central issue is to predict future players' performance using past observations. In spite of the fact that basketball data continue to expand and there is a constant demand for new techniques that provide objective information to help understand the game, there are not many publicly available projection systems. In this paper we have presented a methodology to deal with sparse functional data in order to forecast the basketball players' performance. This has been done by analyzing ROPES and PACE and by including the method of analogues together with functional archetypoid analysis. ROPES depends on several parameters, so we have carried out a validation study to choose an optimal combination that provides smooth curves and avoids overfitting (included in the appendix). The combination obtained works well to avoid narrow intervals and overconfident inferences. A comparison study has also been carried out to compare ROPES with PACE, and with simple alternatives, such as the average and naïve methods. PACE performed best overall and also in terms of runtime with respect to ROPES. However, unlike ROPES, it is not possible to obtain prediction intervals with its current computational implementation. In addition, ROPES also performed better than simple methods. Therefore, we have applied ROPES in the real case using data between 1973-1974 and 2017-2018 NBA regular seasons.
In the sparse case, information from all functions is used to fit each function, so all individuals contribute to a greater or lesser degree to form the estimations. To overcome this problem and to refine the predictions, we have used the so-called "method of analogues." The idea is to relate a player's curve to one of the possible types of players and then to predict his performance using only the information about these comparable athletes. In our case, the types of players are given by the archetypoids of the data set.
Once the computations are finished, an interactive web application shows the plots with the past and future behavior of 2017-2018 NBA players under the age of 24. Two variables have been analyzed: on the one hand, BPM is recognized as the most suitable metric to carry out an analysis involving historical data; on the other hand, WS is another widely used advanced metric. Adding a second variable allows us to examine differences in career arcs for different aspects of skill. Any other variable can be used. The predictions for eight players have been presented and a comparison with CARMELO has been done. The implications of the archetypoid coefficients have also been interpreted.
Player forecasting systems are important as a means of summarizing the overall match performance of individual players. Any forecasting method is limited because some aspects such as injury risk or work ethic, which influence future performance, are very difficult to quantify. However, coaches and experts can use these systems to review performances of their own players as well as tracking the performance levels of potential acquisitions. We hope that the approach presented here will provide valuable information about players' overall ability to support decision making. Sparse functional data are very common in sports. Therefore, it is very reasonable to bring methods developed to deal with this kind of data to the field of sports. This methodology can serve as a starting point for further efforts in the same direction. One of the referees suggested us to remark the following two situations that our analysis has not considered: (a) the different amounts of playing time going into each averaged BPM and WS data points. In mathematical terms, this is a case of unequal variances, also called heteroscedasticity; (b) the pattern of sparsity in the data is not random, since players retiring or leaving the NBA should have low BPM and WS values in these age intervals. Both situations were formulated by the referee. We will consider them in future work. Following another referee's suggestion, we will also try to compare the players' forecasts using relative rankings in terms of their coefficients from a common archetypoid. The data and all R code are freely available at https://www.uv.es/ vivigui/software for reproducibility and further exploration of the results.