CoDa-thesis at TSE (Toulouse, France)


Contribution to the statistical analysis of compositional data with an application to political economy


On the 14th of October 2019, Thi Huong An NGUYEN, a member of our association defended her thesis, supervised by Dr Anne RUIZ-GAZEN and Dr Christine THOMAS-AGNAN. The jury members were Dr Denis ALLARD (INRA, Avignon), Dr Raja CHAKIR (INRA, Région de Paris), Dr Peter FILZMOSER (TUWien), Dr Michel LE BRETON (TSE), and Dr Josep Antoni MARTIN-FERNANDEZ (UdG).



The objective of this thesis is to investigate the outcome of an election and the impacts of the socio-economics factors on the vote shares in the multiparty system from a mathematical point of view. The vote shares of the departmental election in France in 2015 form a vector called composition. Thus, the classical regression model cannot be used directly to model these vote shares because of contraints of compositional data.

Chapter 2 presents a regression model in which the dependent variable is a compositional variable and the set of explanatory variables contains both classical variables and compositional variables. The author analyzes the impacts of socio-economic factors on the outcome of the election through predicting the vote shares according to either a classical explanatory variable or a compositional explanatory variable. Some graphical techniques are also presented to interpret the coefficients of a regression model. Furthermore, some authors show that electoral data often exhibit heavy tail behavior. Thus, the author proposes to replace the Normal distribution by the Student distribution. There are two versions of the multivariate Student distribution: the uncorrelated Student (UT) distribution and the independent Student (IT) distribution.



In Chapter 3, the author presents a complete summary for the Student distributions which includes the univariate and multivariate Student, the IT and the UT distribution with fixed degrees of freedom. She proves that the maximum likelihood estimator of the covariance matrix in the UT model is asymptotically biased. She also provides an iterative reweighted algorithm to compute the maximum likelihood estimator of parameter of the IT model. A simulation is provided and some Kolmogorov–Smirnov tests based on the Mahalanobis distance are carried out to select the right model. However, this does not work for the UT model because of a single realization of an observation of the multivariate distribution.

In Chapter 4, the multivariate Student (IT) regression model to political economy data is applied. The author then compares this model to the multivariate Normal regression model. She also applies the Kolmogorov–Smirnov tests based on the Mahalanobis distance which is proposed in Chapter 3 to select a better model.

Finally, the author investigates the assumption of statistical independence across territorial units which may be questionable due to potential spatial autocorrelation for compositional data. She develops a simultaneous spatial autoregressive model for compositional data which allows for both spatial correlation and correlations across equations by using two-stage and three-stage least squares methods. A simulation study to illustrate these methods is presented. An application to a data set from the 2015 French departmental election are also showed.

There is still work to continue in the direction of overcoming the problem of zeros in vote shares. This problem is already present for the departmental French elections at the canton level when aggregating the electoral parties in three categories. It would have been even more serious when considering the original political parties with no aggregation. Besides, another direction consists in considering the multivariate Student distribution for a spatial model.