What is CoDa?

Compositional Data (CoDa) have been defined historically as random vectors with strictly positive components whose sum is constant (e.g., 100, one, a million). More recently, the term covers all those vectors representing parts of a whole which only carry relative information, thus including not only parts per unit or percentages, but also molar compositions.


Typical examples in different fields are: geology (geochemical elements), economy (income/expenditure distribution), medicine (body composition: fat, bone, lean), questionnaire surveys (ipsative data), food industry (food composition: fat, sugar, etc), chemistry (chemical composition), ecology (abundance of different species), paleontology (foraminifera taxa), agriculture (nutrient balance ionomics), sociology (time-use surveys), environmental sciences (soil contamination), and genetics (genotype frequency). This type of data appears in most applications, and the interest and importance of consistent statistical methods cannot be underestimated. Although the concern of the problems related to them was kept alive mainly by researchers from the field of Geosciences, in particular by members of the International Association for Mathematical Geosciences, the awareness of coherent methods is growing in the environmental and biological sciences.



This hot topic of research has nowadays a broad impact in these fields. However, it took a long time to find a solution to the problem of how to perform a proper statistical analysis of this type of data, i.e. to solve the problem of the spurious correlation, as it was named by Karl Pearson back in 1897, or the closure problem as called by Felix Chayes in the 1960's. Because standard statistical techniques loose their applicability and classical interpretation when applied to compositional data, new techniques were needed. No theoretically sound solution was proposed until the 1980's, when John Aitchison set forth a consistent theory based on log-ratios. Later developments have shown that the mathematical foundation of a proper statistical analysis for this type of data is based on the definition of a specific geometry on the simplex (the sample space of compositional data). Based on it, is possible to rigorously develop any statistical analysis (cluster analysis, discriminant analysis, factor analysis, regression models, to mention just a few).


Practitioners interested in CoDa can find in the website: www.compositionaldata.com a forum for the exchange of information, material and ideas. The free software Compositional Data Package (CoDaPack) implements presently the most elementary the compositional statistical methods. CoDaPack is oriented to users coming from the applied sciences, with no extensive background in using various computer packages. Also R Packages for CoDa can be found in the R-project web site: compositions, robCompositions, and zCompositions.