Package 'descriptio'

Title: Descriptive Statistical Analysis
Description: Description of statistical associations between variables : measures of local and global association between variables (phi, Cramér V, correlations, eta-squared, Goodman and Kruskal tau, permutation tests, etc.), multiple graphical representations of the associations between variables (using 'ggplot2') and weighted statistics.
Authors: Nicolas Robette [aut, cre]
Maintainer: Nicolas Robette <[email protected]>
License: GPL (>= 2)
Version: 1.4
Built: 2025-02-06 05:13:04 UTC
Source: https://github.com/nicolas-robette/descriptio

Help Index


Measures the association between a categorical variable and a continuous variable

Description

Measures the association between a categorical variable and a continuous variable

Usage

assoc.catcont(x, y, weights = NULL,
              na.rm.cat = FALSE, na.value.cat = "NAs", na.rm.cont = FALSE,
              nperm = NULL, distrib = "asympt", digits = 3)

Arguments

x

the categorical variable (must be a factor)

y

the continuous variable (must be a numeric vector)

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm.cat

logical, indicating whether NA values in the categorical variable (i.e. x) should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the categorical variable (see na.value.cat argument).

na.value.cat

character. Name of the level for NA category. Default is "NAs". Only used if na.rm.cat = FALSE.

na.rm.cont

logical, indicating whether NA values in the continuous variable (i.e. y) should be silently removed before the computation proceeds. Default is FALSE.

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution ("asympt", default) or via Monte Carlo resampling ("approx".

digits

integer. The number of digits (default is 3).

Value

A list with the following elements :

summary

summary statistics (mean, median, etc.) of the continuous variable for each level of the categorical variable

eta.squared

eta-squared between the two variables

permutation.pvalue

p-value from a permutation (i.e. non-parametric) test of independence

cor

point biserial correlation between the two variables, for each level of the categorical variable

cor.perm.pval

permutation p-value of the correlation between the two variables, for each level of the categorical variable

test.values

test-values as proposed by Lebart et al (1984)

test.values.pval

p-values corresponding to the test-values

Author(s)

Nicolas Robette

References

Rakotomalala R., 'Comprendre la taille d'effet (effect size)', [http://eric.univ-lyon2.fr/~ricco/cours/slides/effect_size.pdf]

Lebart L., Morineau A. and Warwick K., 1984, *Multivariate Descriptive Statistical Analysis*, John Wiley and sons, New-York.

See Also

assoc.twocat, assoc.twocont, assoc.yx, condesc, catdesc, darma

Examples

data(Movies)
with(Movies, assoc.catcont(Country, Budget, nperm = 10))

Measures the groupwise association between a categorical variable and a continuous variable

Description

Measures the association between a categorical variable and a continuous variable, for each category of a group variable

Usage

assoc.catcont.by(x, y, by, weights = NULL,
                 na.rm.cat = FALSE, na.value.cat = "NAs", na.rm.cont = FALSE,
                 nperm = NULL, distrib = "asympt", digits = 3)

Arguments

x

factor : the categorical variable

y

numeric vector : the continuous variable

by

factor : the group variable

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm.cat

logical, indicating whether NA values in the categorical variable (i.e. x) should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the categorical variable (see na.value.cat argument).

na.value.cat

character. Name of the level for NA category. Default is "NAs". Only used if na.rm.cat = FALSE.

na.rm.cont

logical, indicating whether NA values in the continuous variable (i.e. y) should be silently removed before the computation proceeds. Default is FALSE.

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution ("asympt", default) or via Monte Carlo resampling ("approx".

digits

integer. The number of digits (default is 3).

Value

A list of items, one for each category of the group variable. Each item is a list with the following elements :

summary

summary statistics (mean, median, etc.) of the continuous variable for each level of the categorical variable

eta.squared

eta-squared between the two variables

permutation.pvalue

p-value from a permutation (i.e. non-parametric) test of independence

cor

point biserial correlation between the two variables, for each level of the categorical variable

cor.perm.pval

permutation p-value of the correlation between the two variables, for each level of the categorical variable

test.values

test-values as proposed by Lebart et al (1984)

test.values.pval

p-values corresponding to the test-values

Author(s)

Nicolas Robette

References

Rakotomalala R., 'Comprendre la taille d'effet (effect size)', [http://eric.univ-lyon2.fr/~ricco/cours/slides/effect_size.pdf]

Lebart L., Morineau A. and Warwick K., 1984, *Multivariate Descriptive Statistical Analysis*, John Wiley and sons, New-York.

See Also

assoc.catcont, assoc.twocat, assoc.twocont, assoc.yx, condesc, catdesc, darma

Examples

data(Movies)
with(Movies, assoc.catcont.by(Country, Budget, ArtHouse, nperm = 10))

Cross-tabulation and measures of association between two categorical variables

Description

Cross-tabulation and measures of association between two categorical variables

Usage

assoc.twocat(x, y, weights = NULL, na.rm = FALSE, na.value = "NAs",
             nperm = NULL, distrib = "asympt")

Arguments

x

the first categorical variable (must be a factor)

y

the second categorical variable (must be a factor)

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution (asympt, default) or via Monte Carlo resampling (approx).

Value

A list of lists with the following elements :

tables list :

freq

cross-tabulation frequencies

prop

percentages

rprop

row percentages

cprop

column percentages

expected

expected values

global list :

chi.squared

chi-squared value

cramer.v

Cramer's V between the two variables

permutation.pvalue

p-value from a permutation (i.e. non-parametric) test of independence

global.pem

global PEM

GK.tau.xy

Goodman and Kruskal tau (forward association, i.e. x is the predictor and y is the response)

GK.tau.yx

Goodman and Kruskal tau (backward association, i.e. y is the predictor and x is the respons)

local list :

std.residuals

the table of standardized (i.e. Pearson) residuals.

adj.residuals

the table of adjusted standardized residuals.

adj.res.pval

the table of p-values of adjusted standardized residuals.

odds.ratios

the table of odds ratios.

local.pem

the table of local PEM

phi

the table of the phi coefficients for each pair of levels

phi.perm.pval

the table of permutation p-values for each pair of levels

gather : a data frame gathering informations, with one row per cell of the cross-tabulation.

Note

The adjusted standardized residuals are strictly equivalent to test-values for nominal variables as proposed by Lebart et al (1984).

Author(s)

Nicolas Robette

References

Agresti, A. (2007). An Introduction to Categorical Data Analysis, 2nd ed. New York: John Wiley & Sons.

Rakotomalala R., Comprendre la taille d'effet (effect size), http://eric.univ-lyon2.fr/~ricco/cours/slides/effect_size.pdf

Lebart L., Morineau A. and Warwick K., 1984, *Multivariate Descriptive Statistical Analysis*, John Wiley and sons, New-York.

See Also

assoc.catcont, assoc.twocont, assoc.yx, condesc, catdesc, darma

Examples

data(Movies)
assoc.twocat(Movies$Country, Movies$ArtHouse, nperm=100)

Groupwise cross-tabulation and measures of association between two categorical variables

Description

Cross-tabulation and measures of association between two categorical variables, for each category of a group variable

Usage

assoc.twocat.by(x, y, by, weights = NULL, na.rm = FALSE, na.value = "NAs",
                nperm = NULL, distrib = "asympt")

Arguments

x

factor : the first categorical variable

y

factor : the second categorical variable

by

factor : the group variable

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution (asympt, default) or via Monte Carlo resampling (approx).

Value

A list of items, one for each category of the group variable. Each item is a list of lists with the following elements :

tables list :

freq

cross-tabulation frequencies

prop

percentages

rprop

row percentages

cprop

column percentages

expected

expected values

global list :

chi.squared

chi-squared value

cramer.v

Cramer's V between the two variables

permutation.pvalue

p-value from a permutation (i.e. non-parametric) test of independence

global.pem

global PEM

GK.tau.xy

Goodman and Kruskal tau (forward association, i.e. x is the predictor and y is the response)

GK.tau.yx

Goodman and Kruskal tau (backward association, i.e. y is the predictor and x is the respons)

local list :

std.residuals

the table of standardized (i.e.Pearson) residuals.

adj.residuals

the table of adjusted standardized residuals.

adj.res.pval

the table of p-values of adjusted standardized residuals.

odds.ratios

the table of odds ratios.

local.pem

the table of local PEM

phi

the table of the phi coefficients for each pair of levels

phi.perm.pval

the table of permutation p-values for each pair of levels

gather : a data frame gathering informations, with one row per cell of the cross-tabulation.

Note

The adjusted standardized residuals are strictly equivalent to test-values for nominal variables as proposed by Lebart et al (1984).

Author(s)

Nicolas Robette

References

Agresti, A. (2007). An Introduction to Categorical Data Analysis, 2nd ed. New York: John Wiley & Sons.

Rakotomalala R., Comprendre la taille d'effet (effect size), http://eric.univ-lyon2.fr/~ricco/cours/slides/effect_size.pdf

Lebart L., Morineau A. and Warwick K., 1984, *Multivariate Descriptive Statistical Analysis*, John Wiley and sons, New-York.

See Also

assoc.twocat, assoc.catcont, assoc.twocont, assoc.yx, condesc, catdesc, darma

Examples

data(Movies)
assoc.twocat.by(Movies$Country, Movies$ArtHouse, Movies$Festival, nperm=100)

Measures the association between two continuous variables

Description

Measures the association between two continuous variables with Pearson, Spearman and Kendall correlations.

Usage

assoc.twocont(x, y, weights = NULL, na.rm = FALSE,
              nperm = NULL, distrib = "asympt")

Arguments

x

a continuous variable (must be a numeric vector)

y

a continuous variable (must be a numeric vector)

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. Default is FALSE.

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution ("asympt", default) or via Monte Carlo resampling ("approx".

Value

A data frame with Pearson, Spearman and Kendall correlations. The correlation value is in the first row and a p-value from a permutation (so non parametric) test of independence is in the second row.

Author(s)

Nicolas Robette

See Also

assoc.twocat, assoc.catcont, assoc.yx, condesc, catdesc, darma

Examples

## Hollander & Wolfe (1973), p. 187f.
## Assessment of tuna quality.  We compare the Hunter L measure of
##  lightness to the averages of consumer panel scores (recoded as
##  integer values from 1 to 6 and averaged over 80 such values) in
##  9 lots of canned tuna.
x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
y <- c( 2.6,  3.1,  2.5,  5.0,  3.6,  4.0,  5.2,  2.8,  3.8)
assoc.twocont(x,y,nperm=100)

Measures the groupwise association between two continuous variables

Description

Measures the association between two continuous variables with Pearson, Spearman and Kendall correlations, for each category of a group variable.

Usage

assoc.twocont.by(x, y, by, weights = NULL, na.rm = FALSE,
                 nperm = NULL, distrib = "asympt")

Arguments

x

numeric vector : a continuous variable

y

numeric vector : a continuous variable

by

factor : the group variable

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. Default is FALSE.

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution ("asympt", default) or via Monte Carlo resampling ("approx".

Value

A list of items, one for each category of the groupe variable. Each item is a data frame with Pearson, Spearman and Kendall correlations. The correlation value is in the first row and a p-value from a permutation (so non parametric) test of independence is in the second row.

Author(s)

Nicolas Robette

See Also

assoc.twocont, assoc.twocat, assoc.catcont, assoc.yx, condesc, catdesc, darma

Examples

## Hollander & Wolfe (1973), p. 187f.
## Assessment of tuna quality.  We compare the Hunter L measure of
##  lightness to the averages of consumer panel scores (recoded as
##  integer values from 1 to 6 and averaged over 80 such values) in
##  9 lots of canned tuna.
x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
y <- c( 2.6,  3.1,  2.5,  5.0,  3.6,  4.0,  5.2,  2.8,  3.8)
group <- factor(c("A","B","C","C","B","A","A","C","B"))
assoc.twocont.by(x,y,group,nperm=100)

Bivariate association measures between pairs of variables.

Description

Computes bivariate association measures between every pairs of variables from a data frame.

Usage

assoc.xx(x, weights = NULL, correlation = "kendall",
  na.rm.cat = FALSE, na.value.cat = "NAs", na.rm.cont = FALSE,
  nperm = NULL, distrib = "asympt", dec = c(3,3))

Arguments

x

the data frame of variables

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

correlation

character. The type of measure of correlation measure to use between two continuous variables : "pearson", "spearman" or "kendall" (default).

na.rm.cat

logical, indicating whether NA values in the categorical variables should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the categorical variables (see na.value.cat argument).

na.value.cat

character. Name of the level for NA category. Default is "NAs". Only used if na.rm.cat = FALSE.

na.rm.cont

logical, indicating whether NA values in the continuous variables should be silently removed before the computation proceeds. Default is FALSE.

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution ("asympt", default) or via Monte Carlo resampling ("approx").

dec

vector of 2 integers for number of decimals. The first value if for association measures, the second for permutation p-values. Default is c(3,3).

Details

The function computes an association measure : Pearson's, Spearman's or Kendall's correlation for pairs of numeric variables, Cramer's V for pairs of factors and eta-squared for pairs numeric-factor. It can also compute the p-value of a permutation test of association for each pair of variables.

Value

A table with the following elements :

measure

: name of the association measure

association

: value of the association measure

permutation.pvalue

: p-value from the permutation test

Author(s)

Nicolas Robette

See Also

darma, assoc.twocat, assoc.twocont, assoc.catcont, condesc, catdesc, assoc.yx

Examples

data(iris)
  iris2 = iris
  iris2$Species = factor(iris$Species == "versicolor")
  assoc.xx(iris2, nperm = 10)

Bivariate association measures between a response and predictor variables.

Description

Computes bivariate association measures between a response and predictor variables (and, optionnaly, between every pairs of predictor variables.)

Usage

assoc.yx(y, x, weights = NULL, xx = TRUE, correlation = "kendall",
  na.rm.cat = FALSE, na.value.cat = "NAs", na.rm.cont = FALSE,
  nperm = NULL, distrib = "asympt", dec = c(3,3))

Arguments

y

the response variable

x

the predictor variables

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

xx

whether the association measures should be computed for couples of predictor variables (default) or not. With a lot of predictors, consider setting xx to FALSE (for reasons of computation time).

correlation

character. The type of measure of correlation measure to use between two continuous variables : "pearson", "spearman" or "kendall" (default).

na.rm.cat

logical, indicating whether NA values in the categorical variables should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the categorical variables (see na.value.cat argument).

na.value.cat

character. Name of the level for NA category. Default is "NAs". Only used if na.rm.cat = FALSE.

na.rm.cont

logical, indicating whether NA values in the continuous variables should be silently removed before the computation proceeds. Default is FALSE.

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution ("asympt", default) or via Monte Carlo resampling ("approx").

dec

vector of 2 integers for number of decimals. The first value if for association measures, the second for permutation p-values. Default is c(3,3).

Details

The function computes an association measure : Pearson's, Spearman's or Kendall's correlation for pairs of numeric variables, Cramer's V for pairs of factors and eta-squared for pairs numeric-factor. It can also compute the p-value of a permutation test of association for each pair of variables.

Value

A list of the following items :

YX

: a table with the association measures between the response and predictor variables

XX

: a table with the association measures between every pairs of predictor variables

In each table :

measure

: name of the association measure

association

: value of the association measure

permutation.pvalue

: p-value from the permutation test

Author(s)

Nicolas Robette

See Also

darma, assoc.twocat, assoc.twocont, assoc.catcont, condesc, catdesc

Examples

data(iris)
  iris2 = iris
  iris2$Species = factor(iris$Species == "versicolor")
  assoc.yx(iris2$Species,iris2[,1:4],nperm=10)

Measures the association between a categorical variable and some continuous and/or categorical variables

Description

Measures the association between a categorical variable and some continuous and/or categorical variables

Usage

catdesc(y, x, weights = NULL, 
na.rm.cat = FALSE, na.value.cat = "NAs", na.rm.cont = FALSE,
measure = "phi", limit = NULL, correlation = "kendall", robust = TRUE, 
nperm = NULL, distrib = "asympt", digits = 2)

Arguments

y

the categorical variable to describe (must be a factor)

x

a data frame with continuous and/or categorical variables

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm.cat

logical, indicating whether NA values in the categorical variables should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the categorical variables (see na.value.cat argument).

na.value.cat

character. Name of the level for NA category. Default is "NAs". Only used if na.rm.cat = FALSE.

na.rm.cont

logical, indicating whether NA values in the continuous variables should be silently removed before the computation proceeds. Default is FALSE.

measure

character. The measure of local association between categories of categorical variables. Can be "phi" for phi coefficient (default), "or" for odds ratios, "std.residuals" for standardized (i.e. Pearson) residuals, "adj.residuals" for adjusted standardized residuals or "pem" for local percentages of maximum deviation from independence.

limit

for the relationship between y and a categorical variable, only associations higher or equal to limit will be displayed. If NULL (default), they are all displayed.

correlation

character. The type of measure of correlation measure to use between two continuous variables : "pearson", "spearman" or "kendall" (default).

robust

logical. If TRUE (default), median and mad are used instead of mean and standard deviation.

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution ("asympt", default) or via Monte Carlo resampling ("approx").

digits

numeric. Number of digits for mean, median, standard deviation and mad. Default is 2.

Value

A list of the following items :

variables

associations between y and the variables in x

bylevel

a list with one element for each level of y

Each element in bylevel has the following items :

categories

a data frame with categorical variables from x and local associations

continuous.var

a data frame with continuous variables from x and associations measured by correlation coefficients

Note

If nperm is not NULL, permutation tests of independence are computed and the p-values from these tests are provided.

Author(s)

Nicolas Robette

References

Rakotomalala R., 'Comprendre la taille d'effet (effect size)', [http://eric.univ-lyon2.fr/~ricco/cours/slides/effect_size.pdf]

See Also

catdes, condesc, assoc.yx, darma

Examples

data(Movies)
catdesc(Movies$ArtHouse, Movies[,c("Budget","Genre","Country")])

Bivariate statistics between a categorical variable and a set of variables

Description

Computes bivariate statistics for a set of variables according to the subgroups of observations defined by a categorical variable.

Usage

cattab(x, y, weights = NULL, percent = "column",
       robust = TRUE, show.n = TRUE, show.asso = TRUE,
       digits = c(1,1), na.rm = TRUE, na.value = "NAs")

Arguments

x

data frame. The variables which are described in rows. They can be numerical or factors.

y

factor. The categorical variable which defines subgroups of observations described in columns.

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

percent

character. Whether to compute row percentages ("row") or column percentages ("column", default).

robust

logical. Whether to use medians instead of means. Default is TRUE.

show.n

logical. Whether to display frequencies (between brackets) in addition to the percentages. Default is TRUE.

show.asso

logical. Whether to add a column with measures of global association (Cramer's V and eta-squared). Default is TRUE.

digits

vector of 2 integers. The first value sets the number of digits for percentages, the second one sets the number of digits for medians and means. Default is c(1,1). If NULL, the results are not rounded.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

Details

The function uses gtsummary package to build the table of statistics, and then gt package to finalize the layout. Weights are handled silently with survey package.

Besides, the function is compatible with the attribute labels assigned with labelled package : these labels are displayed automatically.

Value

An object of class gt_tbl.

Note

This function is quite similar to profiles, but displays the results in a fancier way.

Author(s)

Nicolas Robette

See Also

catdesc, assoc.yx, darma, assoc.twocat, assoc.twocat.by, profiles

Examples

data(Movies)
cattab(x = Movies[, c("Genre", "ArtHouse", "Critics", "BoxOffice")],
       y = Movies$Country)

Measures the association between a continuous variable and some continuous and/or categorical variables

Description

Measures the association between a continuous variable and some continuous and/or categorical variables

Usage

condesc(y, x, weights = NULL, 
na.rm.cat = FALSE, na.value.cat = "NAs", na.rm.cont = FALSE,
limit = NULL, correlation = "kendall", robust = TRUE, 
nperm = NULL, distrib = "asympt", digits = 2)

Arguments

y

the continuous variable to describe

x

a data frame with continuous and/or categorical variables

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm.cat

logical, indicating whether NA values in the categorical variables should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the categorical variables (see na.value.cat argument).

na.value.cat

character. Name of the level for NA category. Default is "NAs". Only used if na.rm.cat = FALSE.

na.rm.cont

logical, indicating whether NA values in the continuous variables should be silently removed before the computation proceeds. Default is FALSE.

limit

for the relationship between y and a category of a categorical variable, only associations (point-biserial correlations) higher or equal to limit will be displayed. If NULL (default), they are all displayed.

correlation

character. The type of correlation measure to use between two continuous variables : "pearson", "spearman" or "kendall" (default).

robust

logical. If TRUE (default), meadian and mad are used instead of mean and standard deviation.

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution ("asympt", default) or via Monte Carlo resampling ("approx").

digits

numeric. Number of digits for mean, median, standard deviation and mad. Default is 2.

Value

A list of the following items :

variables

associations between y and the variables in x

categories

a data frame with categorical variables from x and associations measured by point biserial correlation.

Note

If nperm is not NULL, permutation tests of independence are computed and the p-values from these tests are provided.

Author(s)

Nicolas Robette

References

Rakotomalala R., 'Comprendre la taille d'effet (effect size)', [http://eric.univ-lyon2.fr/~ricco/cours/slides/effect_size.pdf]

See Also

condes, catdesc, assoc.yx, darma

Examples

data(Movies)
condesc(Movies$BoxOffice, Movies[,c("Budget","Genre","Country")])

Bivariate statistics between a continuous variable and a set of variables

Description

Computes bivariate statistics between a continuous variable and a set of variables, possibly according to a strata variable.

Usage

contab(x, y, strata = NULL, weights = NULL, robust = TRUE,
       digits = c(1,3), na.rm = TRUE, na.value = "NAs")

Arguments

x

data frame. The variables which are described in rows. They can be numerical or factors.

y

factor. The categorical variable which defines subgroups of observations described in columns.

strata

optional categorical variable to stratify the table by column. Default is NULL, which means no strata.

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

robust

logical. Whether to use medians (and mads) instead of means (and standard deviations). Default is TRUE.

digits

vector of 2 integers. The first value sets the number of digits for medians, mads, means and standard deviations (categorical variables). The second one sets the number of digits for slopes (continuous variables). Default is c(1,3). If NULL, the results are not rounded.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the categorical variables with NA values (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

Details

For categorical variables in x, the function computes :

- column 1 : the median and the mad of y for each level of the variable

- column 2 : the global association between the variable and y, measured by the eta-squared

For continous variables in x, it computes :

- column 1 : the slope of the linear regression of y according to the variable

- column 2 : the global association between the variable and y, measured by Pearson and Spearman correlations

Value

An object of class gt_tbl.

Author(s)

Nicolas Robette

See Also

regtab, condesc, assoc.yx, darma, assoc.twocont, assoc.twocont.by

Examples

data(Movies)
contab(x = Movies[, c("Genre", "ArtHouse", "Budget")],
       y = Movies$BoxOffice)

Pretty 2, 3 or 4-way cross-tabulations

Description

Displays pretty 2, 3 or 4-way cross-tabulations, from possibly weighted data, and with the opportunity to color the cells of the table according to a local measure of association (phi coefficients, standardized residuals or PEM).

Usage

crosstab(x, 
         y,
         xstrata = NULL,
         ystrata = NULL,
         weights = NULL,
         stat = "rprop",
         show.n = FALSE,
         show.cramer = TRUE,
         na.rm = FALSE,
         na.value = "NAs",
         digits = 1,
         sort = "none",
         color.cells = FALSE,
         measure = "phi",
         limits = c(-1, 1),
         min.asso = 0.1, 
         palette = "PRGn",
         reverse = FALSE)

Arguments

x

the row categorical variable

y

the column categorical variable

xstrata

optional categorical variable to stratify the table by rows. Default is NULL, which means no row strata.

ystrata

optional categorical variable to stratify the table by columns. Default is NULL, which means no column strata.

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

stat

character. Whether to compute a contingency table ("freq", default), percentages ("prop"), row percentages ("rprop") or column percentages ("cprop").

show.n

logical. Whether to display frequencies (between brackets) in addition to the percentages. Ignored if stat = "freq". Default is FALSE.

show.cramer

logical. If TRUE (default), Cramer's V measure of association is displayed beside the table.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

digits

integer. The number of digits (default is 1). If NULL, the results are not rounded.

sort

character. If "both", rows and columns are sorted according to the first factor of a correspondence analysis of the contingency table. If "x", only rows are sorted. If "y", only columns are sorted. If "none" (default), no sorting is done.

color.cells

logical, indicating whether the cells of the table should be colored according to local measures of association. Default is FALSE.

measure

character. The measure of association used to color the cells. Can be "phi" for phi coefficient (default), "std.residuals" for standardized residuals, "adj.residuals" for adjusted standardized residuals or "pem" for local percentages of maximum deviation from independence. Only used if color.cells = TRUE.

limits

a numeric vector of length 2 providing limits of the scale. Default is c(-1,1). Only used if color.cells = TRUE.

min.asso

numerical value. The cells with a local association below min.asso (in absolute value) are kept blank. Only used if color.cells = TRUE.

palette

The colours or colour function that values will be mapped to (see details).

reverse

Whether the colors (or color function) in palette should be used in reverse order. For example, if the default order of a palette goes from blue to green, then reverse = TRUE will result in the colors going from green to blue. Default is FALSE. Only used if color.cells = TRUE.

Details

The function uses gtsummary package to build the cross-tabulation, and then gt package to finalize the layout and color the cells. Weights are handled silently with survey package.

Besides, the function is compatible with the attribute labels assigned with labelled package : these labels are displayed automatically.

The palette argument can be any of the following :

1. A character vector of RGB or named colours. Examples: palette(), c("#000000", "#0000FF", "#FFFFFF"), topo.colors(10)

2. The name of an RColorBrewer palette, e.g. "BuPu" or "Greens".

3. The full name of a viridis palette: "viridis", "magma", "inferno", or "plasma".

4. A function that receives a single value between 0 and 1 and returns a colour. Examples: colorRamp(c("#000000", "#FFFFFF"), interpolate="spline").

Value

An object of class gt_tbl.

Example Output

Example 1

image of rendered example table

Example 2

image of rendered example table

Author(s)

Nicolas Robette

See Also

assoc.twocat,weighted.table, phi.table

Examples

data(Movies)
# example 1
crosstab(Movies$Genre, Movies$Country)
# example 2
with(Movies, crosstab(Genre, Country, ystrata = ArtHouse, show.n = TRUE, color.cells = TRUE))

Describes Associations as in a Regression Model Analysis.

Description

Computes bivariate association measures between a response and predictor variables, producing a summary looking like a regression analysis.

Usage

darma(y, x, weights = NULL, target = 1,
      na.rm.cat = FALSE, na.value.cat = "NAs", na.rm.cont = FALSE,
      correlation = "kendall",
      nperm = NULL, distrib = "asympt", dec = c(1,3,3))

Arguments

y

the response variable

x

the predictor variables

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

target

rank or name of the category of interest when y is categorical

na.rm.cat

logical, indicating whether NA values in the categorical variables should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the categorical variables (see na.value.cat argument).

na.value.cat

character. Name of the level for NA category. Default is "NAs". Only used if na.rm.cat = FALSE.

na.rm.cont

logical, indicating whether NA values in the continuous variables should be silently removed before the computation proceeds. Default is FALSE.

correlation

character. The type of measure of correlation measure to use between two continuous variables : "pearson", "spearman" or "kendall" (default).

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution ("asympt", default) or via Monte Carlo resampling ("approx").

dec

vector of 3 integers for number of decimals. The first value if for percents or medians, the second for association measures, the third for permutation p-values. Default is c(1,3,3).

Details

The function computes association measures (phi, correlation coefficient, Kendall's correlation) between the variable of interest and the other variables. It can also compute the p-values permutation tests.

Value

A data frame

Author(s)

Nicolas Robette

See Also

assoc.yx, assoc.twocat, assoc.twocont, assoc.catcont, condesc, catdesc

Examples

data(iris)
  iris2 = iris
  iris2$Species = factor(iris$Species == "versicolor")
  darma(iris2$Species, iris2[,1:4], target=2, nperm=100)

Association plot

Description

For a cross-tabulation, plots measures of local association with bars of varying height and width, using ggplot2.

Usage

ggassoc_assocplot(data, mapping, measure = "std.residuals",
                  limits = NULL, sort = "none",
                  na.rm = FALSE, na.value = "NAs",
                  colors = NULL, direction = 1, legend = "right")

Arguments

data

dataset to use for plot

mapping

aesthetics being used. x and y are required, weight can also be specified.

measure

character. The measure of association used to fill the rectangles. Can be "phi" for phi coefficient, "or" for odds ratios, "std.residuals" (default) for standardized (i.e. Pearson) residuals, "adj.residuals" for adjusted standardized residuals or "pem" for local percentages of maximum deviation from independence.

limits

a numeric vector of length two providing limits of the scale. If NULL (default), the limits are automatically adjusted to the data.

sort

character. If "both", rows and columns are sorted according to the first factor of a correspondence analysis of the contingency table. If "x", only rows are sorted. If "y", only columns are sorted. If "none" (default), no sorting is done.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

colors

vector of colors that will be interpolated to produce a color gradient. If NULL (default), the "Temps" palette from rcartocolors package is used.

direction

Sets the order of colours in the scale. If 1, the default, colours are as output by RColorBrewer::brewer.pal(). If -1, the order of colours is reversed.

legend

the position of legend ("none", "left", "right", "bottom", "top"). If "none", no legend is displayed.

Details

The measure of local association measures how much each combination of categories of x and y is over/under-represented.

The bars vary in width according to the square root of the expected frequency. They vary in height and color shading according to the measure of association. If the measure chosen is "std.residuals" (Pearson's residuals), as in the original association plot from Cohen and Friendly, the area of the bars is proportional to the difference in observed and expected frequencies.

This function can be used as a high-level plot with ggduo and ggpairs functions of the GGally package.

Value

a ggplot object

Author(s)

Nicolas Robette

References

Cohen, A. (1980), On the graphical display of the significant components in a two-way contingency table. Communications in Statistics—Theory and Methods, 9, 1025–1041. doi:10.1080/03610928008827940.

Friendly, M. (1992), Graphical methods for categorical data. SAS User Group International Conference Proceedings, 17, 190–200. http://datavis.ca/papers/sugi/sugi17.pdf

See Also

assoc.twocat, phi.table, catdesc, assoc.yx, darma, ggassoc_crosstab, ggpairs

Examples

data(Movies)
ggassoc_assocplot(data=Movies, mapping=ggplot2::aes(Country, Genre))

Bar plot of a crosstabulation inspired by Bertin

Description

For a cross-tabulation, plots bars for the conditional percentages of variable y according to variable x, using ggplot2. The general display is inspired by Bertin's plots.

Usage

ggassoc_bertin(data, mapping, prop.width = FALSE, 
sort = "none", add.gray = FALSE, add.rprop = FALSE,
na.rm = FALSE, na.value ="NAs")

Arguments

data

dataset to use for plot

mapping

aesthetics being used. x and y are required, weight can also be specified.

prop.width

logical. If TRUE, the width of the bars is proportional to the margin percentages of variable x.

sort

character. If "both", rows and columns are sorted according to the first factor of a correspondence analysis of the contingency table. If "x", only variable x is sorted. If "y", only variable y is sorted. If "none" (default), no sorting is done.

add.gray

logical. If FALSE (default), only white and black are used to fill the bars. If TRUE, gray is used additionally to fill the part of the bars corresponding to margin percentages of variable y.

add.rprop

logical. If TRUE, row percentages are displayed on top of the bars. Default is FALSE.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

Details

The height of the bars is proportional to the conditional frequency of variable y. The bars are filled in black if the conditional frequency is higher than the marginal frequency; otherwise it's filled in white.

This graphical representation is inspired by the principles of Jacques Bertin and the online AMADO tool (https://paris-timemachine.huma-num.fr/amado/main.html).

Note : It does not allow faceting.

Value

a ggplot object

Author(s)

Nicolas Robette

References

J. Bertin: La graphique et le traitement graphique de l'information. Flammarion: Paris 1977.

See Also

assoc.twocat, phi.table, catdesc, ggassoc_crosstab, ggassoc_assocplot, ggassoc_phiplot, ggassoc_chiasmogram

Examples

data(Movies)
ggassoc_bertin(Movies, ggplot2::aes(x = Country, y = Genre))
ggassoc_bertin(Movies, ggplot2::aes(x = Country, y = Genre),
 sort = "both", prop.width = TRUE, add.gray = 3, add.rprop = TRUE)

Boxplots with violins

Description

Displays of boxplot and combines it with a violin plot, using ggplot2.

Usage

ggassoc_boxplot(data, mapping, 
na.rm.cat = FALSE, na.value.cat = "NAs", na.rm.cont = FALSE,
axes.labs = TRUE, ticks.labs = TRUE, text.size = 3,
sort = FALSE, box = TRUE, notch = FALSE, violin = TRUE)

Arguments

data

dataset to use for plot

mapping

aesthetic being used. It must specify x and y.

na.rm.cat

logical, indicating whether NA values in the categorical variable (i.e. x) should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the categorical variable (see na.value.cat argument).

na.value.cat

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

na.rm.cont

logical, indicating whether NA values in the continuous variable (i.e. y) should be silently removed before the computation proceeds. Default is FALSE.

axes.labs

Whether to display the labels of the axes, i.e. the names of x and y. Default is TRUE.

ticks.labs

Whether to display the labels of the categories of x and y. Default is TRUE.

text.size

Size of the association measure. If NULL, the text is not added to the plot.

sort

logical. If TRUE, the levels of the categorical variable are reordered according to the conditional medians, so that boxplots are sorted. Default is FALSE.

box

Whether to draw boxplots. Default is TRUE.

notch

If FALSE (default) make a standard box plot. If TRUE, make a notched box plot. Notches are used to compare groups; if the notches of two boxes do not overlap, this suggests that the medians are significantly different.

violin

Whether to draw a violin plot. Default is TRUE.

Details

Eta-squared measure of global association between x and y is displayed in upper-left corner of the plot.

This function can be used as a high-level plot with ggduo and ggpairs functions of the GGally package.

Value

a ggplot object

Author(s)

Nicolas Robette

See Also

assoc.catcont, condesc, assoc.yx, darma, ggpairs

Examples

data(Movies)
ggassoc_boxplot(Movies, mapping = ggplot2::aes(x = Critics, y = ArtHouse))

Plots counts and associations of a crosstabulation

Description

For a cross-tabulation, plots the number of observations by using rectangles with proportional areas, and the phi measures of association between the categories with a diverging gradient of colour, using ggplot2.

Usage

ggassoc_chiasmogram(data, mapping, measure = "phi",
limits = NULL, sort = "none",
na.rm = FALSE, na.value = "NAs",
colors = NULL, direction = 1)

Arguments

data

dataset to use for plot

mapping

aesthetics being used. x and y are required, weight can also be specified.

measure

character. The measure of association used for filling the rectangles. Can be "phi" for phi coefficient (default), "or" for odds ratios, "residuals" for Pearson residuals, "std.residuals" for standardized Pearson residuals or "pem" for local percentages of maximum deviation from independence.

limits

a numeric vector of length two providing limits of the scale. If NULL (default), the limits are automatically adjusted to the data.

sort

character. If "both", rows and columns are sorted according to the first factor of a correspondence analysis of the contingency table. If "x", only rows are sorted. If "y", only columns are sorted. If "none" (default), no sorting is done.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

colors

vector of colors that will be interpolated to produce a color gradient. If NULL (default), the "Temps" palette from rcartocolors package is used.

direction

Sets the order of colours in the scale. If 1, the default, colours are as output by RColorBrewer::brewer.pal(). If -1, the order of colours is reversed.

Details

The height of the rectangles is proportional to the marginal frequency of the row variable ; their width is proportional to the marginal frequency of the column variable. So the area of the rectangles is proportional to the expected frequency.

The rectangles are filled according to a measure of local association, which measures how much each combination of categories of x and y is over/under-represented.

This function can be used as a high-level plot with ggduo and ggpairs functions of the GGally package.

Note : It does not allow faceting.

Value

a ggplot object

Author(s)

Nicolas Robette

References

Bozon Michel, Héran François. La découverte du conjoint. II. Les scènes de rencontre dans l'espace social. Population, 43(1), 1988, pp. 121-150.

See Also

assoc.twocat, phi.table, catdesc, assoc.yx, darma, ggassoc_phiplot, ggpairs

Examples

data(Movies)
ggassoc_chiasmogram(data=Movies, mapping=ggplot2::aes(Genre, Country))

Proportional area plot

Description

For a cross-tabulation, plots the observed (or expected) frequencies by using rectangles with proportional areas, and the measures of local association between the categories with a diverging gradient of colour, using ggplot2.

Usage

ggassoc_crosstab(data, mapping, size = "freq", max.size =  20,
                 measure = "phi", limits = NULL, sort = "none", 
                 na.rm = FALSE, na.value = "NAs",
                 colors = NULL, direction = 1, legend = "right")

Arguments

data

dataset to use for plot

mapping

aesthetics being used. x and y are required, weight can also be specified.

size

character. If "freq" (default), areas are proportional to observed frequencies. If "expected", they are proportional to expected frequencies.

max.size

numeric value, specifying the maximum size of the squares. Default is 20.

measure

character. The measure of association used for filling the rectangles. Can be "phi" for phi coefficient (default), "or" for odds ratios, "std.residuals" for standardized residuals, "adj.residuals" for adjusted standardized residuals or "pem" for local percentages of maximum deviation from independence.

limits

a numeric vector of length two providing limits of the scale. If NULL (default), the limits are automatically adjusted to the data.

sort

character. If "both", rows and columns are sorted according to the first factor of a correspondence analysis of the contingency table. If "x", only rows are sorted. If "y", only columns are sorted. If "none" (default), no sorting is done.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

colors

vector of colors that will be interpolated to produce a color gradient. If NULL (default), the "Temps" palette from rcartocolors package is used.

direction

Sets the order of colours in the scale. If 1, the default, colours are as output by RColorBrewer::brewer.pal(). If -1, the order of colours is reversed.

legend

the position of legend ("none", "left", "right", "bottom", "top"). If "none", no legend is displayed.

Details

The measure of local association measures how much each combination of categories of x and y is over/under-represented.

The areas of the rectangles are proportional to observed or expected frequencies. Their color shading varies according to the measure of association.

This function can be used as a high-level plot with ggduo and ggpairs functions of the GGally package.

Value

a ggplot object

Author(s)

Nicolas Robette

See Also

assoc.twocat, phi.table, catdesc, assoc.yx, darma, ggassoc_phiplot, ggpairs

Examples

data(Movies)
ggassoc_crosstab(data=Movies, mapping=ggplot2::aes(Genre, Country))

Marimekko plot

Description

For a cross-tabulation, plots a marimekko chart (also called mosaic plot), using ggplot2.

Usage

ggassoc_marimekko(data, mapping, type = "classic", 
measure = "phi", limits = NULL, 
na.rm = FALSE, na.value = "NAs",
palette = NULL, colors = NULL, direction = 1, 
linecolor = "gray60", linewidth = 0.1, 
sort = "none", legend = "right")

Arguments

data

dataset to use for plot

mapping

aesthetics being used. x and y are required, weight can also be specified.

type

character. If "classic" (default), a simple marimekko chart is plotted, with no use of local associations. If type is "shades", tiles are shaded according to the local associations between categories. If type is "patterns", tiles are filled with patterns, and the density of patterns is proportional to the absolute level of local association between categories.

measure

character. The measure of association used for filling (if type is "shades) or patterning (if type is "patterns") the tiles. Can be "phi" for phi coefficient, "or" for odds ratios, "std.residuals" (default) for standardized (i.e. Pearson) residuals, "adj.residuals" for adjusted standardized residuals or "pem" for local percentages of maximum deviation from independence.

limits

a numeric vector of length two providing limits of the scale. If NULL (default), the limits are automatically adjusted to the data. Only used for type "shades".

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

palette

A character vector of color codes. The number of colors should be equal or higher than the number of categories in y. If NULL (default), the "Tableau" palette from ggthemes package is used. Only used for types "classic" and "patterns".

colors

vector of colors that will be interpolated to produce a color gradient. If NULL (default), the "Temps" palette from rcartocolors package is used. Only used for type "shades".

direction

Sets the order of colours in the scale. If 1, the default, colours are as output by RColorBrewer::brewer.pal(). If -1, the order of colours is reversed.

linecolor

character. Color of the contour lines of the tiles. Default is gray60.

linewidth

numeric. Width of the contour lines of the tiles. Default is 0.1.

sort

character. If "both", rows and columns are sorted according to the first factor of a correspondence analysis of the contingency table. If "x", only rows are sorted. If "y", only columns are sorted. If "none" (default), no sorting is done.

legend

the position of legend ("none", "left", "right", "bottom", "top"). If "none", no legend is displayed.

Details

The measure of local association measures how much each combination of categories of x and y is over/under-represented.

This function can be used as a high-level plot with ggduo and ggpairs functions of the GGally package.

Note : It does not allow faceting.

Value

a ggplot object

Author(s)

Nicolas Robette

References

Hartigan, J.A., and Kleiner, B. (1984), "A mosaic of television ratings". The American Statistician, 38, 32–35.

Friendly, M. (1994), "Mosaic displays for multi-way contingency tables". Journal of the American Statistical Association, 89, 190–200.

See Also

assoc.twocat, phi.table, catdesc, assoc.yx, darma, ggassoc_crosstab, ggpairs

Examples

data(Movies)
ggassoc_marimekko(data=Movies, mapping=ggplot2::aes(Genre, Country))
ggassoc_marimekko(data=Movies, mapping=ggplot2::aes(Genre, Country), type = "patterns")
ggassoc_marimekko(data=Movies, mapping=ggplot2::aes(Genre, Country), type = "shades")

Bar plot of measures of local association of a crosstabulation

Description

For a cross-tabulation, plots the measures of local association with bars of varying height, using ggplot2.

Usage

ggassoc_phiplot(data, mapping, measure = "phi", 
                limit = NULL, sort = "none",
                na.rm = FALSE, na.value = "NAs")

Arguments

data

dataset to use for plot

mapping

aesthetics being used. x and y are required, weight can also be specified.

measure

character. The measure of association used for filling the rectangles. Can be "phi" for phi coefficient (default), "or" for odds ratios, "std.residuals" for standardized residuals, "adj.residuals" for adjusted standardized residuals or "pem" for local percentages of maximum deviation from independence.

limit

numeric value, specifying the upper limit of the scale for the height of the bars, i.e. for the measures of association (the lower limit is set to 0-limit). It corresponds to the maximum absolute value of association one wants to represent in the plot. If NULL (default), the limit is automatically adjusted to the data.

sort

character. If "both", rows and columns are sorted according to the first factor of a correspondence analysis of the contingency table. If "x", only rows are sorted. If "y", only columns are sorted. If "none" (default), no sorting is done.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

Details

The measure of association measures how much each combination of categories of x and y is over/under-represented. The bars vary in width according to the number of observations in the categories of the column variable. They vary in height according to the measure of association. Bars are black if the association is positive and white if it is negative.

The genuine version of this plot (see Cibois, 2004) uses the measure of association called "pem", i.e. the local percentages of maximum deviation from independence.

This function can be used as a high-level plot with ggduo and ggpairs functions of the GGally package.

Value

a ggplot object

Author(s)

Nicolas Robette

References

Cibois Philippe, 2004, Les écarts à l'indépendance. Techniques simples pour analyser des données d'enquêtes, Collection "Méthodes quantitatives pour les sciences sociales"

See Also

assoc.twocat, phi.table, catdesc, assoc.yx, darma, ggassoc_crosstab, ggpairs

Examples

data(Movies)
ggassoc_phiplot(data=Movies, mapping=ggplot2::aes(Country, Genre))

Scatter plot with a smoothing line

Description

Displays of scatter plot and adds a smoothing line, using ggplot2.

Usage

ggassoc_scatter(data, mapping, na.rm = FALSE,
axes.labs = TRUE, ticks.labs = TRUE, text.size = 3)

Arguments

data

dataset to use for plot

mapping

aesthetic being used. It must specify x and y.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. Default is FALSE.

axes.labs

Whether to display the labels of the axes, i.e. the names of x and y. Default is TRUE.

ticks.labs

Whether to display the labels of the categories of x and y. Default is TRUE.

text.size

Size of the association measure. If NULL, the text is not added to the plot.

Details

Kendall's tau rank correlation between x and y is displayed in upper-left corner of the plot.

Smoothing is performed with gam.

This function can be used as a high-level plot with ggduo and ggpairs functions of the GGally package.

Value

a ggplot object

Author(s)

Nicolas Robette

See Also

assoc.twocont, condesc, assoc.yx, darma, ggpairs

Examples

data(Movies)
ggassoc_scatter(Movies, mapping = ggplot2::aes(x = Budget, y = Critics))

Movies (data)

Description

The data concerns a sample of 1000 Movies which were on screens in France and come of their characteristics.

Usage

data(Movies)

Format

A data frame with 1000 observations and the following 7 variables:

Budget

numeric vector of movie budgets

Genre

is a factor with 9 levels

Country

is a factor with 4 level. Country of origin of the movie.

ArtHouse

is a factor with levels No, Yes. Whether the movie had the "Art House" label.

Festival

is a factor with levels No, Yes. Whether the movie was selected in Cannes, Berlin or Venise film festivals.

Critics

numeric vector of average ratings from intellectual criticism.

BoxOffice

numeric vector of number of admissions.

Examples

data(Movies)
str(Movies)

Computes the odds ratios for every cells of a contingency table

Description

Computes the odds ratio for every cells of the cross-tabulation between two categorical variables

Usage

or.table(x, y, weights = NULL, na.rm = FALSE, na.value = "NAs", digits = 3)

Arguments

x

the first categorical variable

y

the second categorical variable

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

digits

integer. The number of digits (default is 3). If NULL, the results are not rounded.

Value

A table with the odds ratios

Author(s)

Nicolas Robette

See Also

assoc.twocat,assoc.catcont, condesc, catdesc

Examples

data(Movies)
or.table(Movies$Country, Movies$ArtHouse)

Computes the local and global Percentages of Maximum Deviation from Independence (pem)

Description

Computes the local and global Percentages of Maximum Deviation from Independence (pem) of a contingency table.

Usage

pem.table(x, y, weights = NULL, sort = FALSE, na.rm = FALSE, na.value = "NAs", digits = 1)

Arguments

x

the first categorical variable

y

the second categorical variable

weights

an optional numeric vector of weights (by default, a vector of 1 for uniform weights)

sort

logical. Whether rows and columns are sorted according to a correspondence analysis or not (default is FALSE).

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

digits

integer. The number of digits (default is 1). If NULL, the results are not rounded.

Details

The Percentage of Maximum Deviation from Independence (pem) is an association measure for contingency tables and also provides attraction (resp. repulsion) measures in each cell of the crosstabulation (see Cibois, 1993). It is an alternative to khi2, Cramer's V coefficient, etc.

Value

Returns a list:

peml

Table with local percentages of maximum deviation from independence

pemg

Numeric value, i.e. the global percentage of maximum deviation from independence

Author(s)

Nicolas Robette

References

Cibois P., 1993, Le pem, pourcentage de l'ecart maximum : un indice de liaison entre modalites d'un tableau de contingence, Bulletin de methodologie sociologique, n40, p.43-63. https://cibois.pagesperso-orange.fr/bms93.pdf

See Also

table, chisq.test, phi.table, assocstats

Examples

data(Movies)
pem.table(Movies$Country, Movies$ArtHouse)

Computes the phi coefficient for every cells of a contingency table

Description

Computes the phi coefficient for every cells of the cross-tabulation between two categorical variables

Usage

phi.table(x, y, weights = NULL, na.rm = FALSE, na.value = "NAs", digits = 3)

Arguments

x

the first categorical variable

y

the second categorical variable

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

digits

integer. The number of digits (default is 3). If NULL, the results are not rounded.

Value

A table with the phi coefficients

Author(s)

Nicolas Robette

References

Rakotomalala R., 'Comprendre la taille d'effet (effect size)', http://eric.univ-lyon2.fr/~ricco/cours/slides/effect_size.pdf

See Also

assoc.twocat,assoc.catcont, condesc, catdesc

Examples

data(Movies)
phi.table(Movies$Country, Movies$ArtHouse)

Profiles by level of a categorical variable

Description

Computes profiles (frequencies or percentages) for subgroups of observations defined by the levels of a categorical variable.

Usage

profiles(X, y, weights = NULL, stat = "cprop",
 mar = TRUE, digits = 1)

Arguments

X

data frame. The variables which are described in the profiles. There should be only factors.

y

factor. The categorical variable which defines subgroups of observations whose profiles will be computed.

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

stat

character. Whether to compute frequencies ("freq"), percentages ("prop"), row percentages ("rprop") or column percentages ("cprop", default).

mar

logical, indicating whether to compute margins. Default is TRUE.

digits

numeric. Number of digits. Default is 1.

Value

A data frame with profiles in columns

Author(s)

Nicolas Robette

See Also

catdesc, assoc.yx, darma, assoc.twocat, assoc.twocat.by

Examples

data(Movies)
profiles(Movies[,c(2,4,5)], Movies$Country)

Univariate and Multivariate Regressions and Their Average Marginal Effects

Description

Computes linear or binomial regressions in two steps : univariate regressions and a multivariate regressions. All the results are nicely displayed side by side with average marginal effects.

Usage

regtab(x, y, weights = NULL, continuous = "slopes", 
 show.ci = TRUE, conf.level = 0.95)

Arguments

x

data frame. The explanatory (i.e. independent) variables used in regressions. They can be numerical or factors.

y

vector. The outcome (i.e. dependent) variable. It can be numerical (linear regression) or a factor with 2 levels (binomial regression).

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

continuous

character. The kind of average marginal effects computed for continuous explanatory variables. If "slopes" (defaults), these are average marginal slopes. If "predictions", these are average marginal predictions for a set of values.

show.ci

logical. Whether to display the confidence intervals

conf.level

numerical value. Defaults to 0.95, which corresponds to a 95 percent confidence interval. Must be strictly greater than 0 and less than 1.

Details

This function is basically a wrapper for regression functions in the gtsummary function. It computes a series of univariate regressions (one for each explanatory variable), then a multivariate regression (with all explanatory variables) and displays the results side by side. These results are presented in the form of average marginal effects : average marginal predictions for categorical variables and average marginal slopes (or predictions) for continuous variables.

Besides, the function is compatible with the attribute labels assigned with labelled package : these labels are displayed automatically.

Value

an object of class tbl_merge from gtsummary package

Author(s)

Nicolas Robette

References

Arel-Bundock V, Greifer N, Heiss A (Forthcoming). “How to Interpret Statistical Models Using marginaleffects in R and Python.” Journal of Statistical Software.

Larmarange J., 2024, “Prédictions marginales, contrastes marginaux & effets marginaux”, in Guide-R, Guide pour l’analyse de données d’enquêtes avec R, https://larmarange.github.io/guide-R/analyses/estimations-marginales.html

See Also

cattab, catdesc, condesc, assoc.yx, darma, assoc.twocat, assoc.twocat.by

Examples

data(Movies)
regtab(x = Movies[, c("Genre", "Budget", "Festival", "Critics")],
       y = Movies$BoxOffice)

Cross-tabulation statistics for ggplot2

Description

Computes statistics of a cross-tabulation using assoc.twocat function.

Usage

stat_twocat(mapping = NULL, 
            data = NULL,
            geom = "point",
            position = "identity",
            ...,
            show.legend = NA,
            inherit.aes = TRUE)

Arguments

mapping

Set of aesthetic mappings created by aes(). If specified and inherit.aes = TRUE (the default), it is combined with the default mapping at the top level of the plot. You must supply mapping if there is no plot mapping.

data

The data to be displayed in this layer. There are three options: If NULL, the default, the data is inherited from the plot data as specified in the call to ggplot(). A data.frame, or other object, will override the plot data. All objects will be fortified to produce a data frame. See fortify() for which variables will be created. A function will be called with a single argument, the plot data. The return value must be a data.frame, and will be used as the layer data. A function can be created from a formula (e.g. ~ head(.x, 10)).

geom

Override the default connection with ggplot2::geom_point().

position

Position adjustment, either as a string naming the adjustment (e.g. "jitter" to use position_jitter), or the result of a call to a position adjustment function. Use the latter if you need to change the settings of the adjustment.

...

Other arguments passed on to layer(). These are often aesthetics, used to set an aesthetic to a fixed value, like colour = "red" or size = 3. They may also be parameters to the paired geom/stat.

show.legend

logical. Should this layer be included in the legends? NA, the default, includes if any aesthetics are mapped. FALSE never includes, and TRUE always includes. It can also be a named logical vector to finely select the aesthetics to display.

inherit.aes

If FALSE, overrides the default aesthetics, rather than combining with them. This is most useful for helper functions that define both data and aesthetics and shouldn't inherit behaviour from the default plot specification, e.g. borders().

Value

A ggplot2 plot with the added statistic.

Author(s)

Nicolas Robette


Standardized residuals of a contingency table

Description

Computes standardized or adjusted residuals of a (possibly) weighted contingency table

Usage

stdres.table(x, y, weights = NULL, na.rm = FALSE,
  na.value = "NAs", digits = 3, residuals = "std")

Arguments

x

the first categorical variable

y

the second categorical variable

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

digits

integer. The number of digits (default is 3). If NULL, the results are not rounded.

residuals

If "std" (default), standardized (i.e. Pearson) residuals are computed. If "adj", adjusted standardized residuals are computed.

Value

A table with the residuals

Note

The adjusted standardized residuals are strictly equivalent to test-values for nominal variables as proposed by Lebart et al (1984).

Author(s)

Nicolas Robette

References

Agresti, A. (2007). An Introduction to Categorical Data Analysis, 2nd ed. New York: John Wiley & Sons.

Rakotomalala R., Comprendre la taille d'effet (effect size), http://eric.univ-lyon2.fr/~ricco/cours/slides/effect_size.pdf

Lebart L., Morineau A. and Warwick K., 1984, *Multivariate Descriptive Statistical Analysis*, John Wiley and sons, New-York.

See Also

assoc.twocat,phi.table, or.table, pem.table

Examples

data(Movies)
stdres.table(Movies$Country, Movies$ArtHouse)

Weighted correlation

Description

Computes the weighted correlation between two distributions. This can be Pearson, Spearman or Kendall correlation.

Usage

weighted.cor(x, y, weights = NULL, method = "pearson", na.rm = FALSE)

Arguments

x

numeric vector

y

numeric vector

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

method

a character string indicating which correlation coefficient is to be computed. One of "pearson" (default), "kendall", or "spearman".

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. Default is FALSE.

Value

a length-one numeric vector

Author(s)

Nicolas Robette

See Also

weighted.sd, weighted.cor2

Examples

data(Movies)
weighted.cor(Movies$Critics, Movies$BoxOffice, weights = rep(c(.8,1.2), 500))
weighted.cor(Movies$Critics, Movies$BoxOffice, weights = rep(c(.8,1.2), 500), method = "spearman")

Weighted correlations

Description

Computes a matrix of weighted correlations between the columns of x and the columns of y. This can be Pearson, Spearman or Kendall correlation.

Usage

weighted.cor2(x, y = NULL, weights = NULL, method = "pearson", na.rm = FALSE)

Arguments

x

a data frame of numeric vectors

y

an optional data frame of numeric vectors. Default is NULL, which means that correlations between the columns of x are computed.

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

method

a character string indicating which correlation coefficient is to be computed. One of "pearson" (default), "kendall", or "spearman".

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. Default is FALSE.

Value

a matrix of correlations

Author(s)

Nicolas Robette

See Also

weighted.cor

Examples

data(Movies)
weighted.cor2(Movies[,c("Budget", "Critics", "BoxOffice")], weights = rep(c(.8,1.2), 500))

Weighted covariance

Description

Computes the weighted covariance between two distributions.

Usage

weighted.cov(x, y, weights = NULL, na.rm = FALSE)

Arguments

x

numeric vector

y

numeric vector

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. Default is FALSE.

Value

a length-one numeric vector

Author(s)

Nicolas Robette

See Also

weighted.sd, weighted.cor, weighted.cov2

Examples

data(Movies)
weighted.cov(Movies$Critics, Movies$BoxOffice, weights = rep(c(.8,1.2), 500))

Weighted covariances

Description

Computes a matrix of weighted covariances between the columns of x and the columns of y.

Usage

weighted.cov2(x, y = NULL, weights = NULL, na.rm = FALSE)

Arguments

x

a data frame of numeric vectors

y

an optional data frame of numeric vectors. Default is NULL, which means that covariances between the columns of x are computed.

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. Default is FALSE.

Value

a matrix of covariances

Author(s)

Nicolas Robette

See Also

weighted.cov

Examples

data(Movies)
weighted.cov2(Movies[,c("Budget", "Critics", "BoxOffice")], weights = rep(c(.8,1.2), 500))

Cramer's V

Description

Computes Cramer's V measure of association between two (possibly weighted) categorical variables

Usage

weighted.cramer(x, y, weights = NULL, na.rm = FALSE)

Arguments

x

the first categorical variable

y

the second categorical variable

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds.

Value

Numerical value with Cramer's V.

Author(s)

Nicolas Robette

References

Rakotomalala R., 'Comprendre la taille d'effet (effect size)', http://eric.univ-lyon2.fr/~ricco/cours/slides/effect_size.pdf

See Also

assoc.twocat

Examples

data(Movies)
weighted.cramer(Movies$Country, Movies$ArtHouse)

Weighted median absolute deviation to median

Description

Computes the weighted median absolute deviation to median (aka MAD) of a distribution.

Usage

weighted.mad(x, weights = NULL, na.rm = FALSE)

Arguments

x

numeric vector

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. Default is FALSE.

Value

a length-one numeric vector

Author(s)

Nicolas Robette

See Also

weighted.quantile

Examples

data(Movies)
weighted.mad(Movies$Critics, weights = rep(c(.8,1.2), 500))

Weighted quantiles

Description

Computes the weighted quantiles of a distribution.

Usage

weighted.quantile(x, weights = NULL, probs = seq(0, 1, 0.25),
                  na.rm = FALSE, names = FALSE)

Arguments

x

numeric vector whose sample quantiles are wanted

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

probs

numeric vector of probabilities with values in [0,1]

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. Default is FALSE.

names

logical. if TRUE, the result has a names attribute. Default is FALSE.

Value

A numeric vector of the same length as probs argument.

Note

This function is taken from https://stackoverflow.com/questions/2748725/is-there-a-weighted-median-function

See Also

weighted.mad

Examples

data(Movies)
weighted.quantile(Movies$Critics, weights = rep(c(.8,1.2), 500), names = TRUE)

Weighted standard deviation

Description

Computes the weighted standard deviation of a distribution.

Usage

weighted.sd(x, weights = NULL, na.rm = FALSE)

Arguments

x

numeric vector

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. Default is FALSE.

Value

a length-one numeric vector

Author(s)

Nicolas Robette

See Also

weighted.cor

Examples

data(Movies)
weighted.sd(Movies$Critics, weights = rep(c(.8,1.2), 500))

Computes a (possibly weighted) contingency table

Description

Computes a contingency table from one or two vectors, with the possibility of specifying weights.

Usage

weighted.table(x, y = NULL, weights = NULL, stat = "freq",
              mar = FALSE, na.rm = FALSE, na.value = "NAs", digits = 1)

Arguments

x

an object which can be interpreted as factor

y

an optional object which can be interpreted as factor

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

stat

character. Whether to compute a contingency table ("freq", default), percentages ("prop"), row percentages ("rprop") or column percentages ("cprop").

mar

logical, indicating whether to compute margins. Default is FALSE.

na.rm

logical, indicating whether NA values should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the variables (see na.value argument).

na.value

character. Name of the level for NA category. Default is "NAs". Only used if na.rm = FALSE.

digits

integer indicating the number of decimal places (default is 1)

Value

Returns a contingency table.

Author(s)

Nicolas Robette

See Also

table, assoc.twocat

Examples

data(Movies)
weighted.table(Movies$Country, Movies$ArtHouse)