
Frequency tables and cross-tabulations in R
Source:vignettes/frequency-tables.Rmd
frequency-tables.Rmdfreq() and cross_tab() are the core
tabulation functions in spicy. They handle factors, labelled variables
(from haven or labelled), weights, and missing values out of the box.
This vignette covers the main options using the bundled
sochealth dataset.
Frequency tables with freq()
Basic usage
Pass a data frame and a variable name to get counts and percentages:
freq(sochealth, education)
#> Frequency table: education
#>
#> Category │ Values Freq. Percent
#> ──────────┼─────────────────────────────────
#> Valid │ Lower secondary 261 21.8
#> │ Upper secondary 539 44.9
#> │ Tertiary 400 33.3
#> ──────────┼─────────────────────────────────
#> Total │ 1200 100.0
#>
#> Label: Highest education level
#> Class: ordered, factor
#> Data: sochealthSorting
Sort by frequency with sort = "-" (decreasing) or
sort = "+" (increasing). Sort alphabetically with
sort = "name+" or sort = "name-":
freq(sochealth, education, sort = "-")
#> Frequency table: education
#>
#> Category │ Values Freq. Percent
#> ──────────┼─────────────────────────────────
#> Valid │ Upper secondary 539 44.9
#> │ Tertiary 400 33.3
#> │ Lower secondary 261 21.8
#> ──────────┼─────────────────────────────────
#> Total │ 1200 100.0
#>
#> Label: Highest education level
#> Class: ordered, factor
#> Data: sochealthSort alphabetically:
freq(sochealth, education, sort = "name+")
#> Frequency table: education
#>
#> Category │ Values Freq. Percent
#> ──────────┼─────────────────────────────────
#> Valid │ Lower secondary 261 21.8
#> │ Tertiary 400 33.3
#> │ Upper secondary 539 44.9
#> ──────────┼─────────────────────────────────
#> Total │ 1200 100.0
#>
#> Label: Highest education level
#> Class: ordered, factor
#> Data: sochealthCumulative percentages
Add cumulative columns with cum = TRUE:
freq(sochealth, smoking, cum = TRUE)
#> Frequency table: smoking
#>
#> Category │ Values Freq. Percent Valid Percent Cum. Percent
#> ──────────┼─────────────────────────────────────────────────────
#> Valid │ No 926 77.2 78.8 77.2
#> │ Yes 249 20.8 21.2 97.9
#> Missing │ NA 25 2.1 100.0
#> ──────────┼─────────────────────────────────────────────────────
#> Total │ 1200 100.0 100.0 100.0
#>
#> Category │ Values Cum. Valid Percent
#> ──────────┼────────────────────────────
#> Valid │ No 78.8
#> │ Yes 100.0
#> Missing │ NA
#> ──────────┼────────────────────────────
#> Total │ 100.0
#>
#> Label: Current smoker
#> Class: factor
#> Data: sochealthWeighted frequencies
Supply a weight variable with weights. By default,
rescale = TRUE adjusts the weighted total to match the
unweighted sample size:
freq(sochealth, education, weights = weight)
#> Frequency table: education
#>
#> Category │ Values Freq. Percent
#> ──────────┼──────────────────────────────────
#> Valid │ Lower secondary 258.62 21.6
#> │ Upper secondary 546.40 45.5
#> │ Tertiary 394.99 32.9
#> ──────────┼──────────────────────────────────
#> Total │ 1200 100.0
#>
#> Label: Highest education level
#> Class: ordered, factor
#> Data: sochealth
#> Weight: weight (rescaled)Set rescale = FALSE to keep the raw weighted counts:
freq(sochealth, education, weights = weight, rescale = FALSE)
#> Frequency table: education
#>
#> Category │ Values Freq. Percent
#> ──────────┼───────────────────────────────────
#> Valid │ Lower secondary 257.86 21.6
#> │ Upper secondary 544.79 45.5
#> │ Tertiary 393.82 32.9
#> ──────────┼───────────────────────────────────
#> Total │ 1196.47 100.0
#>
#> Label: Highest education level
#> Class: ordered, factor
#> Data: sochealth
#> Weight: weightLabelled variables
When a variable has value labels (e.g., imported from SPSS or Stata
with haven), freq() shows them by default with the
[code] label format. Control this with
labelled_levels:
# Create a labelled version of the smoking variable
sh <- sochealth
sh$smoking_lbl <- labelled::labelled(
ifelse(sh$smoking == "Yes", 1L, 0L),
labels = c("Non-smoker" = 0L, "Current smoker" = 1L)
)
# Default: [code] label
freq(sh, smoking_lbl)
#> Frequency table: smoking_lbl
#>
#> Category │ Values Freq. Percent Valid Percent
#> ──────────┼───────────────────────────────────────────────────
#> Valid │ [0] Non-smoker 926 77.2 78.8
#> │ [1] Current smoker 249 20.8 21.2
#> Missing │ NA 25 2.1
#> ──────────┼───────────────────────────────────────────────────
#> Total │ 1200 100.0 100.0
#>
#> Class: haven_labelled, vctrs_vctr, integer
#> Data: sh
# Labels only (no codes)
freq(sh, smoking_lbl, labelled_levels = "labels")
#> Frequency table: smoking_lbl
#>
#> Category │ Values Freq. Percent Valid Percent
#> ──────────┼───────────────────────────────────────────────
#> Valid │ Non-smoker 926 77.2 78.8
#> │ Current smoker 249 20.8 21.2
#> Missing │ NA 25 2.1
#> ──────────┼───────────────────────────────────────────────
#> Total │ 1200 100.0 100.0
#>
#> Class: haven_labelled, vctrs_vctr, integer
#> Data: sh
# Codes only (no labels)
freq(sh, smoking_lbl, labelled_levels = "values")
#> Frequency table: smoking_lbl
#>
#> Category │ Values Freq. Percent Valid Percent
#> ──────────┼───────────────────────────────────────
#> Valid │ 0 926 77.2 78.8
#> │ 1 249 20.8 21.2
#> Missing │ NA 25 2.1
#> ──────────┼───────────────────────────────────────
#> Total │ 1200 100.0 100.0
#>
#> Class: haven_labelled, vctrs_vctr, integer
#> Data: shCustom missing values
Treat specific values as missing with na_val:
freq(sochealth, income_group, na_val = "High")
#> Frequency table: income_group
#>
#> Category │ Values Freq. Percent Valid Percent
#> ──────────┼─────────────────────────────────────────────
#> Valid │ Low 247 20.6 25.6
#> │ Lower middle 388 32.3 40.3
#> │ Upper middle 328 27.3 34.1
#> Missing │ NA 237 19.8
#> ──────────┼─────────────────────────────────────────────
#> Total │ 1200 100.0 100.0
#>
#> Label: Household income group
#> Class: ordered, factor
#> Data: sochealthCross-tabulations with cross_tab()
Basic two-way table
Cross two variables to get a contingency table with a chi-squared test and effect size:
cross_tab(sochealth, smoking, education)
#> Crosstable: smoking x education (N)
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 179 415 332
#> Yes │ 78 112 59
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 257 527 391
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 926
#> Yes │ 249
#> ─────────────┼────────────
#> Total │ 1175
#>
#> Chi-2(2) = 21.6, p < 0.001
#> Cramer's V = 0.14Row and column percentages
Use percent = "row" or percent = "col" to
display percentages instead of raw counts:
cross_tab(sochealth, smoking, education, percent = "col")
#> Crosstable: smoking x education (Column %)
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 69.6 78.7 84.9
#> Yes │ 30.4 21.3 15.1
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 100.0 100.0 100.0
#> N │ 257 527 391
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 78.8
#> Yes │ 21.2
#> ─────────────┼────────────
#> Total │ 100.0
#> N │ 1175
#>
#> Chi-2(2) = 21.6, p < 0.001
#> Cramer's V = 0.14
cross_tab(sochealth, smoking, education, percent = "row")
#> Crosstable: smoking x education (Row %)
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 19.3 44.8 35.9
#> Yes │ 31.3 45.0 23.7
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 21.9 44.9 33.3
#>
#> Values │ Total N
#> ─────────────┼───────────────────────
#> No │ 100.0 926
#> Yes │ 100.0 249
#> ─────────────┼───────────────────────
#> Total │ 100.0 1175
#>
#> Chi-2(2) = 21.6, p < 0.001
#> Cramer's V = 0.14Grouping with by
Stratify the table by a third variable:
cross_tab(sochealth, smoking, education, by = sex)
#> Crosstable: smoking x education (N) | sex = Female
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 95 220 160
#> Yes │ 38 62 31
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 133 282 191
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 475
#> Yes │ 131
#> ─────────────┼────────────
#> Total │ 606
#>
#> Chi-2(2) = 7.1, p = 0.029
#> Cramer's V = 0.11
#>
#> Crosstable: smoking x education (N) | sex = Male
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 84 195 172
#> Yes │ 40 50 28
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 124 245 200
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 451
#> Yes │ 118
#> ─────────────┼────────────
#> Total │ 569
#>
#> Chi-2(2) = 15.6, p < 0.001
#> Cramer's V = 0.17For more than one grouping variable, use
interaction():
cross_tab(sochealth, smoking, education,
by = interaction(sex, age_group))
#> Crosstable: smoking x education (N) | sex x age_group = Female.25-34
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 23 49 29
#> Yes │ 9 9 7
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 32 58 36
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 101
#> Yes │ 25
#> ─────────────┼────────────
#> Total │ 126
#>
#> Chi-2(2) = 2.1, p = 0.356
#> Cramer's V = 0.13
#>
#> Crosstable: smoking x education (N) | sex x age_group = Male.25-34
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 9 42 32
#> Yes │ 11 11 4
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 20 53 36
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 83
#> Yes │ 26
#> ─────────────┼────────────
#> Total │ 109
#>
#> Chi-2(2) = 14.2, p < 0.001
#> Cramer's V = 0.36
#>
#> Crosstable: smoking x education (N) | sex x age_group = Female.35-49
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 24 73 48
#> Yes │ 10 20 8
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 34 93 56
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 145
#> Yes │ 38
#> ─────────────┼────────────
#> Total │ 183
#>
#> Chi-2(2) = 3.0, p = 0.223
#> Cramer's V = 0.13
#>
#> Crosstable: smoking x education (N) | sex x age_group = Male.35-49
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 33 59 60
#> Yes │ 14 17 7
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 47 76 67
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 152
#> Yes │ 38
#> ─────────────┼────────────
#> Total │ 190
#>
#> Chi-2(2) = 6.9, p = 0.032
#> Cramer's V = 0.19
#>
#> Crosstable: smoking x education (N) | sex x age_group = Female.50-64
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 28 63 45
#> Yes │ 8 16 6
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 36 79 51
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 136
#> Yes │ 30
#> ─────────────┼────────────
#> Total │ 166
#>
#> Chi-2(2) = 2.0, p = 0.360
#> Cramer's V = 0.11
#>
#> Crosstable: smoking x education (N) | sex x age_group = Male.50-64
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 28 58 42
#> Yes │ 8 13 5
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 36 71 47
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 128
#> Yes │ 26
#> ─────────────┼────────────
#> Total │ 154
#>
#> Chi-2(2) = 2.1, p = 0.343
#> Cramer's V = 0.12
#>
#> Crosstable: smoking x education (N) | sex x age_group = Female.65-75
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 20 35 38
#> Yes │ 11 17 10
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 31 52 48
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 93
#> Yes │ 38
#> ─────────────┼────────────
#> Total │ 131
#>
#> Chi-2(2) = 2.5, p = 0.282
#> Cramer's V = 0.14
#>
#> Crosstable: smoking x education (N) | sex x age_group = Male.65-75
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 14 36 38
#> Yes │ 7 9 12
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 21 45 50
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 88
#> Yes │ 28
#> ─────────────┼────────────
#> Total │ 116
#>
#> Chi-2(2) = 1.4, p = 0.499
#> Cramer's V = 0.11Ordinal variables
When both variables are ordered factors, cross_tab()
automatically switches from Cramer’s V to Kendall’s Tau-b:
cross_tab(sochealth, self_rated_health, education)
#> Crosstable: self_rated_health x education (N)
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ────────────────┼───────────────────────────────────────────────────────────
#> Poor │ 28 28 5
#> Fair │ 86 118 62
#> Good │ 102 263 193
#> Very good │ 44 118 133
#> ────────────────┼───────────────────────────────────────────────────────────
#> Total │ 260 527 393
#>
#> Values │ Total
#> ────────────────┼────────────
#> Poor │ 61
#> Fair │ 266
#> Good │ 558
#> Very good │ 295
#> ────────────────┼────────────
#> Total │ 1180
#>
#> Chi-2(6) = 73.2, p < 0.001
#> Kendall's Tau-b = 0.20You can override the automatic selection with
assoc_measure:
cross_tab(sochealth, self_rated_health, education, assoc_measure = "gamma")
#> Crosstable: self_rated_health x education (N)
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ────────────────┼───────────────────────────────────────────────────────────
#> Poor │ 28 28 5
#> Fair │ 86 118 62
#> Good │ 102 263 193
#> Very good │ 44 118 133
#> ────────────────┼───────────────────────────────────────────────────────────
#> Total │ 260 527 393
#>
#> Values │ Total
#> ────────────────┼────────────
#> Poor │ 61
#> Fair │ 266
#> Good │ 558
#> Very good │ 295
#> ────────────────┼────────────
#> Total │ 1180
#>
#> Chi-2(6) = 73.2, p < 0.001
#> Goodman-Kruskal Gamma = 0.31Confidence intervals for effect sizes
Add a 95% confidence interval for the association measure with
assoc_ci = TRUE:
cross_tab(sochealth, smoking, education, assoc_ci = TRUE)
#> Crosstable: smoking x education (N)
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 179 415 332
#> Yes │ 78 112 59
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 257 527 391
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 926
#> Yes │ 249
#> ─────────────┼────────────
#> Total │ 1175
#>
#> Chi-2(2) = 21.6, p < 0.001
#> Cramer's V = 0.14, 95% CI [0.08, 0.19]Weighted cross-tabulations
Weights work the same as in freq(). Without rescaling,
the table shows raw weighted counts:
cross_tab(sochealth, smoking, education, weights = weight)
#> Crosstable: smoking x education (N)
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 176 417 324
#> Yes │ 79 114 60
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 255 531 384
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 917
#> Yes │ 253
#> ─────────────┼────────────
#> Total │ 1170
#>
#> Chi-2(2) = 21.3, p < 0.001
#> Cramer's V = 0.13
#> Weight: weightWith rescale = TRUE, the weighted total matches the
unweighted sample size:
cross_tab(sochealth, smoking, education, weights = weight, rescale = TRUE)
#> Crosstable: smoking x education (N)
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 176 419 325
#> Yes │ 79 115 60
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 255 534 385
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 921
#> Yes │ 254
#> ─────────────┼────────────
#> Total │ 1175
#>
#> Chi-2(2) = 21.4, p < 0.001
#> Cramer's V = 0.13
#> Weight: weight (rescaled)Monte Carlo simulation
When expected cell counts are small, use simulated p-values:
cross_tab(sochealth, smoking, education,
simulate_p = TRUE, simulate_B = 5000)
#> Crosstable: smoking x education (N)
#>
#> Values │ Lower secondary Upper secondary Tertiary
#> ─────────────┼───────────────────────────────────────────────────────────
#> No │ 179 415 332
#> Yes │ 78 112 59
#> ─────────────┼───────────────────────────────────────────────────────────
#> Total │ 257 527 391
#>
#> Values │ Total
#> ─────────────┼────────────
#> No │ 926
#> Yes │ 249
#> ─────────────┼────────────
#> Total │ 1175
#>
#> Chi-2(NA) = 21.6, p < 0.001 (simulated)
#> Cramer's V = 0.14Data frame output
Set styled = FALSE to get a plain data frame for further
processing:
cross_tab(sochealth, smoking, education,
percent = "col", styled = FALSE)
#> Values Lower secondary Upper secondary Tertiary
#> 1 No 69.6 78.7 84.9
#> 2 Yes 30.4 21.3 15.1Setting global defaults
You can set package-wide defaults with options() so you
don’t have to repeat arguments:
options(
spicy.percent = "column",
spicy.simulate_p = TRUE,
spicy.rescale = TRUE
)Learn more
-
vignette("association-measures")- choosing the right effect size for your contingency table. -
vignette("table-categorical")- building publication-ready categorical tables. -
?freqand?cross_tabfor the full argument reference.