Skip to contents

spicy is an R package for descriptive statistics and data analysis, designed for data science and survey research workflows. It covers variable inspection, frequency tables, cross-tabulations with chi-squared tests and effect sizes, and publication-ready summary tables, offering functionality similar to Stata or SPSS but within a tidyverse-friendly R environment. This vignette walks through the core workflow using the bundled sochealth dataset, a simulated social-health survey with 1200 respondents and 24 variables.

Inspect your data

varlist() (or its shortcut vl()) gives a compact overview of every variable in a data frame: name, label, representative values, class, number of distinct values, valid observations, and missing values. In RStudio or Positron, calling varlist() without arguments opens an interactive viewer - this is the most common usage in practice. Here we use tbl = TRUE to produce static output for the vignette:

varlist(sochealth, tbl = TRUE)
#> # A tibble: 24 × 7
#>    Variable          Label                 Values Class N_distinct N_valid   NAs
#>    <chr>             <chr>                 <chr>  <chr>      <int>   <int> <int>
#>  1 sex               Sex                   Femal… fact…          2    1200     0
#>  2 age               Age (years)           25, 2… nume…         51    1200     0
#>  3 age_group         Age group             25-34… orde…          4    1200     0
#>  4 education         Highest education le… Lower… orde…          3    1200     0
#>  5 social_class      Subjective social cl… Lower… orde…          5    1200     0
#>  6 region            Region of residence   Centr… fact…          6    1200     0
#>  7 employment_status Employment status     Emplo… fact…          4    1200     0
#>  8 income_group      Household income gro… Low, … orde…          4    1182    18
#>  9 income            Monthly household in… 1000,… nume…       1052    1200     0
#> 10 smoking           Current smoker        No, Y… fact…          2    1175    25
#> # ℹ 14 more rows

You can also select specific columns with tidyselect syntax:

varlist(sochealth, starts_with("bmi"), income, weight, tbl = TRUE)
#> # A tibble: 4 × 7
#>   Variable     Label                       Values Class N_distinct N_valid   NAs
#>   <chr>        <chr>                       <chr>  <chr>      <int>   <int> <int>
#> 1 bmi          Body mass index             16, 1… nume…        177    1188    12
#> 2 bmi_category BMI category                Norma… orde…          3    1188    12
#> 3 income       Monthly household income (… 1000,… nume…       1052    1200     0
#> 4 weight       Survey design weight        0.294… nume…        794    1200     0

Frequency tables

freq() produces frequency tables with counts, percentages, and (optionally) valid and cumulative percentages.

freq(sochealth, education)
#> Frequency table: education
#> 
#>  Category    Values               Freq.    Percent 
#> ────────────┼───────────────────────────────────────
#>  Valid       Lower secondary        261       21.8 
#>              Upper secondary        539       44.9 
#>              Tertiary               400       33.3 
#> ────────────┼───────────────────────────────────────
#>  Total                             1200      100.0 
#> 
#> Label: Highest education level
#> Class: ordered, factor
#> Data: sochealth

Weighted frequencies use the weights argument. With rescale = TRUE, the total weighted N matches the unweighted N:

freq(sochealth, education, weights = weight, rescale = TRUE)
#> Frequency table: education
#> 
#>  Category    Values               Freq.    Percent 
#> ────────────┼───────────────────────────────────────
#>  Valid       Lower secondary        259       21.6 
#>              Upper secondary        546       45.5 
#>              Tertiary               395       32.9 
#> ────────────┼───────────────────────────────────────
#>  Total                             1200      100.0 
#> 
#> Label: Highest education level
#> Class: ordered, factor
#> Data: sochealth
#> Weight: weight (rescaled)

Cross-tabulations

cross_tab() crosses two categorical variables. By default it shows counts, a chi-squared test, and Cramer’s V:

cross_tab(sochealth, smoking, education)
#> Crosstable: smoking x education (N)
#> 
#>  Values      Lower secondary    Upper secondary    Tertiary    Total 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  No                      179                415         332      926 
#>  Yes                      78                112          59      249 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  Total                   257                527         391     1175 
#> 
#> Chi-2(2) = 21.6, p <.001
#> Cramer's V = 0.14

Add percentages with percent:

cross_tab(sochealth, smoking, education, percent = "col")
#> Crosstable: smoking x education (Column %)
#> 
#>  Values      Lower secondary    Upper secondary    Tertiary    Total 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  No                     69.6               78.7        84.9     78.8 
#>  Yes                    30.4               21.3        15.1     21.2 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  Total                 100.0              100.0       100.0    100.0 
#>  N                       257                527         391     1175 
#> 
#> Chi-2(2) = 21.6, p <.001
#> Cramer's V = 0.14

Group by a third variable with by:

cross_tab(sochealth, smoking, education, by = sex)
#> Crosstable: smoking x education (N) | sex = Female
#> 
#>  Values      Lower secondary    Upper secondary    Tertiary    Total 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  No                       95                220         160      475 
#>  Yes                      38                 62          31      131 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  Total                   133                282         191      606 
#> 
#> Chi-2(2) = 7.1, p = .029
#> Cramer's V = 0.11
#> 
#> Crosstable: smoking x education (N) | sex = Male
#> 
#>  Values      Lower secondary    Upper secondary    Tertiary    Total 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  No                       84                195         172      451 
#>  Yes                      40                 50          28      118 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  Total                   124                245         200      569 
#> 
#> Chi-2(2) = 15.6, p <.001
#> Cramer's V = 0.17

When both variables are ordered factors, cross_tab() automatically selects an ordinal measure (Kendall’s Tau-b) instead of Cramer’s V:

cross_tab(sochealth, self_rated_health, education)
#> Crosstable: self_rated_health x education (N)
#> 
#>  Values         Lower secondary    Upper secondary    Tertiary    Total 
#> ─────────────┼──────────────────────────────────────────────────┼─────────
#>  Poor                        28                 28           5       61 
#>  Fair                        86                118          62      266 
#>  Good                       102                263         193      558 
#>  Very good                   44                118         133      295 
#> ─────────────┼──────────────────────────────────────────────────┼─────────
#>  Total                      260                527         393     1180 
#> 
#> Chi-2(6) = 73.2, p <.001
#> Kendall's Tau-b = 0.20

Association measures

For a quick overview of all available association statistics, pass a contingency table to assoc_measures():

tbl <- xtabs(~ smoking + education, data = sochealth)
assoc_measures(tbl)
#> Measure                            Estimate     SE  CI lower  CI upper      p 
#> Cramer's V                            0.136     --     0.079     0.191  <.001 
#> Contingency Coefficient               0.134     --        --        --  <.001 
#> Lambda symmetric                      0.000  0.000     0.000     0.000     -- 
#> Lambda R|C                            0.000  0.000     0.000     0.000     -- 
#> Lambda C|R                            0.000  0.000     0.000     0.000     -- 
#> Goodman-Kruskal's Tau R|C             0.018  0.008     0.003     0.034   .023 
#> Goodman-Kruskal's Tau C|R             0.008  0.003     0.001     0.014   .022 
#> Uncertainty Coefficient symmetric     0.011  0.005     0.002     0.021   .021 
#> Uncertainty Coefficient R|C           0.018  0.008     0.003     0.032   .021 
#> Uncertainty Coefficient C|R           0.009  0.004     0.001     0.016   .021 
#> Goodman-Kruskal Gamma                -0.268  0.056    -0.378    -0.158  <.001 
#> Kendall's Tau-b                      -0.126  0.027    -0.180    -0.073  <.001 
#> Kendall's Tau-c                      -0.117  0.026    -0.167    -0.067  <.001 
#> Somers' D R|C                        -0.091  0.020    -0.131    -0.052  <.001 
#> Somers' D C|R                        -0.175  0.038    -0.249    -0.101  <.001

Individual functions such as cramer_v(), gamma_gk(), or kendall_tau_b() return a scalar by default. Pass detail = TRUE for the confidence interval and p-value:

cramer_v(tbl, detail = TRUE)
#> Estimate  CI lower  CI upper      p
#>    0.136     0.079     0.191  <.001

Summary tables

table_categorical() covers grouped or one-way summary tables for categorical variables:

table_categorical(
  sochealth,
  select = c(smoking, physical_activity, dentist_12m),
  by = education,
  output = "tinytable"
)
Variable Lower secondary Upper secondary Tertiary Total p Cramer's V
n % n % n % n %
smoking <.001 .14
    No 179 69.6 415 78.7 332 84.9 926 78.8
    Yes 78 30.4 112 21.3 59 15.1 249 21.2
physical_activity <.001 .21
    No 177 67.8 310 57.5 163 40.8 650 54.2
    Yes 84 32.2 229 42.5 237 59.2 550 45.8
dentist_12m <.001 .22
    No 113 43.3 174 32.3 67 16.8 354 29.5
    Yes 148 56.7 365 67.7 333 83.2 846 70.5

table_continuous() summarizes continuous variables, either overall or by a categorical by variable, and can also add group-comparison tests:

table_continuous(
  sochealth,
  select = c(bmi, life_sat_health),
  by = education
)
#> Descriptive statistics
#> 
#>  Variable                        Group              M     SD    Min    Max  
#> ────────────────────────────────┼────────────────────────────────────────────
#>  Body mass index                 Lower secondary  28.09  3.47  18.20  38.90 
#>                                  Upper secondary  26.02  3.43  16.00  37.10 
#>                                  Tertiary         24.39  3.52  16.00  33.00 
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  Satisfaction with health (1-5)  Lower secondary   2.71  1.20   1.00   5.00 
#>                                  Upper secondary   3.53  1.19   1.00   5.00 
#>                                  Tertiary          4.11  1.04   1.00   5.00 
#> 
#>  Variable                        Group            95% CI LL  95% CI UL   n  
#> ────────────────────────────────┼────────────────────────────────────────────
#>  Body mass index                 Lower secondary    27.66      28.51    260 
#>                                  Upper secondary    25.73      26.31    534 
#>                                  Tertiary           24.04      24.74    394 
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  Satisfaction with health (1-5)  Lower secondary     2.57       2.86    259 
#>                                  Upper secondary     3.43       3.63    534 
#>                                  Tertiary            4.01       4.21    399 
#> 
#>  Variable                        Group              p   
#> ────────────────────────────────┼────────────────────────
#>  Body mass index                 Lower secondary  <.001 
#>                                  Upper secondary        
#>                                  Tertiary               
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  Satisfaction with health (1-5)  Lower secondary  <.001 
#>                                  Upper secondary        
#>                                  Tertiary

table_continuous_lm() covers the same reporting territory when you want to stay in a linear-model framework, for example with robust standard errors or case weights:

table_continuous_lm(
  sochealth,
  select = c(wellbeing_score, bmi),
  by = sex,
  vcov = "HC3"
)
#> Continuous outcomes by Sex
#> 
#>  Variable                       M (Female)  M (Male)  Δ (Male - Female) 
#> ───────────────────────────────┼─────────────────────────────────────────
#>  WHO-5 wellbeing index (0-100)    67.16      71.05          3.89        
#>  Body mass index                  25.69      26.20          0.51        
#> 
#>  Variable                       95% CI LL  95% CI UL    p     R²    n   
#> ───────────────────────────────┼─────────────────────────────────────────
#>  WHO-5 wellbeing index (0-100)    2.12       5.65     <.001  0.02  1200 
#>  Body mass index                  0.09       0.93      .018  0.00  1188

For detailed guidance, see the dedicated articles on table_categorical(), table_continuous(), table_continuous_lm(), and the final reporting overview for APA-style summary tables.

Row-wise summaries

mean_n(), sum_n(), and count_n() compute row-wise statistics across selected columns, with automatic handling of missing values.

sochealth |>
  dplyr::mutate(
    mean_sat  = mean_n(select = starts_with("life_sat")),
    sum_sat   = sum_n(select = starts_with("life_sat"), min_valid = 2),
    n_missing = count_n(select = starts_with("life_sat"), special = "NA")
  ) |>
  dplyr::select(starts_with("life_sat"), mean_sat, sum_sat, n_missing) |>
  head() |>
  as.data.frame()
#>   life_sat_health life_sat_work life_sat_relationships life_sat_standard
#> 1               5             3                      5                 5
#> 2               4             4                      5                 5
#> 3               3             2                      5                 3
#> 4               3             4                      3                 2
#> 5               4             5                      4                 4
#> 6               5             5                      5                 3
#>   mean_sat sum_sat n_missing
#> 1     4.50      18         0
#> 2     4.50      18         0
#> 3     3.25      13         0
#> 4     3.00      12         0
#> 5     4.25      17         0
#> 6     4.50      18         0

Learn more

  • See ?varlist to inspect variables, labels, values, and missing data.
  • See ?cross_tab for the full list of arguments (weights, simulation, association measures).
  • See ?table_categorical for grouped or one-way categorical tables.
  • See ?table_continuous for continuous summaries and group comparisons.
  • See ?table_continuous_lm for model-based mean-comparison tables with robust standard errors or case weights.
  • See ?assoc_measures for the complete list of association statistics.
  • See ?code_book to generate an interactive HTML codebook.