Making descriptive statistics tables

Tips on making tables of descriptive statistics of your data.
quarto
tables
descriptive statistics
Authors

Stefania Noerman

Luke W. Johnston

Published

June 2, 2023

Modified

May 13, 2024

This session was recorded and uploaded on YouTube here:

In this session, we covered the {gtsummary} R packages that can be used to easily create tables to describe study datasets easily. To show off how to make tables, we’ll use the dataset provided in the {palmerpenguins} package. So first, let’s load up our packages!

library(tidyverse)
library(gtsummary)
library(palmerpenguins) 

Let’s first take a look at the data:

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

There are 8 columns in the dataset, with a good variety of character and numeric data. The core function from {gtsummary} is the tbl_summary() function.

penguins %>% 
  tbl_summary()
Characteristic N = 3441
species
    Adelie 152 (44%)
    Chinstrap 68 (20%)
    Gentoo 124 (36%)
island
    Biscoe 168 (49%)
    Dream 124 (36%)
    Torgersen 52 (15%)
bill_length_mm 44.5 (39.2, 48.5)
    Unknown 2
bill_depth_mm 17.30 (15.60, 18.70)
    Unknown 2
flipper_length_mm 197 (190, 213)
    Unknown 2
body_mass_g 4,050 (3,550, 4,750)
    Unknown 2
sex
    female 165 (50%)
    male 168 (50%)
    Unknown 11
year
    2007 110 (32%)
    2008 114 (33%)
    2009 120 (35%)
1 n (%); Median (IQR)

You see that it automatically checks all the columns and runs descriptive statistics that are specific to the data type (e.g. count and percent for character data). It also defaults to an overall summary statistic for each column. But if we want to do it by a specific group, we use the by argument:

penguins %>% 
  tbl_summary(
    by = species
  )
Characteristic Adelie, N = 1521 Chinstrap, N = 681 Gentoo, N = 1241
island


    Biscoe 44 (29%) 0 (0%) 124 (100%)
    Dream 56 (37%) 68 (100%) 0 (0%)
    Torgersen 52 (34%) 0 (0%) 0 (0%)
bill_length_mm 38.8 (36.8, 40.8) 49.6 (46.4, 51.1) 47.3 (45.3, 49.6)
    Unknown 1 0 1
bill_depth_mm 18.40 (17.50, 19.00) 18.45 (17.50, 19.40) 15.00 (14.20, 15.70)
    Unknown 1 0 1
flipper_length_mm 190 (186, 195) 196 (191, 201) 216 (212, 221)
    Unknown 1 0 1
body_mass_g 3,700 (3,350, 4,000) 3,700 (3,488, 3,950) 5,000 (4,700, 5,500)
    Unknown 1 0 1
sex


    female 73 (50%) 34 (50%) 58 (49%)
    male 73 (50%) 34 (50%) 61 (51%)
    Unknown 6 0 5
year


    2007 50 (33%) 26 (38%) 34 (27%)
    2008 50 (33%) 18 (26%) 46 (37%)
    2009 52 (34%) 24 (35%) 44 (35%)
1 n (%); Median (IQR)

Which creates several table columns for each species in the dataset. There are some Unknown values listed in the rows for the variables. That’s because there is some missingness. We can remove that by using the missing argument:

penguins %>% 
  tbl_summary(
    by = species,
    missing = "no"
  )
Characteristic Adelie, N = 1521 Chinstrap, N = 681 Gentoo, N = 1241
island


    Biscoe 44 (29%) 0 (0%) 124 (100%)
    Dream 56 (37%) 68 (100%) 0 (0%)
    Torgersen 52 (34%) 0 (0%) 0 (0%)
bill_length_mm 38.8 (36.8, 40.8) 49.6 (46.4, 51.1) 47.3 (45.3, 49.6)
bill_depth_mm 18.40 (17.50, 19.00) 18.45 (17.50, 19.40) 15.00 (14.20, 15.70)
flipper_length_mm 190 (186, 195) 196 (191, 201) 216 (212, 221)
body_mass_g 3,700 (3,350, 4,000) 3,700 (3,488, 3,950) 5,000 (4,700, 5,500)
sex


    female 73 (50%) 34 (50%) 58 (49%)
    male 73 (50%) 34 (50%) 61 (51%)
year


    2007 50 (33%) 26 (38%) 34 (27%)
    2008 50 (33%) 18 (26%) 46 (37%)
    2009 52 (34%) 24 (35%) 44 (35%)
1 n (%); Median (IQR)

That’s nicer! We can overwrite the statistic used with the statistic argument and giving that argument a list() (details are described in the package functionality documentation).

penguins %>% 
  tbl_summary(
    by = species,
    missing = "no",
    statistic = list(all_continuous() ~ "{mean} {sd}")
  )
Characteristic Adelie, N = 1521 Chinstrap, N = 681 Gentoo, N = 1241
island


    Biscoe 44 (29%) 0 (0%) 124 (100%)
    Dream 56 (37%) 68 (100%) 0 (0%)
    Torgersen 52 (34%) 0 (0%) 0 (0%)
bill_length_mm 38.8 2.7 48.8 3.3 47.5 3.1
bill_depth_mm 18.35 1.22 18.42 1.14 14.98 0.98
flipper_length_mm 190 7 196 7 217 6
body_mass_g 3,701 459 3,733 384 5,076 504
sex


    female 73 (50%) 34 (50%) 58 (49%)
    male 73 (50%) 34 (50%) 61 (51%)
year


    2007 50 (33%) 26 (38%) 34 (27%)
    2008 50 (33%) 18 (26%) 46 (37%)
    2009 52 (34%) 24 (35%) 44 (35%)
1 n (%); Mean SD

In between the {} are the functions you want to use. So here, the functions are mean() and sd(). We can edit the labels used for the variables with label argument.

penguins %>% 
  tbl_summary(
    by = species,
    missing = "no",
    statistic = list(all_continuous() ~ "{mean} {sd}"),
    label = list(bill_length_mm = "Bill length (mm)")
  )
Characteristic Adelie, N = 1521 Chinstrap, N = 681 Gentoo, N = 1241
island


    Biscoe 44 (29%) 0 (0%) 124 (100%)
    Dream 56 (37%) 68 (100%) 0 (0%)
    Torgersen 52 (34%) 0 (0%) 0 (0%)
Bill length (mm) 38.8 2.7 48.8 3.3 47.5 3.1
bill_depth_mm 18.35 1.22 18.42 1.14 14.98 0.98
flipper_length_mm 190 7 196 7 217 6
body_mass_g 3,701 459 3,733 384 5,076 504
sex


    female 73 (50%) 34 (50%) 58 (49%)
    male 73 (50%) 34 (50%) 61 (51%)
year


    2007 50 (33%) 26 (38%) 34 (27%)
    2008 50 (33%) 18 (26%) 46 (37%)
    2009 52 (34%) 24 (35%) 44 (35%)
1 n (%); Mean SD

It’s often useful to know what the sample size is in each variable, which we can add as a column by piping into add_n().

penguins %>% 
  tbl_summary(
    by = species,
    missing = "no",
    statistic = list(all_continuous() ~ "{mean} {sd}"),
    label = list(bill_length_mm = "Bill length (mm)")
  ) %>% 
  add_n()
Characteristic N Adelie, N = 1521 Chinstrap, N = 681 Gentoo, N = 1241
island 344


    Biscoe
44 (29%) 0 (0%) 124 (100%)
    Dream
56 (37%) 68 (100%) 0 (0%)
    Torgersen
52 (34%) 0 (0%) 0 (0%)
Bill length (mm) 342 38.8 2.7 48.8 3.3 47.5 3.1
bill_depth_mm 342 18.35 1.22 18.42 1.14 14.98 0.98
flipper_length_mm 342 190 7 196 7 217 6
body_mass_g 342 3,701 459 3,733 384 5,076 504
sex 333


    female
73 (50%) 34 (50%) 58 (49%)
    male
73 (50%) 34 (50%) 61 (51%)
year 344


    2007
50 (33%) 26 (38%) 34 (27%)
    2008
50 (33%) 18 (26%) 46 (37%)
    2009
52 (34%) 24 (35%) 44 (35%)
1 n (%); Mean SD

Without using the by argument, we get the overall values for all the data. But with the by argument, that gets removed. We can add it back with the add_overall() function.

penguins %>% 
  tbl_summary(
    by = species,
    missing = "no",
    statistic = list(all_continuous() ~ "{mean} {sd}"),
    label = list(bill_length_mm = "Bill length (mm)")
  ) %>% 
  add_n() %>% 
  add_overall()
Characteristic N Overall, N = 3441 Adelie, N = 1521 Chinstrap, N = 681 Gentoo, N = 1241
island 344



    Biscoe
168 (49%) 44 (29%) 0 (0%) 124 (100%)
    Dream
124 (36%) 56 (37%) 68 (100%) 0 (0%)
    Torgersen
52 (15%) 52 (34%) 0 (0%) 0 (0%)
Bill length (mm) 342 43.9 5.5 38.8 2.7 48.8 3.3 47.5 3.1
bill_depth_mm 342 17.15 1.97 18.35 1.22 18.42 1.14 14.98 0.98
flipper_length_mm 342 201 14 190 7 196 7 217 6
body_mass_g 342 4,202 802 3,701 459 3,733 384 5,076 504
sex 333



    female
165 (50%) 73 (50%) 34 (50%) 58 (49%)
    male
168 (50%) 73 (50%) 34 (50%) 61 (51%)
year 344



    2007
110 (32%) 50 (33%) 26 (38%) 34 (27%)
    2008
114 (33%) 50 (33%) 18 (26%) 46 (37%)
    2009
120 (35%) 52 (34%) 24 (35%) 44 (35%)
1 n (%); Mean SD

We almost have a table ready for including in a paper or report!