Making descriptive statistics tables

This session was recorded and uploaded on YouTube here:

In this session, we covered the {gtsummary} R packages that can be used to easily create tables to describe study datasets easily. To show off how to make tables, we’ll use the dataset provided in the {palmerpenguins} package. So first, let’s load up our packages!

library(tidyverse)
library(gtsummary)
library(palmerpenguins)

Let’s first take a look at the data:

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

There are 8 columns in the dataset, with a good variety of character and numeric data. The core function from {gtsummary} is the tbl_summary() function.

penguins %>% 
  tbl_summary()

Characteristic	N = 344¹
species
Adelie	152 (44%)
Chinstrap	68 (20%)
Gentoo	124 (36%)
island
Biscoe	168 (49%)
Dream	124 (36%)
Torgersen	52 (15%)
bill_length_mm	44.5 (39.2, 48.5)
Unknown	2
bill_depth_mm	17.30 (15.60, 18.70)
Unknown	2
flipper_length_mm	197 (190, 213)
Unknown	2
body_mass_g	4,050 (3,550, 4,750)
Unknown	2
sex
female	165 (50%)
male	168 (50%)
Unknown	11
year
2007	110 (32%)
2008	114 (33%)
2009	120 (35%)
¹ n (%); Median (Q1, Q3)

You see that it automatically checks all the columns and runs descriptive statistics that are specific to the data type (e.g. count and percent for character data). It also defaults to an overall summary statistic for each column. But if we want to do it by a specific group, we use the by argument:

penguins %>% 
  tbl_summary(
    by = species
  )

Characteristic	Adelie N = 152¹	Chinstrap N = 68¹	Gentoo N = 124¹
island
Biscoe	44 (29%)	0 (0%)	124 (100%)
Dream	56 (37%)	68 (100%)	0 (0%)
Torgersen	52 (34%)	0 (0%)	0 (0%)
bill_length_mm	38.8 (36.7, 40.8)	49.6 (46.3, 51.2)	47.3 (45.3, 49.6)
Unknown	1	0	1
bill_depth_mm	18.40 (17.50, 19.00)	18.45 (17.50, 19.40)	15.00 (14.20, 15.70)
Unknown	1	0	1
flipper_length_mm	190 (186, 195)	196 (191, 201)	216 (212, 221)
Unknown	1	0	1
body_mass_g	3,700 (3,350, 4,000)	3,700 (3,475, 3,950)	5,000 (4,700, 5,500)
Unknown	1	0	1
sex
female	73 (50%)	34 (50%)	58 (49%)
male	73 (50%)	34 (50%)	61 (51%)
Unknown	6	0	5
year
2007	50 (33%)	26 (38%)	34 (27%)
2008	50 (33%)	18 (26%)	46 (37%)
2009	52 (34%)	24 (35%)	44 (35%)
¹ n (%); Median (Q1, Q3)

Which creates several table columns for each species in the dataset. There are some Unknown values listed in the rows for the variables. That’s because there is some missingness. We can remove that by using the missing argument:

penguins %>% 
  tbl_summary(
    by = species,
    missing = "no"
  )

Characteristic	Adelie N = 152¹	Chinstrap N = 68¹	Gentoo N = 124¹
island
Biscoe	44 (29%)	0 (0%)	124 (100%)
Dream	56 (37%)	68 (100%)	0 (0%)
Torgersen	52 (34%)	0 (0%)	0 (0%)
bill_length_mm	38.8 (36.7, 40.8)	49.6 (46.3, 51.2)	47.3 (45.3, 49.6)
bill_depth_mm	18.40 (17.50, 19.00)	18.45 (17.50, 19.40)	15.00 (14.20, 15.70)
flipper_length_mm	190 (186, 195)	196 (191, 201)	216 (212, 221)
body_mass_g	3,700 (3,350, 4,000)	3,700 (3,475, 3,950)	5,000 (4,700, 5,500)
sex
female	73 (50%)	34 (50%)	58 (49%)
male	73 (50%)	34 (50%)	61 (51%)
year
2007	50 (33%)	26 (38%)	34 (27%)
2008	50 (33%)	18 (26%)	46 (37%)
2009	52 (34%)	24 (35%)	44 (35%)
¹ n (%); Median (Q1, Q3)

That’s nicer! We can overwrite the statistic used with the statistic argument and giving that argument a list() (details are described in the package functionality documentation).

penguins %>% 
  tbl_summary(
    by = species,
    missing = "no",
    statistic = list(all_continuous() ~ "{mean} {sd}")
  )

Characteristic	Adelie N = 152¹	Chinstrap N = 68¹	Gentoo N = 124¹
island
Biscoe	44 (29%)	0 (0%)	124 (100%)
Dream	56 (37%)	68 (100%)	0 (0%)
Torgersen	52 (34%)	0 (0%)	0 (0%)
bill_length_mm	38.8 2.7	48.8 3.3	47.5 3.1
bill_depth_mm	18.35 1.22	18.42 1.14	14.98 0.98
flipper_length_mm	190 7	196 7	217 6
body_mass_g	3,701 459	3,733 384	5,076 504
sex
female	73 (50%)	34 (50%)	58 (49%)
male	73 (50%)	34 (50%)	61 (51%)
year
2007	50 (33%)	26 (38%)	34 (27%)
2008	50 (33%)	18 (26%)	46 (37%)
2009	52 (34%)	24 (35%)	44 (35%)
¹ n (%); Mean SD

In between the {} are the functions you want to use. So here, the functions are mean() and sd(). We can edit the labels used for the variables with label argument.

penguins %>% 
  tbl_summary(
    by = species,
    missing = "no",
    statistic = list(all_continuous() ~ "{mean} {sd}"),
    label = list(bill_length_mm = "Bill length (mm)")
  )

Characteristic	Adelie N = 152¹	Chinstrap N = 68¹	Gentoo N = 124¹
island
Biscoe	44 (29%)	0 (0%)	124 (100%)
Dream	56 (37%)	68 (100%)	0 (0%)
Torgersen	52 (34%)	0 (0%)	0 (0%)
Bill length (mm)	38.8 2.7	48.8 3.3	47.5 3.1
bill_depth_mm	18.35 1.22	18.42 1.14	14.98 0.98
flipper_length_mm	190 7	196 7	217 6
body_mass_g	3,701 459	3,733 384	5,076 504
sex
female	73 (50%)	34 (50%)	58 (49%)
male	73 (50%)	34 (50%)	61 (51%)
year
2007	50 (33%)	26 (38%)	34 (27%)
2008	50 (33%)	18 (26%)	46 (37%)
2009	52 (34%)	24 (35%)	44 (35%)
¹ n (%); Mean SD

It’s often useful to know what the sample size is in each variable, which we can add as a column by piping into add_n().

penguins %>% 
  tbl_summary(
    by = species,
    missing = "no",
    statistic = list(all_continuous() ~ "{mean} {sd}"),
    label = list(bill_length_mm = "Bill length (mm)")
  ) %>% 
  add_n()

Characteristic	N	Adelie N = 152¹	Chinstrap N = 68¹	Gentoo N = 124¹
island	344
Biscoe		44 (29%)	0 (0%)	124 (100%)
Dream		56 (37%)	68 (100%)	0 (0%)
Torgersen		52 (34%)	0 (0%)	0 (0%)
Bill length (mm)	342	38.8 2.7	48.8 3.3	47.5 3.1
bill_depth_mm	342	18.35 1.22	18.42 1.14	14.98 0.98
flipper_length_mm	342	190 7	196 7	217 6
body_mass_g	342	3,701 459	3,733 384	5,076 504
sex	333
female		73 (50%)	34 (50%)	58 (49%)
male		73 (50%)	34 (50%)	61 (51%)
year	344
2007		50 (33%)	26 (38%)	34 (27%)
2008		50 (33%)	18 (26%)	46 (37%)
2009		52 (34%)	24 (35%)	44 (35%)
¹ n (%); Mean SD

Without using the by argument, we get the overall values for all the data. But with the by argument, that gets removed. We can add it back with the add_overall() function.

penguins %>% 
  tbl_summary(
    by = species,
    missing = "no",
    statistic = list(all_continuous() ~ "{mean} {sd}"),
    label = list(bill_length_mm = "Bill length (mm)")
  ) %>% 
  add_n() %>% 
  add_overall()

Characteristic	N	Overall N = 344¹	Adelie N = 152¹	Chinstrap N = 68¹	Gentoo N = 124¹
island	344
Biscoe		168 (49%)	44 (29%)	0 (0%)	124 (100%)
Dream		124 (36%)	56 (37%)	68 (100%)	0 (0%)
Torgersen		52 (15%)	52 (34%)	0 (0%)	0 (0%)
Bill length (mm)	342	43.9 5.5	38.8 2.7	48.8 3.3	47.5 3.1
bill_depth_mm	342	17.15 1.97	18.35 1.22	18.42 1.14	14.98 0.98
flipper_length_mm	342	201 14	190 7	196 7	217 6
body_mass_g	342	4,202 802	3,701 459	3,733 384	5,076 504
sex	333
female		165 (50%)	73 (50%)	34 (50%)	58 (49%)
male		168 (50%)	73 (50%)	34 (50%)	61 (51%)
year	344
2007		110 (32%)	50 (33%)	26 (38%)	34 (27%)
2008		114 (33%)	50 (33%)	18 (26%)	46 (37%)
2009		120 (35%)	52 (34%)	24 (35%)	44 (35%)
¹ n (%); Mean SD

We almost have a table ready for including in a paper or report!