library(tidyverse)
library(gtsummary)
library(palmerpenguins)
This session was recorded and uploaded on YouTube here:
In this session, we covered the {gtsummary}
R packages that can be used to easily create tables to describe study datasets easily. To show off how to make tables, we’ll use the dataset provided in the {palmerpenguins}
package. So first, let’s load up our packages!
Let’s first take a look at the data:
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
There are 8 columns in the dataset, with a good variety of character and numeric data. The core function from {gtsummary}
is the tbl_summary()
function.
%>%
penguins tbl_summary()
Characteristic | N = 3441 |
---|---|
species | |
Adelie | 152 (44%) |
Chinstrap | 68 (20%) |
Gentoo | 124 (36%) |
island | |
Biscoe | 168 (49%) |
Dream | 124 (36%) |
Torgersen | 52 (15%) |
bill_length_mm | 44.5 (39.2, 48.5) |
Unknown | 2 |
bill_depth_mm | 17.30 (15.60, 18.70) |
Unknown | 2 |
flipper_length_mm | 197 (190, 213) |
Unknown | 2 |
body_mass_g | 4,050 (3,550, 4,750) |
Unknown | 2 |
sex | |
female | 165 (50%) |
male | 168 (50%) |
Unknown | 11 |
year | |
2007 | 110 (32%) |
2008 | 114 (33%) |
2009 | 120 (35%) |
1 n (%); Median (Q1, Q3) |
You see that it automatically checks all the columns and runs descriptive statistics that are specific to the data type (e.g. count and percent for character data). It also defaults to an overall summary statistic for each column. But if we want to do it by a specific group, we use the by
argument:
%>%
penguins tbl_summary(
by = species
)
Characteristic | Adelie N = 1521 |
Chinstrap N = 681 |
Gentoo N = 1241 |
---|---|---|---|
island | |||
Biscoe | 44 (29%) | 0 (0%) | 124 (100%) |
Dream | 56 (37%) | 68 (100%) | 0 (0%) |
Torgersen | 52 (34%) | 0 (0%) | 0 (0%) |
bill_length_mm | 38.8 (36.7, 40.8) | 49.6 (46.3, 51.2) | 47.3 (45.3, 49.6) |
Unknown | 1 | 0 | 1 |
bill_depth_mm | 18.40 (17.50, 19.00) | 18.45 (17.50, 19.40) | 15.00 (14.20, 15.70) |
Unknown | 1 | 0 | 1 |
flipper_length_mm | 190 (186, 195) | 196 (191, 201) | 216 (212, 221) |
Unknown | 1 | 0 | 1 |
body_mass_g | 3,700 (3,350, 4,000) | 3,700 (3,475, 3,950) | 5,000 (4,700, 5,500) |
Unknown | 1 | 0 | 1 |
sex | |||
female | 73 (50%) | 34 (50%) | 58 (49%) |
male | 73 (50%) | 34 (50%) | 61 (51%) |
Unknown | 6 | 0 | 5 |
year | |||
2007 | 50 (33%) | 26 (38%) | 34 (27%) |
2008 | 50 (33%) | 18 (26%) | 46 (37%) |
2009 | 52 (34%) | 24 (35%) | 44 (35%) |
1 n (%); Median (Q1, Q3) |
Which creates several table columns for each species
in the dataset. There are some Unknown
values listed in the rows for the variables. That’s because there is some missingness. We can remove that by using the missing
argument:
%>%
penguins tbl_summary(
by = species,
missing = "no"
)
Characteristic | Adelie N = 1521 |
Chinstrap N = 681 |
Gentoo N = 1241 |
---|---|---|---|
island | |||
Biscoe | 44 (29%) | 0 (0%) | 124 (100%) |
Dream | 56 (37%) | 68 (100%) | 0 (0%) |
Torgersen | 52 (34%) | 0 (0%) | 0 (0%) |
bill_length_mm | 38.8 (36.7, 40.8) | 49.6 (46.3, 51.2) | 47.3 (45.3, 49.6) |
bill_depth_mm | 18.40 (17.50, 19.00) | 18.45 (17.50, 19.40) | 15.00 (14.20, 15.70) |
flipper_length_mm | 190 (186, 195) | 196 (191, 201) | 216 (212, 221) |
body_mass_g | 3,700 (3,350, 4,000) | 3,700 (3,475, 3,950) | 5,000 (4,700, 5,500) |
sex | |||
female | 73 (50%) | 34 (50%) | 58 (49%) |
male | 73 (50%) | 34 (50%) | 61 (51%) |
year | |||
2007 | 50 (33%) | 26 (38%) | 34 (27%) |
2008 | 50 (33%) | 18 (26%) | 46 (37%) |
2009 | 52 (34%) | 24 (35%) | 44 (35%) |
1 n (%); Median (Q1, Q3) |
That’s nicer! We can overwrite the statistic used with the statistic
argument and giving that argument a list()
(details are described in the package functionality documentation).
%>%
penguins tbl_summary(
by = species,
missing = "no",
statistic = list(all_continuous() ~ "{mean} {sd}")
)
Characteristic | Adelie N = 1521 |
Chinstrap N = 681 |
Gentoo N = 1241 |
---|---|---|---|
island | |||
Biscoe | 44 (29%) | 0 (0%) | 124 (100%) |
Dream | 56 (37%) | 68 (100%) | 0 (0%) |
Torgersen | 52 (34%) | 0 (0%) | 0 (0%) |
bill_length_mm | 38.8 2.7 | 48.8 3.3 | 47.5 3.1 |
bill_depth_mm | 18.35 1.22 | 18.42 1.14 | 14.98 0.98 |
flipper_length_mm | 190 7 | 196 7 | 217 6 |
body_mass_g | 3,701 459 | 3,733 384 | 5,076 504 |
sex | |||
female | 73 (50%) | 34 (50%) | 58 (49%) |
male | 73 (50%) | 34 (50%) | 61 (51%) |
year | |||
2007 | 50 (33%) | 26 (38%) | 34 (27%) |
2008 | 50 (33%) | 18 (26%) | 46 (37%) |
2009 | 52 (34%) | 24 (35%) | 44 (35%) |
1 n (%); Mean SD |
In between the {}
are the functions you want to use. So here, the functions are mean()
and sd()
. We can edit the labels used for the variables with label
argument.
%>%
penguins tbl_summary(
by = species,
missing = "no",
statistic = list(all_continuous() ~ "{mean} {sd}"),
label = list(bill_length_mm = "Bill length (mm)")
)
Characteristic | Adelie N = 1521 |
Chinstrap N = 681 |
Gentoo N = 1241 |
---|---|---|---|
island | |||
Biscoe | 44 (29%) | 0 (0%) | 124 (100%) |
Dream | 56 (37%) | 68 (100%) | 0 (0%) |
Torgersen | 52 (34%) | 0 (0%) | 0 (0%) |
Bill length (mm) | 38.8 2.7 | 48.8 3.3 | 47.5 3.1 |
bill_depth_mm | 18.35 1.22 | 18.42 1.14 | 14.98 0.98 |
flipper_length_mm | 190 7 | 196 7 | 217 6 |
body_mass_g | 3,701 459 | 3,733 384 | 5,076 504 |
sex | |||
female | 73 (50%) | 34 (50%) | 58 (49%) |
male | 73 (50%) | 34 (50%) | 61 (51%) |
year | |||
2007 | 50 (33%) | 26 (38%) | 34 (27%) |
2008 | 50 (33%) | 18 (26%) | 46 (37%) |
2009 | 52 (34%) | 24 (35%) | 44 (35%) |
1 n (%); Mean SD |
It’s often useful to know what the sample size is in each variable, which we can add as a column by piping into add_n()
.
%>%
penguins tbl_summary(
by = species,
missing = "no",
statistic = list(all_continuous() ~ "{mean} {sd}"),
label = list(bill_length_mm = "Bill length (mm)")
%>%
) add_n()
Characteristic | N | Adelie N = 1521 |
Chinstrap N = 681 |
Gentoo N = 1241 |
---|---|---|---|---|
island | 344 | |||
Biscoe | 44 (29%) | 0 (0%) | 124 (100%) | |
Dream | 56 (37%) | 68 (100%) | 0 (0%) | |
Torgersen | 52 (34%) | 0 (0%) | 0 (0%) | |
Bill length (mm) | 342 | 38.8 2.7 | 48.8 3.3 | 47.5 3.1 |
bill_depth_mm | 342 | 18.35 1.22 | 18.42 1.14 | 14.98 0.98 |
flipper_length_mm | 342 | 190 7 | 196 7 | 217 6 |
body_mass_g | 342 | 3,701 459 | 3,733 384 | 5,076 504 |
sex | 333 | |||
female | 73 (50%) | 34 (50%) | 58 (49%) | |
male | 73 (50%) | 34 (50%) | 61 (51%) | |
year | 344 | |||
2007 | 50 (33%) | 26 (38%) | 34 (27%) | |
2008 | 50 (33%) | 18 (26%) | 46 (37%) | |
2009 | 52 (34%) | 24 (35%) | 44 (35%) | |
1 n (%); Mean SD |
Without using the by
argument, we get the overall values for all the data. But with the by
argument, that gets removed. We can add it back with the add_overall()
function.
%>%
penguins tbl_summary(
by = species,
missing = "no",
statistic = list(all_continuous() ~ "{mean} {sd}"),
label = list(bill_length_mm = "Bill length (mm)")
%>%
) add_n() %>%
add_overall()
Characteristic | N | Overall N = 3441 |
Adelie N = 1521 |
Chinstrap N = 681 |
Gentoo N = 1241 |
---|---|---|---|---|---|
island | 344 | ||||
Biscoe | 168 (49%) | 44 (29%) | 0 (0%) | 124 (100%) | |
Dream | 124 (36%) | 56 (37%) | 68 (100%) | 0 (0%) | |
Torgersen | 52 (15%) | 52 (34%) | 0 (0%) | 0 (0%) | |
Bill length (mm) | 342 | 43.9 5.5 | 38.8 2.7 | 48.8 3.3 | 47.5 3.1 |
bill_depth_mm | 342 | 17.15 1.97 | 18.35 1.22 | 18.42 1.14 | 14.98 0.98 |
flipper_length_mm | 342 | 201 14 | 190 7 | 196 7 | 217 6 |
body_mass_g | 342 | 4,202 802 | 3,701 459 | 3,733 384 | 5,076 504 |
sex | 333 | ||||
female | 165 (50%) | 73 (50%) | 34 (50%) | 58 (49%) | |
male | 168 (50%) | 73 (50%) | 34 (50%) | 61 (51%) | |
year | 344 | ||||
2007 | 110 (32%) | 50 (33%) | 26 (38%) | 34 (27%) | |
2008 | 114 (33%) | 50 (33%) | 18 (26%) | 46 (37%) | |
2009 | 120 (35%) | 52 (34%) | 24 (35%) | 44 (35%) | |
1 n (%); Mean SD |
We almost have a table ready for including in a paper or report!