R in Action

Efficient data science with R

A demonstration by Md. Aminul Islam Shazid.

Grammar of graphics with ggplot2

Plots using grammar of graphics with ggplot2

  • ggplot2 is an R package that implements the grammar of graphics.
  • Can provide beautiful graphics with some simple building blocks.
  • Variables/features/columns are mapped to various elements of the plot called “aesthetics”, e.g., axis, colours, point size, line type etc.
  • Then a geometry transforms that “aesthetic” mapping into a plot.

A simple example

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = flipper_length_mm)) +
    geom_point()

Adding a grouping variable

ggplot(penguins, 
       mapping = aes(x = bill_length_mm, 
                     y = flipper_length_mm, 
                     color = species)) + 
    geom_point()

Let’s add another dimension to the plot!

ggplot(penguins, 
       mapping = aes(x = bill_length_mm, 
                     y = flipper_length_mm, 
                     color = species, 
                     size = body_mass_g)) + 
    geom_point(alpha = 0.5)

Adding yet another dimension!

ggplot(penguins, 
       mapping = aes(x = bill_length_mm, 
                     y = flipper_length_mm, 
                     color = species, 
                     size = body_mass_g)) + 
    geom_point(alpha = 0.5) +
    facet_wrap(~island)

Comparing a variable across groups with boxplot

ggplot(penguins,
       mapping = aes(y = body_mass_g, 
                     x = species, 
                     fill = species)) +
    geom_boxplot(width = 0.2, show.legend = FALSE)

Violon plots as alternative to boxplot

More informative: gives a sense of the density too!

ggplot(penguins,
       mapping = aes(y = body_mass_g, 
                     x = species, 
                     fill = species)) +
    geom_violin(width = 0.5, show.legend = FALSE) + 
    geom_boxplot(fill = "white", width = 0.1, show.legend = FALSE)

Bar diagrams

penguins |> 
    count(island, species) |> 
    ggplot() + 
    aes(x = island, y = n, fill = species) + 
    geom_bar(stat = "identity", 
             position = position_dodge2(preserve = "single"))

Line chart

To show trend or evolution.

ggplot() + 
    aes(x = time(AirPassengers), y = AirPassengers) + 
    geom_line()

Line chart with a trend line!

LOESS smoother added as a trend line.

ggplot() + 
    aes(x = time(AirPassengers), y = AirPassengers) + 
    geom_line() + 
    geom_smooth()

Fast data exploration with DataExplorer

Basic info about a dataset

library(DataExplorer)
plot_intro(penguins)

Find missing values

plot_missing(penguins)

Frequency distribution of all discrete variables

plot_bar(diamonds)

Frequency distribution by a discrete variable

plot_bar(diamonds, by = "cut")

Histogram of all continuous variables

plot_histogram(diamonds)

Kernel density of all continuous variables

plot_density(diamonds)

Boxplot

Boxplots of all continuous variables with groups formed with respect to a categorical variable

plot_boxplot(diamonds, by = "cut")

Scatterplot of one variable with all other continuous variable

plot_scatterplot(
    split_columns(diamonds)$continuous, 
    by = "price", 
    sampled_rows = 1000L
)

Quantile-quantile plot of all continuous variables

plot_qq(diamonds)

Correlogram

plot_correlation(split_columns(diamonds)$continuous)

Publication ready tables with gtsummary

Table describing the sample

This is the so-called table-1

library(gtsummary)
tbl_summary(
    data = trial, 
    missing_text = "NA",
    include = c("age", "trt", "marker", "stage", "grade", "death")
    ) |> 
    bold_labels()
Characteristic N = 2001
Age 47 (38, 57)
    NA 11
Chemotherapy Treatment
    Drug A 98 (49%)
    Drug B 102 (51%)
Marker Level (ng/mL) 0.64 (0.22, 1.39)
    NA 10
T Stage
    T1 53 (27%)
    T2 54 (27%)
    T3 43 (22%)
    T4 50 (25%)
Grade
    I 68 (34%)
    II 68 (34%)
    III 64 (32%)
death 112 (56%)
1 Median (IQR); n (%)

Cross table

tbl_summary(
    data = trial, 
    include = c("age", "trt", "marker", "stage", "grade"),
    by = "death",
    percent = "row",
    missing_text = "NA",
    ) |> 
    add_p() |> 
    bold_p() |> 
    bold_labels() |> 
    modify_spanning_header(all_stat_cols() ~ "**Death**")
Characteristic Death p-value2
No, N = 881 Yes, N = 1121
Age 47 (36, 57) 48 (38, 58) 0.5
    NA 2 9
Chemotherapy Treatment

0.4
    Drug A 46 (47%) 52 (53%)
    Drug B 42 (41%) 60 (59%)
Marker Level (ng/mL) 0.73 (0.23, 1.33) 0.57 (0.20, 1.45) 0.6
    NA 2 8
T Stage

0.004
    T1 29 (55%) 24 (45%)
    T2 27 (50%) 27 (50%)
    T3 21 (49%) 22 (51%)
    T4 11 (22%) 39 (78%)
Grade

0.080
    I 35 (51%) 33 (49%)
    II 32 (47%) 36 (53%)
    III 21 (33%) 43 (67%)
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

Regression model summary table

gtsummary many different kinds of statistical models. Adding support for new models is also very easy.

logit_model <- glm(death ~ age + trt + marker + stage + grade, 
                   data = trial, family = binomial)
tbl_regression(
    logit_model,
    exponentiate = TRUE
    ) |> 
    bold_p() |> 
    bold_labels()
Characteristic OR1 95% CI1 p-value
Age 1.01 0.99, 1.03 0.3
Chemotherapy Treatment


    Drug A
    Drug B 1.42 0.76, 2.67 0.3
Marker Level (ng/mL) 0.95 0.65, 1.38 0.8
T Stage


    T1
    T2 1.53 0.67, 3.55 0.3
    T3 1.46 0.59, 3.67 0.4
    T4 5.34 2.13, 14.3 <0.001
Grade


    I
    II 1.07 0.49, 2.34 0.9
    III 2.07 0.97, 4.52 0.062
1 OR = Odds Ratio, CI = Confidence Interval

Decision tree classifier in R

Fitting a decision tree model

Classifying disease outcome using decision tree

library(tree)
tree1 <- tree(death ~ age + trt + marker + stage + grade, 
     data = trial)
plot(tree1)
text(tree1, pretty = 0)

Hierarchical clustering in R

Finding similar cars

library(colorhcplot)
d <- dist(mtcars)
hc1 <- hclust(d)
plot(hc1, hang = -1, cex = 0.8)
rect.hclust(hc1, k = 3)

KNN clustering in R

Finding clusters of flowers in the iris dataset

library(factoextra)
km1 <- kmeans(iris[, 1:4], centers = 4, nstart = 1000)
fviz_cluster(km1, data = iris[, 1:4], geom = "point")

Time series analysis in R

Plotting a time series

library(ggfortify)
autoplot(AirPassengers)

Decompose a time series

Decomposing the AirPassengers data into trend, seasonality etc.

dAP <- decompose(AirPassengers)
autoplot(dAP)

Forecasting future values

library(forecast)
AP_arima <- auto.arima(AirPassengers)
AP_f <- forecast(AP_arima, h = 30)
autoplot(AP_f)