Research fair 2024

`R` in Action

Efficient data science with R

A demonstration by Md. Aminul Islam Shazid.

Grammar of graphics with ggplot2

Plots using grammar of graphics with `ggplot2`

ggplot2 is an R package that implements the grammar of graphics.
Can provide beautiful graphics with some simple building blocks.
Variables/features/columns are mapped to various elements of the plot called “aesthetics”, e.g., axis, colours, point size, line type etc.
Then a geometry transforms that “aesthetic” mapping into a plot.

A simple example

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = flipper_length_mm)) +
    geom_point()

Adding a grouping variable

ggplot(penguins, 
       mapping = aes(x = bill_length_mm, 
                     y = flipper_length_mm, 
                     color = species)) + 
    geom_point()

Let’s add another dimension to the plot!

ggplot(penguins, 
       mapping = aes(x = bill_length_mm, 
                     y = flipper_length_mm, 
                     color = species, 
                     size = body_mass_g)) + 
    geom_point(alpha = 0.5)

Adding yet another dimension!

ggplot(penguins, 
       mapping = aes(x = bill_length_mm, 
                     y = flipper_length_mm, 
                     color = species, 
                     size = body_mass_g)) + 
    geom_point(alpha = 0.5) +
    facet_wrap(~island)

Comparing a variable across groups with boxplot

ggplot(penguins,
       mapping = aes(y = body_mass_g, 
                     x = species, 
                     fill = species)) +
    geom_boxplot(width = 0.2, show.legend = FALSE)

Violon plots as alternative to boxplot

More informative: gives a sense of the density too!

ggplot(penguins,
       mapping = aes(y = body_mass_g, 
                     x = species, 
                     fill = species)) +
    geom_violin(width = 0.5, show.legend = FALSE) + 
    geom_boxplot(fill = "white", width = 0.1, show.legend = FALSE)

Bar diagrams

penguins |> 
    count(island, species) |> 
    ggplot() + 
    aes(x = island, y = n, fill = species) + 
    geom_bar(stat = "identity", 
             position = position_dodge2(preserve = "single"))

Line chart

To show trend or evolution.

ggplot() + 
    aes(x = time(AirPassengers), y = AirPassengers) + 
    geom_line()

Line chart with a trend line!

LOESS smoother added as a trend line.

ggplot() + 
    aes(x = time(AirPassengers), y = AirPassengers) + 
    geom_line() + 
    geom_smooth()

Fast data exploration with DataExplorer

Basic info about a dataset

library(DataExplorer)
plot_intro(penguins)

Find missing values

plot_missing(penguins)

Frequency distribution of all discrete variables

plot_bar(diamonds)

Frequency distribution by a discrete variable

plot_bar(diamonds, by = "cut")

Histogram of all continuous variables

plot_histogram(diamonds)

Kernel density of all continuous variables

plot_density(diamonds)

Boxplot

Boxplots of all continuous variables with groups formed with respect to a categorical variable

plot_boxplot(diamonds, by = "cut")

Scatterplot of one variable with all other continuous variable

plot_scatterplot(
    split_columns(diamonds)$continuous, 
    by = "price", 
    sampled_rows = 1000L
)

Quantile-quantile plot of all continuous variables

plot_qq(diamonds)

Correlogram

plot_correlation(split_columns(diamonds)$continuous)

Publication ready tables with gtsummary

Table describing the sample

This is the so-called table-1

library(gtsummary)
tbl_summary(
    data = trial, 
    missing_text = "NA",
    include = c("age", "trt", "marker", "stage", "grade", "death")
    ) |> 
    bold_labels()

Characteristic	N = 200¹
Age	47 (38, 57)
NA	11
Chemotherapy Treatment
Drug A	98 (49%)
Drug B	102 (51%)
Marker Level (ng/mL)	0.64 (0.22, 1.39)
NA	10
T Stage
T1	53 (27%)
T2	54 (27%)
T3	43 (22%)
T4	50 (25%)
Grade
I	68 (34%)
II	68 (34%)
III	64 (32%)
death	112 (56%)
¹ Median (IQR); n (%)

Cross table

tbl_summary(
    data = trial, 
    include = c("age", "trt", "marker", "stage", "grade"),
    by = "death",
    percent = "row",
    missing_text = "NA",
    ) |> 
    add_p() |> 
    bold_p() |> 
    bold_labels() |> 
    modify_spanning_header(all_stat_cols() ~ "**Death**")

Characteristic	Death		p-value²
Characteristic	No, N = 88¹	Yes, N = 112¹	p-value²
Age	47 (36, 57)	48 (38, 58)	0.5
NA	2	9
Chemotherapy Treatment			0.4
Drug A	46 (47%)	52 (53%)
Drug B	42 (41%)	60 (59%)
Marker Level (ng/mL)	0.73 (0.23, 1.33)	0.57 (0.20, 1.45)	0.6
NA	2	8
T Stage			0.004
T1	29 (55%)	24 (45%)
T2	27 (50%)	27 (50%)
T3	21 (49%)	22 (51%)
T4	11 (22%)	39 (78%)
Grade			0.080
I	35 (51%)	33 (49%)
II	32 (47%)	36 (53%)
III	21 (33%)	43 (67%)
¹ Median (IQR); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

Regression model summary table

gtsummary many different kinds of statistical models. Adding support for new models is also very easy.

logit_model <- glm(death ~ age + trt + marker + stage + grade, 
                   data = trial, family = binomial)
tbl_regression(
    logit_model,
    exponentiate = TRUE
    ) |> 
    bold_p() |> 
    bold_labels()

Characteristic	OR¹	95% CI¹	p-value
Age	1.01	0.99, 1.03	0.3
Chemotherapy Treatment
Drug A	—	—
Drug B	1.42	0.76, 2.67	0.3
Marker Level (ng/mL)	0.95	0.65, 1.38	0.8
T Stage
T1	—	—
T2	1.53	0.67, 3.55	0.3
T3	1.46	0.59, 3.67	0.4
T4	5.34	2.13, 14.3	<0.001
Grade
I	—	—
II	1.07	0.49, 2.34	0.9
III	2.07	0.97, 4.52	0.062
¹ OR = Odds Ratio, CI = Confidence Interval

Decision tree classifier in R

Fitting a decision tree model

Classifying disease outcome using decision tree

library(tree)
tree1 <- tree(death ~ age + trt + marker + stage + grade, 
     data = trial)
plot(tree1)
text(tree1, pretty = 0)

Hierarchical clustering in R

Finding similar cars

library(colorhcplot)
d <- dist(mtcars)
hc1 <- hclust(d)
plot(hc1, hang = -1, cex = 0.8)
rect.hclust(hc1, k = 3)

KNN clustering in R

Finding clusters of flowers in the `iris` dataset

library(factoextra)
km1 <- kmeans(iris[, 1:4], centers = 4, nstart = 1000)
fviz_cluster(km1, data = iris[, 1:4], geom = "point")

Time series analysis in R

Plotting a time series

library(ggfortify)
autoplot(AirPassengers)

Decompose a time series

Decomposing the AirPassengers data into trend, seasonality etc.

dAP <- decompose(AirPassengers)
autoplot(dAP)

Forecasting future values

library(forecast)
AP_arima <- auto.arima(AirPassengers)
AP_f <- forecast(AP_arima, h = 30)
autoplot(AP_f)

R in Action

Plots using grammar of graphics with ggplot2