Find the Optimal Number of Clusters (k) Using the Elbow Method

This function computes the optimal number of clusters (k) for a given dataset based on the within-cluster sum of squares (WCSS) values. The optimal k is determined by identifying the first k where the WCSS falls below a specified slope threshold derived from the standard deviation of the differences between consecutive WCSS values.

Usage

find_optimal_k(data, variable, k_max = 10, slope_factor = 0.5, plot = FALSE)

Arguments

data: A data frame containing the data to be clustered.
variable: A string specifying the column name within the dataframe to be clustered.
k_max: An integer specifying the maximum number of clusters to consider (default is 10).
slope_factor: A numeric value to scale the standard deviation of the first derivative for slope threshold determination (default is 0.5). We strongly recommend using a slope >= 1 for large datasets to allow more flexibility in thresholding.
plot: A logical indicating whether to generate a plot of the WCSS values against the number of clusters and the threshold value (default is FALSE). If you do not know what slope_factor to use, we recommend using plot = TRUE to visualize the data.

Value

A list containing:

optimal_k: An integer indicating the optimal number of clusters.
slopes: A data frame with the WCSS values for each k.
slope_threshold: The calculated slope threshold.
plot: (optional) A ggplot object if plot = TRUE.

Examples

set.seed(123)
r_norm_data <- data.frame(normal_dist = rnorm(100, mean = 50, sd = 10)) 
result <- find_optimal_k(r_norm_data, variable = "normal_dist", k_max = 15, plot = TRUE)
print(result$optimal_k)
#> [1] 5