Find the Optimal Number of Clusters (k) Using the Elbow Method
find_optimal_k.Rd
This function computes the optimal number of clusters (k) for a given dataset based on the within-cluster sum of squares (WCSS) values. The optimal k is determined by identifying the first k where the WCSS falls below a specified slope threshold derived from the standard deviation of the differences between consecutive WCSS values.
Arguments
- data
A data frame containing the data to be clustered.
- variable
A string specifying the column name within the dataframe to be clustered.
- k_max
An integer specifying the maximum number of clusters to consider (default is 10).
- slope_factor
A numeric value to scale the standard deviation of the first derivative for slope threshold determination (default is 0.5). We strongly recommend using a slope >= 1 for large datasets to allow more flexibility in thresholding.
- plot
A logical indicating whether to generate a plot of the WCSS values against the number of clusters and the threshold value (default is FALSE). If you do not know what slope_factor to use, we recommend using plot = TRUE to visualize the data.
Value
A list containing:
optimal_k
: An integer indicating the optimal number of clusters.slopes
: A data frame with the WCSS values for each k.slope_threshold
: The calculated slope threshold.plot
: (optional) A ggplot object ifplot = TRUE
.
Examples
set.seed(123)
r_norm_data <- data.frame(normal_dist = rnorm(100, mean = 50, sd = 10))
result <- find_optimal_k(r_norm_data, variable = "normal_dist", k_max = 15, plot = TRUE)
print(result$optimal_k)
#> [1] 5