Skip to contents

This function computes the optimal number of clusters (k) for a given dataset based on the within-cluster sum of squares (WCSS) values. The optimal k is determined by identifying the first k where the WCSS falls below a specified slope threshold derived from the standard deviation of the differences between consecutive WCSS values.

Usage

find_optimal_k(data, variable, k_max = 10, slope_factor = 0.5, plot = FALSE)

Arguments

data

A data frame containing the data to be clustered.

variable

A string specifying the column name within the dataframe to be clustered.

k_max

An integer specifying the maximum number of clusters to consider (default is 10).

slope_factor

A numeric value to scale the standard deviation of the first derivative for slope threshold determination (default is 0.5). We strongly recommend using a slope >= 1 for large datasets to allow more flexibility in thresholding.

plot

A logical indicating whether to generate a plot of the WCSS values against the number of clusters and the threshold value (default is FALSE). If you do not know what slope_factor to use, we recommend using plot = TRUE to visualize the data.

Value

A list containing:

  • optimal_k: An integer indicating the optimal number of clusters.

  • slopes: A data frame with the WCSS values for each k.

  • slope_threshold: The calculated slope threshold.

  • plot: (optional) A ggplot object if plot = TRUE.

Examples

set.seed(123)
r_norm_data <- data.frame(normal_dist = rnorm(100, mean = 50, sd = 10)) 
result <- find_optimal_k(r_norm_data, variable = "normal_dist", k_max = 15, plot = TRUE)
print(result$optimal_k)
#> [1] 5