A healthcare analytics company wants to segment patients into groups that have similar risk factors to develop personalized treatment plans. The company has a dataset that includes patient health records, medication history, and lifestyle changes. The company must identify the appropriate algorithm to determine the number of groups by using hyperparameters.
Which solution will meet these requirements?
A. Use the Amazon SageMaker AI XGBoost algorithm. Set max_depth to control tree complexity for risk groups.
B. Use the Amazon SageMaker k-means clustering algorithm. Set k to specify the number of clusters.
C. Use the Amazon SageMaker AI DeepAR algorithm. Set epochs to determine the number of training iterations for risk groups.
D. Use the Amazon SageMaker AI Random Cut Forest (RCF) algorithm. Set a contamination hyperparameter for risk anomaly detection.
Explanation:
The problem described is a patient segmentation use case, which is a classic example of unsupervised learning. The objective is to group patients with similar characteristics without predefined labels. AWS documentation clearly states that Amazon SageMaker k-means is designed specifically for clustering and segmentation tasks.
The SageMaker k-means algorithm groups data points into clusters based on feature similarity and requires the user to define the number of clusters using the k hyperparameter. This directly satisfies the requirement to “determine the number of groups by using hyperparameters.” AWS recommends k-means for applications such as customer segmentation, risk grouping, and pattern discovery in healthcare data.
Option A (XGBoost) is a supervised learning algorithm used for classification and regression. The max_depth hyperparameter controls tree complexity, not the number of groups, making it unsuitable for this task.
Option C (DeepAR) is a time-series forecasting algorithm optimized for predicting future values, not clustering patients.
Option D (Random Cut Forest) is an anomaly detection algorithm. While useful for identifying outliers or unusual patient behavior, it does not perform clustering or group segmentation.
AWS SageMaker documentation explicitly identifies k-means as the correct choice when the goal is to partition data into a predefined number of clusters using a tunable hyperparameter.
Therefore, Option B is the correct and AWS-verified answer.