English translation
Create sample data
An AutoML system functions like a configurable pipeline. Each automated component must generate traceable logs; otherwise, results become difficult to reproduce or interpret.
I will verify the inputs, outputs, and logging behavior of each component. An automated workflow without comprehensive logging makes it extremely difficult to diagnose why a particular model was selected.
In the previous article, we introduced what Automated Machine Learning (AutoML) is and how it helps users streamline the model development process. Now, let’s delve deeper into AutoML’s core components—interconnected modules that collectively form a complete AutoML solution, enabling automation across data preprocessing, feature selection, model training, and hyperparameter optimization.
1. Data Preprocessing Component
Data preprocessing is a critical step in any machine learning pipeline. AutoML systems typically integrate multiple preprocessing modules capable of automating the following tasks:
When learning AutoML’s core components, first mentally chain together data processing, feature engineering, model search, hyperparameter tuning, and evaluation. If any link in this chain remains unclear, reproducing or auditing automated outcomes becomes challenging.
- Missing Value Handling: Automatically detects missing entries and imputes them using appropriate strategies (e.g., mean or median imputation).
- Categorical Encoding: Converts categorical variables into numeric representations—for instance, via one-hot encoding or label encoding.
- Feature Scaling: Applies standardization or normalization to features to improve model convergence and performance.
Example
Suppose we have a dataset containing missing values and categorical variables. We can leverage an AutoML library such as TPOT or auto-sklearn for preprocessing. For example:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Create sample data
data = pd.DataFrame({
'age': [25, 27, None, 29],
'gender': ['male', 'female', 'female', 'male']
})
# Handle missing values
imputer = SimpleImputer(strategy='mean')
data['age'] = imputer.fit_transform(data[['age']])
# Encode categorical variable
encoder = OneHotEncoder()
encoded_gender = encoder.fit_transform(data[['gender']]).toarray()
2. Feature Engineering Module
Feature engineering plays a pivotal role in boosting model performance. AutoML enhances feature sets through automated feature selection and feature construction.
You don’t need to absorb every detail of “Overview of AutoML: Core Components” all at once. Start with one small, hands-on problem you can validate yourself—then use the diagrams and text to fill in conceptual gaps.
- Feature Selection: Automatically evaluates each feature’s contribution to model performance and selects the most informative subset.
- Feature Construction: Generates new features from existing ones—for example, polynomial features or interaction terms.
Example
Using the FeatureTools library for automated feature construction:
import featuretools as ft
# Create a feature entity set
es = ft.EntitySet(id='data')
es = es.add_dataframe(dataframe_name='data', dataframe=data, index='id')
# Automatically generate new features
features, feature_defs = ft.dfs(entityset=es, target_dataframe_name='data')
3. Model Selection and Training Module
AutoML systems provide a diverse suite of machine learning algorithms and autonomously select the best-performing one. Key capabilities include:
- Model Selection: Automatically identifies the optimal algorithm using techniques such as cross-validation.
- Model Training: Fits the selected model on training data; commonly supported algorithms include decision trees, random forests, and support vector machines.
Example
In auto-sklearn, model selection and training can be implemented as follows:
from autosklearn.classification import AutoSklearnClassifier
# Instantiate AutoSklearn classifier
automl = AutoSklearnClassifier(time_left_for_this_task=120, per_run_time_limit=30)
automl.fit(X_train, y_train)
4. Hyperparameter Optimization Module
Every machine learning algorithm has a set of hyperparameters that govern its learning capacity and generalization ability. AutoML systems typically employ the following methods for hyperparameter optimization:
- Grid Search: Exhaustively searches over a predefined grid of hyperparameter combinations.
- Bayesian Optimization: Uses probabilistic modeling and Bayesian inference to efficiently navigate the hyperparameter space and locate high-performing configurations.
Example
Using Optuna for hyperparameter optimization:
import optuna
def objective(trial):
max_depth = trial.suggest_int('max_depth', 2, 32)
model = RandomForestClassifier(max_depth=max_depth)
model.fit(X_train, y_train)
return model.score(X_valid, y_valid)
study = optuna.create_study()
study.optimize(objective, n_trials=100)
5. Model Evaluation and Validation Module
After model training, evaluation is essential for assessing performance. Common metrics include accuracy, F1-score, and ROC curves. AutoML systems can automatically generate evaluation reports and visualizations—making model behavior transparent and interpretable for users.
Example
Evaluating a model using scikit-learn:
from sklearn.metrics import accuracy_score, f1_score
y_pred = automl.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1-score:", f1_score(y_test, y_pred, average='weighted'))
At this point, consider organizing “Overview of AutoML: Core Components” into a retrospective table: clarify the central narrative first, then test it against a small concrete task.
After finishing “Overview of AutoML: Core Components”, try running a minimal end-to-end example first—then assess which steps you can now execute independently.
Summary
Automated Machine Learning (AutoML) comprises several interdependent core components—from data preprocessing and feature engineering, through model training and hyperparameter optimization, to final model evaluation. Together, these components significantly enhance both the automation level and effectiveness of machine learning workflows. In the next article, we’ll explore AutoML’s advantages and challenges—deepening our understanding of its practical value and limitations.
Continue