English translation
Automating Feature Engineering: Generation and Transformation
Automated feature generation expands the search space—but also increases the risk of overfitting and computational cost. The more features you generate, the more critical rigorous validation becomes.
I document the origin of every newly generated feature and verify whether it can be computed in real time when deployed to production.
In the previous article, we explored techniques for feature selection—methods to identify features most relevant to model performance. In this article, we dive into feature generation and transformation, a pivotal step in feature engineering. Thoughtfully designed feature generation and transformation can significantly boost model performance, enabling machine learning algorithms to more effectively uncover latent patterns in the data.
What Are Feature Generation and Transformation?
Feature generation refers to creating new features from raw data—features that help models better capture underlying data structure. Feature transformation, by contrast, involves modifying existing features—either to improve model performance or to meet algorithm-specific requirements.
When performing automated feature generation and transformation, first assess: field semantics, temporal availability, combination rules, encoding schemes, data leakage risks, and validation-set performance.
Methods of Feature Generation
-
Polynomial Features:
Construct new features using polynomial combinations of existing ones. For example, given features and , we may generate , , and .
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
Interaction (Composite) Features:
Combining two or more features can yield meaningful signals. For instance, in house price prediction, dividing “price” by “number of rooms” yields “price per room”—a potentially informative derived feature.
df['price_per_room'] = df['price'] / df['num_rooms']
Temporal Features:
For time-series or date-stamped data, extracting components such as year, month, day, or weekday enables modeling of cyclical or seasonal patterns.
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
Methods of Feature Transformation
-
Standardization and Normalization:
Standardization rescales features to zero mean and unit variance—ideal for many ML algorithms (e.g., SVM, linear regression). Normalization scales features to the [0, 1] range.from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = StandardScaler() X_standardized = scaler.fit_transform(X) min_max_scaler = MinMaxScaler() X_normalized = min_max_scaler.fit_transform(X) -
Logarithmic Transformation:
For right-skewed (positively skewed) features, applying a log transform (e.g., ) often improves normality—benefiting algorithms sensitive to distribution shape.df['log_feature'] = np.log(df['original_feature'] + 1) -
Encoding Categorical Features:
Categorical variables must be converted into numeric representations. Common approaches include one-hot encoding (for low- to medium-cardinality features) and label encoding (with caution—only when ordinal meaning exists).df = pd.get_dummies(df, columns=['categorical_feature'], drop_first=True)
Case Study
Suppose we’re building a house price prediction model, with features including area, number of rooms, and housing type.
First, generate a new feature: price per square foot:
df['price_per_sqft'] = df['price'] / df['area']
Next, apply logarithmic transformation to stabilize the price distribution:
df['log_price'] = np.log(df['price'] + 1)
Finally, encode the categorical feature housing_type using one-hot encoding:
df = pd.get_dummies(df, columns=['housing_type'], drop_first=True)
After these steps, the resulting feature matrix is better suited for model training.
Having read this article, consolidate “Feature Engineering Automation: Feature Generation and Transformation” into a retrospective checklist: clarify the core workflow first, then validate it on a small task.
After reading “Feature Engineering Automation: Feature Generation and Transformation”, start by walking through the full pipeline on a small, concrete example—then assess which steps you can already execute independently.
Summary
Feature generation and transformation are essential components of feature engineering. Selecting appropriate generation and transformation strategies can substantially enhance model performance. In the next article, we’ll explore tools and frameworks for automating feature engineering—further streamlining the end-to-end machine learning workflow.
You don’t need to absorb every detail of “Feature Engineering Automation: Feature Generation and Transformation” all at once. Start with a small, actionable problem you can implement and validate—then use the diagrams and narrative to fill in conceptual gaps.
Continue