Part 2 – Top Scikit Learn Tips for Building Efficient Machine Learning Workflows


This series of articles draws inspiration and key concepts from Data School’s valuable “TOP 50 Scikit-learn Tips and Tricks” resource. While not a direct adaptation, each article aims to build upon those core ideas, providing comprehensive explanations, code examples, and practical considerations for effective implementation. The goal is to empower you with a deeper understanding and the ability to confidently apply these scikit-learn techniques in your machine learning projects.


Missing Indicator Features: Capture potential patterns related to missingness itself by creating binary features indicating missing values.

from sklearn.impute import SimpleImputer
from sklearn.impute import MissingIndicator

# Option 1: Using SimpleImputer
imputer = SimpleImputer(strategy='mean', add_indicator=True)
X_imputed = imputer.fit_transform(X)

# Option 2: Using MissingIndicator
indicator = MissingIndicator()
missing_features = indicator.fit_transform(X)

Reproducibility in Train-Test Split: Ensure consistent splits for fair model comparison and collaboration by setting a random state.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Advanced Imputers: Explore KNNImputer and IterativeImputer for more sophisticated missing value handling.

from sklearn.impute import KNNImputer, IterativeImputer

# KNNImputer
imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X)

# IterativeImputer
imputer = IterativeImputer(max_iter=10)
X_imputed = imputer.fit_transform(X)

IterativeImputer: Iterates through features, predicting missing values using other features as predictors.

KNNImputer: Imputes missing values based on values in similar rows (nearest neighbors).

Numeric Feature Restriction: KNNImputer and IterativeImputer are designed for numerical features, not categorical ones.

HistGradientBoostingClassifier for Missing Values: This model can intrinsically handle missing values during training and prediction, often eliminating the need for imputation.

from sklearn.ensemble import HistGradientBoostingClassifier

model = HistGradientBoostingClassifier()
model.fit(X_train, y_train)  # No imputation needed

OneHotEncoder Compatibility: drop_first=True and handle_unknown='ignore' are incompatible settings in OneHotEncoder. Generally, avoid drop_first=True unless it’s essential for specific model requirements.

Tuning Entire Pipelines: Optimize model performance by tuning hyperparameters of both preprocessing steps and the model itself within a pipeline.

from sklearn.model_selection import RandomizedSearchCV

pipeline = Pipeline(...)  # As defined in previous examples
params = {
    'scaler__with_mean': [True, False],
    'classifier__n_estimators': [100, 200, 300],
    ...
}
search = RandomizedSearchCV(pipeline, params, cv=5)
search.fit(X_train, y_train)
best_pipeline = search.best_estimator_

RandomizedSearchCV Efficiency: Explore hyperparameter combinations more efficiently than GridSearchCV, especially for large hyperparameter spaces. Specify appropriate distributions for continuous variables.

from scipy.stats import randint, uniform

params = {
    'n_estimators': randint(100, 500),  # Random integer from 100 to 500
    'learning_rate': uniform(0.01, 0.1)  # Random float between 0.01 and 0.1
}
search = RandomizedSearchCV(model, params, cv=5)

Regularization in Logistic Regression:

  • L1 (Lasso): Shrinks coefficients towards zero, potentially setting some to zero for feature selection.
  • L2 (Ridge): Shrinks coefficients but keeps them non-zero, reducing model complexity without feature elimination.
  • ElasticNet: Combines L1 and L2 for a balance of feature selection and coefficient shrinkage.
from sklearn.linear_model import LogisticRegression

# L1 regularization
model = LogisticRegression(penalty='l1', solver='liblinear')  # Note solver choice

# L2 regularization
model = LogisticRegression(penalty='l2')

# ElasticNet regularization
model = LogisticRegression(penalty='elasticnet', l1_ratio=0.5)  # Adjust l1_ratio

Solver Compatibility in Grid Search: Not all solvers support all penalties. Check documentation and avoid including incompatible combinations in grid search to prevent errors.

Confusion Matrix Plotting:

from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(model, X_test, y_test)

Decision Tree Visualization:

from sklearn.tree import plot_tree, export_text

# Visualize tree structure
plot_tree(tree_model)

# Export text representation
tree_rules = export_text(tree_model)
print(tree_rules

Tree Pruning for Overfitting Control: Prevent overly complex trees that might overfit by pruning branches with minimal impact.

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(ccp_alpha=0.001)  # Tune ccp_alpha

Stratified Train-Test Split: Maintain class proportions in both training and test sets, crucial for imbalanced datasets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Imputing Categorical Missing Values:

  • SimpleImputer(strategy='most_frequent'): Replaces missing values with the most frequent category.
  • SimpleImputer(strategy='constant', fill_value='missing'): Fills with a designated value, indicating missingness for potential feature engineering.
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='constant', fill_value='missing')  # Example
X_imputed = imputer.fit_transform(X)

Saving Pipelines with joblib

from joblib import dump, load

dump(pipeline, 'pipeline.joblib')  # Save
pipeline = load('pipeline.joblib')  # Load

Shuffling Data in Cross-Validation: Prevent model bias from data order by shuffling before cross-validation.

from sklearn.model_selection import KFold, cross_val_score

kf = KFold(5, shuffle=True, random_state=1)
scores = cross_val_score(model, X, y, cv=kf)

StratifiedKFold for Classification: Maintains class proportions within each fold, especially important for imbalanced datasets.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(5, shuffle=True, random_state=1)
scores = cross_val_score(model, X, y, cv=skf)

ROC AUC for Multiclass Classification:

from sklearn.metrics import roc_auc_score, cross_val_score

roc_auc = roc_auc_score(y_test, model.predict_proba(X_test), multi_class='ovo')
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc_ovo')

ROC AUC for Imbalanced Datasets: Measures model performance across different classification thresholds, often more informative than accuracy in imbalanced settings.


That’s it! I hope these insights from Data School’s “TOP 50 Scikit-learn Tips and Tricks” have been useful for your machine learning journey. Stay tuned for Part 3, where we’ll delve deeper into even more practical techniques and considerations to boost your scikit-learn expertise! See you then!