Part 3 – Top Scikit Learn Tips for Building Efficient Machine Learning Workflows


This series of articles draws inspiration and key concepts from Data School’s valuable “TOP 50 Scikit-learn Tips and Tricks” resource. While not a direct adaptation, each article aims to build upon those core ideas, providing comprehensive explanations, code examples, and practical considerations for effective implementation. The goal is to empower you with a deeper understanding and the ability to confidently apply these scikit-learn techniques in your machine learning projects.


FunctionTransformer for Custom Functions: Integrate custom preprocessing functions into scikit-learn pipelines for consistency and reusability.

from sklearn.preprocessing import FunctionTransformer

def custom_function(X):
    # Perform custom operations on X
    return transformed_X

transformer = FunctionTransformer(custom_function)

Feature Selection with SelectPercentile: Retain a specified percentage of features based on their chi-squared scores for feature importance.

from sklearn.feature_selection import SelectPercentile, chi2

selector = SelectPercentile(chi2, percentile=50)

Pipeline Steps: Common pipeline structure

make_pipeline(
    column_transformer,  # Preprocess different feature types
    feature_selector,  # Select important features
    classifier  # Final model
)

Visualizing Pipelines:

from sklearn import set_config

set_config(display='diagram')
pipeline

Retrieving Column Names:

ct = ColumnTransformer(...)  # Example
new_feature_names = ct.get_feature_names_out()

OneHotEncoder’s drop Parameter:

  • Explanation: Controls handling of multicollinearity in one-hot encoded features.
    • drop=None: Keeps all features (default).
    • drop='first': Drops the first category for each feature.
    • drop='if_binary': Drops one category for binary features.

Column Transformer Flexibility: Selectively transform or pass through specific columns:

from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([
    ('encoder', OneHotEncoder(), ['cat_col1', 'cat_col2']),  # Transform specific columns
    ('scaler', StandardScaler(), ['num_col1', 'num_col2']),
    ('passthrough', 'passthrough', ['id_col'])  # Pass through unchanged
])

Tree-Based Models and Encoding: Tree-based models can often learn effectively from both one-hot encoded and ordinal encoded features, even without inherent feature order. Linear models often perform better with one-hot encoding.

Parallel Grid Search:

from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(model, param_grid, cv=5, n_jobs=-1)  # Use all available cores

Feature Interactions with PolynomialFeatures: Creates new features by combining existing ones, but can become impractical with many features and might not be necessary for tree-based models.

Feature Selection with Interactions: For a manageable number of features, create interactions and then use feature selection to identify the most important ones.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)  # Create interactions up to degree 2
X_poly = poly.fit_transform(X)

Ensemble Methods with VotingClassifier: Combine multiple models for potentially improved performance.

from sklearn.ensemble import VotingClassifier

estimators = [('model1', model1), ('model2', model2), ('model3', model3)]
ensemble = VotingClassifier(estimators, voting='soft')  # Example: soft voting

That’s it! I hope these insights from Data School’s “TOP 50 Scikit-learn Tips and Tricks” have been useful for your machine learning journey. Stay tuned for Part 4, where we’ll delve deeper into even more practical techniques and considerations to boost your scikit-learn expertise! See you then!