Reflection - ML Flow Up to Scaling

Data Preparation Steps

Preparing Data for Machine Learning in Python

This guide outlines the critical steps for preparing data in machine learning pipelines.

Collect resources Import the necessary Python modules such as pandas, numpy, and sklearn. For readability, it is a good idea to import a module just before it is used.
Read in dataset Use pd.read_csv and pay attention to the separator used—if it isn’t commas. Follow convention: use X (capital) and y for feature columns and the output vector. If feature columns are not contiguous, use:
X = dataset.iloc[:, [0] + list(range(2, dataset.shape[1]))].
Clean data Use the SimpleImputer class from the sklearn.impute module to substitute the column's mean for missing values in the feature vector:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, columns_of_interest])
Then transform the data:
X = imputer.transform(X).
Encode data Data that is not from a measurement (e.g., age) must be encoded. For example, a country name cannot be measured. One-hot encoding transforms France, Germany, and Spain into:
[1 0 0], [0 1 0], [0 0 1].
Use the ColumnTransformer class from the sklearn.compose module. This prevents incorrect magnitude interpretation when using values like 1, 2, 3.
If the output vector is non-numeric, encode it using the LabelEncoder class from sklearn.preprocessing.
Split data Split the dataset into training and test sets using train_test_split from sklearn.model_selection:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1).
Forcing the value of the seed using random_state ensures reproducible results, which is useful during development.
Scale feature values Scaling should be done after splitting because test data represents real-world data unavailable during development. Use the StandardScaler class from sklearn.preprocessing:
- Apply fit_transform to X_train.
- Use transform (using the same fit data from training) on X_test.

Comments

Popular posts from this blog

The Dummy Variable Trap

Your Handy ML Reference

Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System Designers