Reflection - ML Flow Up to Scaling
Preparing Data for Machine Learning in Python
This guide outlines the critical steps for preparing data in machine learning pipelines.
| Collect resources | Import the necessary Python modules such as pandas, numpy, and sklearn. For readability, it is a good idea to import a module just before it is used. |
| Read in dataset | Use pd.read_csv and pay attention to the separator used—if it isn’t commas. Follow convention: use X (capital) and y for feature columns and the output vector. If feature columns are not contiguous, use:
X = dataset.iloc[:, [0] + list(range(2, dataset.shape[1]))].
|
| Clean data | Use the SimpleImputer class from the sklearn.impute module to substitute the column's mean for missing values in the feature vector:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, columns_of_interest])
Then transform the data: X = imputer.transform(X).
|
| Encode data | Data that is not from a measurement (e.g., age) must be encoded. For example, a country name cannot be measured. One-hot encoding transforms France, Germany, and Spain into:
[1 0 0], [0 1 0], [0 0 1].
Use the ColumnTransformer class from the sklearn.compose module. This prevents incorrect magnitude interpretation when using values like 1, 2, 3.
If the output vector is non-numeric, encode it using the LabelEncoder class from sklearn.preprocessing.
|
| Split data | Split the dataset into training and test sets using train_test_split from sklearn.model_selection:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1).
Forcing the value of the seed using random_state ensures reproducible results, which is useful during development.
|
| Scale feature values | Scaling should be done after splitting because test data represents real-world data unavailable during development. Use the StandardScaler class from sklearn.preprocessing:
- Apply fit_transform to X_train.
- Use transform (using the same fit data from training) on X_test.
|
Comments
Post a Comment