Reflection - ML Flow Up to Scaling

December 17, 2024

Data Preparation Steps

Preparing Data for Machine Learning in Python

This guide outlines the critical steps for preparing data in machine learning pipelines.

Collect resources	Import the necessary Python modules such as `pandas`, `numpy`, and `sklearn`. For readability, it is a good idea to import a module just before it is used.
Read in dataset	Use `pd.read_csv` and pay attention to the separator used—if it isn’t commas. Follow convention: use `X` (capital) and `y` for feature columns and the output vector. If feature columns are not contiguous, use: `X = dataset.iloc[:, [0] + list(range(2, dataset.shape[1]))]`.
Clean data	Use the `SimpleImputer` class from the `sklearn.impute` module to substitute the column's mean for missing values in the feature vector: `imputer = SimpleImputer(missing_values=np.nan, strategy='mean')` `imputer.fit(X[:, columns_of_interest])` Then transform the data: `X = imputer.transform(X)`.
Encode data	Data that is not from a measurement (e.g., age) must be encoded. For example, a country name cannot be measured. One-hot encoding transforms France, Germany, and Spain into: `[1 0 0], [0 1 0], [0 0 1]`. Use the `ColumnTransformer` class from the `sklearn.compose` module. This prevents incorrect magnitude interpretation when using values like 1, 2, 3. If the output vector is non-numeric, encode it using the `LabelEncoder` class from `sklearn.preprocessing`.
Split data	Split the dataset into training and test sets using `train_test_split` from `sklearn.model_selection`: `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)`. Forcing the value of the seed using `random_state` ensures reproducible results, which is useful during development.
Scale feature values	Scaling should be done after splitting because test data represents real-world data unavailable during development. Use the `StandardScaler` class from `sklearn.preprocessing`: - Apply `fit_transform` to `X_train`. - Use `transform` (using the same fit data from training) on `X_test`.

Comments