Data preprocessing
Advantage:
- Bring it to such a state that machine learning can easily parse.
- Get transformed, encode.
- Missing value
In data quality assessment.
- Eliminate row with missing data.
- Eliminate missing value: pilling them in with mean, median.
There are several ways for you to handle missing values. Try to resolve missing manually, find relationships between features to reasonably replace or replace by mean. Alternatively, use the library available from scikit-learn.
- Libraries
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit('Name_data')
or As I said this handle by hand will bring back the most important observations about the data.
name_df.dropna(inplace = True) # do operation inplace and return None
# compute mean of features have missing. delete row
Categorical features encoding
In some cases, the categorical feature of the training is needed. Gender ? Yes ! Name ? Maybe need !!!
Same as above, there are always 2 ways to do it, manually or using the library
from sklearn.preprocessing import OneHotEncoder
drop_enc = OneHotEncoder(drop='first').fit(X)
drop_enc.categories_
drop_enc.transform([['Female', 1], ['Male', 0]]).toarray()
The other way by hand, use get_dummies(data[‘name_feature’], drop_first = True)
Strongly recommended that use drop first in all cases, which can reduce dimension.
Feature sampling
Only focusing on training without testing is difficult to accept. Instead of bringing all the data to training, let’s split it into 2 parts training and testing. And it must represent the original data set.
How to to do it:
from sklearn.model_selection import train_test_split # Libraries
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)
# test_size: percent of test set
# X, y is data set
Feature scaling
Here’s the curious thing about feature scaling – it improves (significantly) the performance of some machine learning algorithms and does not work at all for others. What could be the reason behind this quirk?
Also, what’s the difference between normalization and standardization? These are two of the most commonly used feature scaling techniques in machine learning but a level of ambiguity exists in their understanding. When should you use which technique?
Compare:
- Normalization: is values are shifted and rescaled so that they end up ranging between 0 and 1 (Min-Max scaling)
- data isn’t follow a Gaussian’s distribution, not assume of data like K-Nearest Neighbors and Neural Network.
- \[X'=\frac{X-X_{min}}{X_{max}-X_{min}}\]
- Standardization: is values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
- follow a Gaussian’s distribution.
- \[X'=\frac{X-\mu}{\sigma}\]
Implementing Feature Scaling in Python
Normalization
from sklearn.preprocessing import MinMaxScaler
# fit scaler on training data
norm = MinMaxScaler().fit(X_train)
# transform training data
X_train_norm = norm.transform(X_train)
# transform testing data
X_test_norm = norm.transform(X_test)
Standardization
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
#transform training and testing data
X_train_std = std.fit_transform(X_train)
X_test_std = std.fit_transform(X_test)