[ML] Advanced optimization algorithm

15 Sep 2020

Reading time ~2 minutes

Data preprocessing

Advantage:

Bring it to such a state that machine learning can easily parse.
Get transformed, encode.

Missing value

In data quality assessment.

Eliminate row with missing data.
Eliminate missing value: pilling them in with mean, median.

There are several ways for you to handle missing values. Try to resolve missing manually, find relationships between features to reasonably replace or replace by mean. Alternatively, use the library available from scikit-learn.

Libraries

import numpy as np
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit('Name_data')

or As I said this handle by hand will bring back the most important observations about the data.

name_df.dropna(inplace = True) # do operation inplace and return None
# compute mean of features have missing. delete row

Categorical features encoding

In some cases, the categorical feature of the training is needed. Gender ? Yes ! Name ? Maybe need !!!

Same as above, there are always 2 ways to do it, manually or using the library

from sklearn.preprocessing import OneHotEncoder

drop_enc = OneHotEncoder(drop='first').fit(X)
drop_enc.categories_

drop_enc.transform([['Female', 1], ['Male', 0]]).toarray()

The other way by hand, use get_dummies(data[‘name_feature’], drop_first = True)

Strongly recommended that use drop first in all cases, which can reduce dimension.

Feature sampling

Only focusing on training without testing is difficult to accept. Instead of bringing all the data to training, let’s split it into 2 parts training and testing. And it must represent the original data set.

How to to do it:

from sklearn.model_selection import train_test_split # Libraries

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

# test_size: percent of test set
# X, y is data set

Feature scaling

Here’s the curious thing about feature scaling – it improves (significantly) the performance of some machine learning algorithms and does not work at all for others. What could be the reason behind this quirk?

Also, what’s the difference between normalization and standardization? These are two of the most commonly used feature scaling techniques in machine learning but a level of ambiguity exists in their understanding. When should you use which technique?

Compare:

Normalization: is values are shifted and rescaled so that they end up ranging between 0 and 1 (Min-Max scaling)
- data isn’t follow a Gaussian’s distribution, not assume of data like K-Nearest Neighbors and Neural Network.
- \[X'=\frac{X-X_{min}}{X_{max}-X_{min}}\]
Standardization: is values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
- follow a Gaussian’s distribution.
- \[X'=\frac{X-\mu}{\sigma}\]

Implementing Feature Scaling in Python

Normalization

from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data
norm = MinMaxScaler().fit(X_train)

# transform training data
X_train_norm = norm.transform(X_train)

# transform testing data
X_test_norm = norm.transform(X_test)

Standardization

from sklearn.preprocessing import StandardScaler

std = StandardScaler()

#transform training and testing data
X_train_std = std.fit_transform(X_train)
X_test_std = std.fit_transform(X_test)

That’s all about data preprocessing. Thank for reading !!!