NARCISSIST? I call you STRONG.

Your personality is so strong that it’s intimidating them. They just can’t deal with your intellect, complexity and deeply ingrained emotional strength. Narcissism is a tendency to think very highly…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Data Preprocessing

For any task in data science, Obviously, data is the most important thing. Its only that most of the time we don’t get data in its most beautiful form. Raw data is often incomplete, inconsistent and is likely to contain many errors. Most of the times Data scientists spend 70 % of their time cleaning their datasets. A cleaned dataset provides much better insights, visualizations and predictions.

Data is mostly available in 2 forms that are structured data and unstructured data.

In this tutorial, we will see how to work with structured data

Most of the times we face 3 major types of problem while trying to apply a Machine Learning Algorithm

These are the basic problems data scientists face, although this list, in reality, can be longer depending on your data.

In [1]:

Pandas library is compatible with most of the formats, it has pre-written functions to import excel files, CSV files(comma separated values), SQL and many more.

In [2]:

In [3]:

Out[3]:

CountryAgeSalaryPurchased0Delhi24.072000.0No1mumbai27.048000.0Yes2new york40.054000.0No3Madrid28.061000.0No4berlin40.0NaNYes

For training a supervised learning algorithm we need a input and an output in this example we used x to denote the inputs and y as output.

In [4]:

In [5]:

Out[5]:

In [6]:

Out[6]:

In [7]:

In [8]:

In [9]:

See how all the missing values which are now replaced with the mean values of that column

Out[9]:

Earlier we saw how categorical data can be a problem for algorithms. Here, our data also have categorical data, in columns labelled as Country and Purchased the data is in categorical form.

Let’s see how SKlearn helps us deal with these.

SKlearn library provides us with 2 classes that are LabelEncoder and OneHotEncoder

This class converts the categorical data to numerical data by assigning a numeric value to each level for example if data has three categories like Delhi, Mumbai and new york, it will assign 0 to Delhi, 1 to Mumbai and 2 to New York. This does help in converting strings of different categories to numeric values but it creates a problem when assigned with 0, 1 and 2 the ML algorithm thinks that New York has a higher value than Mumbai and Delhi whereas, in reality, they have equal importance. to deal with this we have another class called OneHotEncoder.

what OneHotEncoder does is it creates a binary column for each category and returns a sparse matrix or dense array, hence giving equal importance to each value.

In [10]:

In [11]:

In [12]:

Out[12]:

what OneHotEncoder does is it creates a binary column for each category and returns a sparse matrix or dense array, hence giving equal importance to each value.

In [13]:

In [14]:

Out[14]:

In [15]:

In [16]:

Out[16]:

The training set is used to train the machine learning algorithm and the test set is used to test our predictions

In [17]:

In [18]:

As discussed before the large difference in values of different attributes or columns can be a problem to deal with this we need to scale the values, sklearn provides multiple classes that can be used to scale the data depending on what methods they use to scale it. here we used StandardScaler class, which scales data by subtracting the sample mean from each value of the data and then be divided by the standard deviation of the entire dataset.

In [19]:

In [21]:

Out[21]:

In [22]:

Out[22]:

Add a comment

Related posts:

Out of the Box Feature Engineering

For those that haven’t been following along, I’ve been using this forum over the last few weeks to document the development of some algorithms for the automated processing of structured data sets so…

Explore A Simple Trauma Timeline To Prioritize The Healing Process

People often think we have to suffer from childhood abuse or the war to have trauma. However, this is not true. Trauma is caused by being overwhelmed and feeling powerless. While the concept is…

This Too Shall Pass

Like all the times I felt marooned… Like the times I had self-crooned… Like the fresh purple-blue wounds… Like the gray, subdued swoons… Like the umpteen violent storms… Like the disturbing turbulent…