Write your novel faster with a story world wiki

Real talk. When you say you don’t have enough time to write, chances are that time already exists in your day. You just aren’t giving that available time the respect it deserves, because it’s “too…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Out of the Box Feature Engineering

Because learning is faster with a teacher

This essay will build off some of the discussions in the book to derive a new automunge method for normalizing a numerical set. I have come across in my explorations prior the idea of a logarithmic transform of a numerical series to address time series data like one might use to address price fluctuations of a security for instance. I was interested to see the discussions presented in this book about an extension of the logarithmic transform which is more generalizable representation such as to both the square root or logarithmic transform by the inclusion of the parameter lambda in a transformation known as Box-Cox. The transform acts to compress higher values as λ<1 and setting λ>1 has the opposite effect, which acts to normalize data that might have fatter tails. The transformation is applied to the data as follows:

A further limitation of the box-cox transformation is that it only works on a set with all positive values. For a manual address we could work our way through this in a few ways, the simplest being that if we knew that our set had a natural minimum value we could shift the set to all positive by adding a constant to each value (another approach could be to ‘squish’ the range of values to within a desired band such as 0 to 1). However since we’re trying to automate here I don’t see potential to reason our way through whether the set has a natural minimum, we’ll have to instead apply logic tests. Keeping with the philosophy of keep it simple stupid, I think the simplest address will be to test our train set for all positive values, and if so to proceed, otherwise just defer to the prior normalization address. Because our transform has an asymptote at x=0, we’ll actually test our values for x>0.001. Further, when we see our test set assuming inconsistent distribution I’m thinking the easiest approach would be to clip any negative values and set them to NaN to facilitate infill as a quick fix (yes I know not ideal, but remember we’re trying to maintain consistent processing as the train set our machine learning model was trained on).

A further question we’ll have to ask ourselves is other than the ‘all-positive’ test, when would it be appropriate to apply this box-cox transformation in the first place. I’m going to make a general statement here and I expect some may disagree, but from what I’ve gathered in explorations of Kaggle kernels, I think there is a tendency for machine learning practitioners to overuse graphical depictions of their data in their feature engineering analysis. I suspect this partly stems from the need to “do something” and bonus graphs look cool. While I am sure there are cases where graphical depictions of data sets will reveal some useful directions, I think we can make a general statement that statistical measures will be a more robust measure of some particular data streams usefulness, and certainly more amenable to automation. In the cases where both a column and target labels are numerical, that statistical measure could simply be the R value correlation statistic. For instance, we could test a potential feature engineering transformation’s usefulness by evaluating a correlation statistic between the column and the labels before and after the transformation to determine which is more suitable. For cases where either the column or the labels are categorical that type of evaluation becomes more challenging and I’m going to have to put some thought into that. In the mean time, again in the interest of keeping it simple, I think a simple solution can be had by making a hypothesis.

This hypothesis I think can be useful as long as we don’t overdo our transformations, as obviously as our different iterations accumulate, the cost of training will grow right along. The intent for automunge is to develop some heuristics along the way to keep from this redundancy from getting out of hand. However, for the case of the box-cox transformation, I am going to move forward with the assumption that this transform is universal enough that it is worth the simple redundancy of address, after all we are tailoring the transformation to the data based on the derivation of the lambda parameter. A future iteration of the tool may re-evaluate this approach.

(This paragraph can be skipped if you’re not interested in the code:)

Books that were referenced here or otherwise inspired this post:

(As an Amazon Associate I earn from qualifying purchases.)

Add a comment

Related posts:

NARCISSIST? I call you STRONG.

Your personality is so strong that it’s intimidating them. They just can’t deal with your intellect, complexity and deeply ingrained emotional strength. Narcissism is a tendency to think very highly…

Data Preprocessing

For any task in data science, Obviously, data is the most important thing. Its only that most of the time we don’t get data in its most beautiful form. Raw data is often incomplete, inconsistent and…

Explore A Simple Trauma Timeline To Prioritize The Healing Process

People often think we have to suffer from childhood abuse or the war to have trauma. However, this is not true. Trauma is caused by being overwhelmed and feeling powerless. While the concept is…