Dreams Not Yet Realized?

Back when I was 12 or 13, I began playing the piano by ear. My motivation was the Soundtrack for the motion picture “The Sting". The marvelous adaptation of Scott Joplin’s early 1900’s ragtime music…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Conclusion

For our third project of the Data Science program, we were tasked with identifying a stakeholder and business problem then solve identified business problem using predictive modeling. Our stakeholder is a hydroponic farming start-up known as Square Roots. The start-up spends a considerable amount of time and resources to maintain/monitor their irrigation systems. In our project’s scenario, Square Roots is seeking to future proof their irrigation systems by monitoring the operation of mechanical components through various sensor data. Our project will seek to provide Square Roots with recommendations regarding which sensors provide the best predictive data for understanding the maintenance condition of their hydraulic pumps. By implementing these recommendations, Square Roots will be able to efficiently recognize the characteristics that imply an issues and in-turn, troubleshoot before any faults take place.

The data set is generally structured as follows:

To easily illustrate the data set’s attributes and target variables refer to the tables below:

Summary Table of Data Target Variables and Corresponding Classifiers and Descriptions

Given our data set and the information provided by the tests, our ultimate goal is to utilize the sensor data (such as temperature, tank pressure, vibration magnitude, etc.) to predict the state of the hydraulic pump. As previously mentioned, each row represents one full cycle and each column represents one sample (in this case 1 second) of readings from the temperature sensor. To create features from this data we have aggregated each row such that we can review the sensor data by each 60-second test and by 20 second increments. The table below outlines our ‘raw data’ (prior to any preparation) to our ‘transformed’ data after our data preparation.

Conversion of ‘Raw’ data to ‘Transformed’ data

Given the multiple target variables and multiple classes within a majority of these variables, the modeling performed within this analysis focuses on optimizing accuracy, the weighted F-1 score, and ROC-AUC score.

As previously mentioned, the data set includes five target variables. Through our evaluation and considering the stakeholder, we determined that each of these target variables were vital to our final recommendation. As a result, several models were created. About two models were created for each target variable. Depending on the variable, utilized, certain features were utilized including simple averages of the 60-second cycle, the average change over the course of the cycle, the average and change every 20-seconds of the cycle, and standard deviation of both the full 60-second cycle and every 20-seconds. However, to begin, we utilized a simple logistic regression model as our baseline model.

Baseline Model

The baseline model ran a simple logistic regression and included all X-variables and utilized ‘Valve Condition’ as the target variable. As a result, the baseline model reported the following scores:

Although relatively high scores for certain target variables, we felt as though with a few more iterations, we would be able to optimize scores for all variables.

Given the data, the stakeholder’s business problem, and the performance of our baseline model, we felt it would be most efficient for us to run a grid search on several different model types to determine which produced the highest accuracy.

First Model

To start, we utilized a simple average of the test cycles as the main feature. Once we determined the top performing models, we would perform grid searches with these models and repeat the process for other combinations of target variables and features. For our first model, we decided to evaluate the Valve Condition as our target variable and utilizing the average metrics of each cycle (simple average) as our feature. We evaluated five different models — a logistic regression model, a decision tree model, a random forest model, a K-nearest neighbors (KNN) model, a support vector machine model, and an XGBoost model. We then ran a grid search for each of these models to evaluate the hyperparameters that will produce the highest accuracy scores. As shown below, we recorded relatively high scores across the board:

Given the scores, we ultimately decided to repeat this process with other features and target variable pairings, but ultimately only ran KNN and XGBoost models. In total, we ran about 30 different models.

Final Results

With 30 models to choose from, we arrived at the below ‘final’ models for each target variable:

As shown above, we ultimately decided to go with the XGBoost model for our final iteration and Average of cycle thirds as the feature set for each target variable except Cooler Condition. For Cooler Condition we decided to use the Standard Deviation for Cycle 3rds as the feature. This was due to its consistently high score along all our metric axes.

Considering the above analysis we would recommend the stakeholder utilize an XGBoost predictive model. According to the numerous models and iterations we ran, the best, most accurate model the stakeholder should utilize is an XGBoost model. Further, to effectively utilize this model, we would recommend utilizing the model to predict a pump’s cooler condition and internal pump leakage. Based on our analysis, these predictive models generated the highest accuracy scores (99%+). While the accuracy score of these models is high, there are reasons the model may not fully solve the business problem. The data we utilized was ultimately collected from a single test rig, meaning the environment in which this test rig was producing the data analyzed was carefully selected by the test coordinators. Therefore, there could have been situations that caused leaks or other faults with the pumps that were not accounted for, such as human error or other extreme situations.

Further criteria and analyses could yield additional insights to further inform the stakeholder by:

Dreams Not Yet Realized?

Conclusion

Add a comment

Related posts:

On Being 69.

Betrayal

Give your solutions a more human side with Microsoft Cognitive Services