1. Overview
The goal of the problem is to predict Telstr network’s fault severity at a time at a particular location based on the log data available.
The target has 3 categories:0,1,2. It’s a multiclass classification problems.Different types of features are extracted from log files and other sources: event_type.csv, log_feature.csv, resource_type.csv,severity_type.csv.
My final score is 0.44917(72 of 974) in private leaderboard. Here are my code, and I will record my solution.
2. Feature Engineer
Selecting and designing good features is an important area in machine learning, which is called feature engineering.
At First, I just merge all files on Id. Here are five type of features. they are location, severity_type, resource_type, event_type, log_feature. location and severity_type are just one-to-one variables, and others are many_to_one variables. There are about 1200 locations, some are only in the test set. The correlation between location variable and target is 0.27, it gave me a hint that neighbouring locations may be similar in the network’s fault severity problem. I used one_hot encode to solve the many_to_one features. So there are about 400+ features in the initial stage.
I read a paper about how to preprocess high-cardinality Categorical attributes in classification and regression problems, but it seemed to bring little help. I will tried it again after the competition.
On the forum, there are a heated discussion about the magic feature. I spent much of my time to find it.
It is the order of the same location in the severity file which follows the order of fault occurrence.It’s called Intra-location order.
It really did a big help to the final score which improved almost 0.06. To each record, I compute the target(fault_severity) probabilities from the previous same location records, and used the previous-target_probabilities as a feature to build the model.
3. Build models and Ensembling
I tried many models, decision tree, random forests, svm and xgboost. The xgboost model is performed well. On ensembling, I just average the random forest and xgboost result as the final result.
4. What I learned from other kagglers
On the platform, kagglers are willing to share their ideas. Here are some valuable ideas on the Competition Forum.
Some useful tips on feature engineering
It’s the most important step in machine learning whatever models you used.
- About location
Similar location numbers have similar fault severity.(Treat the location as numeric.) Don’t one-hot-encode the location, Tree-based classifiers are not good at handling huge sparse feature matrix. Frequency of LogType_203 = 0 & LogType_203 >0 per location The records in log data are arranged in the order of time.(the magic feature) Here are two ways to encode the information, One is for each location, use the row number, which starts from 1 to the total number of rows for that locaiton. The other is to normalize it between 0 and 1. percentile transformation of location counts - About log feature
Pattern of log feature “one hot” encoding for all log features with volume > 0, for all rows. Each “one hot” encoded pattern treated as a string. Assigned integer ID to each to each string, used as feature. The log transform for the count of “pattern of log feature”, the log transform for counts as “pattern of event” and “pattern of resource” - Common categorical variables
For high-cardinality categorical variables, frequency works well.(Add the frequency of each location in both train and test set.) Summary statistics to reduce one-to-many relationship to one-to-one, and two-way or more-way interaction among multiple variables. Meta features(using Logistic regression to fit sparse matrix as predictors, then ensemble the model). - A useful solution
>a.The order of id on log_feature was frozen.
>b.Converting location, log_feature into numbers and generating count, mean, sum, etc. features (feature set A)
>c.Feature B was generated by shifting A forward by 1 row
>d.Feature C was generated by shifting B backward by 1 row
>Combining A, B, and C and training xgb, RF, GBM models. My final model is an ensemble model of these models.