# Feature-Selection **Repository Path**: algo_coding/Feature-Selection ## Basic Information - **Project Name**: Feature-Selection - **Description**: Features selector based on the self selected-algorithm, loss function and validation method - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2021-11-08 - **Last Updated**: 2021-11-13 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # MLFeatureSelection [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![PyPI version](https://badge.fury.io/py/MLFeatureSelection.svg)](https://pypi.org/project/MLFeatureSelection/) General features selection based on certain machine learning algorithm and evaluation methods **Divesity, Flexible and Easy to use** More features selection method will be included in the future! ## Quick Installation ```python pip3 install MLFeatureSelection ``` ## Modulus in version 0.0.9.5.1 - Modulus for selecting features based on greedy algorithm (from MLFeatureSelection import sequence_selection) - Modulus for removing features based on features importance (from MLFeatureSelection import importance_selection) - Modulus for removing features based on correlation coefficient (from MLFeatureSelection import coherence_selection) - Modulus for reading the features combination from log file (from MLFeatureSelection.tools import readlog) ## This features selection method achieved - **1st** in Rong360 -- https://github.com/duxuhao/rong360-season2 - **6th** in JData-2018 -- https://github.com/duxuhao/JData-2018 - **12nd** in IJCAI-2018 1st round -- https://github.com/duxuhao/IJCAI-2018-2 ## Modulus Usage [Example](https://github.com/duxuhao/Feature-Selection/blob/master/Demo.ipynb) - sequence_selection ```python from MLFeatureSelection import sequence_selection from sklearn.linear_model import LogisticRegression sf = sequence_selection.Select(Sequence = True, Random = True, Cross = False) sf.ImportDF(df,label = 'Label') #import dataframe and label sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function handle and optimize direction, 'ascend' for AUC, ACC, 'descend' for logloss etc. sf.InitialNonTrainableFeatures(notusable) #those features that is not trainable in the dataframe, user_id, string, etc sf.InitialFeatures(initialfeatures) #initial initialfeatures as list sf.GenerateCol() #generate features for selection sf.SetFeatureEachRound(50, False) # set number of feature each round, and set how the features are selected from all features (True: sample selection, False: select chunk by chunk) sf.clf = LogisticRegression() #set the selected algorithm, can be any algorithm sf.SetLogFile('record.log') #log file sf.run(validate) #run with validation function, validate is the function handle of the validation function, return best features combination ``` - importance_selection ```python from MLFeatureSelection import importance_selection import xgboost as xgb sf = importance_selection.Select() sf.ImportDF(df,label = 'Label') #import dataframe and label sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction sf.InitialFeatures() #initial features, input sf.SelectRemoveMode(batch = 2) sf.clf = xgb.XGBClassifier() sf.SetLogFile('record.log') #log file sf.run(validate) #run with validation function, return best features combination ``` - coherence_selection ```python from MLFeatureSelection import coherence_selection import xgboost as xgb sf = coherence_selection.Select() sf.ImportDF(df,label = 'Label') #import dataframe and label sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction sf.InitialFeatures() #initial features, input sf.SelectRemoveMode(batch = 2) sf.clf = xgb.XGBClassifier() sf.SetLogFile('record.log') #log file sf.run(validate) #run with validation function, return best features combination ``` - tools.readlog: read previous selected features from log ```python from MLFeatureSelection.tools import readlog logfile = 'record.log' logscore = 0.5 #any score in the logfile features_combination = readlog(logfile, logscore) ``` - tools.filldf: complete dataset when there is cross-term features ```python from MLFeatureSelection.tools import readlog, filldf def add(x,y): return x + y def substract(x,y): return x - y def times(x,y): return x * y def divide(x,y): return x/y def sq(x,y): return x ** 2 CrossMethod = {'+':add, '-':substract, '*':times, '/':divide, } # set your own cross method df = pd.read_csv('XXX') logfile = 'record.log' logscore = 0.5 #any score in the logfile features_combination = readlog(logfile, logscore) df = filldf(df, features_combination, CrossMethod) ``` - format of validate and lossfunction define your own: **validate**: validation method in function , ie k-fold, last time section valdate, random sampling validation, etc **lossfunction**: model performance evaluation method, ie logloss, auc, accuracy, etc ```python def validate(X, y, features, clf, lossfunction): """define your own validation function with 5 parameters input as X, y, features, clf, lossfunction clf is set by SetClassifier() lossfunction is import earlier features will be generate automatically function return score and trained classfier """ clf.fit(X[features],y) y_pred = clf.predict(X[features]) score = lossfuntion(y_pred,y) return score, clf def lossfunction(y_pred, y_test): """define your own loss function with y_pred and y_test return score """ return np.mean(y_pred == y_test) ``` ## multiple processing Multiple processing can be set in validate function when you are doing N-fold. ## DEMO More examples are added in example folder include: - Demo contain all modulus can be found here ([demo](https://github.com/duxuhao/Feature-Selection/blob/master/Demo.ipynb)) - Simple Titanic with 5-fold validation and evaluated by accuracy ([demo](https://github.com/duxuhao/Feature-Selection/tree/master/example/titanic)) - Demo for S1, S2 score improvement in JData 2018 predict purchase time competition ([demo](https://github.com/duxuhao/Feature-Selection/tree/master/example/JData2018)) - Demo for IJCAI 2018 CTR prediction ([demo](https://github.com/duxuhao/Feature-Selection/tree/master/example/IJCAI-2018)) ## Function Parameters [Parameters](https://github.com/duxuhao/Feature-Selection/blob/master/MLFeatureSelection) ## Algorithm details [Details](https://github.com/duxuhao/Feature-Selection/blob/master/Algorithms_Graphs)