AberdeenProject.core package¶
Submodules¶
AberdeenProject.core.dataPipelines module¶
-
class
AberdeenProject.core.dataPipelines.FullPipeline[source]¶ Bases:
objectThis class chains multiple pipelines (the missing values pipeline, the preprocessing pipeline, …) into one single pipeline.
-
addPipeline(pipeline)[source]¶ This function adds a given pipeline to the full pipeline
Parameters: pipeline (Pipeline) – Data pipeline to be added the full pipeline
-
columns= None¶
-
fit_transform(data)[source]¶ This function feeds the data to the full pipeline
Parameters: data (Pandas dataframe) – Pandas dataframe to be transformed Returns: Transformed Pandas dataframe Return type: Pandas dataframe
-
classmethod
initialize(data)[source]¶ This function initializes the full pipeline with the Pandas dataframe. The column names will be stored as a class attribute and then recovered when needed.
Parameters: data (Pandas Dataframe) – Pandas dataframe needed for the initialization
-
pipelines= []¶
-
-
class
AberdeenProject.core.dataPipelines.MissingValuesPipeline[source]¶ Bases:
objectThis class provides a pipeline for completing missing values.
-
addSimpleImputerPipeline(column, strategy='most_frequent')[source]¶ This function provides basic strategies for imputing missing values that can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of a column in which the missing values are located.
Parameters: - column (String) – Column of the Pandas dataframe
- strategy (String, optional) – Strategy of the imputation, defaults to “most_frequent”
-
allowed_strategies= ('mean', 'median', 'most_frequent', 'constant')¶
-
buildPipeline()[source]¶ This function builds the missing values pipeline and prepares it to be fed with the Pandas dataframe.
-
fit_transform(dataframe, remainder='passthrough', parallelize=True)[source]¶ Fit to data, then transform it
Parameters: - dataframe (Pandas dataframe) – Pandas dataframe to be passed through the missing values pipeline
- remainder (String, optional) – By specifying remainder=’passthrough’, all remaining columns that were not specified in transformers will be automatically passed through. Otherweise, non-specified columns are dropped, defaults to “passthrough”
- parallelize (Boolean, optional) – Parallelize the job using all processors, defaults to True
Returns: Pandas dataframe without missing values
Return type: Pandas dataframe
-
pipelines= {}¶
-
-
class
AberdeenProject.core.dataPipelines.PreprocessingPipeline[source]¶ Bases:
objectThis class provides a pipeline for data preprocessing like one-hot encoding.
-
addOnehotEncoderPipeline(column, strategy=None)[source]¶ Encode categorical feature as a one-hot numeric array
Parameters: - column (String) – Column of the Pandas dataframe
- strategy (String or None, optional) – Preprocessing strategy, defaults to None
-
allowed_strategies= ('one_hot', 'None')¶
-
buildPipeline()[source]¶ This function builds the preprocessing pipeline and prepares it to be fed with the Pandas dataframe.
-
fit_transform(dataframe)[source]¶ Fit to data, then transform it
Parameters: dataframe (Pandas dataframe) – Pandas dataframe to be passed through the preprocessing pipeline Returns: Transformed Pandas dataframe Return type: Pandas dataframe
-
pipelines= {}¶
-
AberdeenProject.core.dataframeCreator module¶
-
class
AberdeenProject.core.dataframeCreator.DataframeCreator[source]¶ Bases:
objectThis class contains all methods and attributes needed in order to create a single Pandas dataframe from features and labels chosen by the user.
-
createDataframe()[source]¶ Every csv-file in the directory “data/” is read and transformed into a Pandas dataframe. After that, all the dataframes are concatenated vertically based on the variable “joinBasedOn” given by the user in the configuration yaml file. To avoid repeating this process everytime the code is run, the dataframe is pickeled and stored as a pickle file in the directory “pickeledData/”. This function is parallelized over all available CPUs using the multiprocessing library.
-
pickleDataframe(dataframe, pickledDataFile)[source]¶ This function saves a given Pandas dataframe as a pickle file
Parameters: - dataframe (Pandas dataframe) – Pandas dataframe to be saved
- pickledDataFile (String) – Path to the desired location of the pickle file
-
readCsvFile(file)[source]¶ A wrapper function to wrap the read_csv function from Pandas
Parameters: file (String) – Path to the csv-file to be read Returns: The csv-file is returned as two-dimensional data structure with labeled axes. Return type: Pandas dataframe
-
threshholdFiltering()[source]¶ This function filters the data based on the value of “threshold” given by the user in the configuration yaml file. The entire column will be ignored if the number of missing values exceeds the threshhold. The dataframe after applying the threshhold filtering will be pickled and stored in the directory “pickeledData/”.
-
AberdeenProject.core.modelFitting module¶
-
class
AberdeenProject.core.modelFitting.Model(dataframe, algorithm, **parameters)[source]¶ Bases:
objectThis class extracts the most important features from a dataframe and then creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
-
bestModel()[source]¶ This function finds the optimal depth for the decision tree model by function for fitting trees of various depths on the training data and choosing the optimal depth using cross-validation
Returns: Optimal decision tree depth Return type: Integer
-
buildRules(out_file)[source]¶ This function exports decision tree rules to an image that can be interpreted by the user
Parameters: out_file (String) – Name of the output file
-
crossValidation(cv)[source]¶ This function reports the performance measure by k-fold cross-validation
Parameters: cv (Integer) – Number of folds
-
crossValidationScores(cv)[source]¶ This function evaluates a score by cross-validation
Parameters: cv (Integer) – Number of folds Returns: Cross validation scores Return type: List
-
decisionTreeToGraphiz(out_file, feature_names, class_names)[source]¶ This function generates a GraphViz representation of the decision tree in DOT format
Parameters: - out_file (String) – Name of the output file
- feature_names (List) – Names of each of the features
- class_names (List) – Name of the target class
-
decisionTreeToPng(out_file, feature_names, class_names)[source]¶ This function wraps up self.decisionTreeToGraphiz and self.graphvizToPng
Parameters: - out_file (String) – Name of the output file
- feature_names (List) – Names of each of the features
- class_names (List) – Name of the target class
-
fit()[source]¶ This function builds the model from the training set (self.X, self.y). If the chosen algorithm is Decision Tree, then the optimal depth from the function “bestModel” will be used
-
graphvizToPng(out_file)[source]¶ Graphical rendering of the decision tree rules from the DOT file
Parameters: out_file (String) – Name of the output file
-
implemented_algorithms= ('decisionTree', 'logisticRegression')¶
-