AberdeenProject.core package¶

Submodules¶

AberdeenProject.core.dataPipelines module¶

class AberdeenProject.core.dataPipelines.FullPipeline[source]¶

Bases: object

This class chains multiple pipelines (the missing values pipeline, the preprocessing pipeline, …) into one single pipeline.

addPipeline(pipeline)[source]¶

This function adds a given pipeline to the full pipeline

Parameters:	pipeline (Pipeline) – Data pipeline to be added the full pipeline

columns = None¶

fit_transform(data)[source]¶

This function feeds the data to the full pipeline

Parameters:	data (Pandas dataframe) – Pandas dataframe to be transformed
Returns:	Transformed Pandas dataframe
Return type:	Pandas dataframe

classmethod initialize(data)[source]¶

This function initializes the full pipeline with the Pandas dataframe. The column names will be stored as a class attribute and then recovered when needed.

Parameters:	data (Pandas Dataframe) – Pandas dataframe needed for the initialization

pipelines = []¶

classmethod recoverColumnsNames()[source]¶

This function returns the column names of the Pandas dataframe

Returns:	Column names of the Pandas dataframe
Return type:	List

class AberdeenProject.core.dataPipelines.MissingValuesPipeline[source]¶

Bases: object

This class provides a pipeline for completing missing values.

addSimpleImputerPipeline(column, strategy='most_frequent')[source]¶

This function provides basic strategies for imputing missing values that can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of a column in which the missing values are located.

Parameters:	column (String) – Column of the Pandas dataframe strategy (String, optional) – Strategy of the imputation, defaults to “most_frequent”

allowed_strategies = ('mean', 'median', 'most_frequent', 'constant')¶

buildPipeline()[source]¶: This function builds the missing values pipeline and prepares it to be fed with the Pandas dataframe.

fit_transform(dataframe, remainder='passthrough', parallelize=True)[source]¶

Fit to data, then transform it

Parameters:	dataframe (Pandas dataframe) – Pandas dataframe to be passed through the missing values pipeline remainder (String, optional) – By specifying remainder=’passthrough’, all remaining columns that were not specified in transformers will be automatically passed through. Otherweise, non-specified columns are dropped, defaults to “passthrough” parallelize (Boolean, optional) – Parallelize the job using all processors, defaults to True
Returns:	Pandas dataframe without missing values
Return type:	Pandas dataframe

pipelines = {}¶

class AberdeenProject.core.dataPipelines.PreprocessingPipeline[source]¶

Bases: object

This class provides a pipeline for data preprocessing like one-hot encoding.

addOnehotEncoderPipeline(column, strategy=None)[source]¶

Encode categorical feature as a one-hot numeric array

Parameters:	column (String) – Column of the Pandas dataframe strategy (String or None, optional) – Preprocessing strategy, defaults to None

allowed_strategies = ('one_hot', 'None')¶

buildPipeline()[source]¶: This function builds the preprocessing pipeline and prepares it to be fed with the Pandas dataframe.

fit_transform(dataframe)[source]¶

Fit to data, then transform it

Parameters:	dataframe (Pandas dataframe) – Pandas dataframe to be passed through the preprocessing pipeline
Returns:	Transformed Pandas dataframe
Return type:	Pandas dataframe

pipelines = {}¶

AberdeenProject.core.dataframeCreator module¶

class AberdeenProject.core.dataframeCreator.DataframeCreator[source]¶

Bases: object

This class contains all methods and attributes needed in order to create a single Pandas dataframe from features and labels chosen by the user.

convertPklToCsv()[source]¶: This function converts pkl files in the “data/” directory into csv files

createDataframe()[source]¶: Every csv-file in the directory “data/” is read and transformed into a Pandas dataframe. After that, all the dataframes are concatenated vertically based on the variable “joinBasedOn” given by the user in the configuration yaml file. To avoid repeating this process everytime the code is run, the dataframe is pickeled and stored as a pickle file in the directory “pickeledData/”. This function is parallelized over all available CPUs using the multiprocessing library.

pickleDataframe(dataframe, pickledDataFile)[source]¶

This function saves a given Pandas dataframe as a pickle file

Parameters:	dataframe (Pandas dataframe) – Pandas dataframe to be saved pickledDataFile (String) – Path to the desired location of the pickle file

readCsvFile(file)[source]¶

A wrapper function to wrap the read_csv function from Pandas

Parameters:	file (String) – Path to the csv-file to be read
Returns:	The csv-file is returned as two-dimensional data structure with labeled axes.
Return type:	Pandas dataframe

threshholdFiltering()[source]¶: This function filters the data based on the value of “threshold” given by the user in the configuration yaml file. The entire column will be ignored if the number of missing values exceeds the threshhold. The dataframe after applying the threshhold filtering will be pickled and stored in the directory “pickeledData/”.

writeColumnsAThFiltering(columnsToKeep)[source]¶: This functions writes the features after the threshold filtering to a text file which will be stored in “results/”

AberdeenProject.core.modelFitting module¶

class AberdeenProject.core.modelFitting.Model(dataframe, algorithm, **parameters)[source]¶

Bases: object

This class extracts the most important features from a dataframe and then creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

bestModel()[source]¶

This function finds the optimal depth for the decision tree model by function for fitting trees of various depths on the training data and choosing the optimal depth using cross-validation

Returns:	Optimal decision tree depth
Return type:	Integer

buildRules(out_file)[source]¶

This function exports decision tree rules to an image that can be interpreted by the user

Parameters:	out_file (String) – Name of the output file

crossValidation(cv)[source]¶

This function reports the performance measure by k-fold cross-validation

Parameters:	cv (Integer) – Number of folds

crossValidationScores(cv)[source]¶

This function evaluates a score by cross-validation

Parameters:	cv (Integer) – Number of folds
Returns:	Cross validation scores
Return type:	List

decisionTreeToGraphiz(out_file, feature_names, class_names)[source]¶

This function generates a GraphViz representation of the decision tree in DOT format

Parameters:	out_file (String) – Name of the output file feature_names (List) – Names of each of the features class_names (List) – Name of the target class

decisionTreeToPng(out_file, feature_names, class_names)[source]¶

This function wraps up self.decisionTreeToGraphiz and self.graphvizToPng

Parameters:	out_file (String) – Name of the output file feature_names (List) – Names of each of the features class_names (List) – Name of the target class

fit()[source]¶: This function builds the model from the training set (self.X, self.y). If the chosen algorithm is Decision Tree, then the optimal depth from the function “bestModel” will be used

graphvizToPng(out_file)[source]¶

Graphical rendering of the decision tree rules from the DOT file

Parameters:	out_file (String) – Name of the output file

implemented_algorithms = ('decisionTree', 'logisticRegression')¶

keepBestFeatures()[source]¶: This function extracts the best features from the user defined dataframe using Random Forest Classifier and writes the resulting features to a text file which will be stored in “results/”

AberdeenProject.core package¶

Submodules¶

AberdeenProject.core.dataPipelines module¶

AberdeenProject.core.dataframeCreator module¶

AberdeenProject.core.modelFitting module¶

Module contents¶