AberdeenProject.core package

Submodules

AberdeenProject.core.dataPipelines module

class AberdeenProject.core.dataPipelines.FullPipeline[source]

Bases: object

This class chains multiple pipelines (the missing values pipeline, the preprocessing pipeline, …) into one single pipeline.

addPipeline(pipeline)[source]

This function adds a given pipeline to the full pipeline

Parameters:pipeline (Pipeline) – Data pipeline to be added the full pipeline
columns = None
fit_transform(data)[source]

This function feeds the data to the full pipeline

Parameters:data (Pandas dataframe) – Pandas dataframe to be transformed
Returns:Transformed Pandas dataframe
Return type:Pandas dataframe
classmethod initialize(data)[source]

This function initializes the full pipeline with the Pandas dataframe. The column names will be stored as a class attribute and then recovered when needed.

Parameters:data (Pandas Dataframe) – Pandas dataframe needed for the initialization
pipelines = []
classmethod recoverColumnsNames()[source]

This function returns the column names of the Pandas dataframe

Returns:Column names of the Pandas dataframe
Return type:List
class AberdeenProject.core.dataPipelines.MissingValuesPipeline[source]

Bases: object

This class provides a pipeline for completing missing values.

addSimpleImputerPipeline(column, strategy='most_frequent')[source]

This function provides basic strategies for imputing missing values that can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of a column in which the missing values are located.

Parameters:
  • column (String) – Column of the Pandas dataframe
  • strategy (String, optional) – Strategy of the imputation, defaults to “most_frequent”
allowed_strategies = ('mean', 'median', 'most_frequent', 'constant')
buildPipeline()[source]

This function builds the missing values pipeline and prepares it to be fed with the Pandas dataframe.

fit_transform(dataframe, remainder='passthrough', parallelize=True)[source]

Fit to data, then transform it

Parameters:
  • dataframe (Pandas dataframe) – Pandas dataframe to be passed through the missing values pipeline
  • remainder (String, optional) – By specifying remainder=’passthrough’, all remaining columns that were not specified in transformers will be automatically passed through. Otherweise, non-specified columns are dropped, defaults to “passthrough”
  • parallelize (Boolean, optional) – Parallelize the job using all processors, defaults to True
Returns:

Pandas dataframe without missing values

Return type:

Pandas dataframe

pipelines = {}
class AberdeenProject.core.dataPipelines.PreprocessingPipeline[source]

Bases: object

This class provides a pipeline for data preprocessing like one-hot encoding.

addOnehotEncoderPipeline(column, strategy=None)[source]

Encode categorical feature as a one-hot numeric array

Parameters:
  • column (String) – Column of the Pandas dataframe
  • strategy (String or None, optional) – Preprocessing strategy, defaults to None
allowed_strategies = ('one_hot', 'None')
buildPipeline()[source]

This function builds the preprocessing pipeline and prepares it to be fed with the Pandas dataframe.

fit_transform(dataframe)[source]

Fit to data, then transform it

Parameters:dataframe (Pandas dataframe) – Pandas dataframe to be passed through the preprocessing pipeline
Returns:Transformed Pandas dataframe
Return type:Pandas dataframe
pipelines = {}

AberdeenProject.core.dataframeCreator module

class AberdeenProject.core.dataframeCreator.DataframeCreator[source]

Bases: object

This class contains all methods and attributes needed in order to create a single Pandas dataframe from features and labels chosen by the user.

convertPklToCsv()[source]

This function converts pkl files in the “data/” directory into csv files

createDataframe()[source]

Every csv-file in the directory “data/” is read and transformed into a Pandas dataframe. After that, all the dataframes are concatenated vertically based on the variable “joinBasedOn” given by the user in the configuration yaml file. To avoid repeating this process everytime the code is run, the dataframe is pickeled and stored as a pickle file in the directory “pickeledData/”. This function is parallelized over all available CPUs using the multiprocessing library.

pickleDataframe(dataframe, pickledDataFile)[source]

This function saves a given Pandas dataframe as a pickle file

Parameters:
  • dataframe (Pandas dataframe) – Pandas dataframe to be saved
  • pickledDataFile (String) – Path to the desired location of the pickle file
readCsvFile(file)[source]

A wrapper function to wrap the read_csv function from Pandas

Parameters:file (String) – Path to the csv-file to be read
Returns:The csv-file is returned as two-dimensional data structure with labeled axes.
Return type:Pandas dataframe
threshholdFiltering()[source]

This function filters the data based on the value of “threshold” given by the user in the configuration yaml file. The entire column will be ignored if the number of missing values exceeds the threshhold. The dataframe after applying the threshhold filtering will be pickled and stored in the directory “pickeledData/”.

writeColumnsAThFiltering(columnsToKeep)[source]

This functions writes the features after the threshold filtering to a text file which will be stored in “results/”

AberdeenProject.core.modelFitting module

class AberdeenProject.core.modelFitting.Model(dataframe, algorithm, **parameters)[source]

Bases: object

This class extracts the most important features from a dataframe and then creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

bestModel()[source]

This function finds the optimal depth for the decision tree model by function for fitting trees of various depths on the training data and choosing the optimal depth using cross-validation

Returns:Optimal decision tree depth
Return type:Integer
buildRules(out_file)[source]

This function exports decision tree rules to an image that can be interpreted by the user

Parameters:out_file (String) – Name of the output file
crossValidation(cv)[source]

This function reports the performance measure by k-fold cross-validation

Parameters:cv (Integer) – Number of folds
crossValidationScores(cv)[source]

This function evaluates a score by cross-validation

Parameters:cv (Integer) – Number of folds
Returns:Cross validation scores
Return type:List
decisionTreeToGraphiz(out_file, feature_names, class_names)[source]

This function generates a GraphViz representation of the decision tree in DOT format

Parameters:
  • out_file (String) – Name of the output file
  • feature_names (List) – Names of each of the features
  • class_names (List) – Name of the target class
decisionTreeToPng(out_file, feature_names, class_names)[source]

This function wraps up self.decisionTreeToGraphiz and self.graphvizToPng

Parameters:
  • out_file (String) – Name of the output file
  • feature_names (List) – Names of each of the features
  • class_names (List) – Name of the target class
fit()[source]

This function builds the model from the training set (self.X, self.y). If the chosen algorithm is Decision Tree, then the optimal depth from the function “bestModel” will be used

graphvizToPng(out_file)[source]

Graphical rendering of the decision tree rules from the DOT file

Parameters:out_file (String) – Name of the output file
implemented_algorithms = ('decisionTree', 'logisticRegression')
keepBestFeatures()[source]

This function extracts the best features from the user defined dataframe using Random Forest Classifier and writes the resulting features to a text file which will be stored in “results/”

Module contents