sasctl.pzmm.write_json_files#
- class sasctl.pzmm.write_json_files.JSONFiles[source]#
Bases:
object
Methods
add_df_to_fitstat
(df, data)Add parameters from provided DataFrame to the fitstats dictionary.
add_tuple_to_fitstat
(data, parameters)Using tuples defined in input_fit_statistics, add them to the dmcas_fitstat json dictionary.
apply_dataframe_to_json
(json_dict, ...[, ...])Map the values of the ROC or Lift charts from SAS CAS to the dictionary representation of the respective json file.
assess_model_bias
(score_table, ...[, ...])Calculates model bias metrics for sensitive variables and dumps metrics into SAS Viya readable JSON Files.
bias_dataframes_to_json
([groupmetrics, ...])Properly formats data from FairAITools CAS Action Set into a JSON readable formats Parameters ---------- groupmetrics: DataFrame A DataFrame containing the group metrics data maxdifference: DataFrame A DataFrame containing the max difference data n_sensitivevariables: int The total number of sensitive values actual_values : String Variable name containing the actual values in score_table. The variable name must follow SAS naming conventions (no spaces and the name cannot begin with a number or symbol). prob_values : list of strings, required for classification problems, otherwise not used A list of variable names containing the predicted probability values in the score table. The first element should represent the predicted probability of the target class. Required for classification problems. Default is None. levels: List of strings, required for classification problems, otherwise not used List of classes of a nominal target in the order they were passed in prob_values. Levels must be passed as a string. Default is None. pred_values : string, required for regression problems, otherwise not used Variable name containing the predicted values in score_table. The variable name must follow SAS naming conventions (no spaces and the name cannot begin with a number or symbol).Required for regression problems. The default value is None. json_path : str or Path, optional Location for the output JSON files. If a path is passed, the json files will populate in the directory and the function will return None, unless return_dataframes is True. Otherwise, the function will return the json strings in a dictionary (dict["maxDifferences.json"] and dict["groupMetrics.json"]). The default value is None.
calculate_model_statistics
(target_value[, ...])Calculates fit statistics (including ROC and Lift curves) from datasets and then either writes them to JSON files or returns them as a single dictionary.
check_for_data
([validate, train, test])Check which datasets were provided and return a list of flags.
check_if_string
(data)Determine if an MLFlow variable in data is a string type.
convert_data_role
(data_role)Converts the data role identifier from string to int or int to string.
create_requirements_json
([model_path, ...])Searches the model directory for Python scripts and pickle files and determines their Python package dependencies.
find_imports
(file_path)Find import calls in provided Python code path.
format_group_metrics
(groupmetrics_dfs[, ...])Converts list of group metrics DataFrames to a single DataFrame Parameters ---------- groupmetrics_dfs: List[DataFrame] List of group metrics DataFrames generated by CASAction pred_values : string, required for regression problems, otherwise not used Variable name containing the predicted values in score_table. The variable name must follow SAS naming conventions (no spaces and the name cannot begin with a number or symbol).Required for regression problems. The default value is None. prob_values : list of strings, required for classification problems, otherwise not used A list of variable names containing the predicted probability values in the score table. The first element should represent the predicted probability of the target class. Required for classification problems. Default is None. datarole : string, optional The data being used to assess bias (i.e. 'TEST', 'VALIDATION', etc.). Default is 'TEST.'.
format_max_differences
(maxdiff_dfs[, datarole])Converts a list of max differences DataFrames into a singular DataFrame Parameters ---------- maxdiff_dfs: List[DataFrame] A list of max_differences DataFrames returned by CAS datarole : string, optional The data being used to assess bias (i.e. 'TEST', 'VALIDATION', etc.). Default is 'TEST.'.
format_parameter
(param_name)Formats the parameter name to the JSON standard expected for dmcas_fitstat.json.
generate_misc
(model_files)Generates the dmcas_relativeimportance.json file, which is used to determine variable importance
generate_mlflow_variable_properties
(input_data)Create a list of dictionaries containing the variable properties found in the MLModel file for MLFlow model runs.
generate_model_card
(model_prefix, ...[, ...])Generates everything required for the model card feature within SAS Model Manager.
generate_outcome_average
(train_data, ...[, ...])Generates the outcome average of the training data.
generate_variable_importance
(conn, ...[, ...])Generates the dmcas_relativeimportance.json file, which is used to determine variable importance
generate_variable_properties
(input_data)Generate a list of dictionaries of variable properties given an input dataframe.
get_code_dependencies
([model_path])Get the package dependencies for all Python scripts in the provided directory path.
get_local_package_version
(package_list)Get package_name versions from the local environment.
get_package_names
(stream)Generates a list of found package names from a pickle stream.
get_pickle_dependencies
(pickle_file)Reads the pickled byte stream from a file object, serializes the pickled byte stream as a bytes object, and inspects the bytes object for all Python modules and aggregates them in a list.
get_pickle_file
([pickle_folder])Given a file path, retrieve the pickle file(s).
get_selection_statistic_value
(model_files[, ...])Finds the value of the chosen selection statistic in dmcas_fitstat.json, which should have been generated before this function has been called.
input_fit_statistics
([fitstat_df, ...])Writes a JSON file to display fit statistics for the model in SAS Model Manager.
read_json_file
(path)Reads a JSON file from a given path.
remove_standard_library_packages
(package_list)Remove any packages from the required list of installed packages that are part of the Python Standard Library.
stat_dataset_to_dataframe
(data[, target_value])Convert the user supplied statistical dataset from either a pandas DataFrame, list of lists, or numpy array to a DataFrame formatted for SAS CAS upload.
truncate_properties
(prop)Check custom properties for values larger than SAS Model Manager expects.
update_model_properties
(model_files, update_dict)Updates the ModelProperties.json file to include properties listed in the update_dict dictionary.
upload_training_data
(conn, model_prefix, ...)Uploads training data to CAS server.
user_input_fitstat
(data)Prompt the user to enter parameters for dmcas_fitstat.json.
write_file_metadata_json
(model_prefix[, ...])Writes a file metadata JSON file pointing to all relevant files.
write_model_properties_json
(model_name, ...)Writes a JSON file containing SAS Model Manager model properties.
write_var_json
(input_data[, is_input, json_path])Writes a variable descriptor JSON file for input or output variables, based on input data containing predictor and prediction columns.
- classmethod add_df_to_fitstat(df: DataFrame, data: List[dict]) List[dict] [source]#
Add parameters from provided DataFrame to the fitstats dictionary.
- Parameters:
- dfpandas.DataFrame
Dataframe containing fitstat parameters and values.
- datalist of dict
List of dicts for the data values of each parameter. Split into the three valid partitions (TRAIN, TEST, VALIDATE).
- Returns:
- list of dict
List of dicts with the user provided values inputted.
- classmethod add_tuple_to_fitstat(data: List[dict], parameters: List[tuple]) List[dict] [source]#
Using tuples defined in input_fit_statistics, add them to the dmcas_fitstat json dictionary.
Warnings are produced for invalid parameters found in the tuple.
- Parameters:
- datalist of dict
List of dicts for the data values of each parameter. Split into the three valid partitions (TRAIN, TEST, VALIDATE).
- parameterslist of tuple
User-provided data for each parameter per partition provided.
- Returns:
- list of dict
List of dicts with the tuple values inputted.
- Raises:
- ValueError
If an parameter within the tuple list is not a tuple or has a length different from the expected three.
- static apply_dataframe_to_json(json_dict: dict, partition: int, stat_df: DataFrame, is_lift: bool = False) dict [source]#
Map the values of the ROC or Lift charts from SAS CAS to the dictionary representation of the respective json file.
- Parameters:
- json_dictdict
Dictionary representation of the ROC or Lift chart json file.
- partitionint
Numerical representation of the data partition. Either 0, 1, or 2.
- stat_dfpandas.DataFrame
ROC or Lift DataFrame generated from the SAS CAS percentile action set.
- is_liftbool
Specify whether to use logic for Lift or ROC row counting. Default value is False.
- Returns:
- json_dictdict
Dictionary representation of the ROC or Lift chart json file, with the values from the SAS CAS percentile action set added in.
- classmethod assess_model_bias(score_table: DataFrame, sensitive_values: str | List[str], actual_values: str, pred_values: str = None, prob_values: List[str] = None, levels: List[str] = None, json_path: str | Path | None = None, cutoff: float = 0.5, datarole: str = 'TEST', return_dataframes: bool = False) dict | None [source]#
Calculates model bias metrics for sensitive variables and dumps metrics into SAS Viya readable JSON Files. This function works for regression and binary classification problems.
- Parameters:
- score_tablepandas.DataFrame
Data structure containing actual values, predicted or predicted probability values, and sensitive variable values. All columns in the score table must have valid variable names.
- sensitive_valuesstring or list of strings
Sensitive variable name or names in score_table. The variable name must follow SAS naming conventions (no spaces and the name cannot begin with a number or symbol).
- actual_valuesstring
Variable name containing the actual values in score_table. The variable name must follow SAS naming conventions (no spaces and the name cannot begin with a number or symbol).
- pred_valuesstring, required for regression problems, otherwise not used
Variable name containing the predicted values in score_table. The variable name must follow SAS naming conventions (no spaces and the name cannot begin with a number or symbol).Required for regression problems. The default value is None.
- prob_valueslist of strings, required for classification problems, otherwise not used
A list of variable names containing the predicted probability values in the score table. The first element should represent the predicted probability of the target class. Required for classification problems. Default is None.
- levels: List of strings, integers, booleans, required for classification problems, otherwise not used
List of classes of a nominal target in the order they were passed in prob_values. Levels must be passed as a string. Default is None.
- json_pathstr or Path, optional
Location for the output JSON files. If a path is passed, the json files will populate in the directory and the function will return None, unless return_dataframes is True. Otherwise, the function will return the json
strings in a dictionary (dict[“maxDifferences.json”] and dict[“groupMetrics.json”]). The default value is None.
- cutofffloat, optional
Cutoff value for confusion matrix. Default is 0.5.
- datarolestring, optional
The data being used to assess bias (i.e. ‘TEST’, ‘VALIDATION’, etc.). Default is ‘TEST.’
- return_dataframesboolean, optional
If true, the function returns the pandas data frames used to create the JSON files and a table for bias metrics. If a JSON path is passed, then the function will return a dictionary that only includes the data frames (dict[“maxDifferencesData”], dict[“groupMetricData”], and dict[“biasMetricsData”]). If a JSON path is
not passed, the function will return a dictionary with the three tables and the two JSON strings
(dict[“maxDifferences.json”] and dict[“groupMetrics.json”]). The default value is False.
- Returns:
- dict
Dictionary containing a key-value pair representing the files name and json dumps respectively.
- Raises:
- RuntimeError
If swat is not installed, this function cannot perform the necessary calculations.
- ValueError
This function requires pred_values OR (regression) or prob_values AND levels (classification) to be passed.
Variable names must follow SAS naming conventions (no spaces or names that begin with a number or symbol).
Warning
This method is experimental and may be modified or removed without warning. ..
- classmethod bias_dataframes_to_json(groupmetrics: DataFrame = None, maxdifference: DataFrame = None, n_sensitivevariables: int = None, actual_values: str = None, prob_values: List[str] = None, levels: List[str] = None, pred_values: str = None, json_path: str | Path | None = None)[source]#
Properly formats data from FairAITools CAS Action Set into a JSON readable formats Parameters ———- groupmetrics: DataFrame
A DataFrame containing the group metrics data
- maxdifference: DataFrame
A DataFrame containing the max difference data
- n_sensitivevariables: int
The total number of sensitive values
- actual_valuesString
Variable name containing the actual values in score_table. The variable name must follow SAS naming conventions (no spaces and the name cannot begin with a number or symbol).
- prob_valueslist of strings, required for classification problems, otherwise not used
A list of variable names containing the predicted probability values in the score table. The first element should represent the predicted probability of the target class. Required for classification problems. Default is None.
- levels: List of strings, required for classification problems, otherwise not used
List of classes of a nominal target in the order they were passed in prob_values. Levels must be passed as a string. Default is None.
- pred_valuesstring, required for regression problems, otherwise not used
Variable name containing the predicted values in score_table. The variable name must follow SAS naming conventions (no spaces and the name cannot begin with a number or symbol).Required for regression problems. The default value is None.
- json_pathstr or Path, optional
Location for the output JSON files. If a path is passed, the json files will populate in the directory and the function will return None, unless return_dataframes is True. Otherwise, the function will return the json
strings in a dictionary (dict[“maxDifferences.json”] and dict[“groupMetrics.json”]). The default value is None.
- Returns:
- dict
Dictionary containing a key-value pair representing the files name and json dumps respectively.
Warning
This method is experimental and may be modified or removed without warning. ..
- classmethod calculate_model_statistics(target_value: str | int | float, prob_value: int | float | None = None, validate_data: DataFrame | List[list] | Type[numpy.array] = None, train_data: DataFrame | List[list] | Type[numpy.array] = None, test_data: DataFrame | List[list] | Type[numpy.array] = None, json_path: str | Path | None = None, target_type: str = 'classification', cutoff: float | None = None) dict | None [source]#
Calculates fit statistics (including ROC and Lift curves) from datasets and then either writes them to JSON files or returns them as a single dictionary.
Calculations are performed using a call to SAS CAS via the swat package. An error will be raised if the swat package is not installed or if a connection to a SAS Viya system is not possible.
Datasets must contain the actual and predicted values and may optionally contain the predicted probabilities. If no probabilities are provided, a dummy probability dataset is generated based on the predicted values and normalized by the target value. If a probability threshold value is not provided, the threshold value is set at 0.5.
Datasets can be provided in the following forms, with the assumption that data is ordered as actual, predict, and probability respectively: * pandas dataframe: the actual and predicted values are their own columns * numpy array: the actual and predicted values are their own columns or rows and ordered such that the actual values come first and the predicted second * list: the actual and predicted values are their own indexed entry
If a json_path is supplied, then this function outputs a set of JSON files named “dmcas_fitstat.json”, “dmcas_roc.json”, “dmcas_lift.json”.
- Parameters:
- target_valuestr, int, or float
Target event value for model prediction events.
- prob_valueint or float, optional
The threshold value for model predictions to indicate an event occurred. The default value is 0.5.
- validate_datapandas.DataFrame, list of list, or numpy array, optional
Dataset pertaining to the validation data. The default value is None.
- train_datapandas.DataFrame, list of list, or numpy array, optional
Dataset pertaining to the training data. The default value is None.
- test_datapandas.DataFrame, list of list, or numpy array, optional
Dataset pertaining to the test data. The default value is None.
- json_pathstr or Path, optional
Location for the output JSON files. The default value is None.
- target_type: str, optional
Type of target the model is trying to find. Currently supports “classification” and “prediction” types. The default value is “classification”.
- Returns:
- dict
Dictionary containing a key-value pair representing the files name and json dumps respectively.
- Raises:
- RuntimeError
If swat is not installed, this function cannot perform the necessary calculations.
- static check_for_data(validate: DataFrame | List[list] | Type[numpy.array] = None, train: DataFrame | List[list] | Type[numpy.array] = None, test: DataFrame | List[list] | Type[numpy.array] = None) list [source]#
Check which datasets were provided and return a list of flags.
- Parameters:
- validatepandas.DataFrame, list of list, or numpy array, optional
Dataset pertaining to the validation data. The default value is None.
- trainpandas.DataFrame, list of list, or numpy array, optional
Dataset pertaining to the training data. The default value is None.
- testpandas.DataFrame, list of list, or numpy array, optional
Dataset pertaining to the test data. The default value is None.
- Returns:
- data_partitionslist
A list of flags indicating which partitions have datasets.
- Raises:
- ValueError
If no data is provided, raises an exception.
- static check_if_string(data: dict) bool [source]#
Determine if an MLFlow variable in data is a string type.
- Parameters:
- datadict
Dictionary representation of a single variable from an MLFlow model.
- Returns:
- bool
True if the variable is a string. False otherwise.
- static convert_data_role(data_role: str | int) str | int [source]#
Converts the data role identifier from string to int or int to string.
JSON file descriptors require the string, int, and formatted int. If the provided data role is not valid, defaults to TRAIN (1).
- Parameters:
- data_rolestr or int
Identifier of the data set’s role; either TRAIN, TEST, or VALIDATE, or correspondingly 1, 2, or 3.
- Returns:
- conversionstr or int
Converted data role identifier.
- classmethod create_requirements_json(model_path: str | Path | None = PosixPath('/home/runner/work/python-sasctl/python-sasctl'), output_path: str | Path | None = None) dict | None [source]#
Searches the model directory for Python scripts and pickle files and determines their Python package dependencies.
Found dependencies are then matched to the package version found in the current working environment. Then the package and version are written to a requirements.json file.
WARNING: The methods utilized in this function can determine package dependencies from provided scripts and pickle files, but CANNOT determine the required package versions without being in the development environment which they were originally created.
This function works best when run in the model development environment and is likely to throw errors if run in another environment (and/or produce incorrect package versions). In the case of using this function outside the model development environment, it is recommended to the user that they adjust the requirements.json file’s package versions to match the model development environment.
When provided with an output_path argument, this function outputs a JSON file named “requirements.json”. Otherwise, a list of dicts is returned.
- Parameters:
- model_pathstr or Path, optional
The path to a Python project, by default the current working directory.
- output_pathstr or Path, optional
The path for the output requirements.json file. The default value is None.
- Returns:
- list of dict
List of dictionary representations of the json file contents, split into each package and/or warning.
- static find_imports(file_path: str | Path) List[str] [source]#
Find import calls in provided Python code path.
Ignores built in Python modules.
Credit: modified from https://stackoverflow.com/questions/44988487/regex-to -parse-import-statements-in-python
- Parameters:
- file_pathstr or Path
File location for the Python file to be parsed.
- Returns:
- list of str
List of found package dependencies.
- static format_group_metrics(groupmetrics_dfs: List[DataFrame], prob_values: List[str] | None = None, pred_values: str | None = None, datarole: str = 'TEST') DataFrame [source]#
Converts list of group metrics DataFrames to a single DataFrame Parameters ———- groupmetrics_dfs: List[DataFrame]
List of group metrics DataFrames generated by CASAction
- pred_valuesstring, required for regression problems, otherwise not used
Variable name containing the predicted values in score_table. The variable name must follow SAS naming conventions (no spaces and the name cannot begin with a number or symbol).Required for regression problems. The default value is None.
- prob_valueslist of strings, required for classification problems, otherwise not used
A list of variable names containing the predicted probability values in the score table. The first element should represent the predicted probability of the target class. Required for classification problems. Default is None.
- datarolestring, optional
The data being used to assess bias (i.e. ‘TEST’, ‘VALIDATION’, etc.). Default is ‘TEST.’
- Returns:
- DataFrame
A singular DataFrame containing formatted data for group metrics
- static format_max_differences(maxdiff_dfs: List[DataFrame], datarole: str = 'TEST') DataFrame [source]#
Converts a list of max differences DataFrames into a singular DataFrame Parameters ———- maxdiff_dfs: List[DataFrame]
A list of max_differences DataFrames returned by CAS
- datarolestring, optional
The data being used to assess bias (i.e. ‘TEST’, ‘VALIDATION’, etc.). Default is ‘TEST.’
- Returns:
- DataFrame
A singluar DataFrame containing all max differences data
- static format_parameter(param_name: str)[source]#
Formats the parameter name to the JSON standard expected for dmcas_fitstat.json.
- Parameters:
- param_namestr
Name of the parameter.
- Returns:
- str
Name of the parameter.
- classmethod generate_misc(model_files: str | Path | dict)[source]#
Generates the dmcas_relativeimportance.json file, which is used to determine variable importance
- Parameters:
- conn
A SWAT connection used to connect to the user’s CAS server
- model_filesstring, Path, or dict
Either the directory location of the model files (string or Path object), or a dictionary containing the contents of all the model files.
- classmethod generate_mlflow_variable_properties(input_data: list) List[dict] [source]#
Create a list of dictionaries containing the variable properties found in the MLModel file for MLFlow model runs.
- Parameters:
- input_datalist of dict
Data pulled from the MLModel file by mlflow_model.py.
- Returns:
- dict_listlist of dict
List of dictionaries containing the variable properties.
- classmethod generate_model_card(model_prefix: str, model_files: str | Path | dict, algorithm: str, train_data: DataFrame, train_predictions: Series | list, target_type: str = 'classificaiton', target_value: str | int | float | None = None, interval_vars: list | None = [], class_vars: list | None = [], selection_statistic: str | None = None, server: str = 'cas-shared-default', caslib: str = 'Public')[source]#
Generates everything required for the model card feature within SAS Model Manager.
This includes uploading the training data to CAS, updating ModelProperties.json to have some extra properties, and generating dmcas_relativeimportance.json.
- Parameters:
- model_prefixstring
The prefix used to name files relating to the model. This is used to provide a unique name to the training data table when it is uploaded to CAS.
- model_filesstring, Path, or dict
Either the directory location of the model files (string or Path object), or a dictionary containing the contents of all the model files.
- algorithmstr
The name of the algorithm used to generate the model.
- train_data: pandas.DataFrame
Training data that contains all input variables as well as the target variable.
- train_predictionspandas.Series, list
List of predictions made by the model on the training data.
- target_typestring
Type of target the model is trying to find. Currently supports “classification” and “prediction” types. The default value is “classification”.
- target_valuestring, int, float, optional
Value the model is targeting for classification models. This argument is not needed for prediction models. The default value is None.
- interval_varslist, optional
A list of interval variables. The default value is an empty list.
- class_varslist, optional
A list of classification variables. The default value is an empty list.
- selection_statistic: str, optional
The selection statistic chosen to score the model against other models. Classification models can take any of the following values: “_RASE_”, “_GINI_”, “_GAMMA_”, “_MCE_”, “_ASE_”, “_MCLL_”, “_KS_”, “_KSPostCutoff_”, “_DIV_”, “_TAU_”, “_KSCut_”, or “_C_”. Prediction models can take any of the following values: “_ASE_”, “_DIV_”, “_RASE_”, “_MAE_”, “_RMAE_”, “_MSLE_”, “_RMSLE_” The default value is “_KS_” for classification models and “_ASE_” for prediction models.
- server: str, optional
The CAS server the training data will be stored on. The default value is “cas-shared-default”
- caslib: str, optional
The caslib the training data will be stored on. The default value is “Public”
- static generate_outcome_average(train_data: DataFrame, input_variables: list, target_type, target_value: str | int | float | None = None)[source]#
Generates the outcome average of the training data. For prediction targets, the event average is generated. For Classification targets, the event percentage is returned.
- Parameters:
- train_data: pandas.DataFrame
Training data that contains all input variables as well as the target variable. If multiple non-input variables are included, the function will assume that the first non-input variable row is the output.
- input_variables: list
A list of all input variables used by the model. Used to isolate the output variable.
- target_typestring
Type the model is targeting. Currently supports “classification” and “prediction” types.
- target_valuestring, int, float, optional
Value the model is targeting for Classification models. This argument is not needed for prediction models. The default value is None.
- Returns:
- dict
- Returns a dictionary with a key value pair that represents the outcome average.
- classmethod generate_variable_importance(conn, model_files: str | Path | dict, train_data: DataFrame, train_predictions: Series | list, target_type: str = 'classification', interval_vars: list | None = [], class_vars: list | None = [], caslib: str = 'Public')[source]#
Generates the dmcas_relativeimportance.json file, which is used to determine variable importance
- Parameters:
- conn
A SWAT connection used to connect to the user’s CAS server
- model_filesstring, Path, or dict
Either the directory location of the model files (string or Path object), or a dictionary containing the contents of all the model files.
- train_data: pandas.DataFrame
Training data that contains all input variables as well as the target variable.
- train_predictionspandas.Series, list
List of predictions made by the model on the training data.
- target_typestring, optional
Type the model is targeting. Currently supports “classification” and “prediction” types. The default value is “classification”.
- interval_varslist, optional
A list of interval variables. The default value is an empty list.
- class_varslist, optional
A list of classification variables. The default value is an empty list.
- caslib: str, optional
The caslib the training data will be stored on. The default value is “Public”
- static generate_variable_properties(input_data: DataFrame | Series) List[dict] [source]#
Generate a list of dictionaries of variable properties given an input dataframe.
- Parameters:
- input_datapandas.Dataframe or pandas.Series
Dataset for either the input or output example data for the model.
- Returns:
- dict_listlist of dicts
List of dictionaries containing the variable properties.
- classmethod get_code_dependencies(model_path: str | Path = PosixPath('/home/runner/work/python-sasctl/python-sasctl')) List[str] [source]#
Get the package dependencies for all Python scripts in the provided directory path.
Note that currently this functionality only works for .py files.
- Parameters:
- model_pathstring or Path, optional
File location for the output JSON file. The default value is the current working directory.
- Returns:
- list
List of found package dependencies.
- static get_local_package_version(package_list: List[str]) List[List[str]] [source]#
Get package_name versions from the local environment.
If the package_name does not contain an attribute of “__version__”, “version”, or “VERSION”, no package_name version will be found.
- Parameters:
- package_listlist of str
List of Python packages.
- Returns:
- list of list of str
Nested list of Python package_name names and found versions.
- static get_package_names(stream: bytes | str) List[str] [source]#
Generates a list of found package names from a pickle stream.
In most cases, the packages returned by the function will be valid Python packages. A check is made in get_local_package_version to ensure that the package is in fact a valid Python package.
This code has been adapted from the following stackoverflow example and utilizes the pickletools package. Credit: modified from https://stackoverflow.com/questions/64850179/inspecting-a-pickle-dump-for -dependencies More information here: python/cpython
- Parameters:
- streambytes or str
A file like object or string containing the pickle.
- Returns:
- List of str
List of package names found as module dependencies in the pickle file.
- classmethod get_pickle_dependencies(pickle_file: str | Path) List[str] [source]#
Reads the pickled byte stream from a file object, serializes the pickled byte stream as a bytes object, and inspects the bytes object for all Python modules and aggregates them in a list.
- Parameters:
- pickle_filestr or Path
The file where you stored pickle data.
- Returns:
- list
A list of modules obtained from the pickle stream. Duplicates are removed and Python built-in modules are removed.
- static get_pickle_file(pickle_folder: str | Path = PosixPath('/home/runner/work/python-sasctl/python-sasctl')) List[Path] [source]#
Given a file path, retrieve the pickle file(s).
- Parameters:
- pickle_folderstr or Path
File location for the input pickle file. The default value is the current working directory.
- Returns:
- list of Path
A list of pickle files.
- static get_selection_statistic_value(model_files: str | Path | dict, selection_statistic: str = '_GINI_')[source]#
Finds the value of the chosen selection statistic in dmcas_fitstat.json, which should have been generated before this function has been called.
- Parameters:
- model_filesstring, Path, or dict
Either the directory location of the model files (string or Path object), or a dictionary containing the contents of all the model files.
- selection_statistic: str, optional
The selection statistic chosen to score the model against other models. Can be any of the following values: “_RASE_”, “_NObs_”, “_GINI_”, “_GAMMA_”, “_MCE_”, “_ASE_”, “_MCLL_”, “_KS_”, “_KSPostCutoff_”, “_DIV_”, “_TAU_”, “_KSCut_”, or “_C_”. The default value is “_GINI_”.
- Returns:
- float
- Returns the numerical value assoicated with the chosen selection statistic.
- classmethod input_fit_statistics(fitstat_df: DataFrame | None = None, user_input: bool | None = False, tuple_list: List[tuple] | None = None, json_path: str | Path | None = None) dict | None [source]#
Writes a JSON file to display fit statistics for the model in SAS Model Manager.
There are three modes to add fit parameters to the JSON file:
1. Call the function with additional tuple arguments containing the name of the parameter, its value, and the partition that it belongs to.
Provide line by line user input prompted by the function.
3. Import values from a CSV file. Format should contain the above tuple in each row.
- The following are the base statistical parameters SAS Viya supports:
RASE = Root Average Squared Error
NObs = Sum of Frequencies
GINI = Gini Coefficient
GAMMA = Gamma
MCE = Misclassification Rate
ASE = Average Squared Error
MCLL = Multi-Class Log Loss
KS = KS (Youden)
KSPostCutoff = ROC Separation
DIV = Divisor for ASE
TAU = Tau
KSCut = KS Cutoff
C = Area Under ROC
This function outputs a JSON file named “dmcas_fitstat.json”.
- Parameters:
- fitstat_dfpandas.DataFrame, optional
Dataframe containing fitstat parameters and values. The default value is None.
- user_inputbool, optional
If true, prompt the user for more parameters. The default value is false.
- tuple_listlist of tuple, optional
Input parameter tuples in the form of (parameterName, parameterValue, data_role). For example, a sample parameter call would be ‘NObs’, 3488, or ‘TRAIN’. Variable data_role is typically either TRAIN, TEST, or VALIDATE or 1, 2, 3 respectively. The default value is None.
- json_pathstr or Path, optional
Location for the output JSON file. The default value is None.
- Returns:
- dict
Dictionary containing a key-value pair representing the file name and json dump respectively.
- static read_json_file(path: str | Path) Any [source]#
Reads a JSON file from a given path.
- Parameters:
- pathstr or Path
Location of the JSON file to be opened.
- Returns:
- json.load(jFile)str
String contents of JSON file.
- static remove_standard_library_packages(package_list: List[str]) List[str] [source]#
Remove any packages from the required list of installed packages that are part of the Python Standard Library.
- Parameters:
- package_listlist of str
List of all packages found that are not Python built-in packages.
- Returns:
- list of str
List of all packages found that are not Python built-in packages or part of the Python Standard Library.
- static stat_dataset_to_dataframe(data: DataFrame | List[list] | Type[numpy.array], target_value: str | int | float = None) DataFrame [source]#
Convert the user supplied statistical dataset from either a pandas DataFrame, list of lists, or numpy array to a DataFrame formatted for SAS CAS upload.
If the prediction probabilities are not provided, the prediction data will be duplicated to allow for calculation of the fit statistics through CAS and then a binary filter is applied to the duplicate column based off of a provided target value. The data is assumed to be in the order of “actual”, “predicted”, “probability” respectively.
- Parameters:
- datapandas.DataFrame, list of list, or numpy array
Dataset representing the actual and predicted values of the model. May also include the prediction probabilities.
- target_valuestr, int, or float, optional
Target event value for model prediction events. Used for creating a binary probability column when no probability values are provided. The default value is None.
- Returns:
- datapandas.DataFrame
Dataset formatted for SAS CAS upload.
- Raises:
- ValueError
Raised if an improper data format is provided.
- static truncate_properties(prop: dict) dict [source]#
Check custom properties for values larger than SAS Model Manager expects.
Property names cannot be larger than 60 characters. Property values cannot be larger than 512 characters.
- Parameters:
- propdict
Key-value pair representing the property name and value.
- Returns:
- propdict
Key-value pair, which was truncated as needed by SAS Model Manager.
- static update_model_properties(model_files, update_dict)[source]#
Updates the ModelProperties.json file to include properties listed in the update_dict dictionary.
- Parameters:
- model_filesstring, Path, or dict
Either the directory location of the model files (string or Path object), or a dictionary containing the contents of all the model files.
- update_dictdictionary
A dictionary containing the key-value pairs that represent properties to be added to the ModelProperties.json file.
- static upload_training_data(conn, model_prefix: str, train_data: DataFrame, server: str = 'cas-shared-default', caslib: str = 'Public')[source]#
Uploads training data to CAS server.
- Parameters:
- conn
SWAT connection. Used to connect to CAS server.
- model_prefixstring
The prefix used to name files relating to the model. This is used to provide a unique name to the training data table when it is uploaded to CAS.
- train_data: pandas.DataFrame
Training data that contains all input variables as well as the target variable.
- server: str, optional
The CAS server the training data will be stored on. The default value is “cas-shared-default”
- caslib: str, optional
The caslib the training data will be stored on. The default value is “Public”
- Returns:
- string
- Returns a string that represents the location of the training table within CAS.
- classmethod user_input_fitstat(data: List[dict]) List[dict] [source]#
Prompt the user to enter parameters for dmcas_fitstat.json.
- Parameters:
- datalist of dicts
List of dicts for the data values of each parameter. Split into the three valid partitions (TRAIN, TEST, VALIDATE).
- Returns:
- list of dicts
List of dicts with the user provided values inputted.
- valid_params: List[str] = ['_RASE_', '_NObs_', '_GINI_', '_GAMMA_', '_MCE_', '_ASE_', '_MCLL_', '_KS_', '_KSPostCutoff_', '_DIV_', '_TAU_', '_KSCut_', '_C_']#
- classmethod write_file_metadata_json(model_prefix: str, json_path: str | Path | None = None, is_h2o_model: bool | None = False, is_tf_keras_model: bool | None = False) dict | None [source]#
Writes a file metadata JSON file pointing to all relevant files.
This function outputs a JSON file named “fileMetadata.json”.
- Parameters:
- model_prefixstr
The variable for the model name that is used when naming model files. For example: hmeqClassTree + [Score.py | .pickle].
- json_pathstr or Path, optional
Path for an output ModelProperties.json file to be generated. If no value is supplied a dict is returned instead. The default value is None.
- is_h2o_modelbool, optional
Sets whether the model metadata is associated with an H2O.ai model. If set as True, the MOJO model file will be set as a score resource. The default value is False.
- Returns:
- dict
Dictionary containing a key-value pair representing the file name and json dump respectively.
- classmethod write_model_properties_json(model_name: str, target_variable: str, target_values: List[Any] | None = None, json_path: str | Path | None = None, model_desc: str | None = None, model_algorithm: str | None = None, model_function: str | None = None, modeler: str | None = None, train_table: str | None = None, properties: List[dict] | None = None) dict | None [source]#
Writes a JSON file containing SAS Model Manager model properties.
Property values for multiclass models are not supported on a model-level in SAS Model Manager. If these values are detected, they will be supplied as custom user properties.
If a json_path is supplied, this function outputs a JSON file named “ModelProperties.json”. Otherwise, a dict is returned.
- Parameters:
- model_namestr
User-defined model name. This value is overwritten by SAS Model Manager based on the name of the zip file used for importing the model.
- target_variablestr
Target variable to be predicted by the model.
- target_valueslist, optional
Model target event(s). Providing no target values indicates the model is a regression model. Providing 2 target values indicates the model is a binary classification model. Providing > 2 target values will supply the values for the different target events as a custom property. An error is raised if only 1 target value is supplied. The default value is None.
- json_pathstr or Path, optional
Path for an output ModelProperties.json file to be generated. If no value is supplied a dict is returned instead. The default value is None.
- model_descstr, optional
User-defined model description. The default value is an empty string.
- model_algorithmstr, optional
User-defined model algorithm name. The default value is an empty string.
- model_functionstr, optional
User-defined model function name. The default value is an empty string.
- modelerstr, optional
User-defined value for the name of the modeler. The default value is an empty string.
- train_tablestr, optional
The path to the model’s training table within SAS Viya. The default value is an empty string.
- propertiesList of dict, optional
List of custom properties to be shown in the user-defined properties section of the model in SAS Model Manager. Dict entries should contain the name, value, and type keys. The default value is an empty list.
- Returns:
- dict
Dictionary containing a key-value pair representing the file name and json dump respectively.
- classmethod write_var_json(input_data: dict | DataFrame | Series, is_input: bool | None = True, json_path: str | Path | None = None) dict | None [source]#
Writes a variable descriptor JSON file for input or output variables, based on input data containing predictor and prediction columns.
If a path is provided, this function creates a JSON file named either inputVar.json or outputVar.json based on argument inputs. Otherwise, a dict is returned with the key-value pair representing the file name and json dump respectively.
- Parameters:
- input_datapandas.DataFrame, pandas.Series, or list of dict
Input dataframe containing the training data set in a pandas.Dataframe format. Columns are used to define predictor and prediction variables (ambiguously named “predict”). Providing a list of dict objects signals that the model files are being created from an MLFlow model.
- is_inputbool, optional
Boolean flag to check if generating the input or output variable JSON. The default value is True.
- json_pathstr or Path, optional
File location for the output JSON file. The default value is None.
- Returns:
- dict
Dictionary containing a key-value pair representing the file name and json dump respectively.
- class sasctl.pzmm.write_json_files.NpEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]#
Bases:
JSONEncoder
Methods
default
(obj)Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).encode
(o)Return a JSON string representation of a Python data structure.
iterencode
(o[, _one_shot])Encode the given object and yield each string representation as available.
- default(obj)[source]#
Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)