Getting started

About this example

This section provides a simple example that demonstrates how to use the basic functionality of the software to create repeatable pipelines. The simplest example follows this pattern:

  1. Create an instance of a Pipeline class.
  2. Add transformer stages to handle variable imputation.
  3. Add an estimator stage to generate parameter estimates.
  4. Run the pipeline with the fit() method to create a pipeline model.
  5. Run the score() method on the model with new data to score and assess the model.

To demonstrate these common steps for developing a machine learning pipeline, the Titanic training data set from a Kaggle competition is used.

Because pipefitter can run in SAS 9 or SAS Viya, the last two steps are data dependent and this document shows how to run the pipeline first with SAS 9 and then with SAS Viya.

Build the pipeline

First, download the training data to a Pandas DataFrame.

In [1]: import pandas as pd

In [2]: train = pd.read_csv('http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv')

In [3]: train.head()
Out[3]: 
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

There are both numeric and character columns that contain missing values. The pipeline can start with two transformer stages to fill missing numeric values with the mean and missing character values with the most common value.

In [4]: from pipefitter.transformer import Imputer

In [5]: meanimp = Imputer(value=Imputer.MEAN)

In [6]: modeimp = Imputer(value=Imputer.MODE)

The following statements add these stages to a pipeline. Printing the object shows that the pipeline includes the two stages.

In [7]: from pipefitter.pipeline import Pipeline

In [8]: pipe = Pipeline([meanimp, modeimp])

In [9]: pipe
Out[9]: Pipeline([Imputer(MEAN), Imputer(MODE)])

The last stage of the pipeline is an estimator. To model the survival of passengers, we can train a decision tree model. This is done using the DecisionTree object in the pipefitter.estimator module. We set the target, inputs, and nominals arguments to match the variables in the data set.

In [10]: from pipefitter.estimator import DecisionTree

In [11]: dtree = DecisionTree(target='Survived',
   ....:                      inputs=['Sex', 'Age', 'Fare'],
   ....:                      nominals=['Sex', 'Survived'])
   ....: 

In [12]: dtree
Out[12]: DecisionTree(alpha=0.0, cf_level=0.25, criterion=None, inputs=['Sex', 'Age', 'Fare'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=['Sex', 'Survived'], prune=False, target='Survived', var_importance=False)

In addition to DecisionTree, there are other estimators such as DecisionForest, GBTree, and LogisticRegression that can be used in similar ways. You can use these estimators in pipelines.

To complete the work on the pipeline, add the estimator stage.

In [13]: pipe.stages.append(dtree)

In [14]: for stage in pipe:
   ....:     print(stage, "\n")
   ....: 
Imputer(MEAN) 

Imputer(MODE) 

DecisionTree(alpha=0.0, cf_level=0.25, criterion=None, inputs=['Sex', 'Age', 'Fare'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=['Sex', 'Survived'], prune=False, target='Survived', var_importance=False) 

Now that the pipeline is complete, we just need to add training data.

Run the pipeline in SAS 9.4

This section follows on the pipeline set up work from the preceding section. The SASPy package enables you to run analytics in SAS 9.4 and higher from a Python API.

First, start a SAS session and copy the training data to a SAS data set.

In [15]: import saspy

In [16]: sasconn = saspy.SASsession(cfgname=cfgname)
SAS Connection established. Subprocess id is 12315


In [17]: train_ds = sasconn.df2sd(df=train, table="train_ds")

In [18]: train_ds.head()
Out[18]: 
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name      Sex  Age  SibSp  \
0                           Braund, Mr. Owen Harris     male    22      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female    38      1   
2                            Heikkinen, Miss. Laina   female    26      0   
3      Futrelle, Mrs. Jacques Heath (Lily May Peel)   female    35      1   
4                          Allen, Mr. William Henry     male    35      0   

   Parch             Ticket     Fare  Cabin Embarked  
0      0         A/5 21171    7.2500   nan         S  
1      0          PC 17599   71.2833   C85         C  
2      0  STON/O2. 3101282    7.9250   nan         S  
3      0            113803   53.1000  C123         S  
4      0            373450    8.0500   nan         S  

Note

For information about starting a SAS session with the SASPy package, see http://sassoftware.github.io/saspy/getting-started.html#start-a-sas-session.

Next, generate the model using the Pipeline.fit() method. The training data is supplied as an argument. This method returns a PipelineModel object.

In [19]: pipeline_model = pipe.fit(train_ds)

In [20]: for stage in pipeline_model:
   ....:     print(stage, "\n")
   ....: 
Imputer(MEAN) 

Imputer(MODE) 

DecisionTreeModel(alpha=0.0, cf_level=0.25, criterion=None, inputs=['Sex', 'Age', 'Fare'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=['Sex', 'Survived'], prune=False, target='Survived', var_importance=False) 

Note

In the pipeline model, the decision tree becomes a decision tree model.

View the model assessment by running the PipelineModel.score() method with the training data.

In [21]: pipeline_model.score(train_ds)
Out[21]: 
Target                           Survived
Level                               CLASS
Var                           P_Survived0
NBins                                 100
NObsUsed                              891
TargetCount                           891
TargetMiss                              0
PredCount                             891
PredMiss                                0
Event                                   0
EventCount                            549
NonEventCount                         342
EventMiss                               0
KSR                                63.201
KS                                0.63201
KSDepth                           62.6263
KSCutOff                         0.616512
KSRef                            0.236842
MaxClassificationRate              83.165
MaxClassificationDepth            65.6566
CRCut                            0.508547
MedianClassificationDepth         33.3333
MedianEventDetectionCutOff       0.870466
MisClassificationRate             18.5185
ClassificationCutOff                  0.5
Name: DecisionTree, dtype: object

Run the pipeline in SAS Viya and SAS Cloud Analytic Services

This section follows on the pipeline set up work from the first section. If you ran the SAS 9.4 section with SASPy, you can continue with the code in this section:

  • The pipelines are designed to be portable.
  • The location of the data determines where the pipeline runs–in SAS 9.4 or in-memory tables in CAS.
  • The model implementations are different between platforms, so the model does need to be retrained.

Use SAS SWAT to connect to SAS Cloud Analytic Services on SAS Viya.

In [22]: import swat

In [23]: casconn = swat.CAS(host, port, userid, password)

Note

For information about starting a CAS session with the SWAT package, see https://sassoftware.github.io/python-swat/getting-started.html.

All processing in pipefitter begins with a data set. In the case of SAS SWAT, the data set is a CASTable object. You can create a CASTable from the training data that was initially downloaded.

In [24]: train_ct = casconn.upload_frame(train, casout=dict(name="train_ct", replace=True))
NOTE: Cloud Analytic Services made the uploaded file available as table TRAIN_CT in caslib CASUSER(kesmit).
NOTE: The table TRAIN_CT has been created in caslib CASUSER(kesmit) from binary data uploaded to Cloud Analytic Services.

In [25]: train_ct.info()
CASTable('TRAIN_CT', caslib='CASUSER(kesmit)')
Data columns (total 12 columns):
               N   Miss     Type
PassengerId  891  False   double
Survived     891  False   double
Pclass       891  False   double
Name         891  False  varchar
Sex          891  False  varchar
Age          714   True   double
SibSp        891  False   double
Parch        891  False   double
Ticket       891  False  varchar
Fare         891  False   double
Cabin        204   True  varchar
Embarked     889   True  varchar
dtypes: double(7), varchar(5)
data size: 157030
vardata size: 35854
memory usage: 157128

The results of the info function demonstrate again that the training data has missing values. Because the pipeline has the two transformer stages, the missing values are replaced when the model is trained.

Use the same Pipeline instance to generate another PipelineModel.

In [26]: pipeline_model = pipe.fit(train_ct)

In [27]: for stage in pipeline_model:
   ....:   print(stage, "\n")
   ....: 
Imputer(MEAN) 

Imputer(MODE) 

DecisionTreeModel(alpha=0.0, cf_level=0.25, criterion=None, inputs=['Sex', 'Age', 'Fare'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=['Sex', 'Survived'], prune=False, target='Survived', var_importance=False) 

Use the pipeline model to score the training data and show the model assessment.

In [28]: pipeline_model.score(train_ct)
Out[28]: 
Target                       Survived
Level                           CLASS
Var                            _DT_P_
NBins                             100
NObsUsed                          891
TargetCount                       891
TargetMiss                          0
PredCount                         891
PredMiss                            0
Event                               0
EventCount                        549
NonEventCount                     342
EventMiss                           0
AreaUnderROCCurve            0.846851
CRCut                            0.47
ClassificationCutOff              0.5
KS                           0.572908
KSCutOff                         0.54
MisClassificationRate         19.9776
Name: DecisionTree, dtype: object

As you can see, the results include many measures that are in common with the results from SAS 9.4 with SASPy. For example, the NObsUsed and MisClassificationRate values are common.

Bonus: HyperParameter tuning

In addition to creating pipelines of transformers and estimators, you can test various permutations of parameters using the HyperParameterTuning class. This class takes a grid of parameters to test and applies them to an estimator or a pipeline and returns the compiled results. The parameter grid can be either a dictionary of key/value pairs where the values are lists, or a list of dictionaries containing the complete set of parameters to test.

The hyperparameter tuning can be performed on pipelines.

In [29]: from pipefitter.model_selection import HyperParameterTuning as HPT

In [30]: hpt = HPT(estimator=pipe,
   ....:           param_grid=dict(max_depth=[6, 10],
   ....:                           leaf_size=[3, 5]),
   ....:           score_type='MisClassificationRate',
   ....:           cv=3)
   ....: 

In [31]: hpt.gridsearch(train_ct)
Out[31]: 
              MeanScore  ScoreStd                         Parameters  \
DecisionTree  20.538721  2.416443   {'leaf_size': 3, 'max_depth': 6}   
DecisionTree  20.650954  1.506466  {'leaf_size': 3, 'max_depth': 10}   
DecisionTree  21.099888  3.284560   {'leaf_size': 5, 'max_depth': 6}   
DecisionTree  21.997755  2.077972  {'leaf_size': 5, 'max_depth': 10}   

                                                     FoldScores  MeanClockTime  
DecisionTree  [17.218543046357617, 21.5547703180212, 22.8758...       0.003019  
DecisionTree  [18.54304635761589, 21.5547703180212, 21.89542...       0.003061  
DecisionTree  [16.55629139072847, 24.028268551236753, 22.875...       0.003059  
DecisionTree  [19.20529801324503, 22.614840989399298, 24.183...       0.002996