Getting started¶
About this example¶
This section provides a simple example that demonstrates how to use the basic functionality of the software to create repeatable pipelines. The simplest example follows this pattern:
- Create an instance of a
Pipeline
class. - Add transformer stages to handle variable imputation.
- Add an estimator stage to generate parameter estimates.
- Run the pipeline with the
fit()
method to create a pipeline model. - Run the
score()
method on the model with new data to score and assess the model.
To demonstrate these common steps for developing a machine learning pipeline, the Titanic training data set from a Kaggle competition is used.
Because pipefitter can run in SAS 9 or SAS Viya, the last two steps are data dependent and this document shows how to run the pipeline first with SAS 9 and then with SAS Viya.
Build the pipeline¶
First, download the training data to a Pandas DataFrame.
In [1]: import pandas as pd
In [2]: train = pd.read_csv('http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv')
In [3]: train.head()
Out[3]:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
There are both numeric and character columns that contain missing values. The pipeline can start with two transformer stages to fill missing numeric values with the mean and missing character values with the most common value.
In [4]: from pipefitter.transformer import Imputer
In [5]: meanimp = Imputer(value=Imputer.MEAN)
In [6]: modeimp = Imputer(value=Imputer.MODE)
The following statements add these stages to a pipeline. Printing the object shows that the pipeline includes the two stages.
In [7]: from pipefitter.pipeline import Pipeline
In [8]: pipe = Pipeline([meanimp, modeimp])
In [9]: pipe
Out[9]: Pipeline([Imputer(MEAN), Imputer(MODE)])
The last stage of the pipeline is an estimator. To model the survival of
passengers, we can train a decision tree model. This is done using the
DecisionTree
object in the pipefitter.estimator
module.
We set the target, inputs, and nominals arguments to match the
variables in the data set.
In [10]: from pipefitter.estimator import DecisionTree
In [11]: dtree = DecisionTree(target='Survived',
....: inputs=['Sex', 'Age', 'Fare'],
....: nominals=['Sex', 'Survived'])
....:
In [12]: dtree
Out[12]: DecisionTree(alpha=0.0, cf_level=0.25, criterion=None, inputs=['Sex', 'Age', 'Fare'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=['Sex', 'Survived'], prune=False, target='Survived', var_importance=False)
In addition to DecisionTree
, there are other estimators such as
DecisionForest
, GBTree
, and LogisticRegression
that can be used in similar ways. You can use these estimators in pipelines.
To complete the work on the pipeline, add the estimator stage.
In [13]: pipe.stages.append(dtree)
In [14]: for stage in pipe:
....: print(stage, "\n")
....:
Imputer(MEAN)
Imputer(MODE)
DecisionTree(alpha=0.0, cf_level=0.25, criterion=None, inputs=['Sex', 'Age', 'Fare'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=['Sex', 'Survived'], prune=False, target='Survived', var_importance=False)
Now that the pipeline is complete, we just need to add training data.
Run the pipeline in SAS 9.4¶
This section follows on the pipeline set up work from the preceding section. The SASPy package enables you to run analytics in SAS 9.4 and higher from a Python API.
First, start a SAS session and copy the training data to a SAS data set.
In [15]: import saspy
In [16]: sasconn = saspy.SASsession(cfgname=cfgname)
SAS Connection established. Subprocess id is 12315
In [17]: train_ds = sasconn.df2sd(df=train, table="train_ds")
In [18]: train_ds.head()
Out[18]:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1
2 Heikkinen, Miss. Laina female 26 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
4 Allen, Mr. William Henry male 35 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 nan S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 nan S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 nan S
Note
For information about starting a SAS session with the SASPy package, see http://sassoftware.github.io/saspy/getting-started.html#start-a-sas-session.
Next, generate the model using the Pipeline.fit()
method. The training
data is supplied as an argument. This method returns a PipelineModel
object.
In [19]: pipeline_model = pipe.fit(train_ds)
In [20]: for stage in pipeline_model:
....: print(stage, "\n")
....:
Imputer(MEAN)
Imputer(MODE)
DecisionTreeModel(alpha=0.0, cf_level=0.25, criterion=None, inputs=['Sex', 'Age', 'Fare'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=['Sex', 'Survived'], prune=False, target='Survived', var_importance=False)
Note
In the pipeline model, the decision tree becomes a decision tree model.
View the model assessment by running the PipelineModel.score()
method
with the training data.
In [21]: pipeline_model.score(train_ds)
Out[21]:
Target Survived
Level CLASS
Var P_Survived0
NBins 100
NObsUsed 891
TargetCount 891
TargetMiss 0
PredCount 891
PredMiss 0
Event 0
EventCount 549
NonEventCount 342
EventMiss 0
KSR 63.201
KS 0.63201
KSDepth 62.6263
KSCutOff 0.616512
KSRef 0.236842
MaxClassificationRate 83.165
MaxClassificationDepth 65.6566
CRCut 0.508547
MedianClassificationDepth 33.3333
MedianEventDetectionCutOff 0.870466
MisClassificationRate 18.5185
ClassificationCutOff 0.5
Name: DecisionTree, dtype: object
Run the pipeline in SAS Viya and SAS Cloud Analytic Services¶
This section follows on the pipeline set up work from the first section. If you ran the SAS 9.4 section with SASPy, you can continue with the code in this section:
- The pipelines are designed to be portable.
- The location of the data determines where the pipeline runs–in SAS 9.4 or in-memory tables in CAS.
- The model implementations are different between platforms, so the model does need to be retrained.
Use SAS SWAT to connect to SAS Cloud Analytic Services on SAS Viya.
In [22]: import swat
In [23]: casconn = swat.CAS(host, port, userid, password)
Note
For information about starting a CAS session with the SWAT package, see https://sassoftware.github.io/python-swat/getting-started.html.
All processing in pipefitter begins with a data set. In the case of SAS SWAT,
the data set is a CASTable
object. You can
create a CASTable from the training data that was initially downloaded.
In [24]: train_ct = casconn.upload_frame(train, casout=dict(name="train_ct", replace=True))
NOTE: Cloud Analytic Services made the uploaded file available as table TRAIN_CT in caslib CASUSER(kesmit).
NOTE: The table TRAIN_CT has been created in caslib CASUSER(kesmit) from binary data uploaded to Cloud Analytic Services.
In [25]: train_ct.info()