About this example¶
This section provides a simple example that demonstrates how to use the basic functionality of the software to create repeatable pipelines. The simplest example follows this pattern:
- Create an instance of a
- Add transformer stages to handle variable imputation.
- Add an estimator stage to generate parameter estimates.
- Run the pipeline with the
fit()method to create a pipeline model.
- Run the
score()method on the model with new data to score and assess the model.
To demonstrate these common steps for developing a machine learning pipeline, the Titanic training data set from a Kaggle competition is used.
Because pipefitter can run in SAS 9 or SAS Viya, the last two steps are data dependent and this document shows how to run the pipeline first with SAS 9 and then with SAS Viya.
Build the pipeline¶
First, download the training data to a Pandas DataFrame.
In : import pandas as pd In : train = pd.read_csv('http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv') In : train.head() Out: PassengerId Survived Pclass \ 0 1 0 3 1 2 1 1 2 3 1 3 3 4 1 1 4 5 0 3 Name Sex Age SibSp \ 0 Braund, Mr. Owen Harris male 22.0 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 2 Heikkinen, Miss. Laina female 26.0 0 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 4 Allen, Mr. William Henry male 35.0 0 Parch Ticket Fare Cabin Embarked 0 0 A/5 21171 7.2500 NaN S 1 0 PC 17599 71.2833 C85 C 2 0 STON/O2. 3101282 7.9250 NaN S 3 0 113803 53.1000 C123 S 4 0 373450 8.0500 NaN S
There are both numeric and character columns that contain missing values. The pipeline can start with two transformer stages to fill missing numeric values with the mean and missing character values with the most common value.
In : from pipefitter.transformer import Imputer In : meanimp = Imputer(value=Imputer.MEAN) In : modeimp = Imputer(value=Imputer.MODE)
The following statements add these stages to a pipeline. Printing the object shows that the pipeline includes the two stages.
In : from pipefitter.pipeline import Pipeline In : pipe = Pipeline([meanimp, modeimp]) In : pipe Out: Pipeline([Imputer(MEAN), Imputer(MODE)])
The last stage of the pipeline is an estimator. To model the survival of
passengers, we can train a decision tree model. This is done using the
DecisionTree object in the
We set the target, inputs, and nominals arguments to match the
variables in the data set.
In : from pipefitter.estimator import DecisionTree In : dtree = DecisionTree(target='Survived', ....: inputs=['Sex', 'Age', 'Fare'], ....: nominals=['Sex', 'Survived']) ....: In : dtree Out: DecisionTree(alpha=0.0, cf_level=0.25, criterion=None, inputs=['Sex', 'Age', 'Fare'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=['Sex', 'Survived'], prune=False, target='Survived', var_importance=False)
In addition to
DecisionTree, there are other estimators such as
that can be used in similar ways. You can use these estimators in pipelines.
To complete the work on the pipeline, add the estimator stage.
In : pipe.stages.append(dtree) In : for stage in pipe: ....: print(stage, "\n") ....: Imputer(MEAN) Imputer(MODE) DecisionTree(alpha=0.0, cf_level=0.25, criterion=None, inputs=['Sex', 'Age', 'Fare'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=['Sex', 'Survived'], prune=False, target='Survived', var_importance=False)
Now that the pipeline is complete, we just need to add training data.
Run the pipeline in SAS 9.4¶
This section follows on the pipeline set up work from the preceding section. The SASPy package enables you to run analytics in SAS 9.4 and higher from a Python API.
First, start a SAS session and copy the training data to a SAS data set.
In : import saspy In : sasconn = saspy.SASsession(cfgname=cfgname) SAS Connection established. Subprocess id is 12315 In : train_ds = sasconn.df2sd(df=train, table="train_ds") In : train_ds.head() Out: PassengerId Survived Pclass \ 0 1 0 3 1 2 1 1 2 3 1 3 3 4 1 1 4 5 0 3 Name Sex Age SibSp \ 0 Braund, Mr. Owen Harris male 22 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 2 Heikkinen, Miss. Laina female 26 0 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 4 Allen, Mr. William Henry male 35 0 Parch Ticket Fare Cabin Embarked 0 0 A/5 21171 7.2500 nan S 1 0 PC 17599 71.2833 C85 C 2 0 STON/O2. 3101282 7.9250 nan S 3 0 113803 53.1000 C123 S 4 0 373450 8.0500 nan S
For information about starting a SAS session with the SASPy package, see http://sassoftware.github.io/saspy/getting-started.html#start-a-sas-session.
Next, generate the model using the
Pipeline.fit() method. The training
data is supplied as an argument. This method returns a
In : pipeline_model = pipe.fit(train_ds) In : for stage in pipeline_model: ....: print(stage, "\n") ....: Imputer(MEAN) Imputer(MODE) DecisionTreeModel(alpha=0.0, cf_level=0.25, criterion=None, inputs=['Sex', 'Age', 'Fare'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=['Sex', 'Survived'], prune=False, target='Survived', var_importance=False)
In the pipeline model, the decision tree becomes a decision tree model.
View the model assessment by running the
with the training data.
In : pipeline_model.score(train_ds) Out: Target Survived Level CLASS Var P_Survived0 NBins 100 NObsUsed 891 TargetCount 891 TargetMiss 0 PredCount 891 PredMiss 0 Event 0 EventCount 549 NonEventCount 342 EventMiss 0 KSR 63.201 KS 0.63201 KSDepth 62.6263 KSCutOff 0.616512 KSRef 0.236842 MaxClassificationRate 83.165 MaxClassificationDepth 65.6566 CRCut 0.508547 MedianClassificationDepth 33.3333 MedianEventDetectionCutOff 0.870466 MisClassificationRate 18.5185 ClassificationCutOff 0.5 Name: DecisionTree, dtype: object
Run the pipeline in SAS Viya and SAS Cloud Analytic Services¶
This section follows on the pipeline set up work from the first section. If you ran the SAS 9.4 section with SASPy, you can continue with the code in this section:
- The pipelines are designed to be portable.
- The location of the data determines where the pipeline runs–in SAS 9.4 or in-memory tables in CAS.
- The model implementations are different between platforms, so the model does need to be retrained.
Use SAS SWAT to connect to SAS Cloud Analytic Services on SAS Viya.
In : import swat In : casconn = swat.CAS(host, port, userid, password)
For information about starting a CAS session with the SWAT package, see https://sassoftware.github.io/python-swat/getting-started.html.
All processing in pipefitter begins with a data set. In the case of SAS SWAT,
the data set is a
CASTable object. You can
create a CASTable from the training data that was initially downloaded.
In : train_ct = casconn.upload_frame(train, casout=dict(name="train_ct", replace=True)) NOTE: Cloud Analytic Services made the uploaded file available as table TRAIN_CT in caslib CASUSER(kesmit). NOTE: The table TRAIN_CT has been created in caslib CASUSER(kesmit) from binary data uploaded to Cloud Analytic Services. In : train_ct.info()