Getting Started¶
Before you can use the SWAT package, you will need a running CAS server. The SWAT package can connect to either the binary port or the HTTP port. If you have the option of either, the binary protocol will give you better performance.
Other than the CAS host and port, you just need a user name and password to connect. User names and passwords can be implemented in various ways, so you may need to see your system administrator on how to acquire an account.
To connect to a CAS server, you simply import SWAT and use the swat.CAS class to create a connection. This has a couple of different forms. The most basic is to pass the hostname, port, username, and password.
In [1]: import swat
In [2]: conn = swat.CAS(host, port, username, password)
However, if you are using a REST connection to CAS, a URL is the more natural way to specify a host, port, and protocol.
In [3]: conn = swat.CAS('https://my-cas-host.com:443/cas-shared-default-http/',
...: username='...', password='...')
...:
Notice that in the URL case, username and password, must be specified as keyword parameters since the port parameter is being skipped. Also, in this case we are using a proxy server that requires the base path of ‘cas-shared-default-http’. If you are connecting directly to a CAS server, this is typically not required.
Now that we have a connection to CAS, we can run some actions on it.
Running CAS Actions¶
To test your connection, you can run the serverstatus action.
In [4]: out = conn.serverstatus()
Note: Grid node action status report: 1 nodes, 8 total actions executed. In [5]: out Out[5]: [About] {'CAS': 'Cloud Analytic Services', 'Version': '4.00', 'VersionLong': 'V.04.00M0D06062024', 'Copyright': 'Copyright © 2014-2024 SAS Institute Inc. All Rights Reserved.', 'ServerTime': '2024-06-07T17:48:55Z', 'System': {'Hostname': 'snap020', 'OS Name': 'Linux', 'OS Family': 'LIN X64', 'OS Release': '2.6.32-358.2.1.el6.x86_64', 'OS Version': '#1 SMP Wed Feb 20 12:17:37 EST 2013', 'Model Number': 'x86_64', 'Linux Distribution': 'Red Hat Enterprise Linux Server release 6.2 (Santiago)'}, 'Documentation': 'http://mycompany.com:8080/job/Actions_ref_doc/ws/casaref/index.html', 'license': {'site': 'SAS Institute Inc.', 'siteNum': 1, 'expires': '01Aug2024:00:00:00', 'gracePeriod': 62, 'warningPeriod': 31}, 'CASHostAccountRequired': 'OPTIONAL', 'Transferred': 'NO', 'CASCacheLocation': 'CAS Disk Cache'} [server] Server Status nodes actions 0 1 8 [nodestatus] Node Status name role uptime running stalled 0 snap020 controller 0.276 0 0 + Elapsed: 0.00121s, mem: 0.304mb
Handling the Output¶
All CAS actions return a CASResults object. This is simply an ordered Python dictionary with a few extra methods and attributes added. In the output above, you’ll see the keys of the dictionary surrounded in square brackets. They are ‘About’, ‘server’, and ‘nodestatus’. Since this is a dictionary, you can just use the standard way of accessing keys.
In [6]: out['nodestatus']
Out[6]:
Node Status
name role uptime running stalled
0 snap020 controller 0.276 0 0
In addition, you can access the keys as attributes. This convenience was added to keep your code looking a bit cleaner. However, be aware that if the name of a key collides with a standard Python attribute or method, you’ll get that attribute or method instead. So this form is fine for interactive programming, but you may want to use the syntax above for actual programs.
In [7]: out.nodestatus
Out[7]:
Node Status
name role uptime running stalled
0 snap020 controller 0.276 0 0
The types of the result keys can vary as well. In this case, the ‘About’ key holds a dictionary. The ‘server’ and ‘nodestatus’ keys hold SASDataFrame objects (a subclass of pandas.DataFrame).
In [8]: for key, value in out.items():
...: print(key, type(value))
...:
About <class 'dict'>
server <class 'swat.SASDataFrame'>
nodestatus <class 'swat.SASDataFrame'>
Since the values in the result are standard Python (and pandas) objects, you can work with them as you normally do.
In [9]: out.nodestatus.role Out[9]: 0 controller Name: role, dtype: object In [10]: out.About['Version'] Out[10]: '4.00'
Simple Statistics¶
We can’t have a getting started section without doing some sort of statistical analysis. First, we need to see what CAS action sets are loaded. We can get a listing of all of the action sets and actions using the help CAS action. If you run help without any arguments, it will display all of the loaded actions and their descriptions. Rather than printing that large listing, we’ll specifically ask for the simple action set since we already know that’s the one we want.
In [11]: conn.help(actionset='simple');
Let’s start with the summary action. Of course, we first need to load some data. The simplest way to load data is to do it from the client side. Note that while this is the simplest way, it’s probably not the best way for large data sets. Those should be loaded from the server side if possible.
The CAS.read_csv() method works just like the pandas.read_csv() function. In fact, CAS.read_csv() uses pandas.read_csv() in the background. When pandas.read_csv() finishes parsing the CSV file into a pandas.DataFrame, it gets uploaded to a CAS table by CAS.read_csv(). The returned object is a CASTable object.
In [12]: tbl = conn.read_csv('https://raw.githubusercontent.com/'
....: 'sassoftware/sas-viya-programming/master/data/cars.csv')
....:
Note: Cloud Analytic Services made the uploaded file available as table TMPQKCG4MT7 in caslib CASUSER(castest).
Note: The table TMPQKCG4MT7 has been created in caslib CASUSER(castest) from binary data uploaded to Cloud Analytic Services.
CASTable objects are essentially client-side views of the table of data in the CAS server. You can interact with them using CAS actions as well as many of the pandas.DataFrame methods and attributes. The pandas.DataFrame API is mirrored as much as possible, the only difference is that behind-the-scenes the real work is being done by CAS.
If you don’t want the difficult-to-read generated name for a table, you can specify one using the casout= parameter.
In [13]: tbl = conn.read_csv('https://raw.githubusercontent.com/'
....: 'sassoftware/sas-viya-programming/master/data/cars.csv',
....: casout='cars')
....:
Note: Cloud Analytic Services made the uploaded file available as table CARS in caslib CASUSER(castest).
Note: The table CARS has been created in caslib CASUSER(castest) from binary data uploaded to Cloud Analytic Services.
Since we started down this path with the intent to use the summary action, let’s do that first.
In [14]: out = conn.summary(table=tbl)
In [15]: out
Out[15]:
[Summary]
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Skewness Kurtosis
0 MSRP 10280.0 192465.0 428.0 0.0 32774.855140 14027638.0 19431.716674 939.267478 3.775916e+08 6.209854e+11 1.612316e+11 59.288490 34.894059 4.160412e-127 2.798099 13.879206
1 Invoice 9875.0 173560.0 428.0 0.0 30014.700935 12846292.0 17642.117750 852.763949 3.112443e+08 5.184789e+11 1.329013e+11 58.778256 35.196963 2.684398e-128 2.834740 13.946164
2 EngineSize 1.3 8.3 428.0 0.0 3.196729 1368.2 1.108595 0.053586 1.228982e+00 4.898540e+03 5.247754e+02 34.679034 59.656105 3.133745e-209 0.708152 0.541944
3 Cylinders 3.0 12.0 426.0 2.0 5.807512 2474.0 1.558443 0.075507 2.428743e+00 1.540000e+04 1.032216e+03 26.834946 76.913766 1.515569e-251 0.592785 0.440378
4 Horsepower 73.0 500.0 428.0 0.0 215.885514 92399.0 71.836032 3.472326 5.160415e+03 2.215110e+07 2.203497e+06 33.275059 62.173176 4.185344e-216 0.930331 1.552159
5 MPG_City 10.0 60.0 428.0 0.0 20.060748 8586.0 5.238218 0.253199 2.743892e+01 1.839580e+05 1.171642e+04 26.111777 79.229235 1.866284e-257 2.782072 15.791147
6 MPG_Highway 12.0 66.0 428.0 0.0 26.843458 11489.0 5.741201 0.277511 3.296139e+01 3.224790e+05 1.407451e+04 21.387709 96.729204 1.665621e-292 1.252395 6.045611
7 Weight 1850.0 7190.0 428.0 0.0 3577.953271 1531364.0 758.983215 36.686838 5.760555e+05 5.725125e+09 2.459757e+08 21.212776 97.526890 5.812547e-294 0.891824 1.688789
8 Wheelbase 89.0 144.0 428.0 0.0 108.154206 46290.0 8.311813 0.401767 6.908624e+01 5.035958e+06 2.949982e+04 7.685150 269.196577 0.000000e+00 0.962287 2.133649
9 Length 143.0 238.0 428.0 0.0 186.362150 79763.0 14.357991 0.694020 2.061519e+02 1.495283e+07 8.802687e+04 7.704349 268.525733 0.000000e+00 0.181977 0.614725
+ Elapsed: 0.0151s, user: 0.013s, sys: 0.003s, mem: 4.54mb
In addition, you can also call the summary action directly on the CASTable object. It will automatically populate the table= parameter.
In [16]: out = tbl.summary()
In [17]: out
Out[17]:
[Summary]
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Skewness Kurtosis
0 MSRP 10280.0 192465.0 428.0 0.0 32774.855140 14027638.0 19431.716674 939.267478 3.775916e+08 6.209854e+11 1.612316e+11 59.288490 34.894059 4.160412e-127 2.798099 13.879206
1 Invoice 9875.0 173560.0 428.0 0.0 30014.700935 12846292.0 17642.117750 852.763949 3.112443e+08 5.184789e+11 1.329013e+11 58.778256 35.196963 2.684398e-128 2.834740 13.946164
2 EngineSize 1.3 8.3 428.0 0.0 3.196729 1368.2 1.108595 0.053586 1.228982e+00 4.898540e+03 5.247754e+02 34.679034 59.656105 3.133745e-209 0.708152 0.541944
3 Cylinders 3.0 12.0 426.0 2.0 5.807512 2474.0 1.558443 0.075507 2.428743e+00 1.540000e+04 1.032216e+03 26.834946 76.913766 1.515569e-251 0.592785 0.440378
4 Horsepower 73.0 500.0 428.0 0.0 215.885514 92399.0 71.836032 3.472326 5.160415e+03 2.215110e+07 2.203497e+06 33.275059 62.173176 4.185344e-216 0.930331 1.552159
5 MPG_City 10.0 60.0 428.0 0.0 20.060748 8586.0 5.238218 0.253199 2.743892e+01 1.839580e+05 1.171642e+04 26.111777 79.229235 1.866284e-257 2.782072 15.791147
6 MPG_Highway 12.0 66.0 428.0 0.0 26.843458 11489.0 5.741201 0.277511 3.296139e+01 3.224790e+05 1.407451e+04 21.387709 96.729204 1.665621e-292 1.252395 6.045611
7 Weight 1850.0 7190.0 428.0 0.0 3577.953271 1531364.0 758.983215 36.686838 5.760555e+05 5.725125e+09 2.459757e+08 21.212776 97.526890 5.812547e-294 0.891824 1.688789
8 Wheelbase 89.0 144.0 428.0 0.0 108.154206 46290.0 8.311813 0.401767 6.908624e+01 5.035958e+06 2.949982e+04 7.685150 269.196577 0.000000e+00 0.962287 2.133649
9 Length 143.0 238.0 428.0 0.0 186.362150 79763.0 14.357991 0.694020 2.061519e+02 1.495283e+07 8.802687e+04 7.704349 268.525733 0.000000e+00 0.181977 0.614725
+ Elapsed: 0.0142s, user: 0.014s, sys: 0.000999s, mem: 4.53mb
Again, the output is a CASResults object (a subclass of a Python dictionary), so we can pull off the keys we want (there is only one in this case). This key contains a SASDataFrame, but since it’s a subclass of pandas.DataFrame, you can do all of the standard DataFrame operations on it.
In [18]: summ = out.Summary
In [19]: summ = summ.set_index('Column')
In [20]: summ.loc['Cylinders', 'Max']
Out[20]: 12.0
Loading CAS Action Sets¶
While CAS comes with a few pre-loaded action sets, you will likely want to load action sets with other capabilities such as percentiles, Data step, SQL, or even machine learning. Most action sets will require a license to run them, so you’ll have to take care of those issues before you can load them.
The action used to load action sets is called loadactionset.
In [21]: conn.loadactionset('percentile')
Note: Added action set 'percentile'.
Out[21]:
[actionset]
'percentile'
+ Elapsed: 0.000789s, mem: 0.203mb
Once you load an action set, its actions will be automatically added as methods to the CAS connection and any CASTable objects associated with that connection.
In [22]: tbl.percentile()
Out[22]:
[Percentile]
Percentiles for CARS
Variable Pctl Value Converged
0 MSRP 25.0 20329.50 1.0
1 MSRP 50.0 27635.00 1.0
2 MSRP 75.0 39215.00 1.0
3 Invoice 25.0 18851.00 1.0
4 Invoice 50.0 25294.50 1.0
5 Invoice 75.0 35732.50 1.0
6 EngineSize 25.0 2.35 1.0
7 EngineSize 50.0 3.00 1.0
8 EngineSize 75.0 3.90 1.0
9 Cylinders 25.0 4.00 1.0
10 Cylinders 50.0 6.00 1.0
11 Cylinders 75.0 6.00 1.0
12 Horsepower 25.0 165.00 1.0
13 Horsepower 50.0 210.00 1.0
14 Horsepower 75.0 255.00 1.0
15 MPG_City 25.0 17.00 1.0
16 MPG_City 50.0 19.00 1.0
17 MPG_City 75.0 21.50 1.0
18 MPG_Highway 25.0 24.00 1.0
19 MPG_Highway 50.0 26.00 1.0
20 MPG_Highway 75.0 29.00 1.0
21 Weight 25.0 3103.00 1.0
22 Weight 50.0 3474.50 1.0
23 Weight 75.0 3978.50 1.0
24 Wheelbase 25.0 103.00 1.0
25 Wheelbase 50.0 107.00 1.0
26 Wheelbase 75.0 112.00 1.0
27 Length 25.0 178.00 1.0
28 Length 50.0 187.00 1.0
29 Length 75.0 194.00 1.0
+ Elapsed: 0.044s, user: 0.09s, sys: 0.012s, mem: 11.7mb
Note that the percentile action set has an action called percentile in it. you can call the action either as tbl.percentile or tbl.percentile.percentile.
CAS Tables as DataFrames¶
As we mentioned previously, CASTable objects implement many of the pandas.DataFrame methods and properties. This means that you can use the familiar pandas.DataFrame API, but use it on data that is far too large for pandas to handle. Here are a few simple examples.
In [23]: tbl.head()
Out[23]:
Selected Rows from Table CARS
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
In [24]: tbl.describe()
Out[24]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
count 428.000000 428.000000 428.000000 426.000000 428.000000 428.000000 428.000000 428.000000 428.000000 428.000000
mean 32774.855140 30014.700935 3.196729 5.807512 215.885514 20.060748 26.843458 3577.953271 108.154206 186.362150
std 19431.716674 17642.117750 1.108595 1.558443 71.836032 5.238218 5.741201 758.983215 8.311813 14.357991
min 10280.000000 9875.000000 1.300000 3.000000 73.000000 10.000000 12.000000 1850.000000 89.000000 143.000000
25% 20329.500000 18851.000000 2.350000 4.000000 165.000000 17.000000 24.000000 3103.000000 103.000000 178.000000
50% 27635.000000 25294.500000 3.000000 6.000000 210.000000 19.000000 26.000000 3474.500000 107.000000 187.000000
75% 39215.000000 35732.500000 3.900000 6.000000 255.000000 21.500000 29.000000 3978.500000 112.000000 194.000000
max 192465.000000 173560.000000 8.300000 12.000000 500.000000 60.000000 66.000000 7190.000000 144.000000 238.000000
In [25]: tbl[['MSRP', 'Invoice']].describe(percentiles=[0.3, 0.7])
Out[25]:
MSRP Invoice
count 428.000000 428.000000
mean 32774.855140 30014.700935
std 19431.716674 17642.117750
min 10280.000000 9875.000000
30% 22000.000000 20284.000000
50% 27635.000000 25294.500000
70% 35940.000000 32997.000000
max 192465.000000 173560.000000
For more information about CASTable, see the API Reference.
Closing the Connection¶
When you are finished with the connection, it’s always a good idea to close it.
In [26]: conn.close()
Authentication¶
The SWAT package supports three types of authentication when connecting to the CAS server:
Userid and Password
OAuth token
Kerberos ( binary protocol only )
Userid and Password¶
While it is possible to put your username and password in the CAS constructor, it’s generally not a good idea to have a password in your code. To get around this issue, the CAS class supports authinfo files. Authinfo files are a file used to store username and password information for specified hostname and port. They are protected by file permissions so that only you can read them. This allows you to set and protect your passwords in one place and have them used by all of your programs.
The format of the file is as follows:
host HOST user USERNAME password PASSWORD port PORT
machine is a synonym for host, login and account are synonyms for user, and protocol is a synonym for port.
You can specify as many of the host lines as possible. The port field is optional. If it is left off, all ports will use the same password. Hostnames much match the hostname used in the CAS constructor exactly. It does not do any DNS expanding of the names. So ‘host1’ and ‘host1.my-company.com’ are considered two different hosts.
Here is an example for a user named ‘user01’ and password ‘!s3cret’ on host ‘cas.my-company.com’ and port 12354:
host cas.my-company.com port 12354 user user01 password !s3cret
By default, the authinfo files are looked for in your home directory under the name .authinfo. You can also use the name .netrc which is the name of an older specification that authinfo was based on.
The permissions on the file must be readable and writable by the owner only. This is done with the following command:
chmod 0600 ~/.authinfo
If you don’t want to use an authinfo in your home directory, you can specify the name of a file explicitly using the authinfo= parameter.
In [27]: conn = swat.CAS('cas.my-company.com', 12354, authinfo='/path/to/authinfo.txt')
The username can also be specified using one of the following environment variables
CAS_USER
CAS_USERNAME
CASUSER
CASUSERNAME
The password can be specified using one of the following environment variables
CAS_TOKEN
CAS_PASSWORD
CASTOKEN
CASPASSWORD
Note
Userid and Password authentication will be deprecated in a future release. OAuth authentication should be used instead when possible.
OAuth Token¶
Authentication to the CAS server can be performed by using an OAuth token. The OAuth token can be specified in the CAS constructor using the password parameter. When specifying the OAuth token in the CAS constructor, do not specify the username parameter.
In [28]: conn = swat.CAS('cas.my-company.com', 12354, password='...')
The OAuth token can also be specified using one of the following environment variables
CAS_TOKEN
CAS_PASSWORD
CASTOKEN
CASPASSWORD
When using the HTTP protocol, the SWAT package can obtain an OAuth token on your behalf by specifying an authentication code in the CAS constructor using the authcode parameter.
In [29]: conn = swat.CAS('https://my-cas-host.com:443/cas-shared-default-http/',
....: authcode='...')
....:
The authentication code can also be specified using one of the following environment variables
CAS_AUTHCODE
VIYA_AUTHCODE
CASAUTHCODE
VIYAAUTHCODE
Beginning with release v1.14.0, the SWAT package supports using Proof Key for Code Exchange ( PKCE ) when using authentication codes to obtain an OAuth token with HTTP. Python 3.6 or later is required for PKCE.
To use PKCE, specify the pkce=True parameter in the CAS constructor. When specifying pkce=True, do not specify the authcode parameter. You will be provided a URL to use to obtain the authentication code and prompted to enter the authentication code obtained from that URL.
In [30]: conn = swat.CAS('https://my-cas-host.com:443/cas-shared-default-http/',
....: pkce=True)
....:
The pkce parameter can also be specified using one of the following environment variables
CAS_PKCE
VIYA_PKCE
CASPKCE
VIYAPKCE
Kerberos¶
The Kerberos Service Principal Name used by Viya 4 is different from the Kerberos Service Principal Name used by Viya 3.5. Releases of the SWAT package starting with 1.8.0 and later use the Viya 4 Service Principal Name. If you wish to connect to a Viya 3.5 CAS server using the SWAT package release 1.8.0 or later, you must use the CASSPN environment variable to specify the service principal name in the format recognized by Viya 3.5 . This value should start with ‘sascas@’ and be followed by the hostname.
In [31]: import os
In [32]: os.environ["CASSPN"] = "sascas@host"