Indexing and Data Selection

Indexing of CASTable objects works much in the same way as they do in pandas.DataFrame objects. You can select one or more columns based on column names or indexes, and you can select slices of columns. However, data selection does have some limitations. CAS tables can be distributed across a grid of computers and they do not have a specified order. Because of this, indexing based on a row index is not possible at this time. However, it is possible to apply where clauses to a the table parameters to filter rows based on that.

There are a few properties that allow indexing a CASTable object in various ways. These properties work just like they pandas.DataFrame counterparts (with the limitations described above).

Property / Method

Description

o[columns]

Subset table based on column names

o.loc[:, columns]

Subset table based on column names

o.iloc[:, columns]

Subset table based on column indexes

o.ix[:, columns]

Subset table based on mixed column names and indexes

o.xs(column, axis=1)

Select a cross-section of the table

o[boolean-column]

Filter data rows based on boolean column values

o.query(‘expr’)

Apply a filter to the data values

The Basics

Just as with pandas.DataFrames, CASTable objects implement Python’s __getitem__ method to allow indexing using [ ]. This allows you to subset the columns that are visible in the table.

In [1]: tbl = conn.read_csv('https://raw.githubusercontent.com/'
   ...:                     'sassoftware/sas-viya-programming/master/data/cars.csv')
   ...: 

Note: Cloud Analytic Services made the uploaded file available as table TMP38TJ_DW9 in caslib CASUSER(castest).
Note: The table TMP38TJ_DW9 has been created in caslib CASUSER(castest) from binary data uploaded to Cloud Analytic Services. In [2]: tbl.head() Out[2]: Selected Rows from Table TMP38TJ_DW9 Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length 0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0 1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0 2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0 3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0 4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

Here we are selecting a single column from the table. This will return a CASColumn object.

In [3]: tbl['Make'].head()
Out[3]: 
0    Acura
1    Acura
2    Acura
3    Acura
4    Acura
Name: Make, dtype: object

Selecting multiple columns returns a new CASTable object.

In [4]: tbl[['Make', 'Model', 'Horsepower']].head()
Out[4]: 
Selected Rows from Table TMP38TJ_DW9

    Make           Model  Horsepower
0  Acura             MDX       265.0
1  Acura  RSX Type S 2dr       200.0
2  Acura         TSX 4dr       200.0
3  Acura          TL 4dr       270.0
4  Acura      3.5 RL 4dr       225.0

You can also access individual columns using attribute syntax.

In [5]: tbl.Make.head()
Out[5]: 
0    Acura
1    Acura
2    Acura
3    Acura
4    Acura
Name: Make, dtype: object

Caution should be used when using attribute syntax because it depends on the fact that there are no existing attributes, methods, or CAS actions with that same name on the CASTable. It also requires that the column name contains a valid Python identifier. Since CAS actions can be added dynamically, attribute access should generally only be used in interactive programming. For programs that will be reused, it is safer to use the [ ] syntax.

Selecting by Name

The loc property is used to select columns based on the column names. Column names can be specified as a string, a list of strings, or a slice. If a string is given, a CASColumn is returned. If a list of strings or a slice is specified, a CASTable is returned.

A single string selects a column. Since row selection is not supported at this time, this is equivalent to tbl.loc['Make'].

In [6]: tbl.loc[:, 'Make'].head()
Out[6]: 
0    Acura
1    Acura
2    Acura
3    Acura
4    Acura
Name: Make, dtype: object

Using a list of strings selects those columns and returns a new CASTable object. Again, this is equivalent to tbl[['Make', 'Model']].

In [7]: tbl.loc[:, ['Make', 'Model']].head()
Out[7]: 
Selected Rows from Table TMP38TJ_DW9

    Make           Model
0  Acura             MDX
1  Acura  RSX Type S 2dr
2  Acura         TSX 4dr
3  Acura          TL 4dr
4  Acura      3.5 RL 4dr

Slicing using column names allows you to select a range of columns.

In [8]: tbl.loc[:, 'Model':'Invoice'].head()
Out[8]: 
Selected Rows from Table TMP38TJ_DW9

            Model   Type Origin DriveTrain     MSRP  Invoice
0             MDX    SUV   Asia        All  36945.0  33337.0
1  RSX Type S 2dr  Sedan   Asia      Front  23820.0  21761.0
2         TSX 4dr  Sedan   Asia      Front  26990.0  24647.0
3          TL 4dr  Sedan   Asia      Front  33195.0  30299.0
4      3.5 RL 4dr  Sedan   Asia      Front  43755.0  39014.0

You can even specify a step size.

In [9]: tbl.loc[:, 'Model':'Invoice':2].head()
Out[9]: 
Selected Rows from Table TMP38TJ_DW9

            Model Origin     MSRP
0             MDX   Asia  36945.0
1  RSX Type S 2dr   Asia  23820.0
2         TSX 4dr   Asia  26990.0
3          TL 4dr   Asia  33195.0
4      3.5 RL 4dr   Asia  43755.0

Note that when using columns names in slices, both endpoints are included in the slice. This is not the same behavior for numeric indexes, but is consistent with the way that slicing works in pandas.DataFrame objects.

Selecting by Position

The iloc property is used to select columns based on column indices. Just like with loc, the column indices can be specified as a single integer, a list of integers, or a slice.

In [10]: tbl.iloc[:, 1].head()
Out[10]: 
0               MDX
1    RSX Type S 2dr
2           TSX 4dr
3            TL 4dr
4        3.5 RL 4dr
Name: Model, dtype: object

Using a list of integers returns a new CASTable object.

In [11]: tbl.iloc[:, [1, 5, 3]].head()
Out[11]: 
Selected Rows from Table TMP38TJ_DW9

            Model     MSRP Origin
0             MDX  36945.0   Asia
1  RSX Type S 2dr  23820.0   Asia
2         TSX 4dr  26990.0   Asia
3          TL 4dr  33195.0   Asia
4      3.5 RL 4dr  43755.0   Asia

Of course, ranges work here as well, with or without a step size.

In [12]: tbl.iloc[:, 2:6].head()
Out[12]: 
Selected Rows from Table TMP38TJ_DW9

    Type Origin DriveTrain     MSRP
0    SUV   Asia        All  36945.0
1  Sedan   Asia      Front  23820.0
2  Sedan   Asia      Front  26990.0
3  Sedan   Asia      Front  33195.0
4  Sedan   Asia      Front  43755.0
In [13]: tbl.iloc[:, 6:2:-2].head()
Out[13]: 
Selected Rows from Table TMP38TJ_DW9

    Make           Model   Type Origin DriveTrain     MSRP  Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0  Acura             MDX    SUV   Asia        All  36945.0  33337.0         3.5        6.0       265.0      17.0         23.0  4451.0      106.0   189.0
1  Acura  RSX Type S 2dr  Sedan   Asia      Front  23820.0  21761.0         2.0        4.0       200.0      24.0         31.0  2778.0      101.0   172.0
2  Acura         TSX 4dr  Sedan   Asia      Front  26990.0  24647.0         2.4        4.0       200.0      22.0         29.0  3230.0      105.0   183.0
3  Acura          TL 4dr  Sedan   Asia      Front  33195.0  30299.0         3.2        6.0       270.0      20.0         28.0  3575.0      108.0   186.0
4  Acura      3.5 RL 4dr  Sedan   Asia      Front  43755.0  39014.0         3.5        6.0       225.0      18.0         24.0  3880.0      115.0   197.0

Mixing Names and Position

The ix property works just like the loc and iloc properties except that it takes a mix of column names and indexes.

In [14]: tbl.ix[:, 'Model'].head()
Out[14]: 
0               MDX
1    RSX Type S 2dr
2           TSX 4dr
3            TL 4dr
4        3.5 RL 4dr
Name: Model, dtype: object

In [15]: tbl.ix[:, 3].head()
Out[15]: 
0    Asia
1    Asia
2    Asia
3    Asia
4    Asia
Name: Origin, dtype: object
In [16]: tbl.ix[:, ['Model', 4, 3]].head()
Out[16]: 
Selected Rows from Table TMP38TJ_DW9

            Model DriveTrain Origin
0             MDX        All   Asia
1  RSX Type S 2dr      Front   Asia
2         TSX 4dr      Front   Asia
3          TL 4dr      Front   Asia
4      3.5 RL 4dr      Front   Asia
In [17]: tbl.ix[:, 'Model':6:2].head()
Out[17]: 
Selected Rows from Table TMP38TJ_DW9

            Model Origin     MSRP
0             MDX   Asia  36945.0
1  RSX Type S 2dr   Asia  23820.0
2         TSX 4dr   Asia  26990.0
3          TL 4dr   Asia  33195.0
4      3.5 RL 4dr   Asia  43755.0

Selecting a Cross Section

The xs method currently only supports column selection (i.e., axis=1). It is primarily here for future development.

In [18]: tbl.xs('Model', axis=1).head()
Out[18]: 
0               MDX
1    RSX Type S 2dr
2           TSX 4dr
3            TL 4dr
4        3.5 RL 4dr
Name: Model, dtype: object

Boolean Indexing

It is possible to use a CASColumn as a way to select rows in a CAS table. The CASColumn should contain values that are valid booleans to CAS (typically integer values where 0 is false and non-zero is true).

Here is a basic example that selects all cars with an MSRP value over 80,000.

In [19]: tbl[tbl.MSRP > 80000].head()
Out[19]: 
Selected Rows from Table TMP38TJ_DW9

     Make                         Model    Type  Origin DriveTrain     MSRP  Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0   Acura        NSX coupe 2dr manual S  Sports    Asia       Rear  89765.0  79978.0         3.2        6.0       290.0      17.0         24.0  3153.0      100.0   174.0
1    Audi                      RS 6 4dr  Sports  Europe      Front  84600.0  76417.0         4.2        8.0       450.0      15.0         22.0  4024.0      109.0   191.0
2   Dodge  Viper SRT-10 convertible 2dr  Sports     USA       Rear  81795.0  74451.0         8.3       10.0       500.0      12.0         20.0  3410.0       99.0   176.0
3  Jaguar                 XKR coupe 2dr  Sports  Europe       Rear  81995.0  74676.0         4.2        8.0       390.0      16.0         23.0  3865.0      102.0   187.0
4  Jaguar           XKR convertible 2dr  Sports  Europe       Rear  86995.0  79226.0         4.2        8.0       390.0      16.0         23.0  4042.0      102.0   187.0

Conditions can be combined with | for or, & for and, and ~ for not. However, due to the order of precedence in Python, you must put your comparisons operations in parentheses before combining them with these operators.

In [20]: tbl[(tbl.MSRP > 80000) & (tbl.Horsepower > 400)].head()
Out[20]: 
Selected Rows from Table TMP38TJ_DW9

            Make                         Model    Type  Origin DriveTrain      MSRP   Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0           Audi                      RS 6 4dr  Sports  Europe      Front   84600.0   76417.0         4.2        8.0       450.0      15.0         22.0  4024.0      109.0   191.0
1          Dodge  Viper SRT-10 convertible 2dr  Sports     USA       Rear   81795.0   74451.0         8.3       10.0       500.0      12.0         20.0  3410.0       99.0   176.0
2  Mercedes-Benz                     CL600 2dr   Sedan  Europe       Rear  128420.0  119600.0         5.5       12.0       493.0      13.0         19.0  4473.0      114.0   196.0
3  Mercedes-Benz                  SL55 AMG 2dr  Sports  Europe       Rear  121770.0  113388.0         5.5        8.0       493.0      14.0         21.0  4235.0      101.0   179.0
4  Mercedes-Benz         SL600 convertible 2dr  Sports  Europe       Rear  126670.0  117854.0         5.5       12.0       493.0      13.0         19.0  4429.0      101.0   179.0

Since each mask of a CASTable object returns a new CASTable object, you can split operations across multiple steps.

In [21]: expensive = tbl[tbl.MSRP > 80000]

In [22]: expensive[expensive.Horsepower > 400].head()
Out[22]: 
Selected Rows from Table TMP38TJ_DW9

            Make                         Model    Type  Origin DriveTrain      MSRP   Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0           Audi                      RS 6 4dr  Sports  Europe      Front   84600.0   76417.0         4.2        8.0       450.0      15.0         22.0  4024.0      109.0   191.0
1          Dodge  Viper SRT-10 convertible 2dr  Sports     USA       Rear   81795.0   74451.0         8.3       10.0       500.0      12.0         20.0  3410.0       99.0   176.0
2  Mercedes-Benz                     CL600 2dr   Sedan  Europe       Rear  128420.0  119600.0         5.5       12.0       493.0      13.0         19.0  4473.0      114.0   196.0
3  Mercedes-Benz                  SL55 AMG 2dr  Sports  Europe       Rear  121770.0  113388.0         5.5        8.0       493.0      14.0         21.0  4235.0      101.0   179.0
4  Mercedes-Benz         SL600 convertible 2dr  Sports  Europe       Rear  126670.0  117854.0         5.5       12.0       493.0      13.0         19.0  4429.0      101.0   179.0

Warning

You can only use columns from within the same CAS table in boolean operations. If you want to combine operations across tables, you should create a view that contains all of the data, then use the filtering features outlined above on that view.

The query Method

Rather than using the boolean data selection described above, you can write a CAS where expression and apply it to a CASTable object directly using the CASTable.query() method. This can often result in more readable code when using longer expressions.

In [23]: tbl.query('MSRP > 80000 and Horsepower > 400').head()
Out[23]: 
Selected Rows from Table TMP38TJ_DW9

            Make                         Model    Type  Origin DriveTrain      MSRP   Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0           Audi                      RS 6 4dr  Sports  Europe      Front   84600.0   76417.0         4.2        8.0       450.0      15.0         22.0  4024.0      109.0   191.0
1          Dodge  Viper SRT-10 convertible 2dr  Sports     USA       Rear   81795.0   74451.0         8.3       10.0       500.0      12.0         20.0  3410.0       99.0   176.0
2  Mercedes-Benz                     CL600 2dr   Sedan  Europe       Rear  128420.0  119600.0         5.5       12.0       493.0      13.0         19.0  4473.0      114.0   196.0
3  Mercedes-Benz                  SL55 AMG 2dr  Sports  Europe       Rear  121770.0  113388.0         5.5        8.0       493.0      14.0         21.0  4235.0      101.0   179.0
4  Mercedes-Benz         SL600 convertible 2dr  Sports  Europe       Rear  126670.0  117854.0         5.5       12.0       493.0      13.0         19.0  4429.0      101.0   179.0

Of course, queries can be combined across multiple steps as well.

In [24]: expensive = tbl.query('MSRP > 80000')

In [25]: expensive.query('Horsepower > 400').head()
Out[25]: 
Selected Rows from Table TMP38TJ_DW9

            Make                         Model    Type  Origin DriveTrain      MSRP   Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0           Audi                      RS 6 4dr  Sports  Europe      Front   84600.0   76417.0         4.2        8.0       450.0      15.0         22.0  4024.0      109.0   191.0
1          Dodge  Viper SRT-10 convertible 2dr  Sports     USA       Rear   81795.0   74451.0         8.3       10.0       500.0      12.0         20.0  3410.0       99.0   176.0
2  Mercedes-Benz                     CL600 2dr   Sedan  Europe       Rear  128420.0  119600.0         5.5       12.0       493.0      13.0         19.0  4473.0      114.0   196.0
3  Mercedes-Benz                  SL55 AMG 2dr  Sports  Europe       Rear  121770.0  113388.0         5.5        8.0       493.0      14.0         21.0  4235.0      101.0   179.0
4  Mercedes-Benz         SL600 convertible 2dr  Sports  Europe       Rear  126670.0  117854.0         5.5       12.0       493.0      13.0         19.0  4429.0      101.0   179.0