Indexing and Data Selection¶
Indexing of CASTable objects works much in the same way as they do in pandas.DataFrame objects. You can select one or more columns based on column names or indexes, and you can select slices of columns. However, data selection does have some limitations. CAS tables can be distributed across a grid of computers and they do not have a specified order. Because of this, indexing based on a row index is not possible at this time. However, it is possible to apply where clauses to a the table parameters to filter rows based on that.
There are a few properties that allow indexing a CASTable object in various ways. These properties work just like they pandas.DataFrame counterparts (with the limitations described above).
Property / Method |
Description |
---|---|
o[columns] |
Subset table based on column names |
o.loc[:, columns] |
Subset table based on column names |
o.iloc[:, columns] |
Subset table based on column indexes |
o.ix[:, columns] |
Subset table based on mixed column names and indexes |
o.xs(column, axis=1) |
Select a cross-section of the table |
o[boolean-column] |
Filter data rows based on boolean column values |
o.query(‘expr’) |
Apply a filter to the data values |
The Basics¶
Just as with pandas.DataFrames, CASTable objects implement Python’s __getitem__ method to allow indexing using [ ]. This allows you to subset the columns that are visible in the table.
In [1]: tbl = conn.read_csv('https://raw.githubusercontent.com/' ...: 'sassoftware/sas-viya-programming/master/data/cars.csv') ...:
Note: Cloud Analytic Services made the uploaded file available as table TMP38TJ_DW9 in caslib CASUSER(castest).
Note: The table TMP38TJ_DW9 has been created in caslib CASUSER(castest) from binary data uploaded to Cloud Analytic Services. In [2]: tbl.head() Out[2]: Selected Rows from Table TMP38TJ_DW9 Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length 0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0 1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0 2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0 3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0 4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
Here we are selecting a single column from the table. This will return a CASColumn object.
In [3]: tbl['Make'].head()
Out[3]:
0 Acura
1 Acura
2 Acura
3 Acura
4 Acura
Name: Make, dtype: object
Selecting multiple columns returns a new CASTable object.
In [4]: tbl[['Make', 'Model', 'Horsepower']].head()
Out[4]:
Selected Rows from Table TMP38TJ_DW9
Make Model Horsepower
0 Acura MDX 265.0
1 Acura RSX Type S 2dr 200.0
2 Acura TSX 4dr 200.0
3 Acura TL 4dr 270.0
4 Acura 3.5 RL 4dr 225.0
You can also access individual columns using attribute syntax.
In [5]: tbl.Make.head()
Out[5]:
0 Acura
1 Acura
2 Acura
3 Acura
4 Acura
Name: Make, dtype: object
Caution should be used when using attribute syntax because it depends on the fact that there are no existing attributes, methods, or CAS actions with that same name on the CASTable. It also requires that the column name contains a valid Python identifier. Since CAS actions can be added dynamically, attribute access should generally only be used in interactive programming. For programs that will be reused, it is safer to use the [ ] syntax.
Selecting by Name¶
The loc property is used to select columns based on the column names. Column names can be specified as a string, a list of strings, or a slice. If a string is given, a CASColumn is returned. If a list of strings or a slice is specified, a CASTable is returned.
A single string selects a column. Since row selection is not supported at this time, this is equivalent to tbl.loc['Make'].
In [6]: tbl.loc[:, 'Make'].head()
Out[6]:
0 Acura
1 Acura
2 Acura
3 Acura
4 Acura
Name: Make, dtype: object
Using a list of strings selects those columns and returns a new CASTable object. Again, this is equivalent to tbl[['Make', 'Model']].
In [7]: tbl.loc[:, ['Make', 'Model']].head()
Out[7]:
Selected Rows from Table TMP38TJ_DW9
Make Model
0 Acura MDX
1 Acura RSX Type S 2dr
2 Acura TSX 4dr
3 Acura TL 4dr
4 Acura 3.5 RL 4dr
Slicing using column names allows you to select a range of columns.
In [8]: tbl.loc[:, 'Model':'Invoice'].head()
Out[8]:
Selected Rows from Table TMP38TJ_DW9
Model Type Origin DriveTrain MSRP Invoice
0 MDX SUV Asia All 36945.0 33337.0
1 RSX Type S 2dr Sedan Asia Front 23820.0 21761.0
2 TSX 4dr Sedan Asia Front 26990.0 24647.0
3 TL 4dr Sedan Asia Front 33195.0 30299.0
4 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0
You can even specify a step size.
In [9]: tbl.loc[:, 'Model':'Invoice':2].head()
Out[9]:
Selected Rows from Table TMP38TJ_DW9
Model Origin MSRP
0 MDX Asia 36945.0
1 RSX Type S 2dr Asia 23820.0
2 TSX 4dr Asia 26990.0
3 TL 4dr Asia 33195.0
4 3.5 RL 4dr Asia 43755.0
Note that when using columns names in slices, both endpoints are included in the slice. This is not the same behavior for numeric indexes, but is consistent with the way that slicing works in pandas.DataFrame objects.
Selecting by Position¶
The iloc property is used to select columns based on column indices. Just like with loc, the column indices can be specified as a single integer, a list of integers, or a slice.
In [10]: tbl.iloc[:, 1].head()
Out[10]:
0 MDX
1 RSX Type S 2dr
2 TSX 4dr
3 TL 4dr
4 3.5 RL 4dr
Name: Model, dtype: object
Using a list of integers returns a new CASTable object.
In [11]: tbl.iloc[:, [1, 5, 3]].head()
Out[11]:
Selected Rows from Table TMP38TJ_DW9
Model MSRP Origin
0 MDX 36945.0 Asia
1 RSX Type S 2dr 23820.0 Asia
2 TSX 4dr 26990.0 Asia
3 TL 4dr 33195.0 Asia
4 3.5 RL 4dr 43755.0 Asia
Of course, ranges work here as well, with or without a step size.
In [12]: tbl.iloc[:, 2:6].head()
Out[12]:
Selected Rows from Table TMP38TJ_DW9
Type Origin DriveTrain MSRP
0 SUV Asia All 36945.0
1 Sedan Asia Front 23820.0
2 Sedan Asia Front 26990.0
3 Sedan Asia Front 33195.0
4 Sedan Asia Front 43755.0
In [13]: tbl.iloc[:, 6:2:-2].head()
Out[13]:
Selected Rows from Table TMP38TJ_DW9
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
Mixing Names and Position¶
The ix property works just like the loc and iloc properties except that it takes a mix of column names and indexes.
In [14]: tbl.ix[:, 'Model'].head() Out[14]: 0 MDX 1 RSX Type S 2dr 2 TSX 4dr 3 TL 4dr 4 3.5 RL 4dr Name: Model, dtype: object In [15]: tbl.ix[:, 3].head() Out[15]: 0 Asia 1 Asia 2 Asia 3 Asia 4 Asia Name: Origin, dtype: object
In [16]: tbl.ix[:, ['Model', 4, 3]].head()
Out[16]:
Selected Rows from Table TMP38TJ_DW9
Model DriveTrain Origin
0 MDX All Asia
1 RSX Type S 2dr Front Asia
2 TSX 4dr Front Asia
3 TL 4dr Front Asia
4 3.5 RL 4dr Front Asia
In [17]: tbl.ix[:, 'Model':6:2].head()
Out[17]:
Selected Rows from Table TMP38TJ_DW9
Model Origin MSRP
0 MDX Asia 36945.0
1 RSX Type S 2dr Asia 23820.0
2 TSX 4dr Asia 26990.0
3 TL 4dr Asia 33195.0
4 3.5 RL 4dr Asia 43755.0
Selecting a Cross Section¶
The xs method currently only supports column selection (i.e., axis=1). It is primarily here for future development.
In [18]: tbl.xs('Model', axis=1).head()
Out[18]:
0 MDX
1 RSX Type S 2dr
2 TSX 4dr
3 TL 4dr
4 3.5 RL 4dr
Name: Model, dtype: object
Boolean Indexing¶
It is possible to use a CASColumn as a way to select rows in a CAS table. The CASColumn should contain values that are valid booleans to CAS (typically integer values where 0 is false and non-zero is true).
Here is a basic example that selects all cars with an MSRP value over 80,000.
In [19]: tbl[tbl.MSRP > 80000].head()
Out[19]:
Selected Rows from Table TMP38TJ_DW9
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura NSX coupe 2dr manual S Sports Asia Rear 89765.0 79978.0 3.2 6.0 290.0 17.0 24.0 3153.0 100.0 174.0
1 Audi RS 6 4dr Sports Europe Front 84600.0 76417.0 4.2 8.0 450.0 15.0 22.0 4024.0 109.0 191.0
2 Dodge Viper SRT-10 convertible 2dr Sports USA Rear 81795.0 74451.0 8.3 10.0 500.0 12.0 20.0 3410.0 99.0 176.0
3 Jaguar XKR coupe 2dr Sports Europe Rear 81995.0 74676.0 4.2 8.0 390.0 16.0 23.0 3865.0 102.0 187.0
4 Jaguar XKR convertible 2dr Sports Europe Rear 86995.0 79226.0 4.2 8.0 390.0 16.0 23.0 4042.0 102.0 187.0
Conditions can be combined with | for or, & for and, and ~ for not. However, due to the order of precedence in Python, you must put your comparisons operations in parentheses before combining them with these operators.
In [20]: tbl[(tbl.MSRP > 80000) & (tbl.Horsepower > 400)].head()
Out[20]:
Selected Rows from Table TMP38TJ_DW9
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Audi RS 6 4dr Sports Europe Front 84600.0 76417.0 4.2 8.0 450.0 15.0 22.0 4024.0 109.0 191.0
1 Dodge Viper SRT-10 convertible 2dr Sports USA Rear 81795.0 74451.0 8.3 10.0 500.0 12.0 20.0 3410.0 99.0 176.0
2 Mercedes-Benz CL600 2dr Sedan Europe Rear 128420.0 119600.0 5.5 12.0 493.0 13.0 19.0 4473.0 114.0 196.0
3 Mercedes-Benz SL55 AMG 2dr Sports Europe Rear 121770.0 113388.0 5.5 8.0 493.0 14.0 21.0 4235.0 101.0 179.0
4 Mercedes-Benz SL600 convertible 2dr Sports Europe Rear 126670.0 117854.0 5.5 12.0 493.0 13.0 19.0 4429.0 101.0 179.0
Since each mask of a CASTable object returns a new CASTable object, you can split operations across multiple steps.
In [21]: expensive = tbl[tbl.MSRP > 80000]
In [22]: expensive[expensive.Horsepower > 400].head()
Out[22]:
Selected Rows from Table TMP38TJ_DW9
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Audi RS 6 4dr Sports Europe Front 84600.0 76417.0 4.2 8.0 450.0 15.0 22.0 4024.0 109.0 191.0
1 Dodge Viper SRT-10 convertible 2dr Sports USA Rear 81795.0 74451.0 8.3 10.0 500.0 12.0 20.0 3410.0 99.0 176.0
2 Mercedes-Benz CL600 2dr Sedan Europe Rear 128420.0 119600.0 5.5 12.0 493.0 13.0 19.0 4473.0 114.0 196.0
3 Mercedes-Benz SL55 AMG 2dr Sports Europe Rear 121770.0 113388.0 5.5 8.0 493.0 14.0 21.0 4235.0 101.0 179.0
4 Mercedes-Benz SL600 convertible 2dr Sports Europe Rear 126670.0 117854.0 5.5 12.0 493.0 13.0 19.0 4429.0 101.0 179.0
Warning
You can only use columns from within the same CAS table in boolean operations. If you want to combine operations across tables, you should create a view that contains all of the data, then use the filtering features outlined above on that view.
The query Method¶
Rather than using the boolean data selection described above, you can write a CAS where expression and apply it to a CASTable object directly using the CASTable.query() method. This can often result in more readable code when using longer expressions.
In [23]: tbl.query('MSRP > 80000 and Horsepower > 400').head()
Out[23]:
Selected Rows from Table TMP38TJ_DW9
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Audi RS 6 4dr Sports Europe Front 84600.0 76417.0 4.2 8.0 450.0 15.0 22.0 4024.0 109.0 191.0
1 Dodge Viper SRT-10 convertible 2dr Sports USA Rear 81795.0 74451.0 8.3 10.0 500.0 12.0 20.0 3410.0 99.0 176.0
2 Mercedes-Benz CL600 2dr Sedan Europe Rear 128420.0 119600.0 5.5 12.0 493.0 13.0 19.0 4473.0 114.0 196.0
3 Mercedes-Benz SL55 AMG 2dr Sports Europe Rear 121770.0 113388.0 5.5 8.0 493.0 14.0 21.0 4235.0 101.0 179.0
4 Mercedes-Benz SL600 convertible 2dr Sports Europe Rear 126670.0 117854.0 5.5 12.0 493.0 13.0 19.0 4429.0 101.0 179.0
Of course, queries can be combined across multiple steps as well.
In [24]: expensive = tbl.query('MSRP > 80000')
In [25]: expensive.query('Horsepower > 400').head()
Out[25]:
Selected Rows from Table TMP38TJ_DW9
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Audi RS 6 4dr Sports Europe Front 84600.0 76417.0 4.2 8.0 450.0 15.0 22.0 4024.0 109.0 191.0
1 Dodge Viper SRT-10 convertible 2dr Sports USA Rear 81795.0 74451.0 8.3 10.0 500.0 12.0 20.0 3410.0 99.0 176.0
2 Mercedes-Benz CL600 2dr Sedan Europe Rear 128420.0 119600.0 5.5 12.0 493.0 13.0 19.0 4473.0 114.0 196.0
3 Mercedes-Benz SL55 AMG 2dr Sports Europe Rear 121770.0 113388.0 5.5 8.0 493.0 14.0 21.0 4235.0 101.0 179.0
4 Mercedes-Benz SL600 convertible 2dr Sports Europe Rear 126670.0 117854.0 5.5 12.0 493.0 13.0 19.0 4429.0 101.0 179.0