Handling By Groups¶
If By groups are specified when running a CAS action, the result are returned with the following behaviors.
A result key named ‘ByGroupInfo’ is returned with all of the By group variable values.
Each By group table is returned in a separate result key with a prefix of ‘ByGroup#.’.
These behaviors can help when you have a large number of By groups and you want to process them as they arrive at the client rather than trying to hold the entire set of results in memory. However, when your result sets are smaller, you may want to combine all of the By group tables into a single pandas.DataFrame. To help in these situations, the CASResults class defines some helper methods for you.
Here is what it looks like to run a standard summary action, and a summary action with a By group specified. We will use this output to demonstrate the By group processing methods.
In [1]: tbl = tbl[['MSRP', 'Horsepower']] In [2]: tbl.summary(subset=['Min', 'Max']) Out[2]: [Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max 0 MSRP 10280.0 192465.0 1 Horsepower 73.0 500.0 + Elapsed: 0.0135s, user: 0.011s, sys: 0.002s, mem: 4.53mb In [3]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']) Out[3]: [ByGroupInfo] ByGroupInfo Origin Origin_f Cylinders Cylinders_f _key_ 0 Asia Asia NaN . Asia . 1 Asia Asia 3.0 3 Asia 3 2 Asia Asia 4.0 4 Asia 4 3 Asia Asia 6.0 6 Asia 6 4 Asia Asia 8.0 8 Asia 8 5 Europe Europe 4.0 4 Europe 4 6 Europe Europe 5.0 5 Europe 5 7 Europe Europe 6.0 6 Europe 6 8 Europe Europe 8.0 8 Europe 8 9 Europe Europe 12.0 12 Europe 12 10 USA USA 4.0 4 USA 4 11 USA USA 6.0 6 USA 6 12 USA USA 8.0 8 USA 8 13 USA USA 10.0 10 USA 10 [ByGroup1.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Asia . MSRP 25700.0 27200.0 . Horsepower 197.0 238.0 [ByGroup2.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Asia 3 MSRP 19110.0 19110.0 3 Horsepower 73.0 73.0 [ByGroup3.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Asia 4 MSRP 10280.0 33260.0 4 Horsepower 93.0 300.0 [ByGroup4.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Asia 6 MSRP 16495.0 89765.0 6 Horsepower 155.0 290.0 [ByGroup5.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Asia 8 MSRP 26650.0 64800.0 8 Horsepower 235.0 340.0 [ByGroup6.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Europe 4 MSRP 16999.0 43175.0 4 Horsepower 100.0 250.0 [ByGroup7.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Europe 5 MSRP 31745.0 42565.0 5 Horsepower 194.0 300.0 [ByGroup8.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Europe 6 MSRP 23785.0 192465.0 6 Horsepower 168.0 477.0 [ByGroup9.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Europe 8 MSRP 39235.0 121770.0 8 Horsepower 217.0 493.0 [ByGroup10.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Europe 12 MSRP 75000.0 128420.0 12 Horsepower 420.0 493.0 [ByGroup11.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders USA 4 MSRP 10995.0 25955.0 4 Horsepower 103.0 220.0 [ByGroup12.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders USA 6 MSRP 17630.0 40720.0 6 Horsepower 155.0 275.0 [ByGroup13.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders USA 8 MSRP 22010.0 76200.0 8 Horsepower 224.0 350.0 [ByGroup14.Summary] Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders USA 10 MSRP 41475.0 81795.0 10 Horsepower 310.0 500.0 + Elapsed: 0.0225s, user: 0.022s, sys: 0.002s, mem: 5.43mb
Selecting Tables by Name Across By Groups¶
The CASResults.get_tables() method will return all tables with a given name across By groups in a list. If the results do not contain By groups, the single table with that name will be returned in a list. This makes it possible to use CASResults.get_tables() the same way whether By group variables are specified or not.
In [4]: tbl.summary(subset=['Min', 'Max']).get_tables('Summary') Out[4]: [Descriptive Statistics for TMPCXHG5FVU Column Min Max 0 MSRP 10280.0 192465.0 1 Horsepower 73.0 500.0] In [5]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).get_tables('Summary') Out[5]: [Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Asia . MSRP 25700.0 27200.0 . Horsepower 197.0 238.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Asia 3 MSRP 19110.0 19110.0 3 Horsepower 73.0 73.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Asia 4 MSRP 10280.0 33260.0 4 Horsepower 93.0 300.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Asia 6 MSRP 16495.0 89765.0 6 Horsepower 155.0 290.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Asia 8 MSRP 26650.0 64800.0 8 Horsepower 235.0 340.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Europe 4 MSRP 16999.0 43175.0 4 Horsepower 100.0 250.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Europe 5 MSRP 31745.0 42565.0 5 Horsepower 194.0 300.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Europe 6 MSRP 23785.0 192465.0 6 Horsepower 168.0 477.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Europe 8 MSRP 39235.0 121770.0 8 Horsepower 217.0 493.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders Europe 12 MSRP 75000.0 128420.0 12 Horsepower 420.0 493.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders USA 4 MSRP 10995.0 25955.0 4 Horsepower 103.0 220.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders USA 6 MSRP 17630.0 40720.0 6 Horsepower 155.0 275.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders USA 8 MSRP 22010.0 76200.0 8 Horsepower 224.0 350.0, Descriptive Statistics for TMPCXHG5FVU Column Min Max Origin Cylinders USA 10 MSRP 41475.0 81795.0 10 Horsepower 310.0 500.0]
The reason that the table name is required is that many CAS actions can have multiple tables with different names in each By group.
Concatenating By Group Tables¶
While you can use the CASResults.get_tables() method to retrieve tables of a specified name then use the concat() function to concatenate them together, there is also a CASResults.concat_bygroups() method that you can use. This method will concatenate all tables with the same name across all By groups and set the concatenated table under the table’s key.
In [6]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).concat_bygroups()
Out[6]:
[Summary]
Descriptive Statistics for TMPCXHG5FVU
Column Min Max
Origin Cylinders
Asia . MSRP 25700.0 27200.0
. Horsepower 197.0 238.0
3 MSRP 19110.0 19110.0
3 Horsepower 73.0 73.0
4 MSRP 10280.0 33260.0
4 Horsepower 93.0 300.0
6 MSRP 16495.0 89765.0
6 Horsepower 155.0 290.0
8 MSRP 26650.0 64800.0
8 Horsepower 235.0 340.0
Europe 4 MSRP 16999.0 43175.0
4 Horsepower 100.0 250.0
5 MSRP 31745.0 42565.0
5 Horsepower 194.0 300.0
6 MSRP 23785.0 192465.0
6 Horsepower 168.0 477.0
8 MSRP 39235.0 121770.0
8 Horsepower 217.0 493.0
12 MSRP 75000.0 128420.0
12 Horsepower 420.0 493.0
USA 4 MSRP 10995.0 25955.0
4 Horsepower 103.0 220.0
6 MSRP 17630.0 40720.0
6 Horsepower 155.0 275.0
8 MSRP 22010.0 76200.0
8 Horsepower 224.0 350.0
10 MSRP 41475.0 81795.0
10 Horsepower 310.0 500.0
By default, this method returns a new CASResults object with the concatenated tables. If you want to modify the CASResults object in place, you can add a inplace=True option.
Selecting a Specific By Group¶
In addition to selecting tables by name, you can also select a specific By group in the result by specifying the By variable values. This is done with the CASResults.get_group() method.
In [7]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).get_group(['Asia', 4])
Out[7]:
[Summary]
Descriptive Statistics for TMPCXHG5FVU
Column Min Max
Origin Cylinders
Asia 4 MSRP 10280.0 33260.0
4 Horsepower 93.0 300.0
The values given for the By group variable values can be either the raw value or the formatted value.
You can also specify the grouping variables as keyword arguments.
In [8]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).get_group(Origin='Asia', Cylinders=4)
Out[8]:
[Summary]
Descriptive Statistics for TMPCXHG5FVU
Column Min Max
Origin Cylinders
Asia 4 MSRP 10280.0 33260.0
4 Horsepower 93.0 300.0
Multiple Sets of By Groups¶
Some CAS actions like simple.mdsummary allow you to specify multiple By group sets. In cases like this, the keys for each By group set are prefixed with “ByGroupSet#.”. To select a By group set, you can use the CASResults.get_set() method. This takes a numeric index indicating which By group set to select. The return value is a new CASResults object that contains just the selected By group set. You can then use the methods above to select tables or concatenate tables together.
In [9]: tbl.mdsummary(sets=[dict(groupby=["Origin"]), ...: dict(groupby=["Cylinders"])]) ...: Out[9]: [ByGroupSet1.ByGroupInfo] ByGroupSet1.ByGroupInfo Origin Origin_f _key_ 0 Asia Asia Asia 1 Europe Europe Europe 2 USA USA USA [ByGroupSet1.ByGroup1.MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Origin Asia MSRP 10280.0 89765.0 158.0 0.0 24741.322785 3909129.0 11321.069675 900.655944 1.281666e+08 1.168392e+11 2.012216e+10 45.757738 27.470338 7.525437e-62 Asia Horsepower 73.0 340.0 158.0 0.0 190.702532 30131.0 59.392627 4.725024 3.527484e+03 6.299873e+06 5.538150e+05 31.144121 40.360121 8.503586e-85 [ByGroupSet1.ByGroup2.MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Origin Europe MSRP 16999.0 192465.0 123.0 0.0 48349.796748 5947025.0 25318.600464 2282.901323 6.410315e+08 3.657433e+11 7.820585e+10 52.365474 21.179101 1.104072e-42 Europe Horsepower 100.0 493.0 123.0 0.0 251.894309 30983.0 80.738884 7.279980 6.518767e+03 8.599731e+06 7.952896e+05 32.052683 34.600962 6.412879e-65 [ByGroupSet1.ByGroup3.MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Origin USA MSRP 10995.0 81795.0 147.0 0.0 28377.442177 4171484.0 11711.982506 965.988036 1.371705e+08 1.384029e+11 2.002690e+10 41.272157 29.376598 3.686853e-63 USA Horsepower 103.0 500.0 147.0 0.0 212.823129 31285.0 63.748618 5.257897 4.063886e+03 7.251499e+06 5.933274e+05 29.953802 40.476851 2.996113e-81 [ByGroupSet2.ByGroupInfo] ByGroupSet2.ByGroupInfo Cylinders Cylinders_f _key_ 0 NaN . . 1 3.0 3 3 2 4.0 4 4 3 5.0 5 5 4 6.0 6 6 5 8.0 8 8 6 10.0 10 10 7 12.0 12 12 [ByGroupSet2.ByGroup1.MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Cylinders . MSRP 25700.0 27200.0 2.0 0.0 26450.0 52900.0 1060.660172 750.0 1125000.0 1.400330e+09 1125000.0 4.010057 35.266667 0.018047 . Horsepower 197.0 238.0 2.0 0.0 217.5 435.0 28.991378 20.5 840.5 9.545300e+04 840.5 13.329369 10.609756 0.059827 [ByGroupSet2.ByGroup2.MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Cylinders 3 MSRP 19110.0 19110.0 1.0 0.0 19110.0 19110.0 NaN NaN NaN 365192100.0 0.0 NaN NaN NaN 3 Horsepower 73.0 73.0 1.0 0.0 73.0 73.0 NaN NaN NaN 5329.0 0.0 NaN NaN NaN [ByGroupSet2.ByGroup3.MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Cylinders 4 MSRP 10280.0 43175.0 136.0 0.0 19747.500000 2685660.0 7070.833747 606.318992 4.999669e+07 5.978462e+10 6.749553e+09 35.806222 32.569489 8.276602e-66 4 Horsepower 93.0 300.0 136.0 0.0 150.514706 20470.0 36.387826 3.120230 1.324074e+03 3.259786e+06 1.787500e+05 24.175595 48.238332 5.424646e-87 [ByGroupSet2.ByGroup4.MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Cylinders 5 MSRP 31745.0 42565.0 7.0 0.0 37187.142857 260310.0 3656.733352 1382.115294 1.337170e+07 9.760415e+09 8.023019e+07 9.833327 26.905963 1.741013e-07 5 Horsepower 194.0 300.0 7.0 0.0 228.000000 1596.0 37.986840 14.357676 1.443000e+03 3.725460e+05 8.658000e+03 16.660895 15.880007 3.957170e-06 [ByGroupSet2.ByGroup5.MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Cylinders 6 MSRP 16495.0 192465.0 190.0 0.0 32124.415789 6103639.0 16092.708870 1167.487809 2.589753e+08 2.450222e+11 4.894633e+10 50.094946 27.515847 5.123923e-68 6 Horsepower 155.0 477.0 190.0 0.0 218.952632 41601.0 41.434628 3.005984 1.716828e+03 9.433129e+06 3.244806e+05 18.924015 72.838924 2.971778e-140 [ByGroupSet2.ByGroup6.MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Cylinders 8 MSRP 22010.0 121770.0 87.0 0.0 51179.988506 4452659.0 18737.189898 2008.837616 3.510823e+08 2.580801e+11 3.019308e+10 36.610383 25.477414 7.769015e-42 8 Horsepower 217.0 493.0 87.0 0.0 298.942529 26008.0 48.388533 5.187795 2.341450e+03 7.976262e+06 2.013647e+05 16.186567 57.624195 1.709083e-70 [ByGroupSet2.ByGroup7.MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Cylinders 10 MSRP 41475.0 81795.0 2.0 0.0 61635.0 123270.0 28510.545417 20160.0 812851200.0 8.410598e+09 812851200.0 46.257071 3.057292 0.201247 10 Horsepower 310.0 500.0 2.0 0.0 405.0 810.0 134.350288 95.0 18050.0 3.461000e+05 18050.0 33.172911 4.263158 0.146679 [ByGroupSet2.ByGroup8.MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Cylinders 12 MSRP 75000.0 128420.0 3.0 0.0 110030.000000 330090.0 30349.485992 17522.283907 9.210913e+08 3.816199e+10 1.842183e+09 27.582919 6.279433 0.024435 12 Horsepower 420.0 493.0 3.0 0.0 468.666667 1406.0 42.146570 24.333333 1.776333e+03 6.624980e+05 3.552667e+03 8.992867 19.260274 0.002685 + Elapsed: 0.0536s, user: 0.057s, sys: 0.011s, mem: 28mb In [10]: tbl.mdsummary(sets=[dict(groupby=["Origin"]), ....: dict(groupby=["Cylinders"])]).get_set(1).get_group('Asia') ....: Out[10]: [MDSummary] Descriptive Statistics for TMPCXHG5FVU Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT Origin Asia MSRP 10280.0 89765.0 158.0 0.0 24741.322785 3909129.0 11321.069675 900.655944 1.281666e+08 1.168392e+11 2.012216e+10 45.757738 27.470338 7.525437e-62 Asia Horsepower 73.0 340.0 158.0 0.0 190.702532 30131.0 59.392627 4.725024 3.527484e+03 6.299873e+06 5.538150e+05 31.144121 40.360121 8.503586e-85