Handling By Groups

If By groups are specified when running a CAS action, the result are returned with the following behaviors.

  1. A result key named ‘ByGroupInfo’ is returned with all of the By group variable values.

  2. Each By group table is returned in a separate result key with a prefix of ‘ByGroup#.’.

These behaviors can help when you have a large number of By groups and you want to process them as they arrive at the client rather than trying to hold the entire set of results in memory. However, when your result sets are smaller, you may want to combine all of the By group tables into a single pandas.DataFrame. To help in these situations, the CASResults class defines some helper methods for you.

Here is what it looks like to run a standard summary action, and a summary action with a By group specified. We will use this output to demonstrate the By group processing methods.

In [1]: tbl = tbl[['MSRP', 'Horsepower']]

In [2]: tbl.summary(subset=['Min', 'Max'])
Out[2]: 
[Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
        Column      Min       Max
 0        MSRP  10280.0  192465.0
 1  Horsepower     73.0     500.0

+ Elapsed: 0.0137s, user: 0.01s, sys: 0.005s, mem: 4.52mb

In [3]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max'])
Out[3]: 
[ByGroupInfo]

 ByGroupInfo
 
     Origin Origin_f  Cylinders   Cylinders_f               _key_
 0     Asia     Asia        NaN             .  Asia             .
 1     Asia     Asia        3.0             3  Asia             3
 2     Asia     Asia        4.0             4  Asia             4
 3     Asia     Asia        6.0             6  Asia             6
 4     Asia     Asia        8.0             8  Asia             8
 5   Europe   Europe        4.0             4  Europe           4
 6   Europe   Europe        5.0             5  Europe           5
 7   Europe   Europe        6.0             6  Europe           6
 8   Europe   Europe        8.0             8  Europe           8
 9   Europe   Europe       12.0            12  Europe          12
 10     USA      USA        4.0             4  USA              4
 11     USA      USA        6.0             6  USA              6
 12     USA      USA        8.0             8  USA              8
 13     USA      USA       10.0            10  USA             10

[ByGroup1.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   .                MSRP  25700.0  27200.0
        .          Horsepower    197.0    238.0

[ByGroup2.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   3                MSRP  19110.0  19110.0
        3          Horsepower     73.0     73.0

[ByGroup3.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   4                MSRP  10280.0  33260.0
        4          Horsepower     93.0    300.0

[ByGroup4.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   6                MSRP  16495.0  89765.0
        6          Horsepower    155.0    290.0

[ByGroup5.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   8                MSRP  26650.0  64800.0
        8          Horsepower    235.0    340.0

[ByGroup6.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 Europe 4                MSRP  16999.0  43175.0
        4          Horsepower    100.0    250.0

[ByGroup7.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 Europe 5                MSRP  31745.0  42565.0
        5          Horsepower    194.0    300.0

[ByGroup8.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min       Max
 Origin Cylinders                               
 Europe 6                MSRP  23785.0  192465.0
        6          Horsepower    168.0     477.0

[ByGroup9.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min       Max
 Origin Cylinders                               
 Europe 8                MSRP  39235.0  121770.0
        8          Horsepower    217.0     493.0

[ByGroup10.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min       Max
 Origin Cylinders                               
 Europe 12               MSRP  75000.0  128420.0
        12         Horsepower    420.0     493.0

[ByGroup11.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    4                MSRP  10995.0  25955.0
        4          Horsepower    103.0    220.0

[ByGroup12.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    6                MSRP  17630.0  40720.0
        6          Horsepower    155.0    275.0

[ByGroup13.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    8                MSRP  22010.0  76200.0
        8          Horsepower    224.0    350.0

[ByGroup14.Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    10               MSRP  41475.0  81795.0
        10         Horsepower    310.0    500.0

+ Elapsed: 0.0231s, user: 0.02s, sys: 0.004s, mem: 5.45mb

Selecting Tables by Name Across By Groups

The CASResults.get_tables() method will return all tables with a given name across By groups in a list. If the results do not contain By groups, the single table with that name will be returned in a list. This makes it possible to use CASResults.get_tables() the same way whether By group variables are specified or not.

In [4]: tbl.summary(subset=['Min', 'Max']).get_tables('Summary')
Out[4]: 
[Descriptive Statistics for TMPHDTLDQBP

       Column      Min       Max
0        MSRP  10280.0  192465.0
1  Horsepower     73.0     500.0]

In [5]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).get_tables('Summary')
Out[5]: 
[Descriptive Statistics for TMPHDTLDQBP

                      Column      Min      Max
Origin Cylinders                              
Asia   .                MSRP  25700.0  27200.0
       .          Horsepower    197.0    238.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min      Max
Origin Cylinders                              
Asia   3                MSRP  19110.0  19110.0
       3          Horsepower     73.0     73.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min      Max
Origin Cylinders                              
Asia   4                MSRP  10280.0  33260.0
       4          Horsepower     93.0    300.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min      Max
Origin Cylinders                              
Asia   6                MSRP  16495.0  89765.0
       6          Horsepower    155.0    290.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min      Max
Origin Cylinders                              
Asia   8                MSRP  26650.0  64800.0
       8          Horsepower    235.0    340.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min      Max
Origin Cylinders                              
Europe 4                MSRP  16999.0  43175.0
       4          Horsepower    100.0    250.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min      Max
Origin Cylinders                              
Europe 5                MSRP  31745.0  42565.0
       5          Horsepower    194.0    300.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min       Max
Origin Cylinders                               
Europe 6                MSRP  23785.0  192465.0
       6          Horsepower    168.0     477.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min       Max
Origin Cylinders                               
Europe 8                MSRP  39235.0  121770.0
       8          Horsepower    217.0     493.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min       Max
Origin Cylinders                               
Europe 12               MSRP  75000.0  128420.0
       12         Horsepower    420.0     493.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min      Max
Origin Cylinders                              
USA    4                MSRP  10995.0  25955.0
       4          Horsepower    103.0    220.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min      Max
Origin Cylinders                              
USA    6                MSRP  17630.0  40720.0
       6          Horsepower    155.0    275.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min      Max
Origin Cylinders                              
USA    8                MSRP  22010.0  76200.0
       8          Horsepower    224.0    350.0,
 Descriptive Statistics for TMPHDTLDQBP

                      Column      Min      Max
Origin Cylinders                              
USA    10               MSRP  41475.0  81795.0
       10         Horsepower    310.0    500.0]

The reason that the table name is required is that many CAS actions can have multiple tables with different names in each By group.

Concatenating By Group Tables

While you can use the CASResults.get_tables() method to retrieve tables of a specified name then use the concat() function to concatenate them together, there is also a CASResults.concat_bygroups() method that you can use. This method will concatenate all tables with the same name across all By groups and set the concatenated table under the table’s key.

In [6]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).concat_bygroups()
Out[6]: 
[Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min       Max
 Origin Cylinders                               
 Asia   .                MSRP  25700.0   27200.0
        .          Horsepower    197.0     238.0
        3                MSRP  19110.0   19110.0
        3          Horsepower     73.0      73.0
        4                MSRP  10280.0   33260.0
        4          Horsepower     93.0     300.0
        6                MSRP  16495.0   89765.0
        6          Horsepower    155.0     290.0
        8                MSRP  26650.0   64800.0
        8          Horsepower    235.0     340.0
 Europe 4                MSRP  16999.0   43175.0
        4          Horsepower    100.0     250.0
        5                MSRP  31745.0   42565.0
        5          Horsepower    194.0     300.0
        6                MSRP  23785.0  192465.0
        6          Horsepower    168.0     477.0
        8                MSRP  39235.0  121770.0
        8          Horsepower    217.0     493.0
        12               MSRP  75000.0  128420.0
        12         Horsepower    420.0     493.0
 USA    4                MSRP  10995.0   25955.0
        4          Horsepower    103.0     220.0
        6                MSRP  17630.0   40720.0
        6          Horsepower    155.0     275.0
        8                MSRP  22010.0   76200.0
        8          Horsepower    224.0     350.0
        10               MSRP  41475.0   81795.0
        10         Horsepower    310.0     500.0

By default, this method returns a new CASResults object with the concatenated tables. If you want to modify the CASResults object in place, you can add a inplace=True option.

Selecting a Specific By Group

In addition to selecting tables by name, you can also select a specific By group in the result by specifying the By variable values. This is done with the CASResults.get_group() method.

In [7]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).get_group(['Asia', 4])
Out[7]: 
[Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   4                MSRP  10280.0  33260.0
        4          Horsepower     93.0    300.0

The values given for the By group variable values can be either the raw value or the formatted value.

You can also specify the grouping variables as keyword arguments.

In [8]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).get_group(Origin='Asia', Cylinders=4)
Out[8]: 
[Summary]

 Descriptive Statistics for TMPHDTLDQBP
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   4                MSRP  10280.0  33260.0
        4          Horsepower     93.0    300.0

Multiple Sets of By Groups

Some CAS actions like simple.mdsummary allow you to specify multiple By group sets. In cases like this, the keys for each By group set are prefixed with “ByGroupSet#.”. To select a By group set, you can use the CASResults.get_set() method. This takes a numeric index indicating which By group set to select. The return value is a new CASResults object that contains just the selected By group set. You can then use the methods above to select tables or concatenate tables together.

In [9]: tbl.mdsummary(sets=[dict(groupby=["Origin"]),
   ...:                     dict(groupby=["Cylinders"])])
   ...: 
Out[9]: 
[ByGroupSet1.ByGroupInfo]

 ByGroupSet1.ByGroupInfo
 
    Origin Origin_f   _key_
 0    Asia     Asia    Asia
 1  Europe   Europe  Europe
 2     USA      USA     USA

[ByGroupSet1.ByGroup1.MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
             Column      Min      Max      N  NMiss          Mean        Sum           Std      StdErr           Var           USS           CSS         CV     TValue         ProbT
 Origin                                                                                                                                                                             
 Asia          MSRP  10280.0  89765.0  158.0    0.0  24741.322785  3909129.0  11321.069675  900.655944  1.281666e+08  1.168392e+11  2.012216e+10  45.757738  27.470338  7.525437e-62
 Asia    Horsepower     73.0    340.0  158.0    0.0    190.702532    30131.0     59.392627    4.725024  3.527484e+03  6.299873e+06  5.538150e+05  31.144121  40.360121  8.503586e-85

[ByGroupSet1.ByGroup2.MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
             Column      Min       Max      N  NMiss          Mean        Sum           Std       StdErr           Var           USS           CSS         CV     TValue         ProbT
 Origin                                                                                                                                                                               
 Europe        MSRP  16999.0  192465.0  123.0    0.0  48349.796748  5947025.0  25318.600464  2282.901323  6.410315e+08  3.657433e+11  7.820585e+10  52.365474  21.179101  1.104072e-42
 Europe  Horsepower    100.0     493.0  123.0    0.0    251.894309    30983.0     80.738884     7.279980  6.518767e+03  8.599731e+06  7.952896e+05  32.052683  34.600962  6.412879e-65

[ByGroupSet1.ByGroup3.MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
             Column      Min      Max      N  NMiss          Mean        Sum           Std      StdErr           Var           USS           CSS         CV     TValue         ProbT
 Origin                                                                                                                                                                             
 USA           MSRP  10995.0  81795.0  147.0    0.0  28377.442177  4171484.0  11711.982506  965.988036  1.371705e+08  1.384029e+11  2.002690e+10  41.272157  29.376598  3.686853e-63
 USA     Horsepower    103.0    500.0  147.0    0.0    212.823129    31285.0     63.748618    5.257897  4.063886e+03  7.251499e+06  5.933274e+05  29.953802  40.476851  2.996113e-81

[ByGroupSet2.ByGroupInfo]

 ByGroupSet2.ByGroupInfo
 
    Cylinders   Cylinders_f         _key_
 0        NaN             .             .
 1        3.0             3             3
 2        4.0             4             4
 3        5.0             5             5
 4        6.0             6             6
 5        8.0             8             8
 6       10.0            10            10
 7       12.0            12            12

[ByGroupSet2.ByGroup1.MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
                Column      Min      Max    N  NMiss     Mean      Sum          Std  StdErr        Var           USS        CSS         CV     TValue     ProbT
 Cylinders                                                                                                                                                     
 .                MSRP  25700.0  27200.0  2.0    0.0  26450.0  52900.0  1060.660172   750.0  1125000.0  1.400330e+09  1125000.0   4.010057  35.266667  0.018047
 .          Horsepower    197.0    238.0  2.0    0.0    217.5    435.0    28.991378    20.5      840.5  9.545300e+04      840.5  13.329369  10.609756  0.059827

[ByGroupSet2.ByGroup2.MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
                Column      Min      Max    N  NMiss     Mean      Sum  Std  StdErr  Var          USS  CSS  CV  TValue  ProbT
 Cylinders                                                                                                                   
 3                MSRP  19110.0  19110.0  1.0    0.0  19110.0  19110.0  NaN     NaN  NaN  365192100.0  0.0 NaN     NaN    NaN
 3          Horsepower     73.0     73.0  1.0    0.0     73.0     73.0  NaN     NaN  NaN       5329.0  0.0 NaN     NaN    NaN

[ByGroupSet2.ByGroup3.MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
                Column      Min      Max      N  NMiss          Mean        Sum          Std      StdErr           Var           USS           CSS         CV     TValue         ProbT
 Cylinders                                                                                                                                                                            
 4                MSRP  10280.0  43175.0  136.0    0.0  19747.500000  2685660.0  7070.833747  606.318992  4.999669e+07  5.978462e+10  6.749553e+09  35.806222  32.569489  8.276602e-66
 4          Horsepower     93.0    300.0  136.0    0.0    150.514706    20470.0    36.387826    3.120230  1.324074e+03  3.259786e+06  1.787500e+05  24.175595  48.238332  5.424646e-87

[ByGroupSet2.ByGroup4.MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
                Column      Min      Max    N  NMiss          Mean       Sum          Std       StdErr           Var           USS           CSS         CV     TValue         ProbT
 Cylinders                                                                                                                                                                          
 5                MSRP  31745.0  42565.0  7.0    0.0  37187.142857  260310.0  3656.733352  1382.115294  1.337170e+07  9.760415e+09  8.023019e+07   9.833327  26.905963  1.741013e-07
 5          Horsepower    194.0    300.0  7.0    0.0    228.000000    1596.0    37.986840    14.357676  1.443000e+03  3.725460e+05  8.658000e+03  16.660895  15.880007  3.957170e-06

[ByGroupSet2.ByGroup5.MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
                Column      Min       Max      N  NMiss          Mean        Sum           Std       StdErr           Var           USS           CSS         CV     TValue          ProbT
 Cylinders                                                                                                                                                                                
 6                MSRP  16495.0  192465.0  190.0    0.0  32124.415789  6103639.0  16092.708870  1167.487809  2.589753e+08  2.450222e+11  4.894633e+10  50.094946  27.515847   5.123923e-68
 6          Horsepower    155.0     477.0  190.0    0.0    218.952632    41601.0     41.434628     3.005984  1.716828e+03  9.433129e+06  3.244806e+05  18.924015  72.838924  2.971778e-140

[ByGroupSet2.ByGroup6.MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
                Column      Min       Max     N  NMiss          Mean        Sum           Std       StdErr           Var           USS           CSS         CV     TValue         ProbT
 Cylinders                                                                                                                                                                              
 8                MSRP  22010.0  121770.0  87.0    0.0  51179.988506  4452659.0  18737.189898  2008.837616  3.510823e+08  2.580801e+11  3.019308e+10  36.610383  25.477414  7.769015e-42
 8          Horsepower    217.0     493.0  87.0    0.0    298.942529    26008.0     48.388533     5.187795  2.341450e+03  7.976262e+06  2.013647e+05  16.186567  57.624195  1.709083e-70

[ByGroupSet2.ByGroup7.MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
                Column      Min      Max    N  NMiss     Mean       Sum           Std   StdErr          Var           USS          CSS         CV    TValue     ProbT
 Cylinders                                                                                                                                                           
 10               MSRP  41475.0  81795.0  2.0    0.0  61635.0  123270.0  28510.545417  20160.0  812851200.0  8.410598e+09  812851200.0  46.257071  3.057292  0.201247
 10         Horsepower    310.0    500.0  2.0    0.0    405.0     810.0    134.350288     95.0      18050.0  3.461000e+05      18050.0  33.172911  4.263158  0.146679

[ByGroupSet2.ByGroup8.MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
                Column      Min       Max    N  NMiss           Mean       Sum           Std        StdErr           Var           USS           CSS         CV     TValue     ProbT
 Cylinders                                                                                                                                                                          
 12               MSRP  75000.0  128420.0  3.0    0.0  110030.000000  330090.0  30349.485992  17522.283907  9.210913e+08  3.816199e+10  1.842183e+09  27.582919   6.279433  0.024435
 12         Horsepower    420.0     493.0  3.0    0.0     468.666667    1406.0     42.146570     24.333333  1.776333e+03  6.624980e+05  3.552667e+03   8.992867  19.260274  0.002685

+ Elapsed: 0.0534s, user: 0.056s, sys: 0.011s, mem: 28mb

In [10]: tbl.mdsummary(sets=[dict(groupby=["Origin"]),
   ....:                     dict(groupby=["Cylinders"])]).get_set(1).get_group('Asia')
   ....: 
Out[10]: 
[MDSummary]

 Descriptive Statistics for TMPHDTLDQBP
 
             Column      Min      Max      N  NMiss          Mean        Sum           Std      StdErr           Var           USS           CSS         CV     TValue         ProbT
 Origin                                                                                                                                                                             
 Asia          MSRP  10280.0  89765.0  158.0    0.0  24741.322785  3909129.0  11321.069675  900.655944  1.281666e+08  1.168392e+11  2.012216e+10  45.757738  27.470338  7.525437e-62
 Asia    Horsepower     73.0    340.0  158.0    0.0    190.702532    30131.0     59.392627    4.725024  3.527484e+03  6.299873e+06  5.538150e+05  31.144121  40.360121  8.503586e-85