pipefitter.transformer.imputer.Imputer

class pipefitter.transformer.imputer.Imputer(value=mean)

Bases: pipefitter.base.BaseImputer

Impute missing values in a data set

The values specified to replace missing values can be statistics or constant values. To specify a statistic, use one of the following pre-defined contants on the Imputer class.

  • Imputer.MAX
  • Imputer.MEAN
  • Imputer.MEDIAN
  • Imputer.MIDRANGE
  • Imputer.MIN
  • Imputer.MODE
  • Imputer.RANDOM
Parameters:

value : ImputerMethod or scalar or dict, optional

Specifies the value to use in place of missing values.
  • If an ImputerMethod is specified, that method is used for all missing values.
  • If a scalar is specified, that value is used to substitute for all missings.
  • If a dict is specified, the keys correspond to the columns and the values are the substitution values (which may also be ImputerMethod instances).

Examples

Sample data set used for imputing examples:

>>> data.head()
      A     B     C     D     E  F  G  H
0   1.0   2.0   3.0   4.0   5.0  a  b  c
1   6.0   NaN   8.0   9.0   NaN  j  e  f
2  11.0   NaN  13.0  14.0   NaN     h  i
3  16.0  17.0  18.0   NaN  20.0  j     l
4   NaN  22.0  23.0  24.0   NaN     n  o

Impute values using the mean:

>>> meanimp = Imputer(Imputer.MEAN) 
>>> newdata = meanimp.transform(data)
>>> newdata.head()
      A          B     C      D     E  F  G  H
0   1.0   2.000000   3.0   4.00   5.0  a  b  c
1   6.0  13.666667   8.0   9.00  12.5  j  e  f
2  11.0  13.666667  13.0  14.00  12.5     h  i
3  16.0  17.000000  18.0  12.75  20.0  j     l
4   8.5  22.000000  23.0  24.00  12.5     n  o

Impute values using the mode:

>>> modeimp = Imputer(Imputer.MODE) 
>>> newdata = modeimp.transform(data)
>>> newdata.head()
      A     B     C     D     E  F  G  H
0   1.0   2.0   3.0   4.0   5.0  a  b  c
1   6.0   2.0   8.0   9.0   5.0  j  e  f
2  11.0   2.0  13.0  14.0   5.0  j  h  i
3  16.0  17.0  18.0   4.0  20.0  j  b  l
4   1.0  22.0  23.0  24.0   5.0  j  n  o

Impute a constant value:

>>> cimp = Imputer(100)
>>> newdata = cimp.transform(data)
>>> newdata.head()
       A      B     C      D      E  F  G  H
0    1.0    2.0   3.0    4.0    5.0  a  b  c
1    6.0  100.0   8.0    9.0  100.0  j  e  f
2   11.0  100.0  13.0   14.0  100.0     h  i
3   16.0   17.0  18.0  100.0   20.0  j     l
4  100.0   22.0  23.0   24.0  100.0     n  o

Impute values in specified columns:

>>> dimp = Imputer({'A': 1, 'B': 100,
                    'F': 'none', 'G': 'miss'})
>>> newdata = cimp.transform(data)
>>> newdata.head()
      A      B     C     D     E     F     G  H
0   1.0    2.0   3.0   4.0   5.0     a     b  c
1   6.0  100.0   8.0   9.0   NaN     j     e  f
2  11.0  100.0  13.0  14.0   NaN  none     h  i
3  16.0   17.0  18.0   NaN  20.0     j  miss  l
4   1.0   22.0  23.0  24.0   NaN  none     n  o
__init__(value=mean)

Methods

__init__([value])
get_combined_params(\*args, \*\*kwargs) Merge all parameters and verify that they valid
get_filtered_params(\*args, \*\*kwargs) Merge parameters that keys that belong to self
get_param(\*names) Return a copy of the requested parameters
get_params(\*names) Return a copy of the requested parameters
has_param(name) Does the parameter exist?
set_param(\*args, \*\*kwargs) Set one or more parameters
set_params(\*args, \*\*kwargs) Set one or more parameters
transform(table[, value]) Perform the imputation on the given data set

Attributes

MAX Constant that indicates the maximum data value of the column
MEAN Constant that indicates the mean data value of the column
MEDIAN Constant that indicates the median data value of the column
MIDRANGE Constant that indicates the midrange data value of the column
MIN Constant that indicates the minimum data value of the column
MODE Constant that indicates the mode data value of the column
RANDOM Constant that indicates that random data should be used
param_defs
static_params