pipefitter.transformer.imputer.Imputer¶

class pipefitter.transformer.imputer.Imputer(value=mean)¶

Bases: pipefitter.base.BaseImputer

Impute missing values in a data set

The values specified to replace missing values can be statistics or constant values. To specify a statistic, use one of the following pre-defined contants on the Imputer class.

Imputer.MAX

Imputer.MEAN

Imputer.MEDIAN

Imputer.MIDRANGE

Imputer.MIN

Imputer.MODE

Imputer.RANDOM

Parameters:

value : ImputerMethod or scalar or dict, optional

Specifies the value to use in place of missing values.

If an ImputerMethod is specified, that method is used for all missing values.

If a scalar is specified, that value is used to substitute for all missings.

If a dict is specified, the keys correspond to the columns and the values are the substitution values (which may also be ImputerMethod instances).

Examples

Sample data set used for imputing examples:

>>> data.head()
      A     B     C     D     E  F  G  H
 1.0   2.0   3.0   4.0   5.0  a  b  c
 6.0   NaN   8.0   9.0   NaN  j  e  f
11.0   NaN  13.0  14.0   NaN     h  i
16.0  17.0  18.0   NaN  20.0  j     l
 NaN  22.0  23.0  24.0   NaN     n  o

Impute values using the mean:

>>> meanimp = Imputer(Imputer.MEAN) 
>>> newdata = meanimp.transform(data)
>>> newdata.head()
      A          B     C      D     E  F  G  H
0   1.0   2.000000   3.0   4.00   5.0  a  b  c
1   6.0  13.666667   8.0   9.00  12.5  j  e  f
2  11.0  13.666667  13.0  14.00  12.5     h  i
3  16.0  17.000000  18.0  12.75  20.0  j     l
4   8.5  22.000000  23.0  24.00  12.5     n  o

Impute values using the mode:

>>> modeimp = Imputer(Imputer.MODE) 
>>> newdata = modeimp.transform(data)
>>> newdata.head()
      A     B     C     D     E  F  G  H
0   1.0   2.0   3.0   4.0   5.0  a  b  c
1   6.0   2.0   8.0   9.0   5.0  j  e  f
2  11.0   2.0  13.0  14.0   5.0  j  h  i
3  16.0  17.0  18.0   4.0  20.0  j  b  l
4   1.0  22.0  23.0  24.0   5.0  j  n  o

Impute a constant value:

>>> cimp = Imputer(100)
>>> newdata = cimp.transform(data)
>>> newdata.head()
       A      B     C      D      E  F  G  H
0    1.0    2.0   3.0    4.0    5.0  a  b  c
1    6.0  100.0   8.0    9.0  100.0  j  e  f
2   11.0  100.0  13.0   14.0  100.0     h  i
3   16.0   17.0  18.0  100.0   20.0  j     l
4  100.0   22.0  23.0   24.0  100.0     n  o

Impute values in specified columns:

>>> dimp = Imputer({'A': 1, 'B': 100,
                    'F': 'none', 'G': 'miss'})
>>> newdata = cimp.transform(data)
>>> newdata.head()
      A      B     C     D     E     F     G  H
0   1.0    2.0   3.0   4.0   5.0     a     b  c
1   6.0  100.0   8.0   9.0   NaN     j     e  f
2  11.0  100.0  13.0  14.0   NaN  none     h  i
3  16.0   17.0  18.0   NaN  20.0     j  miss  l
4   1.0   22.0  23.0  24.0   NaN  none     n  o

__init__(value=mean)¶

Methods

`__init__`([value])
`get_combined_params`(\args, \\*kwargs)	Merge all parameters and verify that they valid
`get_filtered_params`(\args, \\*kwargs)	Merge parameters that keys that belong to self
`get_param`(\*names)	Return a copy of the requested parameters
`get_params`(\*names)	Return a copy of the requested parameters
`has_param`(name)	Does the parameter exist?
`set_param`(\args, \\*kwargs)	Set one or more parameters
`set_params`(\args, \\*kwargs)	Set one or more parameters
`transform`(table[, value])	Perform the imputation on the given data set

Attributes

`MAX`	Constant that indicates the maximum data value of the column
`MEAN`	Constant that indicates the mean data value of the column
`MEDIAN`	Constant that indicates the median data value of the column
`MIDRANGE`	Constant that indicates the midrange data value of the column
`MIN`	Constant that indicates the minimum data value of the column
`MODE`	Constant that indicates the mode data value of the column
`RANDOM`	Constant that indicates that random data should be used
`param_defs`
`static_params`