Automated Data Type Recognition

In most ML scenarios, most of the development and deployment tasks deal with data input pipelines. A data pipeline handles data intake, linkage, type detection, and missing data imputation. SKSurrogate covers the later two stapes automatically and allows for customization as well. This is done via the DataProcess module.

Currently, the DataProcess module identifies the following data types automatically:

  • Binary
  • Categorical
  • Date/Time
  • Float
  • Integer
  • Label
  • Text
  • Objects

We note that the Object type could include various types which may have a known structure but are not implemented in the module yet.

The following example demonstrates the basic functions of the DataProcess module.

Example:

Randomly generated dataframe with various types of data:

import numpy as np
import pandas as pd
import random
from lorem_text import lorem

N = 100
categorical = np.array([random.choice(['Cat01', 'Cat02', 'Cat03', 'Cate04', None]) for _ in range(N)])
binary = np.array([random.choice(['Bin0', 'Bin1', None]) for _ in range(N)])
float1 = np.random.uniform(low=-2.0, high=4.0, size=N)
float2 = np.random.uniform(low=0.0, high=10.0, size=N)
int1 = np.random.randint(0, high=20, size=N)

def random_str(max_len=15):
    chars = [' '] + [chr(_) for _ in range(ord('a'), ord('z')+1)] + [' ']+  [chr(_) for _ in range(ord('A'), ord('Z')+1)] + [' ']
    ln = random.randint(0, max_len)
    rand_list = [random.choice(chars) for _ in range(ln)]
    return ''.join(rand_list)

def random_date(init_date, date_range=30):
    offset = random.randint(0, date_range)
    new_date = np.datetime64(init_date) + offset
    return new_date

strs = [random_str() for _ in range(N)]
dates = [random_date("2021-03-01", 60) for _ in range(N)]

texts = [lorem.sentence() for _ in range(N)]

frame = dict()
frame['categorical'] = categorical
frame['binary'] = binary
frame['float1'] = float1
frame['float2'] = float2
frame['int1'] = int1
frame['str'] = strs
frame['date'] = dates
frame['txt'] = texts
df = pd.DataFrame(frame)
df = df.astype({'txt':pd.StringDtype()})

Import and process the sample dataframe:

from SKSurrogate import *
A = DataPreprocess(df)
A.deduce_types()
A.deduced_types

which returns:

{'float64': ['float1', 'float2'],
 'int64': ['int1'],
 'datetime64': ['date'],
 'other': [],
 'text': ['txt'],
 'binary': ['binary'],
 'categorical': ['categorical'],
 'label': ['str'],
 'obsolete': []}

Then:

A.encode()
print(A.steps)

The output is a SK-Learn compatible pipeline:

[('OneHot',
  OneHotEncoder(cols=['categorical'], drop_invariant=True,
                handle_missing='return_nan', handle_unknown='return_nan')),
 ('Ordinal',
  OrdinalEncoder(cols=['binary', 'str'], handle_missing='return_nan',
                 handle_unknown='return_nan',
                 mapping=[{'col': 'binary', 'mapping': {'Bin0': 0, 'Bin1': 1}},
                          {'col': 'str',
                           'mapping': {'': 0, ' ': 1, ' DmTHDRQErIhF': 2,
                                       ' FmseqO': 3, ' j knr': 4, ' pcG': 5,
                                       'AVJVq nsqyHRpM': 6, 'AYihJxhUbN ': 7,
                                       'Agpg': 8, 'C': 9, 'CKJ': 10, 'CcnGK': 11,
                                       'D': 12, 'DkMstNYdjoRj ': 13, 'EITp': 14,
                                       'FAWrCVv': 15, 'FKgFwuGLmQqLR': 16,
                                       'FVtvoWBCEEi': 17, 'G T oVPh': 18,
                                       'GAxyFGqpzrJXe': 19, 'GYfNntcQww': 20,
                                       'HB euaV YFIb': 21, 'IENmSCFiAECp': 22,
                                       'IGZRolGBCKLsyg': 23,
                                       'J mNPFImkjd iRw': 24, 'JW': 25,
                                       'KSKpIlRm': 26, 'KevYeZyrsvwY': 27,
                                       'KhNjalpZkqxFGBC': 28,
                                       'KjtCfjg PZrx k ': 29, ...}}])),
 ('Date2Num', DateTime2Num(cols=['date'])),
 ('Impute', IterativeImputer())]