Project Specific: Column Selection

In order to reference specific columns in a project’s dataset as input to a task, column selection (i.e. feature selection) is required.

Note: this technique is specifically for when preprocessing or estimation requires or should be applied to a specific column- not to be confused with feature lists, which allow an entire blueprint to make use of a subset of a project’s features.

The following demonstrates how to select a specific column from a project’s dataset and use it in a blueprint.

Selecting multiple columns or specifically excluding columns, to be used as input to a specific task is still being tested / validated, and will be added soon.

[1]:
import datarobot as dr
[2]:
from datarobot_bp_workshop import Workshop, Visualize
[4]:
with open('../../../api.token', 'r') as f:
    token = f.read()
    dr.Client(token=token, endpoint='https://app.datarobot.com/api/v2')

Initialize Workshop with a project_id

[5]:
w = Workshop(project_id='5eb9656901f6bb026828f14e')
Please upgrade to the latest version: pip install --upgrade datarobot_bp_workshop

Select a specific feature

[13]:
w.Features.Insurance_Type
[13]:
Single Column Converter, Select Feature: 'Insurance_Type' (SCPICK)

Input Summary: Categorical Data
Output Method: TaskOutputMethod.TRANSFORM
Select Feature: 'Insurance_Type'

Task Parameters:
  column_name (cn) = '496e737572616e63655f54797065'
[7]:
w.Feature('Insurance_Duration')
[7]:
Single Column Converter: 'Insurance_Duration' (SCPICK)

Input Summary: Categorical Data
Output Method: TaskOutputMethod.TRANSFORM

Task Parameters:
  column_name (cn) = '496e737572616e63655f4475726174696f6e'

Select multiple features

This can be done either by using w.Features.<feature name> (to leverage autocomplete) or '<feature name>' as shown below.

Note: all features must be the same input data type

[14]:
w.FeatureSelection(w.Features.Insurance_Duration, w.Features.Insurance_Type)
[14]:
Multiple Column Selector, Select Features: 'Insurance_Duration', 'Insurance_Type' (MCPICK)

Input Summary: Categorical Data
Output Method: TaskOutputMethod.TRANSFORM
Select Features: 'Insurance_Duration', 'Insurance_Type'

Task Parameters:
  column_names (cns) = ['496e737572616e63655f4475726174696f6e', '496e737572616e63655f54797065']
  method (method) = 'include'
[15]:
w.FeatureSelection('Insurance_Duration', 'Insurance_Type')
[15]:
Multiple Column Selector, Select Features: 'Insurance_Duration', 'Insurance_Type' (MCPICK)

Input Summary: Categorical Data
Output Method: TaskOutputMethod.TRANSFORM
Select Features: 'Insurance_Duration', 'Insurance_Type'

Task Parameters:
  column_names (cns) = ['496e737572616e63655f4475726174696f6e', '496e737572616e63655f54797065']
  method (method) = 'include'

Features may also be excluded instead, which will select all other features of the same type.

[16]:
w.FeatureSelection('Insurance_Duration', 'Insurance_Type', exclude=True)
[16]:
Multiple Column Selector, Exclude Features: 'Insurance_Duration', 'Insurance_Type' (MCPICK)

Input Summary: Categorical Data
Output Method: TaskOutputMethod.TRANSFORM
Exclude Features: 'Insurance_Duration', 'Insurance_Type'

Task Parameters:
  column_names (cns) = ['496e737572616e63655f4475726174696f6e', '496e737572616e63655f54797065']
  method (method) = 'exclude'

Build a blueprint with a specific feature

[8]:
pni = w.Tasks.PNI2(w.Features.Age)
rdt = w.Tasks.RDT5(pni)
binning = w.Tasks.BINNING(pni)
keras = w.Tasks.KERASC(rdt, binning)
keras.set_task_parameters_by_name(learning_rate=0.123)
keras_blueprint = w.BlueprintGraph(keras, name='A blueprint I made with the Python API')
[9]:
source_code = keras_blueprint.to_source_code(to_stdout=True)
w = Workshop(project_id='5eb9656901f6bb026828f14e')

age = w.Features.Age

pni2 = w.Tasks.PNI2(age)

binning = w.Tasks.BINNING(pni2)

rdt5 = w.Tasks.RDT5(pni2)

kerasc = w.Tasks.KERASC(binning, rdt5)
kerasc.set_task_parameters(learning_rate=0.123)

kerasc_blueprint = w.BlueprintGraph(kerasc, name='A blueprint I made with the Python API')
[10]:
exec(compile(source_code, 'blueprint', 'exec'), locals())
[11]:
kerasc_blueprint.show()
../../_images/examples_advanced_FeatureSelection_18_0.png
[12]:
w.set_project(project_id='605ab63ecd8a6669dfd64901')
[12]:
<workshop.workshop.Workshop at 0x7ffad3b7e650>
[14]:
kerasc_blueprint.train(w.project.id)
[14]:
Name: 'A blueprint I made with the Python API'

Input Data: Numeric
Tasks: Single Column Converter: 'Age' | Missing Values Imputed | Binning of numerical variables | Smooth Ridit Transform | Keras Neural Network Classifier
[101]:
starred_models = w.project.get_models(search_params=dict(is_starred=True))
[102]:
model_to_clone = starred_models[0].blueprint_id
[104]:
bp = w.clone(blueprint_id=model_to_clone, name='Now featuring selected columns!')
[105]:
bp.show()
../../_images/examples_advanced_FeatureSelection_24_0.png
[106]:
bp.delete()
Blueprint deleted.
[ ]: