Deep Learning ToolKit

A Deep Learning ToolKit

This project is WIP.

Read the Docs

Install

pip install dlk

or 
git clong this repo and do

python setup.py install

What is this project do?

  • Provide a templete for deep learning (especially for nlp) training and deploy.

  • Provide parameters search.

  • Provide basic architecture search.

  • Provide some basic modules and models.

  • Provide reuse the pretrained model for predict.

More Feature is Comming

  • Generate models.
  • Distill structure.
  • Computer vision support.
  • Online service
    • Provide a web server for online predict.
  • One optimizer different para groups use different schedulers. diff_schedule
  • Support LightGBM, it's maybe not necessary? Will split to another package.
  • Make most modules like CRF to be scriptable
  • Add UnitTest
    • Parser
    • Tokenizer
    • Config
    • Link

dlk.core package

Subpackages

dlk.core.callbacks package

Submodules
dlk.core.callbacks.checkpoint module
class dlk.core.callbacks.checkpoint.CheckpointCallback(config: dlk.core.callbacks.checkpoint.CheckpointCallbackConfig)[source]

Bases: object

Save checkpoint decided by config

class dlk.core.callbacks.checkpoint.CheckpointCallbackConfig(config: Dict)[source]

Bases: object

Config for CheckpointCallback

Config Example:
>>> {
>>>     // default checkpoint configure
>>>     "_name": "checkpoint",
>>>     "config": {
>>>         "monitor": "*@*",    // monitor which metrics or log value
>>>         "save_top_k": 3,   //save top k
>>>         "mode": "*@*", //"max" or "min" select topk min or max checkpoint, min for loss, max for acc
>>>         "save_last": true,  //  always save last checkpoint
>>>         "auto_insert_metric_name": true, //the save file name with or not metric name
>>>         "every_n_train_steps": null, // Number of training steps between checkpoints.
>>>         "every_n_epochs": 1, //Number of epochs between checkpoints.
>>>         "save_on_train_epoch_end": false,// Whether to run checkpointing at the end of the training epoch. If this is False, then the check runs at the end of the validation.
>>>         "save_weights_only": false, //whether save other status like optimizer, etc.
>>>     }
>>> }
dlk.core.callbacks.early_stop module
class dlk.core.callbacks.early_stop.EarlyStoppingCallback(config: dlk.core.callbacks.early_stop.EarlyStoppingCallbackConfig)[source]

Bases: object

Early stop decided by config

class dlk.core.callbacks.early_stop.EarlyStoppingCallbackConfig(config: Dict)[source]

Bases: object

Config for EarlyStoppingCallback

Config Example:
>>> {
>>>     "_name": "early_stop",
>>>     "config":{
>>>         "monitor": "val_loss",
>>>         "mode": "*@*", // min or max, min for the monitor is loss, max for the monitor is acc, f1, etc.
>>>         "patience": 3,
>>>         "min_delta": 0.0,
>>>         "check_on_train_epoch_end": null,
>>>         "strict": true, // if the monitor is not right, raise error
>>>         "stopping_threshold": null, // float, if the value is good enough, stop
>>>         "divergence_threshold": null, // float,  if the value is so bad, stop
>>>         "verbose": true, //verbose mode print more info
>>>     }
>>> }
dlk.core.callbacks.lr_monitor module
class dlk.core.callbacks.lr_monitor.LearningRateMonitorCallback(config: dlk.core.callbacks.lr_monitor.LearningRateMonitorCallbackConfig)[source]

Bases: object

Monitor the learning rate

class dlk.core.callbacks.lr_monitor.LearningRateMonitorCallbackConfig(config: Dict)[source]

Bases: object

Config for LearningRateMonitorCallback

Config Example:
>>> {
>>>     "_name": "lr_monitor",
>>>     "config": {
>>>         "logging_interval": null, // set to None to log at individual interval according to the interval key of each scheduler. other value : step, epoch
>>>         "log_momentum": true, // log momentum or not
>>>     }
>>> }
dlk.core.callbacks.weight_average module
class dlk.core.callbacks.weight_average.StochasticWeightAveragingCallback(config: dlk.core.callbacks.weight_average.StochasticWeightAveragingCallbackConfig)[source]

Bases: object

Average weight by config

class dlk.core.callbacks.weight_average.StochasticWeightAveragingCallbackConfig(config)[source]

Bases: object

Config for StochasticWeightAveragingCallback

Config Example:
>>> {   //weight_average default
>>>     "_name": "weight_average",
>>>     "config": {
>>>         "swa_epoch_start": 0.8, // swa start epoch
>>>         "swa_lrs": null,
>>>             //None. Use the current learning rate of the optimizer at the time the SWA procedure starts.
>>>             //float. Use this value for all parameter groups of the optimizer.
>>>             //List[float]. A list values for each parameter group of the optimizer.
>>>         "annealing_epochs": 10,
>>>         "annealing_strategy": 'cos',
>>>         "device": null, // save device, null for auto detach, if the gpu is oom, you should change this to 'cpu'
>>>     }
>>> }
Module contents

callbacks

dlk.core.callbacks.import_callbacks(callbacks_dir, namespace)[source]

dlk.core.imodels package

Submodules
dlk.core.imodels.basic module
class dlk.core.imodels.basic.BasicIModel(config: dlk.core.imodels.basic.BasicIModelConfig, checkpoint=False)[source]

Bases: pytorch_lightning.core.lightning.LightningModule, dlk.core.imodels.GatherOutputMixin

configure_optimizers()[source]

Configure the optimizer and scheduler

property epoch_training_steps: int

every epoch training steps inferred from datamodule and devices.

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do forward on a mini batch

Parameters

batch – a mini batch inputs

Returns

the outputs

get_progress_bar_dict()[source]

rewrite the prograss_bar_dict, remove the ‘v_num’ which we don’t need

Returns

progress_bar dict

property num_training_epochs: int

Total training epochs inferred from datamodule and devices.

property num_training_steps: int

Total training steps inferred from datamodule and devices.

predict_step(batch: Dict, batch_idx: int) Dict[source]

do predict on a mini batch

Parameters
  • batch – a mini batch inputs

  • batch_idx – the index(dataloader) of the mini batch

Returns

the outputs

test_epoch_end(outputs: List[Dict]) List[Dict][source]

Gather the outputs of all node and do postprocess on it.

Parameters

outputs – current node returnd output list

Returns

all node outputs

test_step(batch: Dict[str, torch.Tensor], batch_idx: int) Dict[source]

do test on a mini batch

The outputs only gather the keys in self.gather_data.keys for postprocess :param batch: a mini batch inputs :param batch_idx: the index(dataloader) of the mini batch

Returns

the outputs

training: bool
training_step(batch: Dict[str, torch.Tensor], batch_idx: int)[source]

do training_step on a mini batch

Parameters
  • batch – a mini batch inputs

  • batch_idx – the index(dataloader) of the mini batch

Returns

the outputs

validation_epoch_end(outputs: List[Dict]) List[Dict][source]

Gather the outputs of all node and do postprocess on it.

The outputs only gather the keys in self.gather_data.keys for postprocess :param outputs: current node returnd output list

Returns

all node outputs

validation_step(batch: Dict[str, torch.Tensor], batch_idx: int) Dict[str, torch.Tensor][source]

do validation on a mini batch

The outputs only gather the keys in self.gather_data.keys for postprocess :param batch: a mini batch inputs :param batch_idx: the index(dataloader) of the mini batch

Returns

the outputs

class dlk.core.imodels.basic.BasicIModelConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

basic imodel config will provide all the config for model/optimizer/loss/scheduler/postprocess

get_loss(config: Dict)[source]

Use config to init the loss

Parameters

config – loss config

Returns

Loss, LossConfig

get_model(config: Dict)[source]

Use config to init the model

Parameters

config – model config

Returns

Model, ModelConfig

get_optimizer(config: Dict)[source]

Use config to init the optimizer

Parameters

config – optimizer config

Returns

Optimizer, OptimizerConfig

get_postprocessor(config: Dict)[source]

Use config to init the postprocessor

Parameters

config – postprocess config

Returns

PostProcess, PostProcessConfig

get_scheduler(config: Dict)[source]

Use config to init the scheduler

Parameters

config – scheduler config

Returns

Scheduler, SchedulerConfig

dlk.core.imodels.distill module
Module contents

imodels

class dlk.core.imodels.GatherOutputMixin[source]

Bases: object

gather all the small batches output to a big batch

concat_list_of_dict_outputs(outputs: List[Dict]) Dict[source]

only support all the outputs has the same dim, now is deprecated.

Parameters

outputs – multi node returned output (list of dict)

Returns

Concat all list by name

gather_outputs(outputs: List[Dict])[source]

gather the dist outputs

Parameters

outputs – one node outputs

Returns

all outputs

static proc_dist_outputs(dist_outputs: List[Dict]) List[Dict][source]

gather all distributed outputs to outputs which is like in a single worker.

Parameters

dist_outputs – the inputs of pytorch_lightning train/test/.._epoch_end when using ddp

Returns

the inputs of pytorch_lightning train/test/.._epoch_end when only run on one worker.

dlk.core.imodels.import_imodels(imodels_dir, namespace)[source]

dlk.core.initmethods package

Submodules
dlk.core.initmethods.default module
class dlk.core.initmethods.default.DefaultInit(config: dlk.core.initmethods.default.DefaultInitConfig)[source]

Bases: object

default method for init the modules

init_lstm(lstm)[source]

Initialize lstm

class dlk.core.initmethods.default.DefaultInitConfig(config)[source]

Bases: dlk.utils.config.BaseConfig

Config for RangeNormInit

Config Example:
>>> {
>>>     "_name": "default",
>>>     "config": {
>>>     }
>>> }
dlk.core.initmethods.range_norm module
class dlk.core.initmethods.range_norm.RangeNormInit(config: dlk.core.initmethods.range_norm.RangeNormInitConfig)[source]

Bases: object

default for transformers init method

class dlk.core.initmethods.range_norm.RangeNormInitConfig(config)[source]

Bases: dlk.utils.config.BaseConfig

Config for RangeNormInit

Config Example:
>>> {
>>>     "_name": "range_norm",
>>>     "config": {
>>>         "range": 0.1,
>>>     }
>>> }
dlk.core.initmethods.range_uniform module
class dlk.core.initmethods.range_uniform.RangeUniformInit(config: dlk.core.initmethods.range_uniform.RangeUniformInitConfig)[source]

Bases: object

for transformers

class dlk.core.initmethods.range_uniform.RangeUniformInitConfig(config)[source]

Bases: dlk.utils.config.BaseConfig

Config for RangeNormInit

Config Example:
>>> {
>>>     "_name": "range_uniform",
>>>     "config": {
>>>         "range": 0.1,
>>>     }
>>> }
Module contents

initmethods

dlk.core.initmethods.import_initmethods(initmethods_dir, namespace)[source]

dlk.core.layers package

Subpackages
dlk.core.layers.decoders package
Submodules
dlk.core.layers.decoders.identity module
class dlk.core.layers.decoders.identity.IdentityDecoder(config: dlk.core.layers.decoders.identity.IdentityDecoderConfig)[source]

Bases: dlk.core.base_module.SimpleModule

Do nothing

forward(inputs)[source]

return inputs

Parameters

inputs – anything

Returns

inputs

training: bool
class dlk.core.layers.decoders.identity.IdentityDecoderConfig(config)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for IdentityDecoder

Config Example:
>>> {
>>>     "config": {
>>>     },
>>>     "_name": "identity",
>>> }
dlk.core.layers.decoders.linear module
class dlk.core.layers.decoders.linear.Linear(config: dlk.core.layers.decoders.linear.LinearConfig)[source]

Bases: dlk.core.base_module.SimpleModule

wrap for torch.nn.Linear

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

All step do this

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

init_weight(method: Callable)[source]

init the weight of submodules by ‘method’

Parameters

method – init method

Returns

None

training: bool
class dlk.core.layers.decoders.linear.LinearConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for Linear

Config Example:
>>> {
>>>     "module": {
>>>         "_base": "linear",
>>>     },
>>>     "config": {
>>>         "input_size": "*@*",
>>>         "output_size": "*@*",
>>>         "pool": null,
>>>         "dropout": 0.0,
>>>         "output_map": {},
>>>         "input_map": {}, // required_key: provide_key
>>>     },
>>>     "_link":{
>>>         "config.input_size": ["module.config.input_size"],
>>>         "config.output_size": ["module.config.output_size"],
>>>         "config.pool": ["module.config.pool"],
>>>         "config.dropout": ["module.config.dropout"],
>>>     },
>>>     "_name": "linear",
>>> }
dlk.core.layers.decoders.linear_crf module
class dlk.core.layers.decoders.linear_crf.LinearCRF(config: dlk.core.layers.decoders.linear_crf.LinearCRFConfig)[source]

Bases: dlk.core.base_module.BaseModule

use torch.nn.Linear get the emission probability and fit to CRF

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do predict, only get the predict labels

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

init_weight(method: Callable)[source]

init the weight of submodules by ‘method’

Parameters

method – init method

Returns

None

predict_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do predict, only get the predict labels

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

training: bool
training_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do training step, get the crf loss

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

validation_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do validation step, get the crf loss and the predict labels

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

class dlk.core.layers.decoders.linear_crf.LinearCRFConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for LinearCRF

Config Example:
>>> {
>>>     "module@linear": {
>>>         "_base": "linear",
>>>     },
>>>     "module@crf": {
>>>         "_base": "crf",
>>>     },
>>>     "config": {
>>>         "input_size": "*@*",  // the linear input_size
>>>         "output_size": "*@*", // the linear output_size
>>>         "reduction": "mean", // crf reduction method
>>>         "output_map": {}, //provide_key: output_key
>>>         "input_map": {} // required_key: provide_key
>>>     },
>>>     "_link":{
>>>         "config.input_size": ["module@linear.config.input_size"],
>>>         "config.output_size": ["module@linear.config.output_size", "module@crf.config.output_size"],
>>>         "config.reduction": ["module@crf.config.reduction"],
>>>     }
>>>     "_name": "linear_crf",
>>> }
Module contents

decoders

dlk.core.layers.decoders.import_decoders(decoders_dir, namespace)[source]
dlk.core.layers.embeddings package
Submodules
dlk.core.layers.embeddings.combine_word_char_cnn module
class dlk.core.layers.embeddings.combine_word_char_cnn.CombineWordCharCNNEmbedding(config: dlk.core.layers.embeddings.combine_word_char_cnn.CombineWordCharCNNEmbeddingConfig)[source]

Bases: dlk.core.base_module.SimpleModule

from ‘input_ids’ and ‘char_ids’ generate ‘embedding’

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

get the combine char and word embedding

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

init_weight(method: Callable)[source]

init the weight of submodules by ‘method’

Parameters

method – init method

Returns

None

training: bool
class dlk.core.layers.embeddings.combine_word_char_cnn.CombineWordCharCNNEmbeddingConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for CombineWordCharCNNEmbedding

Config Example:
>>> {
>>>     "_name": "combine_word_char_cnn",
>>>     "embedding@char": {
>>>         "_base": "static_char_cnn",
>>>     },
>>>     "embedding@word": {
>>>         "_base": "static",
>>>     },
>>>     "config": {
>>>         "word": {
>>>             "embedding_file": "*@*", //the embedding file, must be saved as numpy array by pickle
>>>             "embedding_dim": "*@*",
>>>             "embedding_trace": ".", //default the file itself is the embedding
>>>             "freeze": false, // is freeze
>>>             "padding_idx": 0, //dropout rate
>>>             "output_map": {"embedding": "word_embedding"},
>>>             "input_map": {}, // required_key: provide_key
>>>         },
>>>         "char": {
>>>             "embedding_file": "*@*", //the embedding file, must be saved as numpy array by pickle
>>>             "embedding_dim": 35, //dropout rate
>>>             "embedding_trace": ".", //default the file itself is the embedding
>>>             "freeze": false, // is freeze
>>>             "kernel_sizes": [3], //dropout rate
>>>             "padding_idx": 0,
>>>             "output_map": {"char_embedding": "char_embedding"},
>>>             "input_map": {"char_ids": "char_ids"},
>>>         },
>>>         "dropout": 0, //dropout rate
>>>         "embedding_dim": "*@*", // this must equal to char.embedding_dim + word.embedding_dim
>>>         "output_map": {"embedding": "embedding"}, // this config do nothing, you can change this
>>>         "input_map": {"char_embedding": "char_embedding", 'word_embedding': "word_embedding"}, // if the output of char and word embedding changed, you also should change this
>>>     },
>>>     "_link":{
>>>         "config.word.embedding_file": ["embedding@word.config.embedding_file"],
>>>         "config.word.embedding_dim": ["embedding@word.config.embedding_dim"],
>>>         "config.word.embedding_trace": ["embedding@word.config.embedding_trace"],
>>>         "config.word.freeze": ["embedding@word.config.freeze"],
>>>         "config.word.padding_idx": ["embedding@word.config.padding_idx"],
>>>         "config.word.output_map": ["embedding@word.config.output_map"],
>>>         "config.word.input_map": ["embedding@word.config.input_map"],
>>>         "config.char.embedding_file": ["embedding@char.config.embedding_file"],
>>>         "config.char.embedding_dim": ["embedding@char.config.embedding_dim"],
>>>         "config.char.embedding_trace": ["embedding@char.config.embedding_trace"],
>>>         "config.char.freeze": ["embedding@char.config.freeze"],
>>>         "config.char.kernel_sizes": ["embedding@char.config.kernel_sizes"],
>>>         "config.char.padding_idx": ["embedding@char.config.padding_idx"],
>>>         "config.char.output_map": ["embedding@char.config.output_map"],
>>>         "config.char.input_map": ["embedding@char.config.input_map"],
>>>     },
>>> }
dlk.core.layers.embeddings.identity module
class dlk.core.layers.embeddings.identity.IdentityEmbedding(config: dlk.core.layers.embeddings.identity.IdentityEmbeddingConfig)[source]

Bases: dlk.core.base_module.SimpleModule

Do nothing

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

return inputs

Parameters

inputs – anything

Returns

inputs

training: bool
class dlk.core.layers.embeddings.identity.IdentityEmbeddingConfig(config)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for IdentityEmbedding

Config Example:
>>> {
>>>     "config": {
>>>     },
>>>     "_name": "identity",
>>> }
dlk.core.layers.embeddings.pretrained_transformers module
class dlk.core.layers.embeddings.pretrained_transformers.PretrainedTransformers(config: dlk.core.layers.embeddings.pretrained_transformers.PretrainedTransformersConfig)[source]

Bases: dlk.core.base_module.SimpleModule

Wrap the hugingface transformers

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

get the transformers output as embedding

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

init_weight(method)[source]

init the weight of submodules by ‘method’

Parameters

method – init method

Returns

None

training: bool
class dlk.core.layers.embeddings.pretrained_transformers.PretrainedTransformersConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for PretrainedTransformers

Config Example1:
>>> {
>>>     "module": {
>>>         "_base": "roberta",
>>>     },
>>>     "config": {
>>>         "pretrained_model_path": "*@*",
>>>         "input_map": {
>>>             "input_ids": "input_ids",
>>>             "attention_mask": "attention_mask",
>>>             "type_ids": "type_ids",
>>>         },
>>>         "output_map": {
>>>             "embedding": "embedding",
>>>         },
>>>         "dropout": 0, //dropout rate
>>>         "embedding_dim": "*@*",
>>>     },
>>>     "_link": {
>>>         "config.pretrained_model_path": ["module.config.pretrained_model_path"],
>>>     },
>>>     "_name": "pretrained_transformers",
>>> }
Config Example2:
>>> for gather embedding
>>> {
>>>     "module": {
>>>         "_base": "roberta",
>>>     },
>>>     "config": {
>>>         "pretrained_model_path": "*@*",
>>>         "input_map": {
>>>             "input_ids": "input_ids",
>>>             "attention_mask": "subword_mask",
>>>             "type_ids": "type_ids",
>>>             "gather_index": "gather_index",
>>>         },
>>>         "output_map": {
>>>             "embedding": "embedding",
>>>         },
>>>         "embedding_dim": "*@*",
>>>         "dropout": 0, //dropout rate
>>>     },
>>>     "_link": {
>>>         "config.pretrained_model_path": ["module.config.pretrained_model_path"],
>>>     },
>>>     "_name": "pretrained_transformers",
>>> }
dlk.core.layers.embeddings.random module
class dlk.core.layers.embeddings.random.RandomEmbedding(config: dlk.core.layers.embeddings.random.RandomEmbeddingConfig)[source]

Bases: dlk.core.base_module.SimpleModule

from ‘input_ids’ generate ‘embedding’

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

get the random embedding

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

init_weight(method: Callable)[source]

init the weight of submodules by ‘method’

Parameters

method – init method

Returns

None

training: bool
class dlk.core.layers.embeddings.random.RandomEmbeddingConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for RandomEmbedding

Config Example:
>>> {
>>>     "config": {
>>>         "vocab_size": "*@*",
>>>         "embedding_dim": "*@*",
>>>         "dropout": 0, //dropout rate
>>>         "padding_idx": 0, //dropout rate
>>>         "output_map": {},
>>>         "input_map": {},
>>>     },
>>>     "_name": "random",
>>> }
dlk.core.layers.embeddings.static module
class dlk.core.layers.embeddings.static.StaticEmbedding(config: dlk.core.layers.embeddings.static.StaticEmbeddingConfig)[source]

Bases: dlk.core.base_module.SimpleModule

from ‘input_ids’ generate static ‘embedding’ like glove, word2vec

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

get the pretrained static embedding like glove word2vec

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

init_weight(method)[source]

init the weight of submodules by ‘method’

Parameters

method – init method

Returns

None

training: bool
class dlk.core.layers.embeddings.static.StaticEmbeddingConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for StaticEmbedding

Config Example:
>>> {
>>>     "config": {
>>>         "embedding_file": "*@*", //the embedding file, must be saved as numpy array by pickle
>>>         "embedding_dim": "*@*",
>>>         //if the embedding_file is a dict, you should provide the dict trace to embedding
>>>         "embedding_trace": ".", //default the file itself is the embedding
>>>         /*embedding_trace: "embedding", //this means the <embedding = pickle.load(embedding_file)["embedding"]>*/
>>>         /*embedding_trace: "meta.embedding", //this means the <embedding = pickle.load(embedding_file)['meta']["embedding"]>*/
>>>         "freeze": false, // is freeze
>>>         "padding_idx": 0, //dropout rate
>>>         "dropout": 0, //dropout rate
>>>         "output_map": {},
>>>         "input_map": {}, // required_key: provide_key
>>>     },
>>>     "_name": "static",
>>> }
dlk.core.layers.embeddings.static_char_cnn module
class dlk.core.layers.embeddings.static_char_cnn.StaticCharCNNEmbedding(config: dlk.core.layers.embeddings.static_char_cnn.StaticCharCNNEmbeddingConfig)[source]

Bases: dlk.core.base_module.SimpleModule

from ‘char_ids’ generate ‘embedding’

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

fit the char embedding to cnn and pool to word_embedding

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

init_weight(method)[source]

init the weight of submodules by ‘method’

Parameters

method – init method

Returns

None

training: bool
class dlk.core.layers.embeddings.static_char_cnn.StaticCharCNNEmbeddingConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for StaticCharCNNEmbedding

Config Example:
>>> {
>>>     "module@cnn": {
>>>         "_base": "conv1d",
>>>         config: {
>>>             in_channels: -1,
>>>             out_channels: -1,  //will update while load embedding
>>>             kernel_sizes: [3],
>>>         },
>>>     },
>>>     "config": {
>>>         "embedding_file": "*@*", //the embedding file, must be saved as numpy array by pickle
>>>         //if the embedding_file is a dict, you should provide the dict trace to embedding
>>>         "embedding_trace": ".", //default the file itself is the embedding
>>>         /*embedding_trace: "char_embedding", //this means the <embedding = pickle.load(embedding_file)["char_embedding"]>*/
>>>         /*embedding_trace: "meta.char_embedding", //this means the <embedding = pickle.load(embedding_file)['meta']["char_embedding"]>*/
>>>         "freeze": false, // is freeze
>>>         "dropout": 0, //dropout rate
>>>         "embedding_dim": 35, //dropout rate
>>>         "kernel_sizes": [3], //dropout rate
>>>         "padding_idx": 0,
>>>         "output_map": {"char_embedding": "char_embedding"},
>>>         "input_map": {"char_ids": "char_ids"},
>>>     },
>>>     "_link":{
>>>         "config.embedding_dim": ["module@cnn.config.in_channels", "module@cnn.config.out_channels"],
>>>         "config.kernel_sizes": ["module@cnn.config.kernel_sizes"],
>>>     },
>>>     "_name": "static_char_cnn",
>>> }
Module contents

embeddings

class dlk.core.layers.embeddings.EmbeddingInput(**args)[source]

Bases: object

docstring for EmbeddingInput

class dlk.core.layers.embeddings.EmbeddingOutput(**args)[source]

Bases: object

docstring for EmbeddingOutput

dlk.core.layers.embeddings.import_embeddings(embeddings_dir, namespace)[source]
dlk.core.layers.encoders package
Submodules
dlk.core.layers.encoders.identity module
class dlk.core.layers.encoders.identity.IdentityEncoder(config: dlk.core.layers.encoders.identity.IdentityEncoderConfig)[source]

Bases: dlk.core.base_module.SimpleModule

Do nothing

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

return inputs

Parameters

inputs – anything

Returns

inputs

training: bool
class dlk.core.layers.encoders.identity.IdentityEncoderConfig(config)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for IdentityEncoder

Config Example:
>>> {
>>>     "config": {
>>>     },
>>>     "_name": "identity",
>>> }
dlk.core.layers.encoders.linear module
class dlk.core.layers.encoders.linear.Linear(config: dlk.core.layers.encoders.linear.LinearConfig)[source]

Bases: dlk.core.base_module.SimpleModule

wrap for torch.nn.Linear

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

All step do this

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

init_weight(method: Callable)[source]

init the weight of submodules by ‘method’

Parameters

method – init method

Returns

None

training: bool
class dlk.core.layers.encoders.linear.LinearConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for Linear

Config Example:
>>> {
>>>     "module": {
>>>         "_base": "linear",
>>>     },
>>>     "config": {
>>>         "input_size": "*@*",
>>>         "output_size": "*@*",
>>>         "pool": null,
>>>         "dropout": 0.0,
>>>         "output_map": {},
>>>         "input_map": {}, // required_key: provide_key
>>>     },
>>>     "_link":{
>>>         "config.input_size": ["module.config.input_size"],
>>>         "config.output_size": ["module.config.output_size"],
>>>         "config.pool": ["module.config.pool"],
>>>     },
>>>     "_name": "linear",
>>> }
dlk.core.layers.encoders.lstm module
class dlk.core.layers.encoders.lstm.LSTM(config: dlk.core.layers.encoders.lstm.LSTMConfig)[source]

Bases: dlk.core.base_module.SimpleModule

Wrap for torch.nn.LSTM

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

All step do this

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

init_weight(method: Callable)[source]

init the weight of submodules by ‘method’

Parameters

method – init method

Returns

None

training: bool
class dlk.core.layers.encoders.lstm.LSTMConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for LSTM

Config Example:
>>> {
>>>     module: {
>>>         _base: "lstm",
>>>     },
>>>     config: {
>>>         input_map: {},
>>>         output_map: {},
>>>         input_size: *@*,
>>>         output_size: "*@*",
>>>         num_layers: 1,
>>>         dropout: "*@*", // dropout between layers
>>>     },
>>>     _link: {
>>>         config.input_size: [module.config.input_size],
>>>         config.output_size: [module.config.output_size],
>>>         config.dropout: [module.config.dropout],
>>>     },
>>>     _name: "lstm",
>>> }
Module contents

encoders

dlk.core.layers.encoders.import_encoders(encoders_dir, namespace)[source]
Module contents

dlk.core.losses package

Submodules
dlk.core.losses.bce module
class dlk.core.losses.bce.BCEWithLogitsLoss(config: dlk.core.losses.bce.BCEWithLogitsLossConfig)[source]

Bases: object

binary crossentropy for bi-class classification

calc(result, inputs, rt_config)[source]

calc the loss the predict is from result, the ground truth is from inputs

Parameters
  • result – the model predict dict

  • inputs – the all inputs for model

  • rt_config – provide the current training status >>> { >>> “current_step”: self.global_step, >>> “current_epoch”: self.current_epoch, >>> “total_steps”: self.num_training_steps, >>> “total_epochs”: self.num_training_epochs >>> }

Returns

loss

update_config(rt_config: Dict)[source]

callback for imodel to update the total steps and epochs

when init the loss module, the total step and epoch is not known, when all data ready, the imodel update the value for loss module

Parameters

rt_config – { “total_steps”: self.num_training_steps, “total_epochs”: self.num_training_epochs}

Returns

None

class dlk.core.losses.bce.BCEWithLogitsLossConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for BCEWithLogitsLoss

Config Example:
>>> {
>>>     "config": {
>>>         "pred_truth_pair": [], # len(.) == 2, the 1st is the pred_name, 2nd is truth_name in __call__ inputs
>>>         "schedule": [1],
>>>         "masked_select": null, // if provide, only select the masked(=1) data
>>>         "scale": [1], # scale the loss for every schedule stage
>>>         // "schdeule": [0.3, 1.0], # can be a list or str
>>>         // "scale": "[0.5, 1]",
>>>     },
>>>     "_name": "bce",
>>> }
dlk.core.losses.cross_entropy module
class dlk.core.losses.cross_entropy.CrossEntropyLoss(config: dlk.core.losses.cross_entropy.CrossEntropyLossConfig)[source]

Bases: object

for multi class classification

calc(result, inputs, rt_config)[source]

calc the loss the predict is from result, the ground truth is from inputs

Parameters
  • result – the model predict dict

  • inputs – the all inputs for model

  • rt_config – provide the current training status >>> { >>> “current_step”: self.global_step, >>> “current_epoch”: self.current_epoch, >>> “total_steps”: self.num_training_steps, >>> “total_epochs”: self.num_training_epochs >>> }

Returns

loss

update_config(rt_config)[source]

callback for imodel to update the total steps and epochs

when init the loss module, the total step and epoch is not known, when all data ready, the imodel update the value for loss module

Parameters

rt_config – { “total_steps”: self.num_training_steps, “total_epochs”: self.num_training_epochs}

Returns

None

class dlk.core.losses.cross_entropy.CrossEntropyLossConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for CrossEntropyLoss

Config Example:
>>> {
>>>     "config": {
>>>         "ignore_index": -1,
>>>         "weight": null, # or a list of value for every class
>>>         "label_smoothing": 0.0, # torch>=1.10
>>>         "pred_truth_pair": [], # len(.) == 2, the 1st is the pred_name, 2nd is truth_name in __call__ inputs
>>>         "schedule": [1],
>>>         "scale": [1], # scale the loss for every schedule stage
>>>         // "schdeule": [0.3, 1.0], # can be a list or str
>>>         // "scale": "[0.5, 1]",
>>>     },
>>>     "_name": "cross_entropy",
>>> }
dlk.core.losses.identity module
class dlk.core.losses.identity.IdentityLoss(config: dlk.core.losses.identity.IdentityLossConfig)[source]

Bases: object

gather the loss and return when the loss is calc previor module like crf

calc(result, inputs, rt_config)[source]

calc the loss the predict is from result, the ground truth is from inputs

Parameters
  • result – the model predict dict

  • inputs – the all inputs for model

  • rt_config – provide the current training status >>> { >>> “current_step”: self.global_step, >>> “current_epoch”: self.current_epoch, >>> “total_steps”: self.num_training_steps, >>> “total_epochs”: self.num_training_epochs >>> }

Returns

loss

update_config(rt_config)[source]

callback for imodel to update the total steps and epochs

when init the loss module, the total step and epoch is not known, when all data ready, the imodel update the value for loss module

Parameters

rt_config – { “total_steps”: self.num_training_steps, “total_epochs”: self.num_training_epochs}

Returns

None

class dlk.core.losses.identity.IdentityLossConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for IdentityLoss

Config Example:
>>> {
>>>     config: {
>>>         "schedule": [1],
>>>         "scale": [1], # scale the loss for every schedule
>>>         // "schedule": [0.3, 1.0], # can be a list or str
>>>         // "scale": "[0.5, 1]",
>>>         "loss": "loss", // the real loss from result['loss']
>>>     },
>>>     _name: "identity",
>>> }
dlk.core.losses.mse module
class dlk.core.losses.mse.MSELoss(config: dlk.core.losses.mse.MSELossConfig)[source]

Bases: object

mse loss for regression, distill, etc.

calc(result, inputs, rt_config)[source]

calc the loss the predict is from result, the ground truth is from inputs

Parameters
  • result – the model predict dict

  • inputs – the all inputs for model

  • rt_config – provide the current training status >>> { >>> “current_step”: self.global_step, >>> “current_epoch”: self.current_epoch, >>> “total_steps”: self.num_training_steps, >>> “total_epochs”: self.num_training_epochs >>> }

Returns

loss

update_config(rt_config)[source]

callback for imodel to update the total steps and epochs

when init the loss module, the total step and epoch is not known, when all data ready, the imodel update the value for loss module

Parameters

rt_config – { “total_steps”: self.num_training_steps, “total_epochs”: self.num_training_epochs}

Returns

None

class dlk.core.losses.mse.MSELossConfig(config: Dict)[source]

Bases: dlk.core.base_module.BaseModuleConfig

Config for MSELoss

Config Example:
>>> {
>>>     "config": {
>>>         "pred_truth_pair": [], # len(.) == 2, the 1st is the pred_name, 2nd is truth_name in __call__ inputs
>>>         "schedule": [1],
>>>         "masked_select": null, // if provide, only select the masked(=1) data
>>>         "scale": [1], # scale the loss for every schedule stage
>>>         // "schdeule": [0.3, 1.0], # can be a list or str
>>>         // "scale": "[0.5, 1]",
>>>     },
>>>     "_name": "mse",
>>> }
dlk.core.losses.multi_loss module
class dlk.core.losses.multi_loss.MultiLoss(config: dlk.core.losses.multi_loss.MultiLossConfig)[source]

Bases: object

This module is NotImplemented yet don’t use it

calc(result, inputs, rt_config)[source]

calc the loss the predict is from result, the ground truth is from inputs

Parameters
  • result – the model predict dict

  • inputs – the all inputs for model

  • rt_config – provide the current training status >>> { >>> “current_step”: self.global_step, >>> “current_epoch”: self.current_epoch, >>> “total_steps”: self.num_training_steps, >>> “total_epochs”: self.num_training_epochs >>> }

Returns

loss

get_loss(config)[source]

Use config to init the loss

Parameters

config – loss config

Returns

the Loss and the LossConfig

class dlk.core.losses.multi_loss.MultiLossConfig(config: Dict)[source]

Bases: object

Config for MultiLoss

Config Example:
>>> {
>>>     "loss@the_first": {
>>>         config: {
>>>             "ignore_index": -1,
>>>             "weight": null, # or a list of value for every class
>>>             "label_smoothing": 0.0, # torch>=1.10
>>>             "pred_truth_pair": ["logits1", "label1"], # len(.) == 2, the 1st is the pred_name, 2nd is truth_name in __call__ inputs
>>>             "schedule": [0.3, 0.6, 1],
>>>             "scale": [1, 0, 0.5], # scale the loss for every schedule
>>>             // "schdeule": [0.3, 1.0],
>>>             // "scale": [0, 1, 0.5], # scale the loss
>>>         },
>>>         _name: "cross_entropy",
>>>     },
>>>     "loss@the_second": {
>>>         config: {
>>>             "pred_truth_pair": ["logits2", "label2"], # len(.) == 2, the 1st is the pred_name, 2nd is truth_name in __call__ inputs
>>>             "schdeule": [0.3, 0.6, 1],
>>>             "scale": [0, 1, 0.5], # scale the loss for every schedule
>>>             // "schdeule": [0.3, 1.0],
>>>             // "scale": [0, 1, 0.5], # scale the loss
>>>         },
>>>         _base: "cross_entropy",  // _name or _base is all ok
>>>     },
>>>     config: {
>>>         "loss_list": ['the_first', 'the_second'],
>>>     },
>>>     _name: "cross_entropy",
>>> }
Module contents

losses

dlk.core.losses.import_losses(losses_dir, namespace)[source]

dlk.core.models package

Submodules
dlk.core.models.basic module
class dlk.core.models.basic.BasicModel(config: dlk.core.models.basic.BasicModelConfig, checkpoint)[source]

Bases: dlk.core.base_module.BaseModel

Basic & General Model

check_keys_are_provided(provide: List[str] = []) None[source]

check this all the submodules required key are provided

Returns: None

Raises: PermissionError

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do forward on a mini batch

Parameters

batch – a mini batch inputs

Returns: the outputs

predict_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do predict for one batch

Parameters

inputs – one mini-batch inputs

Returns: the predicts outputs

provide_keys() List[str][source]

return all keys of the dict of the model returned

This method may no use, so we will remove this.

Returns: all keys

test_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do test for one batch

Parameters

inputs – one mini-batch inputs

Returns: the test outputs

training: bool
training_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do training for one batch

Parameters

inputs – one mini-batch inputs

Returns: the training outputs

validation_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do validation for one batch

Parameters

inputs – one mini-batch inputs

Returns: the validation outputs

class dlk.core.models.basic.BasicModelConfig(config)[source]

Bases: dlk.utils.config.BaseConfig

Config for BasicModel

Config Example:
>>> {
>>>     embedding: {
>>>         _base: "static"
>>>         config: {
>>>             embedding_file: "*@*", //the embedding file, must be saved as numpy array by pickle
>>>             embedding_dim: "*@*",
>>>             //if the embedding_file is a dict, you should provide the dict trace to embedding
>>>             embedding_trace: ".", //default the file itself is the embedding
>>>             /*embedding_trace: "embedding", //this means the <embedding = pickle.load(embedding_file)["embedding"]>*/
>>>             /*embedding_trace: "meta.embedding", //this means the <embedding = pickle.load(embedding_file)['meta']["embedding"]>*/
>>>             freeze: false, // is freeze
>>>             dropout: 0, //dropout rate
>>>             output_map: {},
>>>         },
>>>     },
>>>     decoder: {
>>>         _base: "linear",
>>>         config: {
>>>             input_size: "*@*",
>>>             output_size: "*@*",
>>>             pool: null,
>>>             dropout: "*@*", //the decoder output no need dropout
>>>             output_map: {}
>>>         },
>>>     },
>>>     encoder: {
>>>         _base: "lstm",
>>>         config: {
>>>             output_map: {},
>>>             hidden_size: "*@*",
>>>             input_size: *@*,
>>>             output_size: "*@*",
>>>             num_layers: 1,
>>>             dropout: "*@*", // dropout between layers
>>>         },
>>>     },
>>>     "initmethod": {
>>>         "_base": "range_norm"
>>>     },
>>>     "config": {
>>>         "embedding_dim": "*@*",
>>>         "dropout": "*@*",
>>>         "embedding_file": "*@*",
>>>         "embedding_trace": "token_embedding",
>>>     },
>>>     _link: {
>>>         "config.embedding_dim": ["embedding.config.embedding_dim",
>>>                                  "encoder.config.input_size",
>>>                                  "encoder.config.output_size",
>>>                                  "encoder.config.hidden_size",
>>>                                  "decoder.config.output_size",
>>>                                  "decoder.config.input_size"
>>>                                 ],
>>>         "config.dropout": ["encoder.config.dropout", "decoder.config.dropout", "embedding.config.dropout"],
>>>         "config.embedding_file": ['embedding.config.embedding_file'],
>>>         "config.embedding_trace": ['embedding.config.embedding_trace']
>>>     }
>>>     _name: "basic"
>>> }
get_decoder(config)[source]

return the Decoder and DecoderConfig

Parameters

config – the decoder config

Returns

Decoder, DecoderConfig

get_embedding(config: Dict)[source]

return the Embedding and EmbeddingConfig

Parameters

config – the embedding config

Returns

Embedding, EmbeddingConfig

get_encoder(config: Dict)[source]

return the Encoder and EncoderConfig

Parameters

config – the encoder config

Returns

Encoder, EncoderConfig

get_init_method(config: Dict)[source]

return the InitMethod and InitMethodConfig

Parameters

config – the init method config

Returns

InitMethod, InitMethodConfig

Module contents

models

dlk.core.models.import_models(models_dir, namespace)[source]

dlk.core.modules package

Submodules
dlk.core.modules.bert module
class dlk.core.modules.bert.BertWrap(config: dlk.core.modules.bert.BertWrapConfig)[source]

Bases: dlk.core.modules.Module

Bert wrap

forward(inputs: Dict)[source]

do forward on a mini batch

Parameters

batch – a mini batch inputs

Returns

sequence_output, all_hidden_states, all_self_attentions

from_pretrained()[source]

init the model from pretrained_model_path

init_weight(method)[source]

init the weight of model by ‘bert.init_weight()’ or from_pretrain

Parameters

method – init method, no use for pretrained_transformers

Returns

None

training: bool
class dlk.core.modules.bert.BertWrapConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for BertWrap

Config Example:
>>> {
>>>     "config": {
>>>         "pretrained_model_path": "*@*",
>>>         "from_pretrain": true,
>>>         "freeze": false,
>>>         "dropout": 0.0,
>>>     },
>>>     "_name": "bert",
>>> }
dlk.core.modules.conv1d module
class dlk.core.modules.conv1d.Conv1d(config: dlk.core.modules.conv1d.Conv1dConfig)[source]

Bases: dlk.core.modules.Module

Conv for 1d input

forward(x: torch.Tensor)[source]

do forward on a mini batch

Parameters

batch – a mini batch inputs

Returns

conv result the shape is the same as input

training: bool
class dlk.core.modules.conv1d.Conv1dConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for Conv1d

Config Example:
>>> {
>>>     "config": {
>>>         "in_channels": "*@*",
>>>         "out_channels": "*@*",
>>>         "dropout": 0.0,
>>>         "kernel_sizes": [3],
>>>     },
>>>     "_name": "conv1d",
>>> }
dlk.core.modules.crf module
class dlk.core.modules.crf.CRFConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for ConditionalRandomField

Config Example:
>>> {
>>>     "config": {
>>>         "output_size": 2,
>>>         "batch_first": true,
>>>         "reduction": "mean", //none|sum|mean|token_mean
>>>     },
>>>     "_name": "crf",
>>> }
class dlk.core.modules.crf.ConditionalRandomField(config: dlk.core.modules.crf.CRFConfig)[source]

Bases: dlk.core.modules.Module

CRF, training_step for training, forward for decode。

forward(logits: torch.FloatTensor, mask: torch.LongTensor)[source]

predict step, get the best path

Parameters
  • logits – emissions, batch_size*max_len*num_tags

  • mask – batch_size*max_len, mask==0 means padding

Returns

batch*max_len

init_weight(method: Callable)[source]

init the weight of transitions, start_transitions and end_transitions

Initialize the transition parameters. The parameters will be initialized randomly from a uniform distribution between -0.1 and 0.1.

Parameters

method – init method, no use

Returns

None

training: bool
training_step(logits: torch.FloatTensor, tags: torch.LongTensor, mask: torch.LongTensor)[source]

training step, calc the loss

Parameters
  • logits – emissions, batch_size*max_len*num_tags

  • tags – batch_size*max_len

  • mask – batch_size*max_len, mask==0 means padding

Returns

loss

dlk.core.modules.distil_bert module
class dlk.core.modules.distil_bert.DistilBertWrap(config: dlk.core.modules.distil_bert.DistilBertWrapConfig)[source]

Bases: dlk.core.modules.Module

DistillBertWrap

forward(inputs)[source]

do forward on a mini batch

Parameters

batch – a mini batch inputs

Returns

sequence_output, all_hidden_states, all_self_attentions

from_pretrained()[source]

init the model from pretrained_model_path

init_weight(method)[source]

init the weight of model by ‘bert.init_weight()’ or from_pretrain

Parameters

method – init method, no use for pretrained_transformers

Returns

None

training: bool
class dlk.core.modules.distil_bert.DistilBertWrapConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for DistilBertWrap

Config Example:
>>> {
>>>     "config": {
>>>         "pretrained_model_path": "*@*",
>>>         "from_pretrain": true,
>>>         "freeze": false,
>>>         "dropout": 0.0,
>>>     },
>>>     "_name": "distil_bert",
>>> }
dlk.core.modules.linear module
class dlk.core.modules.linear.Linear(config: dlk.core.modules.linear.LinearConfig)[source]

Bases: dlk.core.modules.Module

wrap for nn.Linear

forward(input: torch.Tensor) torch.Tensor[source]

do forward on a mini batch

Parameters

batch – a mini batch inputs

Returns

project result the shape is the same as input(no poll), otherwise depend on poll method

training: bool
class dlk.core.modules.linear.LinearConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for Linear

Config Example:
>>> {
>>>     "config": {
>>>         "input_size": 256,
>>>         "output_size": 2,
>>>         "dropout": 0.0, //the module output no need dropout
>>>         "bias": true, // use bias or not in linear , if set to false, all the bias will be set to 0
>>>         "pool": null, // pooling output or not
>>>     },
>>>     "_name": "linear",
>>> }
dlk.core.modules.logits_gather module
class dlk.core.modules.logits_gather.LogitsGather(config: dlk.core.modules.logits_gather.LogitsGatherConfig)[source]

Bases: dlk.core.modules.Module

Gather the output logits decided by config

forward(input: List[torch.Tensor]) Dict[str, torch.Tensor][source]

gather the needed input to dict

Parameters

batch – a mini batch inputs

Returns

some elements to dict

training: bool
class dlk.core.modules.logits_gather.LogitsGatherConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for LogitsGather

Config Example:
>>> {
>>>     "config": {
>>>         "gather_layer": {
>>>             "0": {
>>>                 "map": "3", // the 0th layer not do scale output to "gather_logits_3", "gather_logits_" is the output name prefix, the "3" is map name
>>>                 "scale": {} //don't scale
>>>             },
>>>             "1": {
>>>                 "map": "4",  // the 1th layer scale output dim from 1024 to 200 and the output named "gather_logits_3"
>>>                 "scale": {"1024":"200"},
>>>             }
>>>         },
>>>         "prefix": "gather_logits_",
>>>     },
>>>     _name: "logits_gather",
>>> }
dlk.core.modules.lstm module
class dlk.core.modules.lstm.LSTM(config: dlk.core.modules.lstm.LSTMConfig)[source]

Bases: dlk.core.modules.Module

A wrap for nn.LSTM

forward(input: torch.Tensor, mask: torch.Tensor) torch.Tensor[source]

do forward on a mini batch

Parameters

batch – a mini batch inputs

Returns

lstm output the shape is the same as input

training: bool
class dlk.core.modules.lstm.LSTMConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for def

Config Example:
>>> {
>>>     "config": {
>>>         "bidirectional": true,
>>>         "output_size": 200, //the output is 2*hidden_size if use
>>>         "input_size": 200,
>>>         "num_layers": 1,
>>>         "dropout": 0.1, // dropout between layers
>>>         "dropout_last": true, //dropout the last layer output or not
>>>     },
>>>     "_name": "lstm",
>>> }
dlk.core.modules.roberta module
class dlk.core.modules.roberta.RobertaWrap(config: dlk.core.modules.roberta.RobertaWrapConfig)[source]

Bases: dlk.core.modules.Module

Roberta Wrap

forward(inputs)[source]

do forward on a mini batch

Parameters

batch – a mini batch inputs

Returns

sequence_output, all_hidden_states, all_self_attentions

from_pretrained()[source]

init the model from pretrained_model_path

init_weight(method)[source]

init the weight of model by ‘bert.init_weight()’ or from_pretrain

Parameters

method – init method, no use for pretrained_transformers

Returns

None

training: bool
class dlk.core.modules.roberta.RobertaWrapConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for RobertaWrap

Config Example:
>>> {
>>>     "config": {
>>>         "pretrained_model_path": "*@*",
>>>         "from_pretrain": true
>>>         "freeze": false,
>>>         "dropout": 0.0,
>>>     },
>>>     "_name": "roberta",
>>> }
Module contents

basic modules

class dlk.core.modules.Module[source]

Bases: torch.nn.modules.module.Module

This class is means DLK Module for replace the torch.nn.Module in this project

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

in simple module, all step fit to this method

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

init_weight(method)[source]

init the weight of submodules by ‘method’

Parameters

method – init method

Returns

None

predict_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do predict for one batch

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

test_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do test for one batch

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

training: bool
training_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do train for one batch

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

validation_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do validation for one batch

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

dlk.core.modules.import_modules(modules_dir, namespace)[source]

dlk.core.optimizers package

Submodules
dlk.core.optimizers.adamw module
class dlk.core.optimizers.adamw.AdamWOptimizer(model: torch.nn.modules.module.Module, config: dlk.core.optimizers.adamw.AdamWOptimizerConfig)[source]

Bases: dlk.core.optimizers.BaseOptimizer

Wrap for optim.AdamW

get_optimizer() torch.optim.adamw.AdamW[source]

return the initialized AdamW optimizer

Returns

AdamW Optimizer

class dlk.core.optimizers.adamw.AdamWOptimizerConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for AdamWOptimizer

Config Example:
>>> {
>>>     "config": {
>>>         "lr": 5e-5,
>>>         "betas": [0.9, 0.999],
>>>         "eps": 1e-6,
>>>         "weight_decay": 1e-2,
>>>         "optimizer_special_groups": {
>>>             "order": ['decoder', 'bias'], // the group order, if the para is in decoder & is in bias, set to decoder. The order name is set to the group name
>>>             "bias": {
>>>                 "config": {
>>>                     "weight_decay": 0
>>>                 },
>>>                 "pattern": ["bias",  "LayerNorm.bias", "LayerNorm.weight"]
>>>             },
>>>             "decoder": {
>>>                 "config": {
>>>                     "lr": 1e-3
>>>                 },
>>>                 "pattern": ["decoder"]
>>>             },
>>>         }
>>>         "name": "default" // default group name
>>>     },
>>>     "_name": "adamw",
>>> }
dlk.core.optimizers.sgd module
class dlk.core.optimizers.sgd.SGDOptimizer(model: torch.nn.modules.module.Module, config: dlk.core.optimizers.sgd.SGDOptimizerConfig)[source]

Bases: dlk.core.optimizers.BaseOptimizer

wrap for optim.SGD

get_optimizer() torch.optim.sgd.SGD[source]

return the initialized SGD optimizer

Returns

SGD Optimizer

class dlk.core.optimizers.sgd.SGDOptimizerConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for SGDOptimizer

Config Example:
>>> {
>>>     "config": {
>>>         "lr": 1e-3,
>>>         "momentum": 0.9,
>>>         "dampening": 0,
>>>         "weight_decay": 0,
>>>         "nesterov":false,
>>>         "optimizer_special_groups": {
>>>             // "order": ['decoder', 'bias'], // the group order, if the para is in decoder & is in bias, set to decoder. The order name is set to the group name
>>>             // "bias": {
>>>             //     "config": {
>>>             //         "weight_decay": 0
>>>             //     },
>>>             //     "pattern": ["bias",  "LayerNorm.bias", "LayerNorm.weight"]
>>>             // },
>>>             // "decoder": {
>>>             //     "config": {
>>>             //         "lr": 1e-3
>>>             //     },
>>>             //     "pattern": ["decoder"]
>>>             // },
>>>         }
>>>         "name": "default" // default group name
>>>     },
>>>     "_name": "sgd",
>>> }
Module contents

optimizers

class dlk.core.optimizers.BaseOptimizer[source]

Bases: object

get_optimizer() torch.optim.optimizer.Optimizer[source]

return the initialized optimizer

Returns

Optimizer

init_optimizer(optimizer: torch.optim.optimizer.Optimizer, model: torch.nn.modules.module.Module, config: Dict)[source]

init the optimizer for paras in model, and the group is decided by config

Parameters
  • optimizer – adamw, sgd, etc.

  • model – pytorch model

  • config – which decided the para group, lr, etc.

Returns

the initialized optimizer

dlk.core.optimizers.import_optimizers(optimizers_dir, namespace)[source]

dlk.core.schedulers package

Submodules
dlk.core.schedulers.constant module
class dlk.core.schedulers.constant.ConstantSchedule(optimizer: torch.optim.optimizer.Optimizer, config: dlk.core.schedulers.constant.ConstantScheduleConfig)[source]

Bases: dlk.core.schedulers.BaseScheduler

no schedule

get_scheduler()[source]

return the initialized constant scheduler

Returns

Schedule

class dlk.core.schedulers.constant.ConstantScheduleConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for ConstantSchedule

Config Example:
>>> {
>>>     "config": {
>>>         "last_epoch": -1
>>>     },
>>>     "_name": "constant",
>>> }
dlk.core.schedulers.constant_warmup module
class dlk.core.schedulers.constant_warmup.ConstantWarmupSchedule(optimizer: torch.optim.optimizer.Optimizer, config: dlk.core.schedulers.constant_warmup.ConstantWarmupScheduleConfig)[source]

Bases: dlk.core.schedulers.BaseScheduler

get_scheduler() torch.optim.lr_scheduler.LambdaLR[source]

return the initialized linear wramup then constant scheduler

Returns

Schedule

class dlk.core.schedulers.constant_warmup.ConstantWarmupScheduleConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for ConstantWarmupSchedule

Config Example:
>>> {
>>>     "config": {
>>>         "last_epoch": -1,
>>>         "num_warmup_steps": 0,
>>>     },
>>>     "_name": "constant_warmup",
>>> }
dlk.core.schedulers.cosine_warmup module
class dlk.core.schedulers.cosine_warmup.CosineWarmupSchedule(optimizer: torch.optim.optimizer.Optimizer, config: dlk.core.schedulers.cosine_warmup.CosineWarmupScheduleConfig)[source]

Bases: dlk.core.schedulers.BaseScheduler

get_scheduler() torch.optim.lr_scheduler.LambdaLR[source]

return the initialized linear wramup then cos decay scheduler

Returns

Schedule

class dlk.core.schedulers.cosine_warmup.CosineWarmupScheduleConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for CosineWarmupSchedule

Config Example:
>>> {
>>>     "config": {
>>>         "last_epoch": -1,
>>>         "num_warmup_steps": 0,
>>>         "num_training_steps": -1,
>>>         "num_cycles": 0.5,
>>>     },
>>>     "_name": "cosine_warmup",
>>> }
dlk.core.schedulers.linear_warmup module
class dlk.core.schedulers.linear_warmup.LinearWarmupSchedule(optimizer: torch.optim.optimizer.Optimizer, config: dlk.core.schedulers.linear_warmup.LinearWarmupScheduleConfig)[source]

Bases: dlk.core.schedulers.BaseScheduler

linear warmup then linear decay

get_scheduler() torch.optim.lr_scheduler.LambdaLR[source]

return the initialized linear wramup then linear decay scheduler

Returns

Schedule

class dlk.core.schedulers.linear_warmup.LinearWarmupScheduleConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config Example:
>>> {
>>>     "config": {
>>>         "last_epoch": -1,
>>>         "num_warmup_steps": 0,
>>>         "num_training_steps": -1,
>>>     },
>>>     "_name": "linear_warmup",
>>> }
dlk.core.schedulers.multi_group_schedule module
dlk.core.schedulers.rec_decay module
class dlk.core.schedulers.rec_decay.RecDecaySchedule(optimizer: torch.optim.optimizer.Optimizer, config: dlk.core.schedulers.rec_decay.RecDecayScheduleConfig)[source]

Bases: dlk.core.schedulers.BaseScheduler

lr=lr*1/(1+decay)

get_scheduler()[source]

return the initialized rec_decay scheduler

lr=lr*1/(1+decay)

Returns

Schedule

class dlk.core.schedulers.rec_decay.RecDecayScheduleConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for RecDecaySchedule

Config Example:
>>> {
>>>     "config": {
>>>         "last_epoch": -1,
>>>         "num_training_steps": -1,
>>>         "decay": 0.05,
>>>         "epoch_training_steps": -1,
>>>     },
>>>     "_name": "rec_decay",
>>> }

the lr=lr*1/(1+decay)

Module contents

schedulers

class dlk.core.schedulers.BaseScheduler[source]

Bases: object

interface for Schedule

get_scheduler() torch.optim.lr_scheduler.LambdaLR[source]

return the initialized scheduler

Returns

Schedule

dlk.core.schedulers.import_schedulers(schedulers_dir, namespace)[source]

Submodules

dlk.core.base_module module

class dlk.core.base_module.BaseModel[source]

Bases: torch.nn.modules.module.Module, dlk.core.base_module.ModuleOutputRenameMixin, dlk.core.base_module.IModuleIO, dlk.core.base_module.IModuleStep

All pytorch models should inheritance this class

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

all models should apply this method

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

training: bool
class dlk.core.base_module.BaseModule(config: dlk.core.base_module.BaseModuleConfig)[source]

Bases: torch.nn.modules.module.Module, dlk.core.base_module.ModuleOutputRenameMixin, dlk.core.base_module.IModuleIO, dlk.core.base_module.IModuleStep

All pytorch modules should inheritance this class

check_keys_are_provided(provide: Set[str]) None[source]

check this module required key are provided

Returns

pass or not

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

all module should apply this method

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

init_weight(method)[source]

init the weight of submodules by ‘method’

Parameters

method – init method

Returns

None

provide_keys() Set[source]

return all keys of the dict of the module returned

Returns

all keys

training: bool
class dlk.core.base_module.BaseModuleConfig(config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

docstring for BaseLayerConfig

class dlk.core.base_module.IModuleIO[source]

Bases: object

interface for check the modules input and output

abstract check_keys_are_provided(provide: List[str]) bool[source]

check this module required key are provided

Returns

pass or not

check_module_chain(module_list: List[dlk.core.base_module.BaseModule]) bool[source]

check the interfaces of the list of modules are alignd or not.

Parameters

module_list – a series modules

Returns

pass or not

Raises

ValueError – the check is not passed

abstract provide_keys() List[str][source]

return all keys of the dict of the module returned

Returns

all keys

class dlk.core.base_module.IModuleStep[source]

Bases: object

docstring for ModuleStepMixin

abstract predict_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do predict for one batch

Parameters

inputs – one mini-batch inputs

Returns

the predicts outputs

test_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do test for one batch

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

abstract training_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do training for one batch

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

abstract validation_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do validataion for one batch

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

class dlk.core.base_module.ModuleOutputRenameMixin[source]

Bases: object

Just rename the output key name by config to adapt the input field of downstream module.

dict_rename(input: Dict, output_map: Dict[str, str]) Dict[source]

rename the key of input(dict) by output_map(name map)

Parameters
  • input – will rename input

  • output_map – name map

Returns

renamed input

get_input_name(name: str) str[source]

use config._input_map map the name to real name

Parameters

name – input_name

Returns

real_name

get_output_name(name: str) str[source]

use config._output_map map the name to real name

Parameters

name – output_name

Returns

real_name

get_real_name(name: str, name_map: Dict[str, str]) str[source]

use the name_map to map the input name to real name

Parameters
  • name – input_name

  • name_map – name map

Returns

real_name

set_rename(input: Set, output_map: Dict[str, str]) Set[source]

rename all the name in input by output_map

Parameters
  • input – a set of names

  • output_map – name map

Returns

renamed input

class dlk.core.base_module.SimpleModule(config: dlk.core.base_module.BaseModuleConfig)[source]

Bases: dlk.core.base_module.BaseModule

docstring for SimpleModule, SimpleModule, all train/predict/test/validation step call the forward

forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

in simple module, all step fit to this method

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

predict_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do predict for one batch

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

test_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do test for one batch

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

training: bool
training_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do train for one batch

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

validation_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor][source]

do validation for one batch

Parameters

inputs – one mini-batch inputs

Returns

one mini-batch outputs

Module contents

dlk.data package

Subpackages

dlk.data.datamodules package

Submodules
dlk.data.datamodules.basic module
class dlk.data.datamodules.basic.BasicDatamodule(config: dlk.data.datamodules.basic.BasicDatamoduleConfig, data: Dict[str, Any])[source]

Bases: dlk.data.datamodules.IBaseDataModule

Basic and General DataModule

online_dataloader()[source]

get the data collate_fn

predict_dataloader()[source]

get the predict set dataloader

real_key_type_pairs(key_type_pairs: Dict, data: Dict, field: str)[source]

return the keys = key_type_pairs.keys() ∩ data.columns

Parameters
  • key_type_pairs – data in columns should map to tensor type

  • data – the pd.DataFrame

  • field – traing/valid/test, etc.

Returns

real_key_type_pairs where keys = key_type_pairs.keys() ∩ data.columns

test_dataloader()[source]

get the test set dataloader

train_dataloader()[source]

get the train set dataloader

val_dataloader()[source]

get the validation set dataloader

class dlk.data.datamodules.basic.BasicDatamoduleConfig(config)[source]

Bases: dlk.utils.config.BaseConfig

Config for BasicDatamodule

Config Example:
>>> {
>>>     "_name": "basic",
>>>     "config": {
>>>         "pin_memory": None,
>>>         "collate_fn": "default",
>>>         "num_workers": null,
>>>         "shuffle": {
>>>             "train": true,
>>>             "predict": false,
>>>             "valid": false,
>>>             "test": false,
>>>             "online": false
>>>         },
>>>         "key_type_pairs": {
>>>              'input_ids': 'int',
>>>              'label_ids': 'long',
>>>              'type_ids': 'long',
>>>          },
>>>         "gen_mask": {
>>>              'input_ids': 'attention_mask',
>>>          },
>>>         "key_padding_pairs": { //default all 0
>>>              'input_ids': 0,
>>>          },
>>>         "key_padding_pairs_2d": { //default all 0, for 2 dimension data
>>>              'input_ids': 0,
>>>          },
>>>         "train_batch_size": 32,
>>>         "predict_batch_size": 32, //predict、test batch_size is equals to valid_batch_size
>>>         "online_batch_size": 1,
>>>     }
>>> },
class dlk.data.datamodules.basic.BasicDataset(key_type_pairs: Dict[str, str], data: pandas.core.frame.DataFrame)[source]

Bases: torch.utils.data.dataset.Dataset

Basic and General Dataset

Module contents

datamodules

class dlk.data.datamodules.DefaultCollate(**config)[source]

Bases: object

docstring for DefaultCollate

class dlk.data.datamodules.IBaseDataModule[source]

Bases: pytorch_lightning.core.datamodule.LightningDataModule

docstring for IBaseDataModule

abstract online_dataloader()[source]
Raises

NotImplementedError

predict_dataloader()[source]
Raises

NotImplementedError

test_dataloader()[source]
Raises

NotImplementedError

train_dataloader()[source]
Raises

NotImplementedError

val_dataloader()[source]
Raises

NotImplementedError

dlk.data.datamodules.import_datamodules(datamodules_dir, namespace)[source]

dlk.data.postprocessors package

Submodules
dlk.data.postprocessors.identity module
class dlk.data.postprocessors.identity.IdentityPostProcessor(config: dlk.data.postprocessors.identity.IdentityPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

docstring for DataSet

process(stage, outputs, origin_data) Dict[source]

do nothing except gather the loss

class dlk.data.postprocessors.identity.IdentityPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

docstring for IdentityPostProcessorConfig

dlk.data.postprocessors.seq_lab module
class dlk.data.postprocessors.seq_lab.AggregationStrategy[source]

Bases: object

docstring for AggregationStrategy

AVERAGE = 'average'
FIRST = 'first'
MAX = 'max'
NONE = 'none'
SIMPLE = 'simple'
class dlk.data.postprocessors.seq_lab.SeqLabPostProcessor(config: dlk.data.postprocessors.seq_lab.SeqLabPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

PostProcess for sequence labeling task

aggregate(pre_entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) List[dict][source]
aggregate_word(entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) dict[source]
aggregate_words(entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) List[dict][source]

Override tokens from a given word that disagree to force agreement on word boundaries.

Example

micro|soft| com|pany| B-ENT I-NAME I-ENT I-ENT will be rewritten with first strategy as microsoft| company| B-ENT I-ENT

calc_score(predict_list: List, ground_truth_list: List)[source]

use predict_list and ground_truth_list to calc scores

Parameters
  • predict_list – list of predict

  • ground_truth_list – list of ground_truth

Returns

precision, recall, f1

crf_predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List[source]

use the crf predict label_ids get predict info

Parameters
  • list_batch_outputs – the crf predict info

  • origin_data – the origin data

Returns

all predict instances info

do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

the named scores, recall, precision, f1

do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List[source]

Process the model predict to human readable format

There are three predictor for diffrent seq_lab task dependent on the config.use_crf(the predict is already decoded to ids), and config.word_ready(subword has gathered to firstpiece)

Parameters
  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

all predicts

do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

  • save_condition – True for save, False for depend on rt_config

Returns

None

gather_pre_entities(sentence: str, input_ids: numpy.ndarray, scores: numpy.ndarray, offset_mapping: Optional[List[Tuple[int, int]]], special_tokens_mask: numpy.ndarray) List[dict][source]

Fuse various numpy arrays into dicts with all the information needed for aggregation

get_entity_info(sub_tokens_index: List, offset_mapping: List, word_ids: List, label: str) Dict[source]

gather sub_tokens to get the start and end

Parameters
  • sub_tokens_index – the entity tokens index list

  • offset_mapping – every token offset in text

  • word_ids – every token in the index of words

  • label – predict label

Returns

entity_info

get_tag(entity_name: str) Tuple[str, str][source]
group_entities(entities: List[dict]) List[dict][source]

Find and group together the adjacent tokens with the same entity predicted.

Parameters

entities – The entities predicted by the pipeline.

group_sub_entities(entities: List[dict]) dict[source]

Group together the adjacent tokens with the same entity predicted.

Parameters

entities – The entities predicted by the pipeline.

predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List[source]

general predict process (especially for subword)

Parameters
  • list_batch_outputs – the predict (sub-)labels logits info

  • origin_data – the origin data

Returns

all predict instances info

word_predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List[source]

use the firstpiece or whole word predict label_logits get predict info

Parameters
  • list_batch_outputs – the predict labels logits info

  • origin_data – the origin data

Returns

all predict instances info

class dlk.data.postprocessors.seq_lab.SeqLabPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

Config for SeqLabPostProcessor

Config Example:
>>> {
>>>     "_name": "seq_lab",
>>>     "config": {
>>>         "meta": "*@*",
>>>         "use_crf": false, //use or not use crf
>>>         "word_ready": false, //already gather the subword first token as the word rep or not
>>>         "ignore_position": true, // calc the metrics, whether ignore the ground_truth and predict position info.( if set to true, only focus on the entity content not position.)
>>>         "ignore_char": " ", // if the entity begin or end with this char, will ignore these char
>>>         //"ignore_char": " ()[]-.,:", // if the entity begin or end with this char, will ignore these char
>>>         "meta_data": {
>>>             "label_vocab": 'label_vocab',
>>>             "tokenizer": "tokenizer",
>>>         },
>>>         "input_map": {
>>>             "logits": "logits",
>>>             "predict_seq_label": "predict_seq_label",
>>>             "_index": "_index",
>>>         },
>>>         "origin_input_map": {
>>>             "uuid": "uuid",
>>>             "sentence": "sentence",
>>>             "input_ids": "input_ids",
>>>             "entities_info": "entities_info",
>>>             "offsets": "offsets",
>>>             "special_tokens_mask": "special_tokens_mask",
>>>             "word_ids": "word_ids",
>>>             "label_ids": "label_ids",
>>>         },
>>>         "save_root_path": ".",  //save data root dir
>>>         "save_path": {
>>>             "valid": "valid",  // relative dir for valid stage
>>>             "test": "test",    // relative dir for test stage
>>>         },
>>>         "start_save_step": 0,  // -1 means the last
>>>         "start_save_epoch": -1,
>>>         "aggregation_strategy": "max", // AggregationStrategy item
>>>         "ignore_labels": ['O', 'X', 'S', "E"], // Out, Out, Start, End
>>>     }
>>> }
dlk.data.postprocessors.txt_cls module
class dlk.data.postprocessors.txt_cls.TxtClsPostProcessor(config: dlk.data.postprocessors.txt_cls.TxtClsPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

postprocess for text classfication

do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

the named scores, acc

do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List[source]

Process the model predict to human readable format

Parameters
  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

all predicts

do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

  • save_condition – True for save, False for depend on rt_config

Returns

None

class dlk.data.postprocessors.txt_cls.TxtClsPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

Config for TxtClsPostProcessor

Config Example:
>>> {
>>>     "_name": "txt_cls",
>>>     "config": {
>>>         "meta": "*@*",
>>>         "meta_data": {
>>>             "label_vocab": 'label_vocab',
>>>         },
>>>         "input_map": {
>>>             "logits": "logits",
>>>             "label_ids": "label_ids"
>>>             "_index": "_index",
>>>         },
>>>         "origin_input_map": {
>>>             "sentence": "sentence",
>>>             "sentence_a": "sentence_a", // for pair
>>>             "sentence_b": "sentence_b",
>>>             "uuid": "uuid"
>>>         },
>>>         "save_root_path": ".",  //save data root dir
>>>         "top_k": 1, //the result return top k result
>>>         "data_type": "single", //single or pair
>>>         "save_path": {
>>>             "valid": "valid",  // relative dir for valid stage
>>>             "test": "test",    // relative dir for test stage
>>>         },
>>>         "start_save_step": 0,  // -1 means the last
>>>         "start_save_epoch": -1,
>>>     }
>>> }
dlk.data.postprocessors.txt_reg module
class dlk.data.postprocessors.txt_reg.TxtRegPostProcessor(config: dlk.data.postprocessors.txt_reg.TxtRegPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

text regression postprocess

do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

the named scores

do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List[source]

Process the model predict to human readable format

Parameters
  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

all predicts

do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

  • save_condition – True for save, False for depend on rt_config

Returns

None

class dlk.data.postprocessors.txt_reg.TxtRegPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

Config for TxtRegPostProcessor

Config Example:
>>> {
>>>     "_name": "txt_reg",
>>>     "config": {
>>>         "input_map": {
>>>             "logits": "logits",
>>>             "values": "values",
>>>             "_index": "_index",
>>>         },
>>>         "origin_input_map": {
>>>             "sentence": "sentence",
>>>             "sentence_a": "sentence_a", // for pair
>>>             "sentence_b": "sentence_b",
>>>             "uuid": "uuid"
>>>         },
>>>         "data_type": "single", //single or pair
>>>         "save_root_path": ".",  //save data root dir
>>>         "save_path": {
>>>             "valid": "valid",  // relative dir for valid stage
>>>             "test": "test",    // relative dir for test stage
>>>         },
>>>         "log_reg": false, // whether logistic regression
>>>         "start_save_step": 0,  // -1 means the last
>>>         "start_save_epoch": -1,
>>>     }
>>> }
Module contents

postprocessors

class dlk.data.postprocessors.IPostProcessor[source]

Bases: object

docstring for IPostProcessor

average_loss(list_batch_outputs: List[Dict]) float[source]

average all the loss of the list_batches

Parameters

list_batch_outputs – a list of outputs

Returns

average_loss

abstract do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>>
    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

the named scores

abstract do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List[source]

Process the model predict to human readable format

Parameters
  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

all predicts

abstract do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

  • save_condition – True for save, False for depend on rt_config

Returns

None

gather_predict_extend_data(input_data: Dict, i: int, predict_extend_return: Dict)[source]

gather the data register in predict_extend_return :param input_data: the model output :param i: the index is i :param predict_extend_return: the name map which will be reserved

Returns

a dict of data in input_data which is register in predict_extend_return

loss_name_map(stage) str[source]

get the stage loss name

Parameters

stage – valid, train or test

Returns

loss_name

process(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False) Union[Dict, List][source]

PostProcess entry

Parameters
  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

  • save_condition – if save_condition is True, will force save the predict on all stage except online

Returns

the log_info(metrics) or the stage is “online” return the predicts

property without_ground_truth_stage: set

there is not groud truth in the returned stage

Returns

without_ground_truth_stage

class dlk.data.postprocessors.IPostProcessorConfig(config)[source]

Bases: dlk.utils.config.BaseConfig

docstring for PostProcessorConfigBase

property input_map

required the output of model process content name map

Returns

input_map

property origin_input_map

required the origin data(before pass to datamodule) column name map

Returns

origin_input_map

property predict_extend_return

save the extend data in predict

Returns

predict_extend_return

dlk.data.postprocessors.import_postprocessors(postprocessors_dir, namespace)[source]

dlk.data.processors package

Submodules
dlk.data.processors.basic module
class dlk.data.processors.basic.BasicProcessor(stage: str, config: dlk.data.processors.basic.BasicProcessorConfig)[source]

Bases: dlk.data.processors.IProcessor

Basic and General Processor

process(data: Dict) Dict[source]

Process entry

Parameters
  • data

  • { (>>>) –

  • "data" (>>>) – {“train”: …},

  • "tokenizer" (>>>) –

  • } (>>>) –

Returns

processed data

class dlk.data.processors.basic.BasicProcessorConfig(stage, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for BasicProcessor

Config Example:
>>> {
>>>     // input should be {"train": train, "valid": valid, ...}, train/valid/test/predict/online etc, should be dataframe and must have a column named "sentence"
>>>     "_name": "basic@test_text_cls",
>>>     "config": {
>>>         "feed_order": ["load", "tokenizer", "token_gather", "label_to_id", "token_embedding", "save"]
>>>     },
>>>     "subprocessor@load": {
>>>         "_base": "load",
>>>         "config":{
>>>             "base_dir": "",
>>>             "predict":{
>>>                 "meta": "./meta.pkl",
>>>             },
>>>             "online": [
>>>                 "predict", //base predict
>>>                 {   // special config, update predict, is this case, the config is null, means use all config from "predict", when this is empty dict, you can only set the value to a str "predict", they will get the same result
>>>                 }
>>>             ]
>>>         }
>>>     },
>>>     "subprocessor@save": {
>>>         "_base": "save",
>>>         "config":{
>>>             "base_dir": "",
>>>             "train":{
>>>                 "processed": "processed_data.pkl", // all data
>>>                 "meta": {
>>>                     "meta.pkl": ['label_vocab'] //only for next time use
>>>                 }
>>>             },
>>>             "predict": {
>>>                 "processed": "processed_data.pkl",
>>>             }
>>>         }
>>>     },
>>>     "subprocessor@tokenizer":{
>>>         "_base": "fast_tokenizer",
>>>         "config": {
>>>             "train": {
>>>                 "config_path": "*@*",
>>>                 "prefix": ""
>>>                 "data_type": "single", // single or pair, if not provide, will calc by len(process_data)
>>>                 "process_data": [
>>>                     ["sentence", { "is_pretokenized": false}],
>>>                 ],
>>>                 "post_processor": "default"
>>>                 "filed_map": { // this is the default value, you can provide other name
>>>                     "ids": "input_ids",
>>>                 }, // the tokenizer output(the key) map to the value
>>>             },
>>>             "predict": "train",
>>>             "online": "train"
>>>         }
>>>     },
>>>     "subprocessor@token_gather":{
>>>         "_base": "token_gather",
>>>         "config": {
>>>             "train": { // only train stage using
>>>                 "data_set": {      // for different stage, this processor will process different part of data
>>>                     "train": ["train", "valid"]
>>>                 },
>>>                 "gather_columns": ["label"], //List of columns. Every cell must be sigle token or list of tokens or set of tokens
>>>                 "deliver": "label_vocab", // output Vocabulary object (the Vocabulary of labels) name.
>>>             }
>>>         }
>>>     },
>>>     "subprocessor@label_to_id":{
>>>         "_base": "token2id",
>>>         "config": {
>>>             "train":{ //train、predict、online stage config,  using '&' split all stages
>>>                 "data_pair": {
>>>                     "label": "label_id"
>>>                 },
>>>                 "data_set": {                   // for different stage, this processor will process different part of data
>>>                     "train": ['train', 'valid', 'test'],
>>>                     "predict": ['predict'],
>>>                     "online": ['online']
>>>                 },
>>>                 "vocab": "label_vocab", // usually provided by the "token_gather" module
>>>             }, //3
>>>             "predict": "train",
>>>             "online": "train",
>>>         }
>>>     },
>>>     "subprocessor@token_embedding": {
>>>         "_base": "token_embedding",
>>>         "config":{
>>>             "train": { // only train stage using
>>>                 "embedding_file": "*@*",
>>>                 "tokenizer": "*@*", //List of columns. Every cell must be sigle token or list of tokens or set of tokens
>>>                 "deliver": "token_embedding", // output Vocabulary object (the Vocabulary of labels) name.
>>>                 "embedding_size": 200,
>>>             }
>>>         }
>>>     },
>>> }
Module contents

processors

class dlk.data.processors.IProcessor[source]

Bases: object

docstring for IProcessor

abstract process(data: Dict) Dict[source]

Process entry

Parameters
  • data

  • { (>>>) –

  • "data" (>>>) – {“train”: …},

  • "tokenizer" (>>>) –

  • } (>>>) –

Returns

processed data

dlk.data.processors.import_processors(processors_dir, namespace)[source]

dlk.data.subprocessors package

Submodules
dlk.data.subprocessors.char_gather module
class dlk.data.subprocessors.char_gather.CharGather(stage: str, config: dlk.data.subprocessors.char_gather.CharGatherConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

gather all character from the ‘gather_columns’ and deliver a vocab named ‘char_vocab’

process(data: Dict) Dict[source]

Charactor gather entry

Parameters
  • data

  • { (>>>) –

  • "data" (>>>) – {“train”: …},

  • "tokenizer" (>>>) –

  • } (>>>) –

Returns

data[self.config.deliver] = Vocabulary()(which gathered_char)

split_to_char(input: Union[str, Iterable])[source]

the char is from token or sentence, so we need split them to List[char]

Parameters

input – auto detach the type of input and split it to char

Returns

the same shape of the input but the str is split to List[char]

class dlk.data.subprocessors.char_gather.CharGatherConfig(stage: str, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for CharGather

Config Example:
>>> {
>>>     "_name": "char_gather",
>>>     "config": {
>>>         "train": {
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ["train", "valid", 'test']
>>>             },
>>>             "gather_columns": "*@*", //List of columns. Every cell must be sigle token or list of tokens or set of tokens
>>>             "deliver": "char_vocab", // output Vocabulary object (the Vocabulary of labels) name.
>>>             "ignore": "", // ignore the token, the id of this token will be -1
>>>             "update": null, // null or another Vocabulary object to update
>>>             "unk": "[UNK]",
>>>             "pad": "[PAD]",
>>>             "min_freq": 1,
>>>             "most_common": -1, //-1 for all
>>>         }
>>>     }
>>> }
dlk.data.subprocessors.fast_tokenizer module
class dlk.data.subprocessors.fast_tokenizer.FastTokenizer(stage: str, config: dlk.data.subprocessors.fast_tokenizer.FastTokenizerConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

FastTokenizer use hugingface tokenizers

Tokenizer the single $sentence Or tokenizer the pair $sentence_a, $sentence_b Generator $tokens, $input_ids, $type_ids, $special_tokens_mask, $offsets, $word_ids, $overflowing, $sequence_ids

process(data: Dict) Dict[source]

Tokenizer entry

Parameters
  • data

  • { (>>>) –

  • "data" (>>>) – {“train”: …},

  • "tokenizer" (>>>) –

  • } (>>>) –

Returns

data and the tokenizer info is in the data[‘data’], if you set the self.config.deliver, the data[self.config.deliver] will set to self.tokenizer.to_str()

class dlk.data.subprocessors.fast_tokenizer.FastTokenizerConfig(stage, config)[source]

Bases: dlk.utils.config.BaseConfig

Config for FastTokenizer

Config Example:
>>> {
>>>     "_name": "fast_tokenizer",
>>>     "config": {
>>>         "train": {
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ["train", "valid", 'test'],
>>>                 "predict": ["predict"],
>>>                 "online": ["online"]
>>>             },
>>>             "config_path": "*@*",
>>>             "truncation": {     // if this is set to None or empty, will not do trunc
>>>                 "max_length": 512,
>>>                 "strategy": "longest_first", // Can be one of longest_first, only_first or only_second.
>>>             },
>>>             "normalizer": ["nfd", "lowercase", "strip_accents", "some_processor_need_config": {config}], // if don't set this, will use the default normalizer from config
>>>             "pre_tokenizer": [{"whitespace": {}}], // if don't set this, will use the default normalizer from config
>>>             "post_processor": "bert", // if don't set this, will use the default normalizer from config, WARNING: not support disable  the default setting( so the default tokenizer.post_tokenizer should be null and only setting in this configure)
>>>             "output_map": { // this is the default value, you can provide other name
>>>                 "tokens": "tokens",
>>>                 "ids": "input_ids",
>>>                 "attention_mask": "attention_mask",
>>>                 "type_ids": "type_ids",
>>>                 "special_tokens_mask": "special_tokens_mask",
>>>                 "offsets": "offsets",
>>>                 "word_ids": "word_ids",
>>>                 "overflowing": "overflowing",
>>>                 "sequence_ids": "sequence_ids",
>>>             }, // the tokenizer output(the key) map to the value
>>>             "input_map": {
>>>                 "sentence": "sentence", //for sigle input, tokenizer the "sentence"
>>>                 "sentence_a": "sentence_a", //for pair inputs, tokenize the "sentence_a" && "sentence_b"
>>>                 "sentence_b": "sentence_b", //for pair inputs
>>>             },
>>>             "deliver": "tokenizer",
>>>             "process_data": { "is_pretokenized": false},
>>>             "data_type": "single", // single or pair, if not provide, will calc by len(process_data)
>>>         },
>>>         "predict": ["train", {"deliver": null}],
>>>         "online": ["train", {"deliver": null}],
>>>     }
>>> }
dlk.data.subprocessors.load module
class dlk.data.subprocessors.load.Load(stage: str, config: dlk.data.subprocessors.load.LoadConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Loader the $meta, etc. to data

load(path: str)[source]

load data from path

Parameters

path – the path to data

Returns

loaded data

process(data: Dict) Dict[source]

Load entry

Parameters
  • data

  • { (>>>) –

  • "data" (>>>) – {“train”: …},

  • "tokenizer" (>>>) –

  • } (>>>) –

Returns

data + loaded_data

class dlk.data.subprocessors.load.LoadConfig(stage, config)[source]

Bases: dlk.utils.config.BaseConfig

Config for Load

Config Example:
>>> {
>>>     "_name": "load",
>>>     "config":{
>>>         "base_dir": ""
>>>         "predict":{
>>>             "meta": "./meta.pkl",
>>>         },
>>>         "online": [
>>>             "predict", //base predict
>>>             {   // special config, update predict, is this case, the config is null, means use all config from "predict", when this is empty dict, you can only set the value to a str "predict", they will get the same result
>>>             }
>>>         ]
>>>     }
>>> },
dlk.data.subprocessors.save module
class dlk.data.subprocessors.save.Save(stage: str, config: dlk.data.subprocessors.save.SaveConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Save the processed data to $base_dir/$processed Save the meta data(like vocab, embedding, etc.) to $base_dir/$meta

process(data: Dict) Dict[source]

Save entry

Parameters
  • data

  • { (>>>) –

  • "data" (>>>) – {“train”: …},

  • "tokenizer" (>>>) –

  • } (>>>) –

Returns

data

save(data, path: str)[source]

save data to path

Parameters
  • data – pickleable data

  • path – the path to data

Returns

loaded data

class dlk.data.subprocessors.save.SaveConfig(stage, config)[source]

Bases: dlk.utils.config.BaseConfig

Config for Save

Config Example:
>>> {
>>>     "_name": "save",
>>>     "config":{
>>>         "base_dir": ""
>>>         "train":{
>>>             "processed": "processed_data.pkl", // all data without meta
>>>             "meta": {
>>>                 "meta.pkl": ['label_ids', 'embedding'] //only for next time use
>>>             }
>>>         },
>>>         "predict": {
>>>             "processed": "processed_data.pkl",
>>>         }
>>>     }
>>> },
dlk.data.subprocessors.seq_lab_firstpiece_relable module
dlk.data.subprocessors.seq_lab_loader module
dlk.data.subprocessors.seq_lab_relabel module
class dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabel(stage: str, config: dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabelConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Relabel the json data to bio

find_position_in_offsets(position: int, offset_list: List, sub_word_ids: List, start: int, end: int, is_start: bool = False)[source]

find the sub_word index which the offset_list[index][0]<=position<offset_list[index][1]

Parameters
  • position – position

  • offset_list – list of all tokens offsets

  • sub_word_ids – word_ids from tokenizer

  • start – start search index

  • end – end search index

  • is_start – is the position is the start of target token, if the is_start==True and cannot find return -1

Returns

the index of the offset which include position

process(data: Dict) Dict[source]

SeqLabRelabel Entry

Parameters

data – Dict

Returns

relabeled data

relabel(one_ins: pandas.core.series.Series)[source]

make token label, if use the first piece label please use the ‘seq_lab_firstpiece_relabel’

Parameters

one_ins – include sentence, entity_info, offsets

Returns

labels(labels for each subtoken)

class dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabelConfig(stage, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for SeqLabRelabel

Config Example:
>>> {
>>>     "_name": "seq_lab_relabel",
>>>     "config": {
>>>         "train":{
>>>             "input_map": {  // without necessery, don't change this
>>>                 "word_ids": "word_ids",
>>>                 "offsets": "offsets",
>>>                 "entities_info": "entities_info",
>>>             },
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ['train', 'valid', 'test'],
>>>                 "predict": ['predict'],
>>>                 "online": ['online']
>>>             },
>>>             "output_map": {
>>>                 "labels": "labels",
>>>             },
>>>             "drop": "shorter", //'longer'/'shorter'/'none', if entities is overlap, will remove by rule
>>>             "start_label": "S",
>>>             "end_label": "E",
>>>             "clean_droped_entity": true, // after drop entity for training, whether drop the entity for calc metrics, default is true, this only works when the drop != 'none'
>>>             "entity_priority": [],
>>>             //"entity_priority": ['Product'],
>>>             "priority_trigger": 1, // if the overlap entity abs(length_a - length_b)<=priority_trigger, will trigger the entity_priority strategy
>>>         }, //3
>>>         "predict": "train",
>>>         "online": "train",
>>>     }
>>> }
dlk.data.subprocessors.token2charid module
class dlk.data.subprocessors.token2charid.Token2CharID(stage: str, config: dlk.data.subprocessors.token2charid.Token2CharIDConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Use ‘Vocabulary’ map the character from tokens to id

process(data: Dict) Dict[source]

Token2CharID Entry

one_token like ‘apple’ will generate [1, 2, 2, 3] if max_token_len==4 and the vocab.word2idx = {‘a’: 1, “p”: 2, “l”: 3}

Parameters

data – will process data

Returns

updated data(token -> char_ids)

class dlk.data.subprocessors.token2charid.Token2CharIDConfig(stage, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for Token2CharID

Config Example:
>>> {
>>>     "_name": "token2charid",
>>>     "config": {
>>>         "train":{
>>>             "data_pair": {
>>>                 "sentence & offsets": "char_ids"
>>>             },
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ['train', 'valid', 'test', 'predict'],
>>>                 "predict": ['predict'],
>>>                 "online": ['online']
>>>             },
>>>             "vocab": "char_vocab", // usually provided by the "token_gather" module
>>>             "max_token_len": 20, // the max length of token, then the output will be max_token_len x token_num (put max_token_len in previor is for padding on token_num)
>>>         },
>>>         "predict": "train",
>>>         "online": "train",
>>>     }
>>> }
dlk.data.subprocessors.token2id module
class dlk.data.subprocessors.token2id.Token2ID(stage: str, config: dlk.data.subprocessors.token2id.Token2IDConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Use ‘Vocabulary’ map the tokens to id

process(data: Dict) Dict[source]

Token2ID Entry

one_token like [‘apple’] will generate [1] if the vocab.word2idx = {‘apple’: 1}

Parameters

data – will process data

Returns

updated data(tokens -> token_ids)

class dlk.data.subprocessors.token2id.Token2IDConfig(stage, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for Token2ID

Config Example:
>>> {
>>>     "_name": "token2id",
>>>     "config": {
>>>         "train":{
>>>             "data_pair": {
>>>                 "labels": "label_ids"
>>>             },
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ['train', 'valid', 'test', 'predict'],
>>>                 "predict": ['predict'],
>>>                 "online": ['online']
>>>             },
>>>             "vocab": "label_vocab", // usually provided by the "token_gather" module
>>>         }, //3
>>>         "predict": "train",
>>>         "online": "train",
>>>     }
>>> }
dlk.data.subprocessors.token_embedding module
class dlk.data.subprocessors.token_embedding.TokenEmbedding(stage: str, config: dlk.data.subprocessors.token_embedding.TokenEmbeddingConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Gather tokens embedding from pretrained ‘embedding_file’ or init embedding(xavier_uniform init, and the range clip in ‘bias_clip_range’)

The tokens are from ‘Tokenizer’(get_vocab) or ‘Vocabulary’(word2idx) object(the two must provide only one)

get_embedding(file_path, embedding_size) Dict[str, List[float]][source]

load the embeddings from file_path, and only get the last embedding_size dimentions embedding

Parameters
  • file_path – embedding file path

  • embedding_size – the embedding dim

Returns

>>> embedding_dict
>>> {
>>>     "word": [embedding, ...]
>>> }

process(data: Dict) Dict[source]

TokenEmbedding Entry

Parameters

data – will process data

Returns

update embedding_dict to data data[self.config.deliver] = np.array(embedding_mat)

update_embedding(embedding_dict: Dict[str, List[float]], vocab: List[str])[source]

update the embedding_dict which token in vocab but not in embedding_dict

Parameters
  • embedding_dict – word->embedding dict

  • vocab – token vocab

Returns

updated embedding_dict

class dlk.data.subprocessors.token_embedding.TokenEmbeddingConfig(stage, config)[source]

Bases: dlk.utils.config.BaseConfig

Config for TokenEmbedding

Config Example:
>>> {
>>>     "_name": "token_embedding",
>>>     "config": {
>>>         "train": {
>>>             "embedding_file": "*@*",
>>>             "tokenizer": null, //List of columns. Every cell must be sigle token or list of tokens or set of tokens
>>>             "vocab": null,
>>>             "deliver": "token_embedding", // output Vocabulary object (the Vocabulary of labels) name.
>>>             "embedding_size": 200,
>>>             "bias_clip_range": [0.5, 0.1], // the init embedding bias weight range, if you provide two, the larger is the up bound the lower is low bound; if you provide one value, we will use it as the bias
>>>         }
>>>     }
>>> }
dlk.data.subprocessors.token_gather module
class dlk.data.subprocessors.token_gather.TokenGather(stage: str, config: dlk.data.subprocessors.token_gather.TokenGatherConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

gather all tokens from the ‘gather_columns’ and deliver a vocab named ‘token_vocab’

get_elements_from_series_by_trace(data: pandas.core.series.Series, trace: str) List[source]

get the datas from data[trace_path] >>> for example: >>> data[0] = {‘entities_info’: [{‘start’: 0, ‘end’: 1, ‘labels’: [‘Label1’]}]} // data is a series, and every element is as data[0] >>> trace = ‘entities_info.labels’ >>> return_result = [[‘Label1’]]

Parameters
  • data – origin data series

  • trace – get data element trace

Returns

the data in the tail of traces

process(data: Dict) Dict[source]

TokenGather entry

Parameters
  • data

  • { (>>>) –

  • "data" (>>>) – {“train”: …},

  • "tokenizer" (>>>) –

  • } (>>>) –

Returns

data[self.config.deliver] = Vocabulary()(which gathered_token)

class dlk.data.subprocessors.token_gather.TokenGatherConfig(stage: str, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for TokenGather

Config Example:
>>> {
>>>     "_name": "token_gather",
>>>     "config": {
>>>         "train": {
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ["train", "valid", 'test']
>>>             },
>>>             "gather_columns": "*@*", //List of columns, if one element of the list is dict, you can set more. Every cell must be sigle token or list of tokens or set of tokens
>>>             //"gather_columns": ['tokens']
>>>             //"gather_columns": ['tokens', {"column": "entities_info", "trace": 'labels'}]
>>>             // the trace only trace the dict, if list is in trace path, will add the trace to every elements in the list. for example: {"entities_info": [{'start': 1, 'end': 2, labels: ['Label1']}, ..]}, the trace to labels is 'entities_info.labels'
>>>             "deliver": "*@*", // output Vocabulary object (the Vocabulary of labels) name.
>>>             "ignore": "", // ignore the token, the id of this token will be -1
>>>             "update": null, // null or another Vocabulary object to update
>>>             "unk": "[UNK]",
>>>             "pad": "[PAD]",
>>>             "min_freq": 1,
>>>             "most_common": -1, //-1 for all
>>>         }
>>>     }
>>> }
dlk.data.subprocessors.token_norm module
class dlk.data.subprocessors.token_norm.TokenNorm(stage: str, config: dlk.data.subprocessors.token_norm.TokenNormConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

This part could merged to fast_tokenizer(it will save some time), but not all process need this part(except some special dataset like conll2003), and will make the fast_tokenizer be heavy.

Token norm:

Love -> love 3281 -> 0000

process(data: Dict) Dict[source]

TokenNorm entry

Parameters
  • data

  • { – “data”: {“train”: …}, “tokenizer”: ..

  • }

Returns

norm data

seq_norm(key: str, one_item: pandas.core.series.Series) str[source]

norm a sentence, the sentence is from one_item[key]

Parameters
  • key – the name in one_item

  • one_item – a pd.Series which include the key

Returns

norm_sentence

token_norm(token: str) str[source]

norm token, the result len(result) == len(token), exp. 12348->00000

Parameters

token – origin token

Returns

normed_token

class dlk.data.subprocessors.token_norm.TokenNormConfig(stage, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for TokenNorm

Config Example:
>>> {
>>>     "_name": "token_norm",
>>>     "config": {
>>>         "train":{
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ['train', 'valid', 'test', 'predict'],
>>>                 "predict": ['predict'],
>>>                 "online": ['online']
>>>             },
>>>             "zero_digits_replaced": true,
>>>             "lowercase": true,
>>>             "extend_vocab": "", //when lowercase is true, this upper_case_vocab will collection all tokens the token is not in vocab but it's lowercase is in vocab. this is only for token gather process
>>>             "tokenizer": "whitespace_split",  //the path to vocab(if the token in vocab skip norm it), the file is setted to one token per line
>>>             "data_pair": {
>>>                 "sentence": "norm_sentence"
>>>             },
>>>         },
>>>         "predict": "train",
>>>         "online": "train",
>>>     }
>>> }
tokenize(seq)[source]

tokenize the seq

dlk.data.subprocessors.txt_cls_loader module
dlk.data.subprocessors.txt_reg_loader module
Module contents

processors

class dlk.data.subprocessors.ISubProcessor[source]

Bases: object

docstring for ISubProcessor

abstract process(data: Dict) Dict[source]

SubProcess entry

Parameters
  • data

  • { (>>>) –

  • "data" (>>>) – {“train”: …},

  • "tokenizer" (>>>) –

  • } (>>>) –

Returns

processed data

dlk.data.subprocessors.import_subprocessors(processors_dir, namespace)[source]

Module contents

dlk.managers package

Submodules

dlk.managers.lightning module

class dlk.managers.lightning.LightningManager(config: dlk.managers.lightning.LightningManagerConfig, rt_config: Dict)[source]

Bases: object

pytorch-lightning traning manager

fit(**inputs)[source]

fit the model and datamodule to trainer

Parameters

**inputs – dict of input, include “model”, ‘datamodule’

Returns

Undefine

get_callbacks(callback_configs: List[Dict], rt_config: Dict)[source]

init the callbacks and return the callbacks list

Parameters
  • callback_configs – the config of every callback

  • rt_config – {“save_dir”: ‘..’, “name”: ‘..’}

Returns

all callbacks

predict(**inputs)[source]

fit the model and datamodule.predict_dataloader to predict

Parameters

**inputs – dict of input, include “model”, ‘datamodule’

Returns

predict list

test(**inputs)[source]

fit the model and datamodule.test_dataloader to test

Parameters

**inputs – dict of input, include “model”, ‘datamodule’

Returns

Undefine

validate(**inputs)[source]

fit the model and datamodule.validation to validate

Parameters

**inputs – dict of input, include “model”, ‘datamodule’

Returns

Undefine

class dlk.managers.lightning.LightningManagerConfig(config)[source]

Bases: dlk.utils.config.BaseConfig

docstring for LightningManagerConfig check https://pytorch-lightning.readthedocs.io trainer for paramaters detail

get_callbacks_config(config: Dict) List[Dict][source]

get the configs for callbacks

Parameters

config – {“config”: {“callbacks”: [“callback_names”..]}, “callback@callback_names”: {config}}

Returns

configs which name in config[‘config’][‘callbacks’]

Module contents

managers

dlk.managers.import_managers(managers_dir, namespace)[source]

dlk.utils package

Submodules

dlk.utils.config module

Provide BaseConfig which provide the basic method for configs, and ConfigTool a general config(dict) process tool

class dlk.utils.config.BaseConfig(config: Dict)[source]

Bases: object

BaseConfig provide the basic function for all config

post_check(config, used=None)[source]

check all the paras in config is used

Parameters
  • config – paras

  • used – used paras

Returns

None

Raises

logger.warning("Unused")

class dlk.utils.config.ConfigTool[source]

Bases: object

This Class is not be used as much as I design.

static do_update_config(config: dict, update_config: Optional[dict] = None) Dict[source]

use the update_config dict update the config dict, recursively

see ConfigTool._inplace_update_dict

Parameters
  • config – will be updated dict

  • update_confg – config: use _new update _base

Returns

updated_config

static get_config_by_stage(stage: str, config: Dict) Dict[source]

get the stage_config for special stage in provide config

it means the config of this stage equals to config[stage] return config[config[stage]]

Config Example:
>>> config = {
>>>     "train":{ //train、predict、online stage config,  using '&' split all stages
>>>         "data_pair": {
>>>             "label": "label_id"
>>>         },
>>>         "data_set": {                   // for different stage, this processor will process different part of data
>>>             "train": ['train', 'dev'],
>>>             "predict": ['predict'],
>>>             "online": ['online']
>>>         },
>>>         "vocab": "label_vocab", // usually provided by the "token_gather" module
>>>     },
>>>     "predict": "train",
>>>     "online": ["train",
>>>     {"vocab": "new_label_vocab"}
>>>     ]
>>> }
>>> config.get_config['predict'] == config['predict'] == config['train']
Parameters
  • stage – the stage, like ‘train’, ‘predict’, etc.

  • config – the base config which has different stage config

Returns

stage_config

static get_leaf_module(module_register: dlk.utils.register.Register, module_config_register: dlk.utils.register.Register, module_name: str, config: Dict) Tuple[Any, object][source]

get the module from module_register and module_config from module_config_register which name=module_name

Parameters
  • module_register – register for module which has ‘module_name’

  • module_config_register – config register for config which has ‘module_name’

  • module_name – the module name which we want to get from register

Returns

module(which name is module_name), module_config(which name is module_name)

dlk.utils.get_root module

Get the dlk package root path

dlk.utils.get_root.get_root()[source]

get the dlk root

Returns

abspath of this package

dlk.utils.logger module

class dlk.utils.logger.Logger(log_file: str = '', base_dir: str = 'logs', log_level: str = 'debug', log_name='dlk')[source]

Bases: object

docstring for logger

static get_logger() loguru._logger.Logger[source]

return the ‘dlk’ logger if initialized otherwise init and return it

Returns

Logger.global_logger

global_log_file: set[str] = {}
global_logger: loguru._logger.Logger = <loguru.logger handlers=[(id=1, level=10, sink=<stdout>)]>
static init_file_logger(log_file, base_dir='logs', log_level: str = 'debug')[source]

init(if there is not one) or change(if there already is one) the log file

Parameters
  • log_file – log file path

  • base_dir – real log path is ‘$base_dir/$log_file’

  • log_level – ‘debug’, ‘info’, etc.

Returns

None

static init_global_logger(log_level: str = 'debug', log_name: Optional[str] = None, reinit: bool = False)[source]

init the global_logger

Parameters
  • log_level – you can change this to logger to different level

  • log_name – change this is not suggested

  • reinit – if set true, will force reinit

Returns

None

level_map = {'debug': 'DEBUG', 'error': 'ERROR', 'info': 'INFO', 'warning': 'WARNING'}
log_name: str = 'dlk'
warning_file = False

dlk.utils.parser module

class dlk.utils.parser.BaseConfigParser(config_file: Union[str, Dict, List], config_base_dir: str = '', register: Optional[dlk.utils.register.Register] = None)[source]

Bases: object

The config parser order is: inherit -> search -> link

If some config is marked to “@”, this means the para has not default value, you must coverd it(like ‘label_nums’, etc.).

static check_config(configs: Union[Dict, List[Dict]]) None[source]

check all config is valid.

check all “@” is replaced to correct value. :param configs: TODO

Returns

None

Raises

ValueError

collect move all links in config to top

only do in the top level of config, collect all level links and return the links with level

Parameters
  • config

    >>> {
    >>>     "arg1": {
    >>>         "arg11": 2
    >>>         "arg12": 3
    >>>         "_link": {"arg11": "arg12"}
    >>>     }
    >>> }
    

  • all_level_links – TODO

  • level – TODO

Returns

>>> {
>>>     "arg1": {
>>>         "arg11": 2
>>>         "arg12": 3
>>>     }
>>>     "_link": {"arg1.arg11": "arg1.arg12"}
>>> }

inplace link the config[to] = config[source]

Parameters
  • link – {link-from:link-to-1, link-from:[link-to-2, link-to-3]}

  • config – will linked base config

Returns

None

flat all the _search paras to list

support recursive parser _search now, this means you can add _search/_link/_base paras in _search paras but you should only search currently level paras

Parameters
  • search – search paras, {“para1”: [1,2,3], ‘para2’: ‘list(range(10))’}

  • config – base config

Returns: list of possible config

classmethod get_base_config(config_name: str) Dict[source]

get the base config use the config_name

Parameters

config_name – the config name

Returns

config of the config_name

get_cartesian_prod(list_of_list_of_dict: List[List[Dict]]) List[List[Dict]][source]

get catesian prod from two lists

Parameters

list_of_list_of_dict – [[config_a1, config_a2], [config_b1, config_b2]]

Returns

[[config_a1, config_b1], [config_a1, config_b2], [config_a2, config_b1], [config_a2, config_b2]]

get_kind_module_base_config(abstract_config: Union[dict, str], kind_module: str = '') List[dict][source]

get the whole config of ‘kind_module’ by given abstract_config

Parameters
  • abstract_config – will expanded config

  • kind_module – the module kind, like ‘embedding’, ‘subprocessor’, which registed in config_parser_register

Returns: parserd config (whole config) of abstract_config

static get_named_list_cartesian_prod(dict_of_list: Optional[Dict[str, List]] = None) List[Dict][source]

get catesian prod from named lists

Parameters

dict_of_list – {‘name1’: [1,2,3], ‘name2’: “list(range(1, 4))”}

Returns

1, ‘name2’: 1}, {‘name1’: 1, ‘name2’: 2}, {‘name1’: 1, ‘name2’: 3}, …]

Return type

[{‘name1’

is_rep_config(list_of_dict: List[dict]) bool[source]

check is there a repeat config in list

Parameters

list_of_dict – a list of dict

Returns

has repeat or not

load_hjson_file(file_path: str) Dict[source]

load hjson file from file_path and return a Dict

Parameters

file_path – the file path

Returns: loaded dict

map_to_submodule(config: dict, map_fun: Callable) Dict[source]

map the map_fun to all submodules in config

use the map_fun to process all the modules

Parameters
  • config – a dict of submodules, the key is the module kind wich registed in config_parser_register

  • map_fun – use the map_fun process the submodule

Returns: TODO

parser(parser_link=True) List[source]

parser the config

Parameters

parser_link – whether parser the links

Returns: all valided configs

parser_with_check(parser_link=True) List[Dict][source]

parser the config and check the config is valid

Parameters

parser_link – whether parser the links

Returns: all valided configs

class dlk.utils.parser.CallbackConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for CallbackConfigParser

class dlk.utils.parser.ConfigConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

parser(parser_link=True)[source]

parser the config

config support _search and _link

Parameters

parser_link – whether parser the links

Returns

all valided configs

class dlk.utils.parser.DatamoduleConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for DatamoduleConfigParser

class dlk.utils.parser.DecoderConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for DecoderConfigParser

class dlk.utils.parser.EmbeddingConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for EmbeddingConfigParser

class dlk.utils.parser.EncoderConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for EncoderConfigParser

class dlk.utils.parser.IModelConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for IModelConfigParser

class dlk.utils.parser.InitMethodConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for InitMethodConfigParser

class dlk.utils.parser.LinkConfigParser(config_file)[source]

Bases: object

parser(parser_link=False)[source]

parser the config

config support _search and _link

Parameters

parser_link – must be false

Returns

all valided configs

class dlk.utils.parser.LinkUnionTool[source]

Bases: object

Assisting tool for parsering the “_link” of config. All the function named the top level has high priority than low level

This class is mostly for resolve the confilicts of the low and high level register links.

find(key: str)[source]

find the root of the key

Parameters

key – a token

Returns

the root of the key

get the registed links

Returns

all registed and validation links

low_level_union(link_from: str, link_to: str)[source]

union the low level link_from->link_to pair

On the basis of the high-level links, this function is used to regist low-level link If link-from and link-to were all not appeared at before, they will be directly registed. If only one of the link-from and link-to appeared, the value of the link-from and link-to will be overwritten by the corresponding value of the upper level, If both link-from and link-to appeared at before, and if they linked the same value, we will do nothing, otherwise RAISE AN ERROR

Parameters
  • link_from – the link-from key

  • link_to – the link-to key

Returns

None

register the low level links, low level means the base(parant) level config

Parameters

links – {“link-from”: [“list of link-to”], “link-from2”: “link-to2”}

Returns

self

register the top level links, top level means the link_to level config

Parameters

links – {“from”: [“tolist”], “from2”: “to2”}

Returns

self

top_level_union(link_from: str, link_to: str)[source]

union the top level link_from->link_to pair

Register the ‘link’(link-from -> link-to) in the same(top) level config should be merged using top_level_union Parameters are not allowed to be assigned repeatedly (the same parameter cannot appear more than once in the link-to position, otherwise it will cause ambiguity.)

Parameters
  • link_from – the link-from key

  • link_to – the link-to key

Returns

None

class dlk.utils.parser.LossConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for LossConfigParser

class dlk.utils.parser.ManagerConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for ManagerConfigParser

class dlk.utils.parser.ModelConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for ModelConfigParser

class dlk.utils.parser.ModuleConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for ModuleConfigParser

class dlk.utils.parser.OptimizerConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for OptimizerConfigParser

class dlk.utils.parser.PostProcessorConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for PostProcessorConfigParser

class dlk.utils.parser.ProcessorConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for ProcessorConfigParser

class dlk.utils.parser.RootConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for RootConfigParser

class dlk.utils.parser.ScheduleConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for ScheduleConfigParser

class dlk.utils.parser.SubProcessorConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for SubProcessorConfigParser

class dlk.utils.parser.TaskConfigParser(config_file)[source]

Bases: dlk.utils.parser.BaseConfigParser

docstring for TaskConfigParser

dlk.utils.register module

class dlk.utils.register.Register(register_name: str)[source]

Bases: object

get(name: str = '') Any[source]

get the module by name

Parameters

name – the name should be the real name or name+@+sub_name, and the

Returns

registed module

register(name: str = '') Callable[source]

register the name: module to self.registry

Parameters

name – the registed module name

Returns

the module

dlk.utils.tokenizer_util module

class dlk.utils.tokenizer_util.PreTokenizerFactory(tokenizer: tokenizers.Tokenizer)[source]

Bases: object

property bert

bert pre_tokenizer

Returns

BertPreTokenizer

property bytelevel

byte level pre_tokenizer

Returns

ByteLevel

get(name)[source]

get pretokenizer by name

Returns

postprocess

property whitespace

whitespace pre_tokenizer

Returns

Whitespace

property whitespacesplit

whitespacesplit pre_tokenizer

Returns

WhitespaceSplit

class dlk.utils.tokenizer_util.TokenizerNormalizerFactory(tokenizer: tokenizers.Tokenizer)[source]

Bases: object

get(name)[source]

get normalizers by name

Returns

Normalizer

property lowercase

do lowercase normalizers

Returns

Lowercase

property nfc

do nfc normalizers

Returns

NFC

property nfd

do nfd normalizers

Returns

NFD

property strip

do strip normalizers

Returns

StripAccents

property strip_accents

do strip normalizers

Returns

StripAccents

class dlk.utils.tokenizer_util.TokenizerPostprocessorFactory(tokenizer: tokenizers.Tokenizer)[source]

Bases: object

docstring for TokenizerPostprocessorFactory

property bert

bert postprocess

Returns

bert postprocess

get(name)[source]

get postprocess by name

Returns

postprocess

dlk.utils.vocab module

class dlk.utils.vocab.Vocabulary(do_strip: bool = False, unknown: str = '', ignore: str = '', pad: str = '')[source]

Bases: object

generate vocab from tokens(token or Iterable tokens) you can dumps the object to dict and load from dict

add(word)[source]

add one word to vocab

Parameters

word – single word

Returns

self

add_from_iter(iterator)[source]

add the tokens in iterator to vocab

Parameters

iterator – List[str] | Set[str] | List[List[str]]

Returns

self

auto_get_index(data: Union[str, List])[source]

get the index of word ∈data from this vocab

Parameters

data – auto detection

Returns

type the same as data

auto_update(data: Union[str, Iterable])[source]

auto detect data type to update the vocab

Parameters

data – str| List[str] | Set[str] | List[List[str]]

Returns

self

dumps() Dict[source]

dumps the object to dict

Returns

self.__dict__

filter_rare(min_freq=1, most_common=- 1)[source]

filter the words which count is to small.

min_freq and most_common can not set all

Parameters
  • min_freq – minist frequency

  • most_common – most common number, -1 means all

Returns

None

get_index(word: str) int[source]

get the index of word from this vocab

Parameters

word – a single token

Returns

index

get_word(index: int) str[source]

get the word by index

Parameters

index – word index

Returns

word

classmethod load(attr: Dict)[source]

load the object from dict

Parameters

attr – self.__dict__

Returns

initialized Vocabulary

Module contents

Appointments

Data format

Input

For one sentence processor:

The input one sentence named “sentence”, label named “labels”

The output named:

    "input_ids",
    "label_ids",
    "word_ids",
    "attention_mask",
    "special_tokens_mask",
    "type_ids", 
    "sequence_ids",
    "char_ids",

The input two sentence named “sentence_a”, “sentence_b”, label named “labels”

The output named:

    "input_ids",
    "label_ids",
    "word_ids",
    "attention_mask",
    "special_tokens_mask",
    "type_ids", 
    "sequence_ids",
    "char_ids",

MASK

We set mask==1 for used data, mask==0 for useless data

Batch First

All data set batch_first=True

Task naming appointments

DLK处理的所有问题我们都看做一个任务,而一个任务又会划分为多个子任务, 子任务又可以有自己的子任务,下面是一个任务的定义方式:

{
    "_name": "task_name", //or "_base", "base_task_name"
    "_link": {}, // this is reserved keywords
    "_search: {}, // this is reserved keywords"
    "sub_task1":{
    },
    "sub_task2":{
    }
}

由于所有的任务他们本身又可以被视为其他任务的子任务,所以我们就来看一下关于一个子任务的一些约定

这是一个子任务的配置格式

{
    "sub_task_name": {
        "_name": "sub_task_config_name",
        ...config
    }
}

or

{
    "sub_task_name": {
        "_base": "base_sub_task_config_name",
        ...additional config
    }
}

配置的key表示这个子任务

sub_task_name 的命名一般会表示该子任务在这个task中所扮演的角色,而每个子任务一般都是由dlk的一个专门的模块进行处理,比如processor任务中的subprocessor子任务均由dlk.data.subprocessors这个模块集合(这个里面会有多个subprocessor)进行处理,为了能区分不同的subprocessor我们在对sub_task_name进行命名时会采用subprocessor@subprocessor_name_a来表明我们使用的是subprocessors这个模块集合中的具有subprocessor_name_a这个功能的subprocessor来处理.

对于配置文件中的 _base_name 模块的命名则会省略掉key中已经包含的sub_task_name

采用 AA@BB#CC的方式对一个子任务的configure进行命名

其中 AA表示处理sub_task_name所在表示的模块集合中的具体模块名,比如最常见的basic表示使用basic模块处理这个子任务,处理方法在对应模块集合中的名为basic中定义的逻辑处理

BB表明这个config处理的是什么问题比如(seq_lab/txt_cls/ets.), CC则表明处理这个问题的配置文件的核心特征

Model appointments

  • All dropout put on output or intern of the module, no dropout for the module input

The main file tree:

.
├── train.py-------------------------: train entry 
├── predict.py-----------------------: predict entry
├── process.py-----------------------: process entry
├── online.py------------------------: online entry
├── managers-------------------------: pytorch_lightning or other trainer
│   └── lightning.py-----------------: 
├── configures-----------------------: all default or specifical config
│   ├── core-------------------------: 
│   │   ├── callbacks----------------: 
│   │   ├── imodels------------------: 
│   │   ├── layers-------------------: 
│   │   │   ├── decoders-------------: 
│   │   │   ├── embeddings-----------: 
│   │   │   └── encoders-------------: 
│   │   ├── losses-------------------: 
│   │   ├── models-------------------: 
│   │   ├── modules------------------: 
│   │   └── optimizers---------------: 
│   ├── data-------------------------: 
│   │   ├── datamodules--------------: 
│   │   ├── processors---------------: 
│   │   └── subprocessors------------: 
│   ├── managers---------------------: 
│   └── tasks------------------------: 
├── core-----------------------------: *core* pytorch or other model code
│   ├── base_module.py---------------: base module for "layers"
│   ├── callbacks--------------------: 
│   ├── imodels----------------------: 
│   ├── layers-----------------------: 
│   │   ├── decoders-----------------: 
│   │   ├── embeddings---------------: 
│   │   └── encoders-----------------: 
│   ├── losses-----------------------: 
│   ├── models-----------------------: 
│   ├── modules----------------------: 
│   ├── optimizers-------------------: 
│   └── schedules--------------------: 
├── data-----------------------------: *core* code for data process or manager
│   ├── datamodules------------------: 
│   ├── postprocessors---------------: 
│   ├── processors-------------------: 
│   └── subprocessors----------------: 
└── utils----------------------------: 
    ├── config.py--------------------: process config(dict) toolkit
    ├── get_root.py------------------: get project root path
    ├── logger.py--------------------: logger
    ├── parser.py--------------------: parser config
    ├── register.py------------------: register the module to a registry
    ├── tokenizer_util.py------------: tokenizer util
    └── vocab.py---------------------: vocabulary

Config Parser Rules

Inherit

Simple e.g.


default.hjson
{
    _base:  parant,
    config: {
        "will_be_rewrite": 3     
    }
}

parant.hjson
{
    _name:  base_config,
    config: {
        "will_be_rewrite": 1,
        "keep": 8     
    }
}

You have the two config named default.hjson, and  parant.hjson, the parser result will be :
{
    _name:  base_config,
    config: {
        "will_be_rewrite": 3,
        "keep": 8     
    }
}

Focus(Representation)

The focus part is for simple the logger file, we will use the value of focus dict to replace the key while logging.

SubModule(Combination)

Due to we using the dict to represent a config, and the key is regarded as the submodule name, but sometimes one top level module will have two or more same submodules(with different config). You can set the submodule name as ‘submodule@speciel_name’.

The subprocessor config format

In subprocessors, the config is based on the progress stage(train, predict, online, etc.).

The stage config could be a dict, a str, or a tuple, for different type of config, we will parser the configure the different way.

  1. when the config is a dict, this is the default type, all things go as you think.

  2. when the config is a str, the string must be one of stage name(train, predict, online, etc.) and the stage config is already defined as dict description in “1”

  3. when the config is a tuple(two elements list), the first element must be a str, which defined in “2”, and the second element is a update config, which type is dict(or None) and defined in ‘1’

Some config value set to “@”, this means you must provided this key-value pair in your own config

Processor Config Example

{
    "processor": {
        "_name": "test_text_classification",
        "config": {
            "feed_order": ["load", "tokenizer", "token_gather", "label_to_id", "save"]
        },
        "subprocessor@load": {
            "_name": "load",
            "config":{
                "base_dir": "",
                "predict":{
                    "token_ids": "./token_ids.pkl",
                    "embedding": "./embedding.pkl",
                    "label_ids": "./label_ids.pkl"
                },
                "online": [
                    "predict", //base predict
                    {   // special config, update predict, is this case, the config is null, means use all config from "predict"
                    }
                ]
            }
        },
        "subprocessor@save": {
            "_name": "save",
            "config":{
                "base_dir": "",
                "train":{
                    "data.train": "./train.pkl",
                    "data.dev": "./dev.pkl",
                    "token_ids": "./token_ids.pkl",
                    "embedding": "./embedding.pkl",
                    "label_ids": "./label_ids.pkl"
                },
                "predict": {
                    "data.predict": "./predict.pkl"
                }
            }
        },
        "subprocessor@tokenizer":{
            "_base": "wordpiece_tokenizer",
            "config": {
                "train": { // you can add some whitespace surround the '&' 
                    "data_set": {                   // for different stage, this processor will process different part of data
                        "train": ["train", "dev"],
                        "predict": ["predict"],
                        "online": ["online"]
                    },
                    "config_path": "./token.json",
                    "normalizer": ["nfd", "lowercase", "strip_accents", "some_processor_need_config": {config}], // if don't set this, will use the default normalizer from config
                    "pre_tokenizer": ["whitespace": {}], // if don't set this, will use the default normalizer from config
                    "post_processor": "bert", // if don't set this, will use the default normalizer from config, WARNING: not support disable  the default setting( so the default tokenizer.post_tokenizer should be null and only setting in this configure)
                    "filed_map": { // this is the default value, you can provide other name
                        "tokens": "tokens",
                        "ids": "ids",
                        "attention_mask": "attention_mask",
                        "type_ids": "type_ids",
                        "special_tokens_mask": "special_tokens_mask",
                        "offsets": "offsets",
                    }, // the tokenizer output(the key) map to the value
                    "data_type": "single", // single or pair, if not provide, will calc by len(process_data)
                    "process_data": [
                        ["sentence", { "is_pretokenized": false}], 
                    ],
                    /*"data_type": "pair", // single or pair*/
                    /*"process_data": [*/
                        /*['sentence_a', { "is_pretokenized": false}], */ 
                        /*['sentence_b', {}], the config of the second data must as same as the first*/ 
                    /*],*/
                },
                "predict": "train",
                "online": "train"
            }
        },
        "subprocessor@token_gather":{
            "_name": "token_gather",
            "config": {
                "train": { // only train stage using
                    "data_set": {                   // for different stage, this processor will process different part of data
                        "train": ["train", "dev"]
                    },
                    "gather_columns": ["label"], //List of columns. Every cell must be sigle token or list of tokens or set of tokens
                    "deliver": "label_vocab", // output Vocabulary object (the Vocabulary of labels) name. 
                    "update": null, // null or another Vocabulary object to update
                }
            }
        },
        "subprocessor@label_to_id":{
            "_name": "token2id",
            "config": {
                "train":{ //train、predict、online stage config,  using '&' split all stages
                    "data_pair": {
                        "label": "label_id"
                    },
                    "data_set": {                   // for different stage, this processor will process different part of data
                        "train": ['train', 'dev'],
                        "predict": ['predict'],
                        "online": ['online']
                    },
                    "vocab": "label_vocab", // usually provided by the "token_gather" module
                },
                "predict": "train",
                "online": "train",
            }
        }
    }
}

To Process Data Format Example

You can provide dataframe format by yourself, or use the task_name_loader(if provided or you can write one) to load your dict format data to dataframe

{
    "data": {
        "train": pd.DataFrame, // may include these columns "uuid"、"origin"、"label"
        "dev": pd.DataFrame, // may include these columns "uuid"、"origin"、"label"
    }
}

Processed Data Format Example

{
    "data": {
        "train": pd.DataFrame, // may include these columns "uuid"、"origin"、"labels"、"origin_tokens"、"label_ids"、"origin_token_ids"
        "dev": pd.DataFrame, // may include these columns "uuid"、"origin"、"labels"、"origin_tokens"、"label_ids"、"origin_token_ids"
    },
    "embedding": ..,
    "token_vocab": ..,
    "label_vocab": ..,
    ...
}

Indices and tables