Deep Learning ToolKit
A Deep Learning ToolKit
This project is WIP.
Install
pip install dlk
or
git clong this repo and do
python setup.py install
What is this project do?
Provide a templete for deep learning (especially for nlp) training and deploy.
Provide parameters search.
Provide basic architecture search.
Provide some basic modules and models.
Provide reuse the pretrained model for predict.
More Feature is Comming
- Generate models.
- Distill structure.
- Computer vision support.
-
Online service
- Provide a web server for online predict.
-
One
optimizer
different para groups use differentscheduler
s. diff_schedule -
Support LightGBM, it's maybe not necessary? Will split to another package.
- Make most modules like CRF to be scriptable
-
Add UnitTest
- Parser
- Tokenizer
- Config
- Link
dlk.core package
Subpackages
dlk.core.callbacks package
Submodules
dlk.core.callbacks.checkpoint module
- class dlk.core.callbacks.checkpoint.CheckpointCallback(config: dlk.core.callbacks.checkpoint.CheckpointCallbackConfig)[source]
Bases:
object
Save checkpoint decided by config
- class dlk.core.callbacks.checkpoint.CheckpointCallbackConfig(config: Dict)[source]
Bases:
object
Config for CheckpointCallback
- Config Example:
>>> { >>> // default checkpoint configure >>> "_name": "checkpoint", >>> "config": { >>> "monitor": "*@*", // monitor which metrics or log value >>> "save_top_k": 3, //save top k >>> "mode": "*@*", //"max" or "min" select topk min or max checkpoint, min for loss, max for acc >>> "save_last": true, // always save last checkpoint >>> "auto_insert_metric_name": true, //the save file name with or not metric name >>> "every_n_train_steps": null, // Number of training steps between checkpoints. >>> "every_n_epochs": 1, //Number of epochs between checkpoints. >>> "save_on_train_epoch_end": false,// Whether to run checkpointing at the end of the training epoch. If this is False, then the check runs at the end of the validation. >>> "save_weights_only": false, //whether save other status like optimizer, etc. >>> } >>> }
dlk.core.callbacks.early_stop module
- class dlk.core.callbacks.early_stop.EarlyStoppingCallback(config: dlk.core.callbacks.early_stop.EarlyStoppingCallbackConfig)[source]
Bases:
object
Early stop decided by config
- class dlk.core.callbacks.early_stop.EarlyStoppingCallbackConfig(config: Dict)[source]
Bases:
object
Config for EarlyStoppingCallback
- Config Example:
>>> { >>> "_name": "early_stop", >>> "config":{ >>> "monitor": "val_loss", >>> "mode": "*@*", // min or max, min for the monitor is loss, max for the monitor is acc, f1, etc. >>> "patience": 3, >>> "min_delta": 0.0, >>> "check_on_train_epoch_end": null, >>> "strict": true, // if the monitor is not right, raise error >>> "stopping_threshold": null, // float, if the value is good enough, stop >>> "divergence_threshold": null, // float, if the value is so bad, stop >>> "verbose": true, //verbose mode print more info >>> } >>> }
dlk.core.callbacks.lr_monitor module
- class dlk.core.callbacks.lr_monitor.LearningRateMonitorCallback(config: dlk.core.callbacks.lr_monitor.LearningRateMonitorCallbackConfig)[source]
Bases:
object
Monitor the learning rate
- class dlk.core.callbacks.lr_monitor.LearningRateMonitorCallbackConfig(config: Dict)[source]
Bases:
object
Config for LearningRateMonitorCallback
- Config Example:
>>> { >>> "_name": "lr_monitor", >>> "config": { >>> "logging_interval": null, // set to None to log at individual interval according to the interval key of each scheduler. other value : step, epoch >>> "log_momentum": true, // log momentum or not >>> } >>> }
dlk.core.callbacks.weight_average module
- class dlk.core.callbacks.weight_average.StochasticWeightAveragingCallback(config: dlk.core.callbacks.weight_average.StochasticWeightAveragingCallbackConfig)[source]
Bases:
object
Average weight by config
- class dlk.core.callbacks.weight_average.StochasticWeightAveragingCallbackConfig(config)[source]
Bases:
object
Config for StochasticWeightAveragingCallback
- Config Example:
>>> { //weight_average default >>> "_name": "weight_average", >>> "config": { >>> "swa_epoch_start": 0.8, // swa start epoch >>> "swa_lrs": null, >>> //None. Use the current learning rate of the optimizer at the time the SWA procedure starts. >>> //float. Use this value for all parameter groups of the optimizer. >>> //List[float]. A list values for each parameter group of the optimizer. >>> "annealing_epochs": 10, >>> "annealing_strategy": 'cos', >>> "device": null, // save device, null for auto detach, if the gpu is oom, you should change this to 'cpu' >>> } >>> }
Module contents
callbacks
dlk.core.imodels package
Submodules
dlk.core.imodels.basic module
- class dlk.core.imodels.basic.BasicIModel(config: dlk.core.imodels.basic.BasicIModelConfig, checkpoint=False)[source]
Bases:
pytorch_lightning.core.lightning.LightningModule
,dlk.core.imodels.GatherOutputMixin
- property epoch_training_steps: int
every epoch training steps inferred from datamodule and devices.
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do forward on a mini batch
- Parameters
batch – a mini batch inputs
- Returns
the outputs
- get_progress_bar_dict()[source]
rewrite the prograss_bar_dict, remove the ‘v_num’ which we don’t need
- Returns
progress_bar dict
- property num_training_epochs: int
Total training epochs inferred from datamodule and devices.
- property num_training_steps: int
Total training steps inferred from datamodule and devices.
- predict_step(batch: Dict, batch_idx: int) Dict [source]
do predict on a mini batch
- Parameters
batch – a mini batch inputs
batch_idx – the index(dataloader) of the mini batch
- Returns
the outputs
- test_epoch_end(outputs: List[Dict]) List[Dict] [source]
Gather the outputs of all node and do postprocess on it.
- Parameters
outputs – current node returnd output list
- Returns
all node outputs
- test_step(batch: Dict[str, torch.Tensor], batch_idx: int) Dict [source]
do test on a mini batch
The outputs only gather the keys in self.gather_data.keys for postprocess :param batch: a mini batch inputs :param batch_idx: the index(dataloader) of the mini batch
- Returns
the outputs
- training: bool
- training_step(batch: Dict[str, torch.Tensor], batch_idx: int)[source]
do training_step on a mini batch
- Parameters
batch – a mini batch inputs
batch_idx – the index(dataloader) of the mini batch
- Returns
the outputs
- validation_epoch_end(outputs: List[Dict]) List[Dict] [source]
Gather the outputs of all node and do postprocess on it.
The outputs only gather the keys in self.gather_data.keys for postprocess :param outputs: current node returnd output list
- Returns
all node outputs
- validation_step(batch: Dict[str, torch.Tensor], batch_idx: int) Dict[str, torch.Tensor] [source]
do validation on a mini batch
The outputs only gather the keys in self.gather_data.keys for postprocess :param batch: a mini batch inputs :param batch_idx: the index(dataloader) of the mini batch
- Returns
the outputs
- class dlk.core.imodels.basic.BasicIModelConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
basic imodel config will provide all the config for model/optimizer/loss/scheduler/postprocess
- get_loss(config: Dict)[source]
Use config to init the loss
- Parameters
config – loss config
- Returns
Loss, LossConfig
- get_model(config: Dict)[source]
Use config to init the model
- Parameters
config – model config
- Returns
Model, ModelConfig
- get_optimizer(config: Dict)[source]
Use config to init the optimizer
- Parameters
config – optimizer config
- Returns
Optimizer, OptimizerConfig
dlk.core.imodels.distill module
Module contents
imodels
- class dlk.core.imodels.GatherOutputMixin[source]
Bases:
object
gather all the small batches output to a big batch
- concat_list_of_dict_outputs(outputs: List[Dict]) Dict [source]
only support all the outputs has the same dim, now is deprecated.
- Parameters
outputs – multi node returned output (list of dict)
- Returns
Concat all list by name
- gather_outputs(outputs: List[Dict])[source]
gather the dist outputs
- Parameters
outputs – one node outputs
- Returns
all outputs
- static proc_dist_outputs(dist_outputs: List[Dict]) List[Dict] [source]
gather all distributed outputs to outputs which is like in a single worker.
- Parameters
dist_outputs – the inputs of pytorch_lightning train/test/.._epoch_end when using ddp
- Returns
the inputs of pytorch_lightning train/test/.._epoch_end when only run on one worker.
dlk.core.initmethods package
Submodules
dlk.core.initmethods.default module
- class dlk.core.initmethods.default.DefaultInit(config: dlk.core.initmethods.default.DefaultInitConfig)[source]
Bases:
object
default method for init the modules
- class dlk.core.initmethods.default.DefaultInitConfig(config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for RangeNormInit
- Config Example:
>>> { >>> "_name": "default", >>> "config": { >>> } >>> }
dlk.core.initmethods.range_norm module
- class dlk.core.initmethods.range_norm.RangeNormInit(config: dlk.core.initmethods.range_norm.RangeNormInitConfig)[source]
Bases:
object
default for transformers init method
- class dlk.core.initmethods.range_norm.RangeNormInitConfig(config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for RangeNormInit
- Config Example:
>>> { >>> "_name": "range_norm", >>> "config": { >>> "range": 0.1, >>> } >>> }
dlk.core.initmethods.range_uniform module
- class dlk.core.initmethods.range_uniform.RangeUniformInit(config: dlk.core.initmethods.range_uniform.RangeUniformInitConfig)[source]
Bases:
object
for transformers
- class dlk.core.initmethods.range_uniform.RangeUniformInitConfig(config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for RangeNormInit
- Config Example:
>>> { >>> "_name": "range_uniform", >>> "config": { >>> "range": 0.1, >>> } >>> }
Module contents
initmethods
dlk.core.layers package
Subpackages
dlk.core.layers.decoders package
- class dlk.core.layers.decoders.identity.IdentityDecoder(config: dlk.core.layers.decoders.identity.IdentityDecoderConfig)[source]
Bases:
dlk.core.base_module.SimpleModule
Do nothing
- training: bool
- class dlk.core.layers.decoders.identity.IdentityDecoderConfig(config)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for IdentityDecoder
- Config Example:
>>> { >>> "config": { >>> }, >>> "_name": "identity", >>> }
- class dlk.core.layers.decoders.linear.Linear(config: dlk.core.layers.decoders.linear.LinearConfig)[source]
Bases:
dlk.core.base_module.SimpleModule
wrap for torch.nn.Linear
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
All step do this
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- init_weight(method: Callable)[source]
init the weight of submodules by ‘method’
- Parameters
method – init method
- Returns
None
- training: bool
- class dlk.core.layers.decoders.linear.LinearConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for Linear
- Config Example:
>>> { >>> "module": { >>> "_base": "linear", >>> }, >>> "config": { >>> "input_size": "*@*", >>> "output_size": "*@*", >>> "pool": null, >>> "dropout": 0.0, >>> "output_map": {}, >>> "input_map": {}, // required_key: provide_key >>> }, >>> "_link":{ >>> "config.input_size": ["module.config.input_size"], >>> "config.output_size": ["module.config.output_size"], >>> "config.pool": ["module.config.pool"], >>> "config.dropout": ["module.config.dropout"], >>> }, >>> "_name": "linear", >>> }
- class dlk.core.layers.decoders.linear_crf.LinearCRF(config: dlk.core.layers.decoders.linear_crf.LinearCRFConfig)[source]
Bases:
dlk.core.base_module.BaseModule
use torch.nn.Linear get the emission probability and fit to CRF
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do predict, only get the predict labels
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- init_weight(method: Callable)[source]
init the weight of submodules by ‘method’
- Parameters
method – init method
- Returns
None
- predict_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do predict, only get the predict labels
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- training: bool
- class dlk.core.layers.decoders.linear_crf.LinearCRFConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for LinearCRF
- Config Example:
>>> { >>> "module@linear": { >>> "_base": "linear", >>> }, >>> "module@crf": { >>> "_base": "crf", >>> }, >>> "config": { >>> "input_size": "*@*", // the linear input_size >>> "output_size": "*@*", // the linear output_size >>> "reduction": "mean", // crf reduction method >>> "output_map": {}, //provide_key: output_key >>> "input_map": {} // required_key: provide_key >>> }, >>> "_link":{ >>> "config.input_size": ["module@linear.config.input_size"], >>> "config.output_size": ["module@linear.config.output_size", "module@crf.config.output_size"], >>> "config.reduction": ["module@crf.config.reduction"], >>> } >>> "_name": "linear_crf", >>> }
decoders
dlk.core.layers.embeddings package
- class dlk.core.layers.embeddings.combine_word_char_cnn.CombineWordCharCNNEmbedding(config: dlk.core.layers.embeddings.combine_word_char_cnn.CombineWordCharCNNEmbeddingConfig)[source]
Bases:
dlk.core.base_module.SimpleModule
from ‘input_ids’ and ‘char_ids’ generate ‘embedding’
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
get the combine char and word embedding
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- init_weight(method: Callable)[source]
init the weight of submodules by ‘method’
- Parameters
method – init method
- Returns
None
- training: bool
- class dlk.core.layers.embeddings.combine_word_char_cnn.CombineWordCharCNNEmbeddingConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for CombineWordCharCNNEmbedding
- Config Example:
>>> { >>> "_name": "combine_word_char_cnn", >>> "embedding@char": { >>> "_base": "static_char_cnn", >>> }, >>> "embedding@word": { >>> "_base": "static", >>> }, >>> "config": { >>> "word": { >>> "embedding_file": "*@*", //the embedding file, must be saved as numpy array by pickle >>> "embedding_dim": "*@*", >>> "embedding_trace": ".", //default the file itself is the embedding >>> "freeze": false, // is freeze >>> "padding_idx": 0, //dropout rate >>> "output_map": {"embedding": "word_embedding"}, >>> "input_map": {}, // required_key: provide_key >>> }, >>> "char": { >>> "embedding_file": "*@*", //the embedding file, must be saved as numpy array by pickle >>> "embedding_dim": 35, //dropout rate >>> "embedding_trace": ".", //default the file itself is the embedding >>> "freeze": false, // is freeze >>> "kernel_sizes": [3], //dropout rate >>> "padding_idx": 0, >>> "output_map": {"char_embedding": "char_embedding"}, >>> "input_map": {"char_ids": "char_ids"}, >>> }, >>> "dropout": 0, //dropout rate >>> "embedding_dim": "*@*", // this must equal to char.embedding_dim + word.embedding_dim >>> "output_map": {"embedding": "embedding"}, // this config do nothing, you can change this >>> "input_map": {"char_embedding": "char_embedding", 'word_embedding': "word_embedding"}, // if the output of char and word embedding changed, you also should change this >>> }, >>> "_link":{ >>> "config.word.embedding_file": ["embedding@word.config.embedding_file"], >>> "config.word.embedding_dim": ["embedding@word.config.embedding_dim"], >>> "config.word.embedding_trace": ["embedding@word.config.embedding_trace"], >>> "config.word.freeze": ["embedding@word.config.freeze"], >>> "config.word.padding_idx": ["embedding@word.config.padding_idx"], >>> "config.word.output_map": ["embedding@word.config.output_map"], >>> "config.word.input_map": ["embedding@word.config.input_map"], >>> "config.char.embedding_file": ["embedding@char.config.embedding_file"], >>> "config.char.embedding_dim": ["embedding@char.config.embedding_dim"], >>> "config.char.embedding_trace": ["embedding@char.config.embedding_trace"], >>> "config.char.freeze": ["embedding@char.config.freeze"], >>> "config.char.kernel_sizes": ["embedding@char.config.kernel_sizes"], >>> "config.char.padding_idx": ["embedding@char.config.padding_idx"], >>> "config.char.output_map": ["embedding@char.config.output_map"], >>> "config.char.input_map": ["embedding@char.config.input_map"], >>> }, >>> }
- class dlk.core.layers.embeddings.identity.IdentityEmbedding(config: dlk.core.layers.embeddings.identity.IdentityEmbeddingConfig)[source]
Bases:
dlk.core.base_module.SimpleModule
Do nothing
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
return inputs
- Parameters
inputs – anything
- Returns
inputs
- training: bool
- class dlk.core.layers.embeddings.identity.IdentityEmbeddingConfig(config)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for IdentityEmbedding
- Config Example:
>>> { >>> "config": { >>> }, >>> "_name": "identity", >>> }
- class dlk.core.layers.embeddings.pretrained_transformers.PretrainedTransformers(config: dlk.core.layers.embeddings.pretrained_transformers.PretrainedTransformersConfig)[source]
Bases:
dlk.core.base_module.SimpleModule
Wrap the hugingface transformers
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
get the transformers output as embedding
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- init_weight(method)[source]
init the weight of submodules by ‘method’
- Parameters
method – init method
- Returns
None
- training: bool
- class dlk.core.layers.embeddings.pretrained_transformers.PretrainedTransformersConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for PretrainedTransformers
- Config Example1:
>>> { >>> "module": { >>> "_base": "roberta", >>> }, >>> "config": { >>> "pretrained_model_path": "*@*", >>> "input_map": { >>> "input_ids": "input_ids", >>> "attention_mask": "attention_mask", >>> "type_ids": "type_ids", >>> }, >>> "output_map": { >>> "embedding": "embedding", >>> }, >>> "dropout": 0, //dropout rate >>> "embedding_dim": "*@*", >>> }, >>> "_link": { >>> "config.pretrained_model_path": ["module.config.pretrained_model_path"], >>> }, >>> "_name": "pretrained_transformers", >>> }
- Config Example2:
>>> for gather embedding >>> { >>> "module": { >>> "_base": "roberta", >>> }, >>> "config": { >>> "pretrained_model_path": "*@*", >>> "input_map": { >>> "input_ids": "input_ids", >>> "attention_mask": "subword_mask", >>> "type_ids": "type_ids", >>> "gather_index": "gather_index", >>> }, >>> "output_map": { >>> "embedding": "embedding", >>> }, >>> "embedding_dim": "*@*", >>> "dropout": 0, //dropout rate >>> }, >>> "_link": { >>> "config.pretrained_model_path": ["module.config.pretrained_model_path"], >>> }, >>> "_name": "pretrained_transformers", >>> }
- class dlk.core.layers.embeddings.random.RandomEmbedding(config: dlk.core.layers.embeddings.random.RandomEmbeddingConfig)[source]
Bases:
dlk.core.base_module.SimpleModule
from ‘input_ids’ generate ‘embedding’
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
get the random embedding
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- init_weight(method: Callable)[source]
init the weight of submodules by ‘method’
- Parameters
method – init method
- Returns
None
- training: bool
- class dlk.core.layers.embeddings.random.RandomEmbeddingConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for RandomEmbedding
- Config Example:
>>> { >>> "config": { >>> "vocab_size": "*@*", >>> "embedding_dim": "*@*", >>> "dropout": 0, //dropout rate >>> "padding_idx": 0, //dropout rate >>> "output_map": {}, >>> "input_map": {}, >>> }, >>> "_name": "random", >>> }
- class dlk.core.layers.embeddings.static.StaticEmbedding(config: dlk.core.layers.embeddings.static.StaticEmbeddingConfig)[source]
Bases:
dlk.core.base_module.SimpleModule
from ‘input_ids’ generate static ‘embedding’ like glove, word2vec
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
get the pretrained static embedding like glove word2vec
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- init_weight(method)[source]
init the weight of submodules by ‘method’
- Parameters
method – init method
- Returns
None
- training: bool
- class dlk.core.layers.embeddings.static.StaticEmbeddingConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for StaticEmbedding
- Config Example:
>>> { >>> "config": { >>> "embedding_file": "*@*", //the embedding file, must be saved as numpy array by pickle >>> "embedding_dim": "*@*", >>> //if the embedding_file is a dict, you should provide the dict trace to embedding >>> "embedding_trace": ".", //default the file itself is the embedding >>> /*embedding_trace: "embedding", //this means the <embedding = pickle.load(embedding_file)["embedding"]>*/ >>> /*embedding_trace: "meta.embedding", //this means the <embedding = pickle.load(embedding_file)['meta']["embedding"]>*/ >>> "freeze": false, // is freeze >>> "padding_idx": 0, //dropout rate >>> "dropout": 0, //dropout rate >>> "output_map": {}, >>> "input_map": {}, // required_key: provide_key >>> }, >>> "_name": "static", >>> }
- class dlk.core.layers.embeddings.static_char_cnn.StaticCharCNNEmbedding(config: dlk.core.layers.embeddings.static_char_cnn.StaticCharCNNEmbeddingConfig)[source]
Bases:
dlk.core.base_module.SimpleModule
from ‘char_ids’ generate ‘embedding’
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
fit the char embedding to cnn and pool to word_embedding
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- init_weight(method)[source]
init the weight of submodules by ‘method’
- Parameters
method – init method
- Returns
None
- training: bool
- class dlk.core.layers.embeddings.static_char_cnn.StaticCharCNNEmbeddingConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for StaticCharCNNEmbedding
- Config Example:
>>> { >>> "module@cnn": { >>> "_base": "conv1d", >>> config: { >>> in_channels: -1, >>> out_channels: -1, //will update while load embedding >>> kernel_sizes: [3], >>> }, >>> }, >>> "config": { >>> "embedding_file": "*@*", //the embedding file, must be saved as numpy array by pickle >>> //if the embedding_file is a dict, you should provide the dict trace to embedding >>> "embedding_trace": ".", //default the file itself is the embedding >>> /*embedding_trace: "char_embedding", //this means the <embedding = pickle.load(embedding_file)["char_embedding"]>*/ >>> /*embedding_trace: "meta.char_embedding", //this means the <embedding = pickle.load(embedding_file)['meta']["char_embedding"]>*/ >>> "freeze": false, // is freeze >>> "dropout": 0, //dropout rate >>> "embedding_dim": 35, //dropout rate >>> "kernel_sizes": [3], //dropout rate >>> "padding_idx": 0, >>> "output_map": {"char_embedding": "char_embedding"}, >>> "input_map": {"char_ids": "char_ids"}, >>> }, >>> "_link":{ >>> "config.embedding_dim": ["module@cnn.config.in_channels", "module@cnn.config.out_channels"], >>> "config.kernel_sizes": ["module@cnn.config.kernel_sizes"], >>> }, >>> "_name": "static_char_cnn", >>> }
embeddings
- class dlk.core.layers.embeddings.EmbeddingInput(**args)[source]
Bases:
object
docstring for EmbeddingInput
dlk.core.layers.encoders package
- class dlk.core.layers.encoders.identity.IdentityEncoder(config: dlk.core.layers.encoders.identity.IdentityEncoderConfig)[source]
Bases:
dlk.core.base_module.SimpleModule
Do nothing
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
return inputs
- Parameters
inputs – anything
- Returns
inputs
- training: bool
- class dlk.core.layers.encoders.identity.IdentityEncoderConfig(config)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for IdentityEncoder
- Config Example:
>>> { >>> "config": { >>> }, >>> "_name": "identity", >>> }
- class dlk.core.layers.encoders.linear.Linear(config: dlk.core.layers.encoders.linear.LinearConfig)[source]
Bases:
dlk.core.base_module.SimpleModule
wrap for torch.nn.Linear
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
All step do this
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- init_weight(method: Callable)[source]
init the weight of submodules by ‘method’
- Parameters
method – init method
- Returns
None
- training: bool
- class dlk.core.layers.encoders.linear.LinearConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for Linear
- Config Example:
>>> { >>> "module": { >>> "_base": "linear", >>> }, >>> "config": { >>> "input_size": "*@*", >>> "output_size": "*@*", >>> "pool": null, >>> "dropout": 0.0, >>> "output_map": {}, >>> "input_map": {}, // required_key: provide_key >>> }, >>> "_link":{ >>> "config.input_size": ["module.config.input_size"], >>> "config.output_size": ["module.config.output_size"], >>> "config.pool": ["module.config.pool"], >>> }, >>> "_name": "linear", >>> }
- class dlk.core.layers.encoders.lstm.LSTM(config: dlk.core.layers.encoders.lstm.LSTMConfig)[source]
Bases:
dlk.core.base_module.SimpleModule
Wrap for torch.nn.LSTM
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
All step do this
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- init_weight(method: Callable)[source]
init the weight of submodules by ‘method’
- Parameters
method – init method
- Returns
None
- training: bool
- class dlk.core.layers.encoders.lstm.LSTMConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for LSTM
- Config Example:
>>> { >>> module: { >>> _base: "lstm", >>> }, >>> config: { >>> input_map: {}, >>> output_map: {}, >>> input_size: *@*, >>> output_size: "*@*", >>> num_layers: 1, >>> dropout: "*@*", // dropout between layers >>> }, >>> _link: { >>> config.input_size: [module.config.input_size], >>> config.output_size: [module.config.output_size], >>> config.dropout: [module.config.dropout], >>> }, >>> _name: "lstm", >>> }
encoders
Module contents
dlk.core.losses package
Submodules
dlk.core.losses.bce module
- class dlk.core.losses.bce.BCEWithLogitsLoss(config: dlk.core.losses.bce.BCEWithLogitsLossConfig)[source]
Bases:
object
binary crossentropy for bi-class classification
- calc(result, inputs, rt_config)[source]
calc the loss the predict is from result, the ground truth is from inputs
- Parameters
result – the model predict dict
inputs – the all inputs for model
rt_config – provide the current training status >>> { >>> “current_step”: self.global_step, >>> “current_epoch”: self.current_epoch, >>> “total_steps”: self.num_training_steps, >>> “total_epochs”: self.num_training_epochs >>> }
- Returns
loss
- update_config(rt_config: Dict)[source]
callback for imodel to update the total steps and epochs
when init the loss module, the total step and epoch is not known, when all data ready, the imodel update the value for loss module
- Parameters
rt_config – { “total_steps”: self.num_training_steps, “total_epochs”: self.num_training_epochs}
- Returns
None
- class dlk.core.losses.bce.BCEWithLogitsLossConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for BCEWithLogitsLoss
- Config Example:
>>> { >>> "config": { >>> "pred_truth_pair": [], # len(.) == 2, the 1st is the pred_name, 2nd is truth_name in __call__ inputs >>> "schedule": [1], >>> "masked_select": null, // if provide, only select the masked(=1) data >>> "scale": [1], # scale the loss for every schedule stage >>> // "schdeule": [0.3, 1.0], # can be a list or str >>> // "scale": "[0.5, 1]", >>> }, >>> "_name": "bce", >>> }
dlk.core.losses.cross_entropy module
- class dlk.core.losses.cross_entropy.CrossEntropyLoss(config: dlk.core.losses.cross_entropy.CrossEntropyLossConfig)[source]
Bases:
object
for multi class classification
- calc(result, inputs, rt_config)[source]
calc the loss the predict is from result, the ground truth is from inputs
- Parameters
result – the model predict dict
inputs – the all inputs for model
rt_config – provide the current training status >>> { >>> “current_step”: self.global_step, >>> “current_epoch”: self.current_epoch, >>> “total_steps”: self.num_training_steps, >>> “total_epochs”: self.num_training_epochs >>> }
- Returns
loss
- update_config(rt_config)[source]
callback for imodel to update the total steps and epochs
when init the loss module, the total step and epoch is not known, when all data ready, the imodel update the value for loss module
- Parameters
rt_config – { “total_steps”: self.num_training_steps, “total_epochs”: self.num_training_epochs}
- Returns
None
- class dlk.core.losses.cross_entropy.CrossEntropyLossConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for CrossEntropyLoss
- Config Example:
>>> { >>> "config": { >>> "ignore_index": -1, >>> "weight": null, # or a list of value for every class >>> "label_smoothing": 0.0, # torch>=1.10 >>> "pred_truth_pair": [], # len(.) == 2, the 1st is the pred_name, 2nd is truth_name in __call__ inputs >>> "schedule": [1], >>> "scale": [1], # scale the loss for every schedule stage >>> // "schdeule": [0.3, 1.0], # can be a list or str >>> // "scale": "[0.5, 1]", >>> }, >>> "_name": "cross_entropy", >>> }
dlk.core.losses.identity module
- class dlk.core.losses.identity.IdentityLoss(config: dlk.core.losses.identity.IdentityLossConfig)[source]
Bases:
object
gather the loss and return when the loss is calc previor module like crf
- calc(result, inputs, rt_config)[source]
calc the loss the predict is from result, the ground truth is from inputs
- Parameters
result – the model predict dict
inputs – the all inputs for model
rt_config – provide the current training status >>> { >>> “current_step”: self.global_step, >>> “current_epoch”: self.current_epoch, >>> “total_steps”: self.num_training_steps, >>> “total_epochs”: self.num_training_epochs >>> }
- Returns
loss
- update_config(rt_config)[source]
callback for imodel to update the total steps and epochs
when init the loss module, the total step and epoch is not known, when all data ready, the imodel update the value for loss module
- Parameters
rt_config – { “total_steps”: self.num_training_steps, “total_epochs”: self.num_training_epochs}
- Returns
None
- class dlk.core.losses.identity.IdentityLossConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for IdentityLoss
- Config Example:
>>> { >>> config: { >>> "schedule": [1], >>> "scale": [1], # scale the loss for every schedule >>> // "schedule": [0.3, 1.0], # can be a list or str >>> // "scale": "[0.5, 1]", >>> "loss": "loss", // the real loss from result['loss'] >>> }, >>> _name: "identity", >>> }
dlk.core.losses.mse module
- class dlk.core.losses.mse.MSELoss(config: dlk.core.losses.mse.MSELossConfig)[source]
Bases:
object
mse loss for regression, distill, etc.
- calc(result, inputs, rt_config)[source]
calc the loss the predict is from result, the ground truth is from inputs
- Parameters
result – the model predict dict
inputs – the all inputs for model
rt_config – provide the current training status >>> { >>> “current_step”: self.global_step, >>> “current_epoch”: self.current_epoch, >>> “total_steps”: self.num_training_steps, >>> “total_epochs”: self.num_training_epochs >>> }
- Returns
loss
- update_config(rt_config)[source]
callback for imodel to update the total steps and epochs
when init the loss module, the total step and epoch is not known, when all data ready, the imodel update the value for loss module
- Parameters
rt_config – { “total_steps”: self.num_training_steps, “total_epochs”: self.num_training_epochs}
- Returns
None
- class dlk.core.losses.mse.MSELossConfig(config: Dict)[source]
Bases:
dlk.core.base_module.BaseModuleConfig
Config for MSELoss
- Config Example:
>>> { >>> "config": { >>> "pred_truth_pair": [], # len(.) == 2, the 1st is the pred_name, 2nd is truth_name in __call__ inputs >>> "schedule": [1], >>> "masked_select": null, // if provide, only select the masked(=1) data >>> "scale": [1], # scale the loss for every schedule stage >>> // "schdeule": [0.3, 1.0], # can be a list or str >>> // "scale": "[0.5, 1]", >>> }, >>> "_name": "mse", >>> }
dlk.core.losses.multi_loss module
- class dlk.core.losses.multi_loss.MultiLoss(config: dlk.core.losses.multi_loss.MultiLossConfig)[source]
Bases:
object
This module is NotImplemented yet don’t use it
- calc(result, inputs, rt_config)[source]
calc the loss the predict is from result, the ground truth is from inputs
- Parameters
result – the model predict dict
inputs – the all inputs for model
rt_config – provide the current training status >>> { >>> “current_step”: self.global_step, >>> “current_epoch”: self.current_epoch, >>> “total_steps”: self.num_training_steps, >>> “total_epochs”: self.num_training_epochs >>> }
- Returns
loss
- class dlk.core.losses.multi_loss.MultiLossConfig(config: Dict)[source]
Bases:
object
Config for MultiLoss
- Config Example:
>>> { >>> "loss@the_first": { >>> config: { >>> "ignore_index": -1, >>> "weight": null, # or a list of value for every class >>> "label_smoothing": 0.0, # torch>=1.10 >>> "pred_truth_pair": ["logits1", "label1"], # len(.) == 2, the 1st is the pred_name, 2nd is truth_name in __call__ inputs >>> "schedule": [0.3, 0.6, 1], >>> "scale": [1, 0, 0.5], # scale the loss for every schedule >>> // "schdeule": [0.3, 1.0], >>> // "scale": [0, 1, 0.5], # scale the loss >>> }, >>> _name: "cross_entropy", >>> }, >>> "loss@the_second": { >>> config: { >>> "pred_truth_pair": ["logits2", "label2"], # len(.) == 2, the 1st is the pred_name, 2nd is truth_name in __call__ inputs >>> "schdeule": [0.3, 0.6, 1], >>> "scale": [0, 1, 0.5], # scale the loss for every schedule >>> // "schdeule": [0.3, 1.0], >>> // "scale": [0, 1, 0.5], # scale the loss >>> }, >>> _base: "cross_entropy", // _name or _base is all ok >>> }, >>> config: { >>> "loss_list": ['the_first', 'the_second'], >>> }, >>> _name: "cross_entropy", >>> }
Module contents
losses
dlk.core.models package
Submodules
dlk.core.models.basic module
- class dlk.core.models.basic.BasicModel(config: dlk.core.models.basic.BasicModelConfig, checkpoint)[source]
Bases:
dlk.core.base_module.BaseModel
Basic & General Model
- check_keys_are_provided(provide: List[str] = []) None [source]
check this all the submodules required key are provided
Returns: None
Raises: PermissionError
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do forward on a mini batch
- Parameters
batch – a mini batch inputs
Returns: the outputs
- predict_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do predict for one batch
- Parameters
inputs – one mini-batch inputs
Returns: the predicts outputs
- provide_keys() List[str] [source]
return all keys of the dict of the model returned
This method may no use, so we will remove this.
Returns: all keys
- test_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do test for one batch
- Parameters
inputs – one mini-batch inputs
Returns: the test outputs
- training: bool
- class dlk.core.models.basic.BasicModelConfig(config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for BasicModel
- Config Example:
>>> { >>> embedding: { >>> _base: "static" >>> config: { >>> embedding_file: "*@*", //the embedding file, must be saved as numpy array by pickle >>> embedding_dim: "*@*", >>> //if the embedding_file is a dict, you should provide the dict trace to embedding >>> embedding_trace: ".", //default the file itself is the embedding >>> /*embedding_trace: "embedding", //this means the <embedding = pickle.load(embedding_file)["embedding"]>*/ >>> /*embedding_trace: "meta.embedding", //this means the <embedding = pickle.load(embedding_file)['meta']["embedding"]>*/ >>> freeze: false, // is freeze >>> dropout: 0, //dropout rate >>> output_map: {}, >>> }, >>> }, >>> decoder: { >>> _base: "linear", >>> config: { >>> input_size: "*@*", >>> output_size: "*@*", >>> pool: null, >>> dropout: "*@*", //the decoder output no need dropout >>> output_map: {} >>> }, >>> }, >>> encoder: { >>> _base: "lstm", >>> config: { >>> output_map: {}, >>> hidden_size: "*@*", >>> input_size: *@*, >>> output_size: "*@*", >>> num_layers: 1, >>> dropout: "*@*", // dropout between layers >>> }, >>> }, >>> "initmethod": { >>> "_base": "range_norm" >>> }, >>> "config": { >>> "embedding_dim": "*@*", >>> "dropout": "*@*", >>> "embedding_file": "*@*", >>> "embedding_trace": "token_embedding", >>> }, >>> _link: { >>> "config.embedding_dim": ["embedding.config.embedding_dim", >>> "encoder.config.input_size", >>> "encoder.config.output_size", >>> "encoder.config.hidden_size", >>> "decoder.config.output_size", >>> "decoder.config.input_size" >>> ], >>> "config.dropout": ["encoder.config.dropout", "decoder.config.dropout", "embedding.config.dropout"], >>> "config.embedding_file": ['embedding.config.embedding_file'], >>> "config.embedding_trace": ['embedding.config.embedding_trace'] >>> } >>> _name: "basic" >>> }
- get_decoder(config)[source]
return the Decoder and DecoderConfig
- Parameters
config – the decoder config
- Returns
Decoder, DecoderConfig
- get_embedding(config: Dict)[source]
return the Embedding and EmbeddingConfig
- Parameters
config – the embedding config
- Returns
Embedding, EmbeddingConfig
Module contents
models
dlk.core.modules package
Submodules
dlk.core.modules.bert module
- class dlk.core.modules.bert.BertWrap(config: dlk.core.modules.bert.BertWrapConfig)[source]
Bases:
dlk.core.modules.Module
Bert wrap
- forward(inputs: Dict)[source]
do forward on a mini batch
- Parameters
batch – a mini batch inputs
- Returns
sequence_output, all_hidden_states, all_self_attentions
- init_weight(method)[source]
init the weight of model by ‘bert.init_weight()’ or from_pretrain
- Parameters
method – init method, no use for pretrained_transformers
- Returns
None
- training: bool
- class dlk.core.modules.bert.BertWrapConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for BertWrap
- Config Example:
>>> { >>> "config": { >>> "pretrained_model_path": "*@*", >>> "from_pretrain": true, >>> "freeze": false, >>> "dropout": 0.0, >>> }, >>> "_name": "bert", >>> }
dlk.core.modules.conv1d module
- class dlk.core.modules.conv1d.Conv1d(config: dlk.core.modules.conv1d.Conv1dConfig)[source]
Bases:
dlk.core.modules.Module
Conv for 1d input
- forward(x: torch.Tensor)[source]
do forward on a mini batch
- Parameters
batch – a mini batch inputs
- Returns
conv result the shape is the same as input
- training: bool
- class dlk.core.modules.conv1d.Conv1dConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for Conv1d
- Config Example:
>>> { >>> "config": { >>> "in_channels": "*@*", >>> "out_channels": "*@*", >>> "dropout": 0.0, >>> "kernel_sizes": [3], >>> }, >>> "_name": "conv1d", >>> }
dlk.core.modules.crf module
- class dlk.core.modules.crf.CRFConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for ConditionalRandomField
- Config Example:
>>> { >>> "config": { >>> "output_size": 2, >>> "batch_first": true, >>> "reduction": "mean", //none|sum|mean|token_mean >>> }, >>> "_name": "crf", >>> }
- class dlk.core.modules.crf.ConditionalRandomField(config: dlk.core.modules.crf.CRFConfig)[source]
Bases:
dlk.core.modules.Module
CRF, training_step for training, forward for decode。
- forward(logits: torch.FloatTensor, mask: torch.LongTensor)[source]
predict step, get the best path
- Parameters
logits – emissions, batch_size*max_len*num_tags
mask – batch_size*max_len, mask==0 means padding
- Returns
batch*max_len
- init_weight(method: Callable)[source]
init the weight of transitions, start_transitions and end_transitions
Initialize the transition parameters. The parameters will be initialized randomly from a uniform distribution between -0.1 and 0.1.
- Parameters
method – init method, no use
- Returns
None
- training: bool
dlk.core.modules.distil_bert module
- class dlk.core.modules.distil_bert.DistilBertWrap(config: dlk.core.modules.distil_bert.DistilBertWrapConfig)[source]
Bases:
dlk.core.modules.Module
DistillBertWrap
- forward(inputs)[source]
do forward on a mini batch
- Parameters
batch – a mini batch inputs
- Returns
sequence_output, all_hidden_states, all_self_attentions
- init_weight(method)[source]
init the weight of model by ‘bert.init_weight()’ or from_pretrain
- Parameters
method – init method, no use for pretrained_transformers
- Returns
None
- training: bool
- class dlk.core.modules.distil_bert.DistilBertWrapConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for DistilBertWrap
- Config Example:
>>> { >>> "config": { >>> "pretrained_model_path": "*@*", >>> "from_pretrain": true, >>> "freeze": false, >>> "dropout": 0.0, >>> },
>>> "_name": "distil_bert", >>> }
dlk.core.modules.linear module
- class dlk.core.modules.linear.Linear(config: dlk.core.modules.linear.LinearConfig)[source]
Bases:
dlk.core.modules.Module
wrap for nn.Linear
- forward(input: torch.Tensor) torch.Tensor [source]
do forward on a mini batch
- Parameters
batch – a mini batch inputs
- Returns
project result the shape is the same as input(no poll), otherwise depend on poll method
- training: bool
- class dlk.core.modules.linear.LinearConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for Linear
- Config Example:
>>> { >>> "config": { >>> "input_size": 256, >>> "output_size": 2, >>> "dropout": 0.0, //the module output no need dropout >>> "bias": true, // use bias or not in linear , if set to false, all the bias will be set to 0 >>> "pool": null, // pooling output or not >>> }, >>> "_name": "linear", >>> }
dlk.core.modules.logits_gather module
- class dlk.core.modules.logits_gather.LogitsGather(config: dlk.core.modules.logits_gather.LogitsGatherConfig)[source]
Bases:
dlk.core.modules.Module
Gather the output logits decided by config
- forward(input: List[torch.Tensor]) Dict[str, torch.Tensor] [source]
gather the needed input to dict
- Parameters
batch – a mini batch inputs
- Returns
some elements to dict
- training: bool
- class dlk.core.modules.logits_gather.LogitsGatherConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for LogitsGather
- Config Example:
>>> { >>> "config": { >>> "gather_layer": { >>> "0": { >>> "map": "3", // the 0th layer not do scale output to "gather_logits_3", "gather_logits_" is the output name prefix, the "3" is map name >>> "scale": {} //don't scale >>> }, >>> "1": { >>> "map": "4", // the 1th layer scale output dim from 1024 to 200 and the output named "gather_logits_3" >>> "scale": {"1024":"200"}, >>> } >>> }, >>> "prefix": "gather_logits_", >>> }, >>> _name: "logits_gather", >>> }
dlk.core.modules.lstm module
- class dlk.core.modules.lstm.LSTM(config: dlk.core.modules.lstm.LSTMConfig)[source]
Bases:
dlk.core.modules.Module
A wrap for nn.LSTM
- forward(input: torch.Tensor, mask: torch.Tensor) torch.Tensor [source]
do forward on a mini batch
- Parameters
batch – a mini batch inputs
- Returns
lstm output the shape is the same as input
- training: bool
- class dlk.core.modules.lstm.LSTMConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for def
- Config Example:
>>> { >>> "config": { >>> "bidirectional": true, >>> "output_size": 200, //the output is 2*hidden_size if use >>> "input_size": 200, >>> "num_layers": 1, >>> "dropout": 0.1, // dropout between layers >>> "dropout_last": true, //dropout the last layer output or not >>> }, >>> "_name": "lstm", >>> }
dlk.core.modules.roberta module
- class dlk.core.modules.roberta.RobertaWrap(config: dlk.core.modules.roberta.RobertaWrapConfig)[source]
Bases:
dlk.core.modules.Module
Roberta Wrap
- forward(inputs)[source]
do forward on a mini batch
- Parameters
batch – a mini batch inputs
- Returns
sequence_output, all_hidden_states, all_self_attentions
- init_weight(method)[source]
init the weight of model by ‘bert.init_weight()’ or from_pretrain
- Parameters
method – init method, no use for pretrained_transformers
- Returns
None
- training: bool
- class dlk.core.modules.roberta.RobertaWrapConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for RobertaWrap
- Config Example:
>>> { >>> "config": { >>> "pretrained_model_path": "*@*", >>> "from_pretrain": true >>> "freeze": false, >>> "dropout": 0.0, >>> }, >>> "_name": "roberta", >>> }
Module contents
basic modules
- class dlk.core.modules.Module[source]
Bases:
torch.nn.modules.module.Module
This class is means DLK Module for replace the torch.nn.Module in this project
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
in simple module, all step fit to this method
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- init_weight(method)[source]
init the weight of submodules by ‘method’
- Parameters
method – init method
- Returns
None
- predict_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do predict for one batch
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- test_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do test for one batch
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- training: bool
dlk.core.optimizers package
Submodules
dlk.core.optimizers.adamw module
- class dlk.core.optimizers.adamw.AdamWOptimizer(model: torch.nn.modules.module.Module, config: dlk.core.optimizers.adamw.AdamWOptimizerConfig)[source]
Bases:
dlk.core.optimizers.BaseOptimizer
Wrap for optim.AdamW
- class dlk.core.optimizers.adamw.AdamWOptimizerConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for AdamWOptimizer
- Config Example:
>>> { >>> "config": { >>> "lr": 5e-5, >>> "betas": [0.9, 0.999], >>> "eps": 1e-6, >>> "weight_decay": 1e-2, >>> "optimizer_special_groups": { >>> "order": ['decoder', 'bias'], // the group order, if the para is in decoder & is in bias, set to decoder. The order name is set to the group name >>> "bias": { >>> "config": { >>> "weight_decay": 0 >>> }, >>> "pattern": ["bias", "LayerNorm.bias", "LayerNorm.weight"] >>> }, >>> "decoder": { >>> "config": { >>> "lr": 1e-3 >>> }, >>> "pattern": ["decoder"] >>> }, >>> } >>> "name": "default" // default group name >>> }, >>> "_name": "adamw", >>> }
dlk.core.optimizers.sgd module
- class dlk.core.optimizers.sgd.SGDOptimizer(model: torch.nn.modules.module.Module, config: dlk.core.optimizers.sgd.SGDOptimizerConfig)[source]
Bases:
dlk.core.optimizers.BaseOptimizer
wrap for optim.SGD
- class dlk.core.optimizers.sgd.SGDOptimizerConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for SGDOptimizer
- Config Example:
>>> { >>> "config": { >>> "lr": 1e-3, >>> "momentum": 0.9, >>> "dampening": 0, >>> "weight_decay": 0, >>> "nesterov":false, >>> "optimizer_special_groups": { >>> // "order": ['decoder', 'bias'], // the group order, if the para is in decoder & is in bias, set to decoder. The order name is set to the group name >>> // "bias": { >>> // "config": { >>> // "weight_decay": 0 >>> // }, >>> // "pattern": ["bias", "LayerNorm.bias", "LayerNorm.weight"] >>> // }, >>> // "decoder": { >>> // "config": { >>> // "lr": 1e-3 >>> // }, >>> // "pattern": ["decoder"] >>> // }, >>> } >>> "name": "default" // default group name >>> }, >>> "_name": "sgd", >>> }
Module contents
optimizers
- class dlk.core.optimizers.BaseOptimizer[source]
Bases:
object
- get_optimizer() torch.optim.optimizer.Optimizer [source]
return the initialized optimizer
- Returns
Optimizer
- init_optimizer(optimizer: torch.optim.optimizer.Optimizer, model: torch.nn.modules.module.Module, config: Dict)[source]
init the optimizer for paras in model, and the group is decided by config
- Parameters
optimizer – adamw, sgd, etc.
model – pytorch model
config – which decided the para group, lr, etc.
- Returns
the initialized optimizer
dlk.core.schedulers package
Submodules
dlk.core.schedulers.constant module
- class dlk.core.schedulers.constant.ConstantSchedule(optimizer: torch.optim.optimizer.Optimizer, config: dlk.core.schedulers.constant.ConstantScheduleConfig)[source]
Bases:
dlk.core.schedulers.BaseScheduler
no schedule
- class dlk.core.schedulers.constant.ConstantScheduleConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for ConstantSchedule
- Config Example:
>>> { >>> "config": { >>> "last_epoch": -1 >>> }, >>> "_name": "constant", >>> }
dlk.core.schedulers.constant_warmup module
- class dlk.core.schedulers.constant_warmup.ConstantWarmupSchedule(optimizer: torch.optim.optimizer.Optimizer, config: dlk.core.schedulers.constant_warmup.ConstantWarmupScheduleConfig)[source]
- class dlk.core.schedulers.constant_warmup.ConstantWarmupScheduleConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for ConstantWarmupSchedule
- Config Example:
>>> { >>> "config": { >>> "last_epoch": -1, >>> "num_warmup_steps": 0, >>> }, >>> "_name": "constant_warmup", >>> }
dlk.core.schedulers.cosine_warmup module
- class dlk.core.schedulers.cosine_warmup.CosineWarmupSchedule(optimizer: torch.optim.optimizer.Optimizer, config: dlk.core.schedulers.cosine_warmup.CosineWarmupScheduleConfig)[source]
- class dlk.core.schedulers.cosine_warmup.CosineWarmupScheduleConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for CosineWarmupSchedule
- Config Example:
>>> { >>> "config": { >>> "last_epoch": -1, >>> "num_warmup_steps": 0, >>> "num_training_steps": -1, >>> "num_cycles": 0.5, >>> }, >>> "_name": "cosine_warmup", >>> }
dlk.core.schedulers.linear_warmup module
- class dlk.core.schedulers.linear_warmup.LinearWarmupSchedule(optimizer: torch.optim.optimizer.Optimizer, config: dlk.core.schedulers.linear_warmup.LinearWarmupScheduleConfig)[source]
Bases:
dlk.core.schedulers.BaseScheduler
linear warmup then linear decay
- class dlk.core.schedulers.linear_warmup.LinearWarmupScheduleConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
- Config Example:
>>> { >>> "config": { >>> "last_epoch": -1, >>> "num_warmup_steps": 0, >>> "num_training_steps": -1, >>> }, >>> "_name": "linear_warmup", >>> }
dlk.core.schedulers.multi_group_schedule module
dlk.core.schedulers.rec_decay module
- class dlk.core.schedulers.rec_decay.RecDecaySchedule(optimizer: torch.optim.optimizer.Optimizer, config: dlk.core.schedulers.rec_decay.RecDecayScheduleConfig)[source]
Bases:
dlk.core.schedulers.BaseScheduler
lr=lr*1/(1+decay)
- class dlk.core.schedulers.rec_decay.RecDecayScheduleConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for RecDecaySchedule
- Config Example:
>>> { >>> "config": { >>> "last_epoch": -1, >>> "num_training_steps": -1, >>> "decay": 0.05, >>> "epoch_training_steps": -1, >>> }, >>> "_name": "rec_decay", >>> }
the lr=lr*1/(1+decay)
Module contents
schedulers
Submodules
dlk.core.base_module module
- class dlk.core.base_module.BaseModel[source]
Bases:
torch.nn.modules.module.Module
,dlk.core.base_module.ModuleOutputRenameMixin
,dlk.core.base_module.IModuleIO
,dlk.core.base_module.IModuleStep
All pytorch models should inheritance this class
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
all models should apply this method
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- training: bool
- class dlk.core.base_module.BaseModule(config: dlk.core.base_module.BaseModuleConfig)[source]
Bases:
torch.nn.modules.module.Module
,dlk.core.base_module.ModuleOutputRenameMixin
,dlk.core.base_module.IModuleIO
,dlk.core.base_module.IModuleStep
All pytorch modules should inheritance this class
- check_keys_are_provided(provide: Set[str]) None [source]
check this module required key are provided
- Returns
pass or not
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
all module should apply this method
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- init_weight(method)[source]
init the weight of submodules by ‘method’
- Parameters
method – init method
- Returns
None
- training: bool
- class dlk.core.base_module.BaseModuleConfig(config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
docstring for BaseLayerConfig
- class dlk.core.base_module.IModuleIO[source]
Bases:
object
interface for check the modules input and output
- abstract check_keys_are_provided(provide: List[str]) bool [source]
check this module required key are provided
- Returns
pass or not
- check_module_chain(module_list: List[dlk.core.base_module.BaseModule]) bool [source]
check the interfaces of the list of modules are alignd or not.
- Parameters
module_list – a series modules
- Returns
pass or not
- Raises
ValueError – the check is not passed
- class dlk.core.base_module.IModuleStep[source]
Bases:
object
docstring for ModuleStepMixin
- abstract predict_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do predict for one batch
- Parameters
inputs – one mini-batch inputs
- Returns
the predicts outputs
- test_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do test for one batch
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- class dlk.core.base_module.ModuleOutputRenameMixin[source]
Bases:
object
Just rename the output key name by config to adapt the input field of downstream module.
- dict_rename(input: Dict, output_map: Dict[str, str]) Dict [source]
rename the key of input(dict) by output_map(name map)
- Parameters
input – will rename input
output_map – name map
- Returns
renamed input
- get_input_name(name: str) str [source]
use config._input_map map the name to real name
- Parameters
name – input_name
- Returns
real_name
- get_output_name(name: str) str [source]
use config._output_map map the name to real name
- Parameters
name – output_name
- Returns
real_name
- class dlk.core.base_module.SimpleModule(config: dlk.core.base_module.BaseModuleConfig)[source]
Bases:
dlk.core.base_module.BaseModule
docstring for SimpleModule, SimpleModule, all train/predict/test/validation step call the forward
- forward(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
in simple module, all step fit to this method
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- predict_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do predict for one batch
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- test_step(inputs: Dict[str, torch.Tensor]) Dict[str, torch.Tensor] [source]
do test for one batch
- Parameters
inputs – one mini-batch inputs
- Returns
one mini-batch outputs
- training: bool
Module contents
dlk.data package
Subpackages
dlk.data.datamodules package
Submodules
dlk.data.datamodules.basic module
- class dlk.data.datamodules.basic.BasicDatamodule(config: dlk.data.datamodules.basic.BasicDatamoduleConfig, data: Dict[str, Any])[source]
Bases:
dlk.data.datamodules.IBaseDataModule
Basic and General DataModule
- real_key_type_pairs(key_type_pairs: Dict, data: Dict, field: str)[source]
return the keys = key_type_pairs.keys() ∩ data.columns
- Parameters
key_type_pairs – data in columns should map to tensor type
data – the pd.DataFrame
field – traing/valid/test, etc.
- Returns
real_key_type_pairs where keys = key_type_pairs.keys() ∩ data.columns
- class dlk.data.datamodules.basic.BasicDatamoduleConfig(config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for BasicDatamodule
- Config Example:
>>> { >>> "_name": "basic", >>> "config": { >>> "pin_memory": None, >>> "collate_fn": "default", >>> "num_workers": null, >>> "shuffle": { >>> "train": true, >>> "predict": false, >>> "valid": false, >>> "test": false, >>> "online": false >>> }, >>> "key_type_pairs": { >>> 'input_ids': 'int', >>> 'label_ids': 'long', >>> 'type_ids': 'long', >>> }, >>> "gen_mask": { >>> 'input_ids': 'attention_mask', >>> }, >>> "key_padding_pairs": { //default all 0 >>> 'input_ids': 0, >>> }, >>> "key_padding_pairs_2d": { //default all 0, for 2 dimension data >>> 'input_ids': 0, >>> }, >>> "train_batch_size": 32, >>> "predict_batch_size": 32, //predict、test batch_size is equals to valid_batch_size >>> "online_batch_size": 1, >>> } >>> },
Module contents
datamodules
- class dlk.data.datamodules.DefaultCollate(**config)[source]
Bases:
object
docstring for DefaultCollate
dlk.data.postprocessors package
Submodules
dlk.data.postprocessors.identity module
- class dlk.data.postprocessors.identity.IdentityPostProcessor(config: dlk.data.postprocessors.identity.IdentityPostProcessorConfig)[source]
Bases:
dlk.data.postprocessors.IPostProcessor
docstring for DataSet
- class dlk.data.postprocessors.identity.IdentityPostProcessorConfig(config: Dict)[source]
Bases:
dlk.data.postprocessors.IPostProcessorConfig
docstring for IdentityPostProcessorConfig
dlk.data.postprocessors.seq_lab module
- class dlk.data.postprocessors.seq_lab.AggregationStrategy[source]
Bases:
object
docstring for AggregationStrategy
- AVERAGE = 'average'
- FIRST = 'first'
- MAX = 'max'
- NONE = 'none'
- SIMPLE = 'simple'
- class dlk.data.postprocessors.seq_lab.SeqLabPostProcessor(config: dlk.data.postprocessors.seq_lab.SeqLabPostProcessorConfig)[source]
Bases:
dlk.data.postprocessors.IPostProcessor
PostProcess for sequence labeling task
- aggregate(pre_entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) List[dict] [source]
- aggregate_word(entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) dict [source]
- aggregate_words(entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) List[dict] [source]
Override tokens from a given word that disagree to force agreement on word boundaries.
Example
micro|soft| com|pany| B-ENT I-NAME I-ENT I-ENT will be rewritten with first strategy as microsoft| company| B-ENT I-ENT
- calc_score(predict_list: List, ground_truth_list: List)[source]
use predict_list and ground_truth_list to calc scores
- Parameters
predict_list – list of predict
ground_truth_list – list of ground_truth
- Returns
precision, recall, f1
- crf_predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List [source]
use the crf predict label_ids get predict info
- Parameters
list_batch_outputs – the crf predict info
origin_data – the origin data
- Returns
all predict instances info
- do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict [source]
calc the scores use the predicts or list_batch_outputs
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
the named scores, recall, precision, f1
- do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List [source]
Process the model predict to human readable format
There are three predictor for diffrent seq_lab task dependent on the config.use_crf(the predict is already decoded to ids), and config.word_ready(subword has gathered to firstpiece)
- Parameters
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
all predicts
- do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]
save the predict when save_condition==True
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
save_condition – True for save, False for depend on rt_config
- Returns
None
- gather_pre_entities(sentence: str, input_ids: numpy.ndarray, scores: numpy.ndarray, offset_mapping: Optional[List[Tuple[int, int]]], special_tokens_mask: numpy.ndarray) List[dict] [source]
Fuse various numpy arrays into dicts with all the information needed for aggregation
- get_entity_info(sub_tokens_index: List, offset_mapping: List, word_ids: List, label: str) Dict [source]
gather sub_tokens to get the start and end
- Parameters
sub_tokens_index – the entity tokens index list
offset_mapping – every token offset in text
word_ids – every token in the index of words
label – predict label
- Returns
entity_info
- group_entities(entities: List[dict]) List[dict] [source]
Find and group together the adjacent tokens with the same entity predicted.
- Parameters
entities – The entities predicted by the pipeline.
- group_sub_entities(entities: List[dict]) dict [source]
Group together the adjacent tokens with the same entity predicted.
- Parameters
entities – The entities predicted by the pipeline.
- predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List [source]
general predict process (especially for subword)
- Parameters
list_batch_outputs – the predict (sub-)labels logits info
origin_data – the origin data
- Returns
all predict instances info
- word_predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List [source]
use the firstpiece or whole word predict label_logits get predict info
- Parameters
list_batch_outputs – the predict labels logits info
origin_data – the origin data
- Returns
all predict instances info
- class dlk.data.postprocessors.seq_lab.SeqLabPostProcessorConfig(config: Dict)[source]
Bases:
dlk.data.postprocessors.IPostProcessorConfig
Config for SeqLabPostProcessor
- Config Example:
>>> { >>> "_name": "seq_lab", >>> "config": { >>> "meta": "*@*", >>> "use_crf": false, //use or not use crf >>> "word_ready": false, //already gather the subword first token as the word rep or not >>> "ignore_position": true, // calc the metrics, whether ignore the ground_truth and predict position info.( if set to true, only focus on the entity content not position.) >>> "ignore_char": " ", // if the entity begin or end with this char, will ignore these char >>> //"ignore_char": " ()[]-.,:", // if the entity begin or end with this char, will ignore these char >>> "meta_data": { >>> "label_vocab": 'label_vocab', >>> "tokenizer": "tokenizer", >>> }, >>> "input_map": { >>> "logits": "logits", >>> "predict_seq_label": "predict_seq_label", >>> "_index": "_index", >>> }, >>> "origin_input_map": { >>> "uuid": "uuid", >>> "sentence": "sentence", >>> "input_ids": "input_ids", >>> "entities_info": "entities_info", >>> "offsets": "offsets", >>> "special_tokens_mask": "special_tokens_mask", >>> "word_ids": "word_ids", >>> "label_ids": "label_ids", >>> }, >>> "save_root_path": ".", //save data root dir >>> "save_path": { >>> "valid": "valid", // relative dir for valid stage >>> "test": "test", // relative dir for test stage >>> }, >>> "start_save_step": 0, // -1 means the last >>> "start_save_epoch": -1, >>> "aggregation_strategy": "max", // AggregationStrategy item >>> "ignore_labels": ['O', 'X', 'S', "E"], // Out, Out, Start, End >>> } >>> }
dlk.data.postprocessors.txt_cls module
- class dlk.data.postprocessors.txt_cls.TxtClsPostProcessor(config: dlk.data.postprocessors.txt_cls.TxtClsPostProcessorConfig)[source]
Bases:
dlk.data.postprocessors.IPostProcessor
postprocess for text classfication
- do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict [source]
calc the scores use the predicts or list_batch_outputs
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
the named scores, acc
- do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List [source]
Process the model predict to human readable format
- Parameters
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
all predicts
- do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]
save the predict when save_condition==True
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
save_condition – True for save, False for depend on rt_config
- Returns
None
- class dlk.data.postprocessors.txt_cls.TxtClsPostProcessorConfig(config: Dict)[source]
Bases:
dlk.data.postprocessors.IPostProcessorConfig
Config for TxtClsPostProcessor
- Config Example:
>>> { >>> "_name": "txt_cls", >>> "config": { >>> "meta": "*@*", >>> "meta_data": { >>> "label_vocab": 'label_vocab', >>> }, >>> "input_map": { >>> "logits": "logits", >>> "label_ids": "label_ids" >>> "_index": "_index", >>> }, >>> "origin_input_map": { >>> "sentence": "sentence", >>> "sentence_a": "sentence_a", // for pair >>> "sentence_b": "sentence_b", >>> "uuid": "uuid" >>> }, >>> "save_root_path": ".", //save data root dir >>> "top_k": 1, //the result return top k result >>> "data_type": "single", //single or pair >>> "save_path": { >>> "valid": "valid", // relative dir for valid stage >>> "test": "test", // relative dir for test stage >>> }, >>> "start_save_step": 0, // -1 means the last >>> "start_save_epoch": -1, >>> } >>> }
dlk.data.postprocessors.txt_reg module
- class dlk.data.postprocessors.txt_reg.TxtRegPostProcessor(config: dlk.data.postprocessors.txt_reg.TxtRegPostProcessorConfig)[source]
Bases:
dlk.data.postprocessors.IPostProcessor
text regression postprocess
- do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict [source]
calc the scores use the predicts or list_batch_outputs
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
the named scores
- do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List [source]
Process the model predict to human readable format
- Parameters
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
all predicts
- do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]
save the predict when save_condition==True
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
save_condition – True for save, False for depend on rt_config
- Returns
None
- class dlk.data.postprocessors.txt_reg.TxtRegPostProcessorConfig(config: Dict)[source]
Bases:
dlk.data.postprocessors.IPostProcessorConfig
Config for TxtRegPostProcessor
- Config Example:
>>> { >>> "_name": "txt_reg", >>> "config": { >>> "input_map": { >>> "logits": "logits", >>> "values": "values", >>> "_index": "_index", >>> }, >>> "origin_input_map": { >>> "sentence": "sentence", >>> "sentence_a": "sentence_a", // for pair >>> "sentence_b": "sentence_b", >>> "uuid": "uuid" >>> }, >>> "data_type": "single", //single or pair >>> "save_root_path": ".", //save data root dir >>> "save_path": { >>> "valid": "valid", // relative dir for valid stage >>> "test": "test", // relative dir for test stage >>> }, >>> "log_reg": false, // whether logistic regression >>> "start_save_step": 0, // -1 means the last >>> "start_save_epoch": -1, >>> } >>> }
Module contents
postprocessors
- class dlk.data.postprocessors.IPostProcessor[source]
Bases:
object
docstring for IPostProcessor
- average_loss(list_batch_outputs: List[Dict]) float [source]
average all the loss of the list_batches
- Parameters
list_batch_outputs – a list of outputs
- Returns
average_loss
- abstract do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict [source]
calc the scores use the predicts or list_batch_outputs
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> >>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
the named scores
- abstract do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List [source]
Process the model predict to human readable format
- Parameters
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
all predicts
- abstract do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]
save the predict when save_condition==True
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
save_condition – True for save, False for depend on rt_config
- Returns
None
- gather_predict_extend_data(input_data: Dict, i: int, predict_extend_return: Dict)[source]
gather the data register in predict_extend_return :param input_data: the model output :param i: the index is i :param predict_extend_return: the name map which will be reserved
- Returns
a dict of data in input_data which is register in predict_extend_return
- loss_name_map(stage) str [source]
get the stage loss name
- Parameters
stage – valid, train or test
- Returns
loss_name
- process(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False) Union[Dict, List] [source]
PostProcess entry
- Parameters
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
save_condition – if save_condition is True, will force save the predict on all stage except online
- Returns
the log_info(metrics) or the stage is “online” return the predicts
- property without_ground_truth_stage: set
there is not groud truth in the returned stage
- Returns
without_ground_truth_stage
- class dlk.data.postprocessors.IPostProcessorConfig(config)[source]
Bases:
dlk.utils.config.BaseConfig
docstring for PostProcessorConfigBase
- property input_map
required the output of model process content name map
- Returns
input_map
- property origin_input_map
required the origin data(before pass to datamodule) column name map
- Returns
origin_input_map
- property predict_extend_return
save the extend data in predict
- Returns
predict_extend_return
dlk.data.processors package
Submodules
dlk.data.processors.basic module
- class dlk.data.processors.basic.BasicProcessor(stage: str, config: dlk.data.processors.basic.BasicProcessorConfig)[source]
Bases:
dlk.data.processors.IProcessor
Basic and General Processor
- class dlk.data.processors.basic.BasicProcessorConfig(stage, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for BasicProcessor
- Config Example:
>>> { >>> // input should be {"train": train, "valid": valid, ...}, train/valid/test/predict/online etc, should be dataframe and must have a column named "sentence" >>> "_name": "basic@test_text_cls", >>> "config": { >>> "feed_order": ["load", "tokenizer", "token_gather", "label_to_id", "token_embedding", "save"] >>> }, >>> "subprocessor@load": { >>> "_base": "load", >>> "config":{ >>> "base_dir": "", >>> "predict":{ >>> "meta": "./meta.pkl", >>> }, >>> "online": [ >>> "predict", //base predict >>> { // special config, update predict, is this case, the config is null, means use all config from "predict", when this is empty dict, you can only set the value to a str "predict", they will get the same result >>> } >>> ] >>> } >>> }, >>> "subprocessor@save": { >>> "_base": "save", >>> "config":{ >>> "base_dir": "", >>> "train":{ >>> "processed": "processed_data.pkl", // all data >>> "meta": { >>> "meta.pkl": ['label_vocab'] //only for next time use >>> } >>> }, >>> "predict": { >>> "processed": "processed_data.pkl", >>> } >>> } >>> }, >>> "subprocessor@tokenizer":{ >>> "_base": "fast_tokenizer", >>> "config": { >>> "train": { >>> "config_path": "*@*", >>> "prefix": "" >>> "data_type": "single", // single or pair, if not provide, will calc by len(process_data) >>> "process_data": [ >>> ["sentence", { "is_pretokenized": false}], >>> ], >>> "post_processor": "default" >>> "filed_map": { // this is the default value, you can provide other name >>> "ids": "input_ids", >>> }, // the tokenizer output(the key) map to the value >>> }, >>> "predict": "train", >>> "online": "train" >>> } >>> }, >>> "subprocessor@token_gather":{ >>> "_base": "token_gather", >>> "config": { >>> "train": { // only train stage using >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ["train", "valid"] >>> }, >>> "gather_columns": ["label"], //List of columns. Every cell must be sigle token or list of tokens or set of tokens >>> "deliver": "label_vocab", // output Vocabulary object (the Vocabulary of labels) name. >>> } >>> } >>> }, >>> "subprocessor@label_to_id":{ >>> "_base": "token2id", >>> "config": { >>> "train":{ //train、predict、online stage config, using '&' split all stages >>> "data_pair": { >>> "label": "label_id" >>> }, >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ['train', 'valid', 'test'], >>> "predict": ['predict'], >>> "online": ['online'] >>> }, >>> "vocab": "label_vocab", // usually provided by the "token_gather" module >>> }, //3 >>> "predict": "train", >>> "online": "train", >>> } >>> }, >>> "subprocessor@token_embedding": { >>> "_base": "token_embedding", >>> "config":{ >>> "train": { // only train stage using >>> "embedding_file": "*@*", >>> "tokenizer": "*@*", //List of columns. Every cell must be sigle token or list of tokens or set of tokens >>> "deliver": "token_embedding", // output Vocabulary object (the Vocabulary of labels) name. >>> "embedding_size": 200, >>> } >>> } >>> }, >>> }
Module contents
processors
dlk.data.subprocessors package
Submodules
dlk.data.subprocessors.char_gather module
- class dlk.data.subprocessors.char_gather.CharGather(stage: str, config: dlk.data.subprocessors.char_gather.CharGatherConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
gather all character from the ‘gather_columns’ and deliver a vocab named ‘char_vocab’
- class dlk.data.subprocessors.char_gather.CharGatherConfig(stage: str, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for CharGather
- Config Example:
>>> { >>> "_name": "char_gather", >>> "config": { >>> "train": { >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ["train", "valid", 'test'] >>> }, >>> "gather_columns": "*@*", //List of columns. Every cell must be sigle token or list of tokens or set of tokens >>> "deliver": "char_vocab", // output Vocabulary object (the Vocabulary of labels) name. >>> "ignore": "", // ignore the token, the id of this token will be -1 >>> "update": null, // null or another Vocabulary object to update >>> "unk": "[UNK]", >>> "pad": "[PAD]", >>> "min_freq": 1, >>> "most_common": -1, //-1 for all >>> } >>> } >>> }
dlk.data.subprocessors.fast_tokenizer module
- class dlk.data.subprocessors.fast_tokenizer.FastTokenizer(stage: str, config: dlk.data.subprocessors.fast_tokenizer.FastTokenizerConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
FastTokenizer use hugingface tokenizers
Tokenizer the single $sentence Or tokenizer the pair $sentence_a, $sentence_b Generator $tokens, $input_ids, $type_ids, $special_tokens_mask, $offsets, $word_ids, $overflowing, $sequence_ids
- class dlk.data.subprocessors.fast_tokenizer.FastTokenizerConfig(stage, config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for FastTokenizer
- Config Example:
>>> { >>> "_name": "fast_tokenizer", >>> "config": { >>> "train": { >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ["train", "valid", 'test'], >>> "predict": ["predict"], >>> "online": ["online"] >>> }, >>> "config_path": "*@*", >>> "truncation": { // if this is set to None or empty, will not do trunc >>> "max_length": 512, >>> "strategy": "longest_first", // Can be one of longest_first, only_first or only_second. >>> }, >>> "normalizer": ["nfd", "lowercase", "strip_accents", "some_processor_need_config": {config}], // if don't set this, will use the default normalizer from config >>> "pre_tokenizer": [{"whitespace": {}}], // if don't set this, will use the default normalizer from config >>> "post_processor": "bert", // if don't set this, will use the default normalizer from config, WARNING: not support disable the default setting( so the default tokenizer.post_tokenizer should be null and only setting in this configure) >>> "output_map": { // this is the default value, you can provide other name >>> "tokens": "tokens", >>> "ids": "input_ids", >>> "attention_mask": "attention_mask", >>> "type_ids": "type_ids", >>> "special_tokens_mask": "special_tokens_mask", >>> "offsets": "offsets", >>> "word_ids": "word_ids", >>> "overflowing": "overflowing", >>> "sequence_ids": "sequence_ids", >>> }, // the tokenizer output(the key) map to the value >>> "input_map": { >>> "sentence": "sentence", //for sigle input, tokenizer the "sentence" >>> "sentence_a": "sentence_a", //for pair inputs, tokenize the "sentence_a" && "sentence_b" >>> "sentence_b": "sentence_b", //for pair inputs >>> }, >>> "deliver": "tokenizer", >>> "process_data": { "is_pretokenized": false}, >>> "data_type": "single", // single or pair, if not provide, will calc by len(process_data) >>> }, >>> "predict": ["train", {"deliver": null}], >>> "online": ["train", {"deliver": null}], >>> } >>> }
dlk.data.subprocessors.load module
- class dlk.data.subprocessors.load.Load(stage: str, config: dlk.data.subprocessors.load.LoadConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Loader the $meta, etc. to data
- class dlk.data.subprocessors.load.LoadConfig(stage, config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for Load
- Config Example:
>>> { >>> "_name": "load", >>> "config":{ >>> "base_dir": "" >>> "predict":{ >>> "meta": "./meta.pkl", >>> }, >>> "online": [ >>> "predict", //base predict >>> { // special config, update predict, is this case, the config is null, means use all config from "predict", when this is empty dict, you can only set the value to a str "predict", they will get the same result >>> } >>> ] >>> } >>> },
dlk.data.subprocessors.save module
- class dlk.data.subprocessors.save.Save(stage: str, config: dlk.data.subprocessors.save.SaveConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Save the processed data to $base_dir/$processed Save the meta data(like vocab, embedding, etc.) to $base_dir/$meta
- class dlk.data.subprocessors.save.SaveConfig(stage, config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for Save
- Config Example:
>>> { >>> "_name": "save", >>> "config":{ >>> "base_dir": "" >>> "train":{ >>> "processed": "processed_data.pkl", // all data without meta >>> "meta": { >>> "meta.pkl": ['label_ids', 'embedding'] //only for next time use >>> } >>> }, >>> "predict": { >>> "processed": "processed_data.pkl", >>> } >>> } >>> },
dlk.data.subprocessors.seq_lab_firstpiece_relable module
dlk.data.subprocessors.seq_lab_loader module
dlk.data.subprocessors.seq_lab_relabel module
- class dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabel(stage: str, config: dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabelConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Relabel the json data to bio
- find_position_in_offsets(position: int, offset_list: List, sub_word_ids: List, start: int, end: int, is_start: bool = False)[source]
find the sub_word index which the offset_list[index][0]<=position<offset_list[index][1]
- Parameters
position – position
offset_list – list of all tokens offsets
sub_word_ids – word_ids from tokenizer
start – start search index
end – end search index
is_start – is the position is the start of target token, if the is_start==True and cannot find return -1
- Returns
the index of the offset which include position
- class dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabelConfig(stage, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for SeqLabRelabel
- Config Example:
>>> { >>> "_name": "seq_lab_relabel", >>> "config": { >>> "train":{ >>> "input_map": { // without necessery, don't change this >>> "word_ids": "word_ids", >>> "offsets": "offsets", >>> "entities_info": "entities_info", >>> }, >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ['train', 'valid', 'test'], >>> "predict": ['predict'], >>> "online": ['online'] >>> }, >>> "output_map": { >>> "labels": "labels", >>> }, >>> "drop": "shorter", //'longer'/'shorter'/'none', if entities is overlap, will remove by rule >>> "start_label": "S", >>> "end_label": "E", >>> "clean_droped_entity": true, // after drop entity for training, whether drop the entity for calc metrics, default is true, this only works when the drop != 'none' >>> "entity_priority": [], >>> //"entity_priority": ['Product'], >>> "priority_trigger": 1, // if the overlap entity abs(length_a - length_b)<=priority_trigger, will trigger the entity_priority strategy >>> }, //3 >>> "predict": "train", >>> "online": "train", >>> } >>> }
dlk.data.subprocessors.token2charid module
- class dlk.data.subprocessors.token2charid.Token2CharID(stage: str, config: dlk.data.subprocessors.token2charid.Token2CharIDConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Use ‘Vocabulary’ map the character from tokens to id
- class dlk.data.subprocessors.token2charid.Token2CharIDConfig(stage, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for Token2CharID
- Config Example:
>>> { >>> "_name": "token2charid", >>> "config": { >>> "train":{ >>> "data_pair": { >>> "sentence & offsets": "char_ids" >>> }, >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ['train', 'valid', 'test', 'predict'], >>> "predict": ['predict'], >>> "online": ['online'] >>> }, >>> "vocab": "char_vocab", // usually provided by the "token_gather" module >>> "max_token_len": 20, // the max length of token, then the output will be max_token_len x token_num (put max_token_len in previor is for padding on token_num) >>> }, >>> "predict": "train", >>> "online": "train", >>> } >>> }
dlk.data.subprocessors.token2id module
- class dlk.data.subprocessors.token2id.Token2ID(stage: str, config: dlk.data.subprocessors.token2id.Token2IDConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Use ‘Vocabulary’ map the tokens to id
- class dlk.data.subprocessors.token2id.Token2IDConfig(stage, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for Token2ID
- Config Example:
>>> { >>> "_name": "token2id", >>> "config": { >>> "train":{ >>> "data_pair": { >>> "labels": "label_ids" >>> }, >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ['train', 'valid', 'test', 'predict'], >>> "predict": ['predict'], >>> "online": ['online'] >>> }, >>> "vocab": "label_vocab", // usually provided by the "token_gather" module >>> }, //3 >>> "predict": "train", >>> "online": "train", >>> } >>> }
dlk.data.subprocessors.token_embedding module
- class dlk.data.subprocessors.token_embedding.TokenEmbedding(stage: str, config: dlk.data.subprocessors.token_embedding.TokenEmbeddingConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Gather tokens embedding from pretrained ‘embedding_file’ or init embedding(xavier_uniform init, and the range clip in ‘bias_clip_range’)
The tokens are from ‘Tokenizer’(get_vocab) or ‘Vocabulary’(word2idx) object(the two must provide only one)
- get_embedding(file_path, embedding_size) Dict[str, List[float]] [source]
load the embeddings from file_path, and only get the last embedding_size dimentions embedding
- Parameters
file_path – embedding file path
embedding_size – the embedding dim
- Returns
>>> embedding_dict >>> { >>> "word": [embedding, ...] >>> }
- class dlk.data.subprocessors.token_embedding.TokenEmbeddingConfig(stage, config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for TokenEmbedding
- Config Example:
>>> { >>> "_name": "token_embedding", >>> "config": { >>> "train": { >>> "embedding_file": "*@*", >>> "tokenizer": null, //List of columns. Every cell must be sigle token or list of tokens or set of tokens >>> "vocab": null, >>> "deliver": "token_embedding", // output Vocabulary object (the Vocabulary of labels) name. >>> "embedding_size": 200, >>> "bias_clip_range": [0.5, 0.1], // the init embedding bias weight range, if you provide two, the larger is the up bound the lower is low bound; if you provide one value, we will use it as the bias >>> } >>> } >>> }
dlk.data.subprocessors.token_gather module
- class dlk.data.subprocessors.token_gather.TokenGather(stage: str, config: dlk.data.subprocessors.token_gather.TokenGatherConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
gather all tokens from the ‘gather_columns’ and deliver a vocab named ‘token_vocab’
- get_elements_from_series_by_trace(data: pandas.core.series.Series, trace: str) List [source]
get the datas from data[trace_path] >>> for example: >>> data[0] = {‘entities_info’: [{‘start’: 0, ‘end’: 1, ‘labels’: [‘Label1’]}]} // data is a series, and every element is as data[0] >>> trace = ‘entities_info.labels’ >>> return_result = [[‘Label1’]]
- Parameters
data – origin data series
trace – get data element trace
- Returns
the data in the tail of traces
- class dlk.data.subprocessors.token_gather.TokenGatherConfig(stage: str, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for TokenGather
- Config Example:
>>> { >>> "_name": "token_gather", >>> "config": { >>> "train": { >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ["train", "valid", 'test'] >>> }, >>> "gather_columns": "*@*", //List of columns, if one element of the list is dict, you can set more. Every cell must be sigle token or list of tokens or set of tokens >>> //"gather_columns": ['tokens'] >>> //"gather_columns": ['tokens', {"column": "entities_info", "trace": 'labels'}] >>> // the trace only trace the dict, if list is in trace path, will add the trace to every elements in the list. for example: {"entities_info": [{'start': 1, 'end': 2, labels: ['Label1']}, ..]}, the trace to labels is 'entities_info.labels' >>> "deliver": "*@*", // output Vocabulary object (the Vocabulary of labels) name. >>> "ignore": "", // ignore the token, the id of this token will be -1 >>> "update": null, // null or another Vocabulary object to update >>> "unk": "[UNK]", >>> "pad": "[PAD]", >>> "min_freq": 1, >>> "most_common": -1, //-1 for all >>> } >>> } >>> }
dlk.data.subprocessors.token_norm module
- class dlk.data.subprocessors.token_norm.TokenNorm(stage: str, config: dlk.data.subprocessors.token_norm.TokenNormConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
This part could merged to fast_tokenizer(it will save some time), but not all process need this part(except some special dataset like conll2003), and will make the fast_tokenizer be heavy.
- Token norm:
Love -> love 3281 -> 0000
- process(data: Dict) Dict [source]
TokenNorm entry
- Parameters
data –
{ – “data”: {“train”: …}, “tokenizer”: ..
} –
- Returns
norm data
- class dlk.data.subprocessors.token_norm.TokenNormConfig(stage, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for TokenNorm
- Config Example:
>>> { >>> "_name": "token_norm", >>> "config": { >>> "train":{ >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ['train', 'valid', 'test', 'predict'], >>> "predict": ['predict'], >>> "online": ['online'] >>> }, >>> "zero_digits_replaced": true, >>> "lowercase": true, >>> "extend_vocab": "", //when lowercase is true, this upper_case_vocab will collection all tokens the token is not in vocab but it's lowercase is in vocab. this is only for token gather process >>> "tokenizer": "whitespace_split", //the path to vocab(if the token in vocab skip norm it), the file is setted to one token per line >>> "data_pair": { >>> "sentence": "norm_sentence" >>> }, >>> }, >>> "predict": "train", >>> "online": "train", >>> } >>> }
dlk.data.subprocessors.txt_cls_loader module
dlk.data.subprocessors.txt_reg_loader module
Module contents
processors
Module contents
dlk.managers package
Submodules
dlk.managers.lightning module
- class dlk.managers.lightning.LightningManager(config: dlk.managers.lightning.LightningManagerConfig, rt_config: Dict)[source]
Bases:
object
pytorch-lightning traning manager
- fit(**inputs)[source]
fit the model and datamodule to trainer
- Parameters
**inputs – dict of input, include “model”, ‘datamodule’
- Returns
Undefine
- get_callbacks(callback_configs: List[Dict], rt_config: Dict)[source]
init the callbacks and return the callbacks list
- Parameters
callback_configs – the config of every callback
rt_config – {“save_dir”: ‘..’, “name”: ‘..’}
- Returns
all callbacks
- predict(**inputs)[source]
fit the model and datamodule.predict_dataloader to predict
- Parameters
**inputs – dict of input, include “model”, ‘datamodule’
- Returns
predict list
- class dlk.managers.lightning.LightningManagerConfig(config)[source]
Bases:
dlk.utils.config.BaseConfig
docstring for LightningManagerConfig check https://pytorch-lightning.readthedocs.io trainer for paramaters detail
- get_callbacks_config(config: Dict) List[Dict] [source]
get the configs for callbacks
- Parameters
config – {“config”: {“callbacks”: [“callback_names”..]}, “callback@callback_names”: {config}}
- Returns
configs which name in config[‘config’][‘callbacks’]
Module contents
managers
dlk.utils package
Submodules
dlk.utils.config module
Provide BaseConfig which provide the basic method for configs, and ConfigTool a general config(dict) process tool
- class dlk.utils.config.BaseConfig(config: Dict)[source]
Bases:
object
BaseConfig provide the basic function for all config
- class dlk.utils.config.ConfigTool[source]
Bases:
object
This Class is not be used as much as I design.
- static do_update_config(config: dict, update_config: Optional[dict] = None) Dict [source]
use the update_config dict update the config dict, recursively
see ConfigTool._inplace_update_dict
- Parameters
config – will be updated dict
update_confg – config: use _new update _base
- Returns
updated_config
- static get_config_by_stage(stage: str, config: Dict) Dict [source]
get the stage_config for special stage in provide config
it means the config of this stage equals to config[stage] return config[config[stage]]
- Config Example:
>>> config = { >>> "train":{ //train、predict、online stage config, using '&' split all stages >>> "data_pair": { >>> "label": "label_id" >>> }, >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ['train', 'dev'], >>> "predict": ['predict'], >>> "online": ['online'] >>> }, >>> "vocab": "label_vocab", // usually provided by the "token_gather" module >>> }, >>> "predict": "train", >>> "online": ["train", >>> {"vocab": "new_label_vocab"} >>> ] >>> } >>> config.get_config['predict'] == config['predict'] == config['train']
- Parameters
stage – the stage, like ‘train’, ‘predict’, etc.
config – the base config which has different stage config
- Returns
stage_config
- static get_leaf_module(module_register: dlk.utils.register.Register, module_config_register: dlk.utils.register.Register, module_name: str, config: Dict) Tuple[Any, object] [source]
get the module from module_register and module_config from module_config_register which name=module_name
- Parameters
module_register – register for module which has ‘module_name’
module_config_register – config register for config which has ‘module_name’
module_name – the module name which we want to get from register
- Returns
module(which name is module_name), module_config(which name is module_name)
dlk.utils.get_root module
Get the dlk package root path
dlk.utils.logger module
- class dlk.utils.logger.Logger(log_file: str = '', base_dir: str = 'logs', log_level: str = 'debug', log_name='dlk')[source]
Bases:
object
docstring for logger
- static get_logger() loguru._logger.Logger [source]
return the ‘dlk’ logger if initialized otherwise init and return it
- Returns
Logger.global_logger
- global_log_file: set[str] = {}
- global_logger: loguru._logger.Logger = <loguru.logger handlers=[(id=1, level=10, sink=<stdout>)]>
- static init_file_logger(log_file, base_dir='logs', log_level: str = 'debug')[source]
init(if there is not one) or change(if there already is one) the log file
- Parameters
log_file – log file path
base_dir – real log path is ‘$base_dir/$log_file’
log_level – ‘debug’, ‘info’, etc.
- Returns
None
- static init_global_logger(log_level: str = 'debug', log_name: Optional[str] = None, reinit: bool = False)[source]
init the global_logger
- Parameters
log_level – you can change this to logger to different level
log_name – change this is not suggested
reinit – if set true, will force reinit
- Returns
None
- level_map = {'debug': 'DEBUG', 'error': 'ERROR', 'info': 'INFO', 'warning': 'WARNING'}
- log_name: str = 'dlk'
- warning_file = False
dlk.utils.parser module
- class dlk.utils.parser.BaseConfigParser(config_file: Union[str, Dict, List], config_base_dir: str = '', register: Optional[dlk.utils.register.Register] = None)[source]
Bases:
object
The config parser order is: inherit -> search -> link
If some config is marked to “@”, this means the para has not default value, you must coverd it(like ‘label_nums’, etc.).
- static check_config(configs: Union[Dict, List[Dict]]) None [source]
check all config is valid.
check all “@” is replaced to correct value. :param configs: TODO
- Returns
None
- Raises
ValueError –
- classmethod collect_link(config, trace: Optional[List] = None, all_level_links: Optional[Dict] = None, level=0)[source]
collect move all links in config to top
only do in the top level of config, collect all level links and return the links with level
- Parameters
config –
>>> { >>> "arg1": { >>> "arg11": 2 >>> "arg12": 3 >>> "_link": {"arg11": "arg12"} >>> } >>> }
all_level_links – TODO
level – TODO
- Returns
>>> { >>> "arg1": { >>> "arg11": 2 >>> "arg12": 3 >>> } >>> "_link": {"arg1.arg11": "arg1.arg12"} >>> }
- static config_link_para(link: Optional[Dict[str, Union[str, List[str]]]] = None, config: Optional[Dict] = None)[source]
inplace link the config[to] = config[source]
- Parameters
link – {link-from:link-to-1, link-from:[link-to-2, link-to-3]}
config – will linked base config
- Returns
None
- classmethod flat_search(search, config: dict) List[dict] [source]
flat all the _search paras to list
support recursive parser _search now, this means you can add _search/_link/_base paras in _search paras but you should only search currently level paras
- Parameters
search – search paras, {“para1”: [1,2,3], ‘para2’: ‘list(range(10))’}
config – base config
Returns: list of possible config
- classmethod get_base_config(config_name: str) Dict [source]
get the base config use the config_name
- Parameters
config_name – the config name
- Returns
config of the config_name
- get_cartesian_prod(list_of_list_of_dict: List[List[Dict]]) List[List[Dict]] [source]
get catesian prod from two lists
- Parameters
list_of_list_of_dict – [[config_a1, config_a2], [config_b1, config_b2]]
- Returns
[[config_a1, config_b1], [config_a1, config_b2], [config_a2, config_b1], [config_a2, config_b2]]
- get_kind_module_base_config(abstract_config: Union[dict, str], kind_module: str = '') List[dict] [source]
get the whole config of ‘kind_module’ by given abstract_config
- Parameters
abstract_config – will expanded config
kind_module – the module kind, like ‘embedding’, ‘subprocessor’, which registed in config_parser_register
Returns: parserd config (whole config) of abstract_config
- static get_named_list_cartesian_prod(dict_of_list: Optional[Dict[str, List]] = None) List[Dict] [source]
get catesian prod from named lists
- Parameters
dict_of_list – {‘name1’: [1,2,3], ‘name2’: “list(range(1, 4))”}
- Returns
1, ‘name2’: 1}, {‘name1’: 1, ‘name2’: 2}, {‘name1’: 1, ‘name2’: 3}, …]
- Return type
[{‘name1’
- is_rep_config(list_of_dict: List[dict]) bool [source]
check is there a repeat config in list
- Parameters
list_of_dict – a list of dict
- Returns
has repeat or not
- load_hjson_file(file_path: str) Dict [source]
load hjson file from file_path and return a Dict
- Parameters
file_path – the file path
Returns: loaded dict
- map_to_submodule(config: dict, map_fun: Callable) Dict [source]
map the map_fun to all submodules in config
use the map_fun to process all the modules
- Parameters
config – a dict of submodules, the key is the module kind wich registed in config_parser_register
map_fun – use the map_fun process the submodule
Returns: TODO
- class dlk.utils.parser.CallbackConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for CallbackConfigParser
- class dlk.utils.parser.DatamoduleConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for DatamoduleConfigParser
- class dlk.utils.parser.DecoderConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for DecoderConfigParser
- class dlk.utils.parser.EmbeddingConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for EmbeddingConfigParser
- class dlk.utils.parser.EncoderConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for EncoderConfigParser
- class dlk.utils.parser.IModelConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for IModelConfigParser
- class dlk.utils.parser.InitMethodConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for InitMethodConfigParser
- class dlk.utils.parser.LinkUnionTool[source]
Bases:
object
Assisting tool for parsering the “_link” of config. All the function named the top level has high priority than low level
This class is mostly for resolve the confilicts of the low and high level register links.
- find(key: str)[source]
find the root of the key
- Parameters
key – a token
- Returns
the root of the key
- low_level_union(link_from: str, link_to: str)[source]
union the low level link_from->link_to pair
On the basis of the high-level links, this function is used to regist low-level link If link-from and link-to were all not appeared at before, they will be directly registed. If only one of the link-from and link-to appeared, the value of the link-from and link-to will be overwritten by the corresponding value of the upper level, If both link-from and link-to appeared at before, and if they linked the same value, we will do nothing, otherwise RAISE AN ERROR
- Parameters
link_from – the link-from key
link_to – the link-to key
- Returns
None
- register_low_links(links: Dict)[source]
register the low level links, low level means the base(parant) level config
- Parameters
links – {“link-from”: [“list of link-to”], “link-from2”: “link-to2”}
- Returns
self
- register_top_links(links: Dict)[source]
register the top level links, top level means the link_to level config
- Parameters
links – {“from”: [“tolist”], “from2”: “to2”}
- Returns
self
- top_level_union(link_from: str, link_to: str)[source]
union the top level link_from->link_to pair
Register the ‘link’(link-from -> link-to) in the same(top) level config should be merged using top_level_union Parameters are not allowed to be assigned repeatedly (the same parameter cannot appear more than once in the link-to position, otherwise it will cause ambiguity.)
- Parameters
link_from – the link-from key
link_to – the link-to key
- Returns
None
- class dlk.utils.parser.LossConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for LossConfigParser
- class dlk.utils.parser.ManagerConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for ManagerConfigParser
- class dlk.utils.parser.ModelConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for ModelConfigParser
- class dlk.utils.parser.ModuleConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for ModuleConfigParser
- class dlk.utils.parser.OptimizerConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for OptimizerConfigParser
- class dlk.utils.parser.PostProcessorConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for PostProcessorConfigParser
- class dlk.utils.parser.ProcessorConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for ProcessorConfigParser
- class dlk.utils.parser.RootConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for RootConfigParser
- class dlk.utils.parser.ScheduleConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for ScheduleConfigParser
- class dlk.utils.parser.SubProcessorConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for SubProcessorConfigParser
- class dlk.utils.parser.TaskConfigParser(config_file)[source]
Bases:
dlk.utils.parser.BaseConfigParser
docstring for TaskConfigParser
dlk.utils.quick_search module
- class dlk.utils.quick_search.QuickSearch(words: Iterable = [])[source]
Bases:
object
Ahocorasick enhanced Trie
- add_words(words: Iterable)[source]
add words from iterator to the trie
- Parameters
words – Iterable[tokens]
- Returns
None
- has(key: str) bool [source]
check key is in trie
- Parameters
key – a token(str)
- Returns
bool(has or not)
- search(search_str: str) List[Dict] [source]
find whether some sub_str in trie
- Parameters
search_str – find the search_str
- Returns
>>> the result organized as { >>> "start": start_position, >>> "end": end_position, >>> "str": search_str[start_position: end_position] >>> }
- Return type
>>> list of result
dlk.utils.register module
- class dlk.utils.register.Register(register_name: str)[source]
Bases:
object
- get(name: str = '') Any [source]
get the module by name
- Parameters
name – the name should be the real name or name+@+sub_name, and the
- Returns
registed module
dlk.utils.tokenizer_util module
- class dlk.utils.tokenizer_util.PreTokenizerFactory(tokenizer: tokenizers.Tokenizer)[source]
Bases:
object
- property bert
bert pre_tokenizer
- Returns
BertPreTokenizer
- property bytelevel
byte level pre_tokenizer
- Returns
ByteLevel
- property whitespace
whitespace pre_tokenizer
- Returns
Whitespace
- property whitespacesplit
whitespacesplit pre_tokenizer
- Returns
WhitespaceSplit
- class dlk.utils.tokenizer_util.TokenizerNormalizerFactory(tokenizer: tokenizers.Tokenizer)[source]
Bases:
object
- property lowercase
do lowercase normalizers
- Returns
Lowercase
- property nfc
do nfc normalizers
- Returns
NFC
- property nfd
do nfd normalizers
- Returns
NFD
- property strip
do strip normalizers
- Returns
StripAccents
- property strip_accents
do strip normalizers
- Returns
StripAccents
dlk.utils.vocab module
- class dlk.utils.vocab.Vocabulary(do_strip: bool = False, unknown: str = '', ignore: str = '', pad: str = '')[source]
Bases:
object
generate vocab from tokens(token or Iterable tokens) you can dumps the object to dict and load from dict
- add_from_iter(iterator)[source]
add the tokens in iterator to vocab
- Parameters
iterator – List[str] | Set[str] | List[List[str]]
- Returns
self
- auto_get_index(data: Union[str, List])[source]
get the index of word ∈data from this vocab
- Parameters
data – auto detection
- Returns
type the same as data
- auto_update(data: Union[str, Iterable])[source]
auto detect data type to update the vocab
- Parameters
data – str| List[str] | Set[str] | List[List[str]]
- Returns
self
- filter_rare(min_freq=1, most_common=- 1)[source]
filter the words which count is to small.
min_freq and most_common can not set all
- Parameters
min_freq – minist frequency
most_common – most common number, -1 means all
- Returns
None
Module contents
Appointments
Data format
Input
For one sentence processor:
The input one sentence named “sentence”, label named “labels”
The output named:
"input_ids",
"label_ids",
"word_ids",
"attention_mask",
"special_tokens_mask",
"type_ids",
"sequence_ids",
"char_ids",
The input two sentence named “sentence_a”, “sentence_b”, label named “labels”
The output named:
"input_ids",
"label_ids",
"word_ids",
"attention_mask",
"special_tokens_mask",
"type_ids",
"sequence_ids",
"char_ids",
MASK
We set mask==1 for used data, mask==0 for useless data
Batch First
All data set batch_first=True
Task naming appointments
DLK处理的所有问题我们都看做一个任务,而一个任务又会划分为多个子任务, 子任务又可以有自己的子任务,下面是一个任务的定义方式:
{
"_name": "task_name", //or "_base", "base_task_name"
"_link": {}, // this is reserved keywords
"_search: {}, // this is reserved keywords"
"sub_task1":{
},
"sub_task2":{
}
}
由于所有的任务他们本身又可以被视为其他任务的子任务,所以我们就来看一下关于一个子任务的一些约定
这是一个子任务的配置格式
{
"sub_task_name": {
"_name": "sub_task_config_name",
...config
}
}
or
{
"sub_task_name": {
"_base": "base_sub_task_config_name",
...additional config
}
}
配置的key表示这个子任务
sub_task_name
的命名一般会表示该子任务在这个task中所扮演的角色,而每个子任务一般都是由dlk的一个专门的模块进行处理,比如processor
任务中的subprocessor
子任务均由dlk.data.subprocessors
这个模块集合(这个里面会有多个subprocessor)进行处理,为了能区分不同的subprocessor
我们在对sub_task_name
进行命名时会采用subprocessor@subprocessor_name_a
来表明我们使用的是subprocessors
这个模块集合中的具有subprocessor_name_a
这个功能的subprocessor
来处理.
对于配置文件中的 _base
或 _name
模块的命名则会省略掉key中已经包含的sub_task_name
采用 AA@BB#CC
的方式对一个子任务的configure进行命名
其中 AA
表示处理sub_task_name
所在表示的模块集合中的具体模块名,比如最常见的basic
表示使用basic
模块处理这个子任务,处理方法在对应模块集合中的名为basic
中定义的逻辑处理
BB
表明这个config处理的是什么问题比如(seq_lab
/txt_cls
/ets.), CC
则表明处理这个问题的配置文件的核心特征
Model appointments
All dropout put on output or intern of the module, no dropout for the module input
The main file tree:
.
├── train.py-------------------------: train entry
├── predict.py-----------------------: predict entry
├── process.py-----------------------: process entry
├── online.py------------------------: online entry
├── managers-------------------------: pytorch_lightning or other trainer
│ └── lightning.py-----------------:
├── configures-----------------------: all default or specifical config
│ ├── core-------------------------:
│ │ ├── callbacks----------------:
│ │ ├── imodels------------------:
│ │ ├── layers-------------------:
│ │ │ ├── decoders-------------:
│ │ │ ├── embeddings-----------:
│ │ │ └── encoders-------------:
│ │ ├── losses-------------------:
│ │ ├── models-------------------:
│ │ ├── modules------------------:
│ │ └── optimizers---------------:
│ ├── data-------------------------:
│ │ ├── datamodules--------------:
│ │ ├── processors---------------:
│ │ └── subprocessors------------:
│ ├── managers---------------------:
│ └── tasks------------------------:
├── core-----------------------------: *core* pytorch or other model code
│ ├── base_module.py---------------: base module for "layers"
│ ├── callbacks--------------------:
│ ├── imodels----------------------:
│ ├── layers-----------------------:
│ │ ├── decoders-----------------:
│ │ ├── embeddings---------------:
│ │ └── encoders-----------------:
│ ├── losses-----------------------:
│ ├── models-----------------------:
│ ├── modules----------------------:
│ ├── optimizers-------------------:
│ └── schedules--------------------:
├── data-----------------------------: *core* code for data process or manager
│ ├── datamodules------------------:
│ ├── postprocessors---------------:
│ ├── processors-------------------:
│ └── subprocessors----------------:
└── utils----------------------------:
├── config.py--------------------: process config(dict) toolkit
├── get_root.py------------------: get project root path
├── logger.py--------------------: logger
├── parser.py--------------------: parser config
├── register.py------------------: register the module to a registry
├── tokenizer_util.py------------: tokenizer util
└── vocab.py---------------------: vocabulary
Config Parser Rules
Inherit
Simple e.g.
default.hjson
{
_base: parant,
config: {
"will_be_rewrite": 3
}
}
parant.hjson
{
_name: base_config,
config: {
"will_be_rewrite": 1,
"keep": 8
}
}
You have the two config named default.hjson, and parant.hjson, the parser result will be :
{
_name: base_config,
config: {
"will_be_rewrite": 3,
"keep": 8
}
}
Grid Search
Simple e.g.
{
_name: search_example,
config: {
"para1": 10,
"para2": 20,
"para3": 30,
_search: {
"para1": [11, 12],
"para2": [21, 22],
}
}
}
given the above config, the parser result will be a list of possible configure which length is 4.
[
{
_name: search_example,
config: {
"para1": 11,
"para2": 21,
"para3": 30,
}
},
{
_name: search_example,
config: {
"para1": 12,
"para2": 21,
"para3": 30,
}
},
{
_name: search_example,
config: {
"para1": 11,
"para2": 22,
"para3": 30,
}
},
{
_name: search_example,
config: {
"para1": 12,
"para2": 22,
"para3": 30,
}
},
]
Link(Argument Passing)
Parameters are not allowed to be assigned repeatedly (the same parameter cannot appear more than once in the target position, otherwise it will cause ambiguity.)
If a low level link wer all not appeared at before, it will be directly regist them.
If only one of key or value appeared in high level _links, the value of the key and value will be overwritten by the corresponding value of the upper level,
If they both appeared at before, and if they linked the same value, we will do nothing, otherwise `RAISE AN ERROR`
Simple e.g.
child.hjson
{
"_base": parant,
"config": {
"para1": 1,
"para2": 2,
"para3": 3,
}
"_link": {"config.para1": "config.para2"}
}
parant.hjson
{
"_name": parant,
"config":{
"para1": 4,
"para2": 5,
"para3": 6,
}
"_link": {"config.para2": "config.para1"}
}
the
result.hjson
{
"_name": parant,
"config":{
"para1": 1,
"para2": 1, # call link({"config.para1": "config.para2"})
"para3": 3,
}
}
Focus(Representation)
The focus part is for simple the logger file, we will use the value of focus dict to replace the key while logging.
SubModule(Combination)
Due to we using the dict to represent a config, and the key is regarded as the submodule name, but sometimes one top level module will have two or more same submodules(with different config). You can set the submodule name as ‘submodule@speciel_name’.
The subprocessor config format
In subprocessors, the config is based on the progress stage(train, predict, online, etc.).
The stage config could be a dict, a str, or a tuple, for different type of config, we will parser the configure the different way.
when the config is a dict, this is the default type, all things go as you think.
when the config is a str, the string must be one of stage name(train, predict, online, etc.) and the stage config is already defined as dict description in “1”
when the config is a tuple(two elements list), the first element must be a str, which defined in “2”, and the second element is a update config, which type is dict(or None) and defined in ‘1’
Some config value set to “@”, this means you must provided this key-value pair in your own config
Processor Config Example
{
"processor": {
"_name": "test_text_classification",
"config": {
"feed_order": ["load", "tokenizer", "token_gather", "label_to_id", "save"]
},
"subprocessor@load": {
"_name": "load",
"config":{
"base_dir": "",
"predict":{
"token_ids": "./token_ids.pkl",
"embedding": "./embedding.pkl",
"label_ids": "./label_ids.pkl"
},
"online": [
"predict", //base predict
{ // special config, update predict, is this case, the config is null, means use all config from "predict"
}
]
}
},
"subprocessor@save": {
"_name": "save",
"config":{
"base_dir": "",
"train":{
"data.train": "./train.pkl",
"data.dev": "./dev.pkl",
"token_ids": "./token_ids.pkl",
"embedding": "./embedding.pkl",
"label_ids": "./label_ids.pkl"
},
"predict": {
"data.predict": "./predict.pkl"
}
}
},
"subprocessor@tokenizer":{
"_base": "wordpiece_tokenizer",
"config": {
"train": { // you can add some whitespace surround the '&'
"data_set": { // for different stage, this processor will process different part of data
"train": ["train", "dev"],
"predict": ["predict"],
"online": ["online"]
},
"config_path": "./token.json",
"normalizer": ["nfd", "lowercase", "strip_accents", "some_processor_need_config": {config}], // if don't set this, will use the default normalizer from config
"pre_tokenizer": ["whitespace": {}], // if don't set this, will use the default normalizer from config
"post_processor": "bert", // if don't set this, will use the default normalizer from config, WARNING: not support disable the default setting( so the default tokenizer.post_tokenizer should be null and only setting in this configure)
"filed_map": { // this is the default value, you can provide other name
"tokens": "tokens",
"ids": "ids",
"attention_mask": "attention_mask",
"type_ids": "type_ids",
"special_tokens_mask": "special_tokens_mask",
"offsets": "offsets",
}, // the tokenizer output(the key) map to the value
"data_type": "single", // single or pair, if not provide, will calc by len(process_data)
"process_data": [
["sentence", { "is_pretokenized": false}],
],
/*"data_type": "pair", // single or pair*/
/*"process_data": [*/
/*['sentence_a', { "is_pretokenized": false}], */
/*['sentence_b', {}], the config of the second data must as same as the first*/
/*],*/
},
"predict": "train",
"online": "train"
}
},
"subprocessor@token_gather":{
"_name": "token_gather",
"config": {
"train": { // only train stage using
"data_set": { // for different stage, this processor will process different part of data
"train": ["train", "dev"]
},
"gather_columns": ["label"], //List of columns. Every cell must be sigle token or list of tokens or set of tokens
"deliver": "label_vocab", // output Vocabulary object (the Vocabulary of labels) name.
"update": null, // null or another Vocabulary object to update
}
}
},
"subprocessor@label_to_id":{
"_name": "token2id",
"config": {
"train":{ //train、predict、online stage config, using '&' split all stages
"data_pair": {
"label": "label_id"
},
"data_set": { // for different stage, this processor will process different part of data
"train": ['train', 'dev'],
"predict": ['predict'],
"online": ['online']
},
"vocab": "label_vocab", // usually provided by the "token_gather" module
},
"predict": "train",
"online": "train",
}
}
}
}
To Process Data Format Example
You can provide dataframe format by yourself, or use the task_name_loader(if provided or you can write one) to load your dict format data to dataframe
{
"data": {
"train": pd.DataFrame, // may include these columns "uuid"、"origin"、"label"
"dev": pd.DataFrame, // may include these columns "uuid"、"origin"、"label"
}
}
Processed Data Format Example
{
"data": {
"train": pd.DataFrame, // may include these columns "uuid"、"origin"、"labels"、"origin_tokens"、"label_ids"、"origin_token_ids"
"dev": pd.DataFrame, // may include these columns "uuid"、"origin"、"labels"、"origin_tokens"、"label_ids"、"origin_token_ids"
},
"embedding": ..,
"token_vocab": ..,
"label_vocab": ..,
...
}