dlk.data.subprocessors package

Submodules

dlk.data.subprocessors.char_gather module

class dlk.data.subprocessors.char_gather.CharGather(stage: str, config: dlk.data.subprocessors.char_gather.CharGatherConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

gather all character from the ‘gather_columns’ and deliver a vocab named ‘char_vocab’

process(data: Dict) → Dict[source]

Charactor gather entry

Parameters

data –
{ (>>>) –
"data" (>>>) – {“train”: …},
"tokenizer" (>>>) –
} (>>>) –

Returns

data[self.config.deliver] = Vocabulary()(which gathered_char)

split_to_char(input: Union[str, Iterable])[source]

the char is from token or sentence, so we need split them to List[char]

Parameters: input – auto detach the type of input and split it to char
Returns: the same shape of the input but the str is split to List[char]

class dlk.data.subprocessors.char_gather.CharGatherConfig(stage: str, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for CharGather

Config Example:

>>> {
>>>     "_name": "char_gather",
>>>     "config": {
>>>         "train": {
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ["train", "valid", 'test']
>>>             },
>>>             "gather_columns": "*@*", //List of columns. Every cell must be sigle token or list of tokens or set of tokens
>>>             "deliver": "char_vocab", // output Vocabulary object (the Vocabulary of labels) name.
>>>             "ignore": "", // ignore the token, the id of this token will be -1
>>>             "update": null, // null or another Vocabulary object to update
>>>             "unk": "[UNK]",
>>>             "pad": "[PAD]",
>>>             "min_freq": 1,
>>>             "most_common": -1, //-1 for all
>>>         }
>>>     }
>>> }

dlk.data.subprocessors.fast_tokenizer module

class dlk.data.subprocessors.fast_tokenizer.FastTokenizer(stage: str, config: dlk.data.subprocessors.fast_tokenizer.FastTokenizerConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

FastTokenizer use hugingface tokenizers

Tokenizer the single $sentence Or tokenizer the pair $sentence_a, $sentence_b Generator $tokens, $input_ids, $type_ids, $special_tokens_mask, $offsets, $word_ids, $overflowing, $sequence_ids

process(data: Dict) → Dict[source]

Tokenizer entry

Parameters

data –
{ (>>>) –
"data" (>>>) – {“train”: …},
"tokenizer" (>>>) –
} (>>>) –

Returns

data and the tokenizer info is in the data[‘data’], if you set the self.config.deliver, the data[self.config.deliver] will set to self.tokenizer.to_str()

class dlk.data.subprocessors.fast_tokenizer.FastTokenizerConfig(stage, config)[source]

Bases: dlk.utils.config.BaseConfig

Config for FastTokenizer

Config Example:

>>> {
>>>     "_name": "fast_tokenizer",
>>>     "config": {
>>>         "train": {
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ["train", "valid", 'test'],
>>>                 "predict": ["predict"],
>>>                 "online": ["online"]
>>>             },
>>>             "config_path": "*@*",
>>>             "truncation": {     // if this is set to None or empty, will not do trunc
>>>                 "max_length": 512,
>>>                 "strategy": "longest_first", // Can be one of longest_first, only_first or only_second.
>>>             },
>>>             "normalizer": ["nfd", "lowercase", "strip_accents", "some_processor_need_config": {config}], // if don't set this, will use the default normalizer from config
>>>             "pre_tokenizer": [{"whitespace": {}}], // if don't set this, will use the default normalizer from config
>>>             "post_processor": "bert", // if don't set this, will use the default normalizer from config, WARNING: not support disable  the default setting( so the default tokenizer.post_tokenizer should be null and only setting in this configure)
>>>             "output_map": { // this is the default value, you can provide other name
>>>                 "tokens": "tokens",
>>>                 "ids": "input_ids",
>>>                 "attention_mask": "attention_mask",
>>>                 "type_ids": "type_ids",
>>>                 "special_tokens_mask": "special_tokens_mask",
>>>                 "offsets": "offsets",
>>>                 "word_ids": "word_ids",
>>>                 "overflowing": "overflowing",
>>>                 "sequence_ids": "sequence_ids",
>>>             }, // the tokenizer output(the key) map to the value
>>>             "input_map": {
>>>                 "sentence": "sentence", //for sigle input, tokenizer the "sentence"
>>>                 "sentence_a": "sentence_a", //for pair inputs, tokenize the "sentence_a" && "sentence_b"
>>>                 "sentence_b": "sentence_b", //for pair inputs
>>>             },
>>>             "deliver": "tokenizer",
>>>             "process_data": { "is_pretokenized": false},
>>>             "data_type": "single", // single or pair, if not provide, will calc by len(process_data)
>>>         },
>>>         "predict": ["train", {"deliver": null}],
>>>         "online": ["train", {"deliver": null}],
>>>     }
>>> }

dlk.data.subprocessors.load module

class dlk.data.subprocessors.load.Load(stage: str, config: dlk.data.subprocessors.load.LoadConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Loader the $meta, etc. to data

load(path: str)[source]

load data from path

Parameters: path – the path to data
Returns: loaded data

process(data: Dict) → Dict[source]

Load entry

Parameters

data –
{ (>>>) –
"data" (>>>) – {“train”: …},
"tokenizer" (>>>) –
} (>>>) –

Returns

data + loaded_data

class dlk.data.subprocessors.load.LoadConfig(stage, config)[source]

Bases: dlk.utils.config.BaseConfig

Config for Load

Config Example:

>>> {
>>>     "_name": "load",
>>>     "config":{
>>>         "base_dir": ""
>>>         "predict":{
>>>             "meta": "./meta.pkl",
>>>         },
>>>         "online": [
>>>             "predict", //base predict
>>>             {   // special config, update predict, is this case, the config is null, means use all config from "predict", when this is empty dict, you can only set the value to a str "predict", they will get the same result
>>>             }
>>>         ]
>>>     }
>>> },

dlk.data.subprocessors.save module

class dlk.data.subprocessors.save.Save(stage: str, config: dlk.data.subprocessors.save.SaveConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Save the processed data to $base_dir/$processed Save the meta data(like vocab, embedding, etc.) to $base_dir/$meta

process(data: Dict) → Dict[source]

Save entry

Parameters

data –
{ (>>>) –
"data" (>>>) – {“train”: …},
"tokenizer" (>>>) –
} (>>>) –

Returns

data

save(data, path: str)[source]

save data to path

Parameters

data – pickleable data
path – the path to data

Returns

loaded data

class dlk.data.subprocessors.save.SaveConfig(stage, config)[source]

Bases: dlk.utils.config.BaseConfig

Config for Save

Config Example:

>>> {
>>>     "_name": "save",
>>>     "config":{
>>>         "base_dir": ""
>>>         "train":{
>>>             "processed": "processed_data.pkl", // all data without meta
>>>             "meta": {
>>>                 "meta.pkl": ['label_ids', 'embedding'] //only for next time use
>>>             }
>>>         },
>>>         "predict": {
>>>             "processed": "processed_data.pkl",
>>>         }
>>>     }
>>> },

dlk.data.subprocessors.seq_lab_firstpiece_relable module

dlk.data.subprocessors.seq_lab_loader module

dlk.data.subprocessors.seq_lab_relabel module

class dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabel(stage: str, config: dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabelConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Relabel the json data to bio

find_position_in_offsets(position: int, offset_list: List, sub_word_ids: List, start: int, end: int, is_start: bool = False)[source]

find the sub_word index which the offset_list[index][0]<=position<offset_list[index][1]

Parameters

position – position
offset_list – list of all tokens offsets
sub_word_ids – word_ids from tokenizer
start – start search index
end – end search index
is_start – is the position is the start of target token, if the is_start==True and cannot find return -1

Returns

the index of the offset which include position

process(data: Dict) → Dict[source]

SeqLabRelabel Entry

Parameters: data – Dict
Returns: relabeled data

relabel(one_ins: pandas.core.series.Series)[source]

make token label, if use the first piece label please use the ‘seq_lab_firstpiece_relabel’

Parameters: one_ins – include sentence, entity_info, offsets
Returns: labels(labels for each subtoken)

class dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabelConfig(stage, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for SeqLabRelabel

Config Example:

>>> {
>>>     "_name": "seq_lab_relabel",
>>>     "config": {
>>>         "train":{
>>>             "input_map": {  // without necessery, don't change this
>>>                 "word_ids": "word_ids",
>>>                 "offsets": "offsets",
>>>                 "entities_info": "entities_info",
>>>             },
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ['train', 'valid', 'test'],
>>>                 "predict": ['predict'],
>>>                 "online": ['online']
>>>             },
>>>             "output_map": {
>>>                 "labels": "labels",
>>>             },
>>>             "drop": "shorter", //'longer'/'shorter'/'none', if entities is overlap, will remove by rule
>>>             "start_label": "S",
>>>             "end_label": "E",
>>>             "clean_droped_entity": true, // after drop entity for training, whether drop the entity for calc metrics, default is true, this only works when the drop != 'none'
>>>             "entity_priority": [],
>>>             //"entity_priority": ['Product'],
>>>             "priority_trigger": 1, // if the overlap entity abs(length_a - length_b)<=priority_trigger, will trigger the entity_priority strategy
>>>         }, //3
>>>         "predict": "train",
>>>         "online": "train",
>>>     }
>>> }

dlk.data.subprocessors.token2charid module

class dlk.data.subprocessors.token2charid.Token2CharID(stage: str, config: dlk.data.subprocessors.token2charid.Token2CharIDConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Use ‘Vocabulary’ map the character from tokens to id

process(data: Dict) → Dict[source]

Token2CharID Entry

one_token like ‘apple’ will generate [1, 2, 2, 3] if max_token_len==4 and the vocab.word2idx = {‘a’: 1, “p”: 2, “l”: 3}

Parameters: data – will process data
Returns: updated data(token -> char_ids)

class dlk.data.subprocessors.token2charid.Token2CharIDConfig(stage, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for Token2CharID

Config Example:

>>> {
>>>     "_name": "token2charid",
>>>     "config": {
>>>         "train":{
>>>             "data_pair": {
>>>                 "sentence & offsets": "char_ids"
>>>             },
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ['train', 'valid', 'test', 'predict'],
>>>                 "predict": ['predict'],
>>>                 "online": ['online']
>>>             },
>>>             "vocab": "char_vocab", // usually provided by the "token_gather" module
>>>             "max_token_len": 20, // the max length of token, then the output will be max_token_len x token_num (put max_token_len in previor is for padding on token_num)
>>>         },
>>>         "predict": "train",
>>>         "online": "train",
>>>     }
>>> }

dlk.data.subprocessors.token2id module

class dlk.data.subprocessors.token2id.Token2ID(stage: str, config: dlk.data.subprocessors.token2id.Token2IDConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Use ‘Vocabulary’ map the tokens to id

process(data: Dict) → Dict[source]

Token2ID Entry

one_token like [‘apple’] will generate [1] if the vocab.word2idx = {‘apple’: 1}

Parameters: data – will process data
Returns: updated data(tokens -> token_ids)

class dlk.data.subprocessors.token2id.Token2IDConfig(stage, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for Token2ID

Config Example:

>>> {
>>>     "_name": "token2id",
>>>     "config": {
>>>         "train":{
>>>             "data_pair": {
>>>                 "labels": "label_ids"
>>>             },
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ['train', 'valid', 'test', 'predict'],
>>>                 "predict": ['predict'],
>>>                 "online": ['online']
>>>             },
>>>             "vocab": "label_vocab", // usually provided by the "token_gather" module
>>>         }, //3
>>>         "predict": "train",
>>>         "online": "train",
>>>     }
>>> }

dlk.data.subprocessors.token_embedding module

class dlk.data.subprocessors.token_embedding.TokenEmbedding(stage: str, config: dlk.data.subprocessors.token_embedding.TokenEmbeddingConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

Gather tokens embedding from pretrained ‘embedding_file’ or init embedding(xavier_uniform init, and the range clip in ‘bias_clip_range’)

The tokens are from ‘Tokenizer’(get_vocab) or ‘Vocabulary’(word2idx) object(the two must provide only one)

get_embedding(file_path, embedding_size) → Dict[str, List[float]][source]

load the embeddings from file_path, and only get the last embedding_size dimentions embedding

Parameters

file_path – embedding file path
embedding_size – the embedding dim

Returns

>>> embedding_dict
>>> {
>>>     "word": [embedding, ...]
>>> }

process(data: Dict) → Dict[source]

TokenEmbedding Entry

Parameters: data – will process data
Returns: update embedding_dict to data data[self.config.deliver] = np.array(embedding_mat)

update_embedding(embedding_dict: Dict[str, List[float]], vocab: List[str])[source]

update the embedding_dict which token in vocab but not in embedding_dict

Parameters

embedding_dict – word->embedding dict
vocab – token vocab

Returns

updated embedding_dict

class dlk.data.subprocessors.token_embedding.TokenEmbeddingConfig(stage, config)[source]

Bases: dlk.utils.config.BaseConfig

Config for TokenEmbedding

Config Example:

>>> {
>>>     "_name": "token_embedding",
>>>     "config": {
>>>         "train": {
>>>             "embedding_file": "*@*",
>>>             "tokenizer": null, //List of columns. Every cell must be sigle token or list of tokens or set of tokens
>>>             "vocab": null,
>>>             "deliver": "token_embedding", // output Vocabulary object (the Vocabulary of labels) name.
>>>             "embedding_size": 200,
>>>             "bias_clip_range": [0.5, 0.1], // the init embedding bias weight range, if you provide two, the larger is the up bound the lower is low bound; if you provide one value, we will use it as the bias
>>>         }
>>>     }
>>> }

dlk.data.subprocessors.token_gather module

class dlk.data.subprocessors.token_gather.TokenGather(stage: str, config: dlk.data.subprocessors.token_gather.TokenGatherConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

gather all tokens from the ‘gather_columns’ and deliver a vocab named ‘token_vocab’

get_elements_from_series_by_trace(data: pandas.core.series.Series, trace: str) → List[source]

get the datas from data[trace_path] >>> for example: >>> data[0] = {‘entities_info’: [{‘start’: 0, ‘end’: 1, ‘labels’: [‘Label1’]}]} // data is a series, and every element is as data[0] >>> trace = ‘entities_info.labels’ >>> return_result = [[‘Label1’]]

Parameters

data – origin data series
trace – get data element trace

Returns

the data in the tail of traces

process(data: Dict) → Dict[source]

TokenGather entry

Parameters

data –
{ (>>>) –
"data" (>>>) – {“train”: …},
"tokenizer" (>>>) –
} (>>>) –

Returns

data[self.config.deliver] = Vocabulary()(which gathered_token)

class dlk.data.subprocessors.token_gather.TokenGatherConfig(stage: str, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for TokenGather

Config Example:

>>> {
>>>     "_name": "token_gather",
>>>     "config": {
>>>         "train": {
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ["train", "valid", 'test']
>>>             },
>>>             "gather_columns": "*@*", //List of columns, if one element of the list is dict, you can set more. Every cell must be sigle token or list of tokens or set of tokens
>>>             //"gather_columns": ['tokens']
>>>             //"gather_columns": ['tokens', {"column": "entities_info", "trace": 'labels'}]
>>>             // the trace only trace the dict, if list is in trace path, will add the trace to every elements in the list. for example: {"entities_info": [{'start': 1， 'end': 2, labels: ['Label1']}, ..]}, the trace to labels is 'entities_info.labels'
>>>             "deliver": "*@*", // output Vocabulary object (the Vocabulary of labels) name.
>>>             "ignore": "", // ignore the token, the id of this token will be -1
>>>             "update": null, // null or another Vocabulary object to update
>>>             "unk": "[UNK]",
>>>             "pad": "[PAD]",
>>>             "min_freq": 1,
>>>             "most_common": -1, //-1 for all
>>>         }
>>>     }
>>> }

dlk.data.subprocessors.token_norm module

class dlk.data.subprocessors.token_norm.TokenNorm(stage: str, config: dlk.data.subprocessors.token_norm.TokenNormConfig)[source]

Bases: dlk.data.subprocessors.ISubProcessor

This part could merged to fast_tokenizer(it will save some time), but not all process need this part(except some special dataset like conll2003), and will make the fast_tokenizer be heavy.

Token norm:: Love -> love 3281 -> 0000

process(data: Dict) → Dict[source]

TokenNorm entry

Parameters

data –
{ – “data”: {“train”: …}, “tokenizer”: ..
} –

Returns

norm data

seq_norm(key: str, one_item: pandas.core.series.Series) → str[source]

norm a sentence, the sentence is from one_item[key]

Parameters

key – the name in one_item
one_item – a pd.Series which include the key

Returns

norm_sentence

token_norm(token: str) → str[source]

norm token, the result len(result) == len(token), exp. 12348->00000

Parameters: token – origin token
Returns: normed_token

class dlk.data.subprocessors.token_norm.TokenNormConfig(stage, config: Dict)[source]

Bases: dlk.utils.config.BaseConfig

Config for TokenNorm

Config Example:

>>> {
>>>     "_name": "token_norm",
>>>     "config": {
>>>         "train":{
>>>             "data_set": {                   // for different stage, this processor will process different part of data
>>>                 "train": ['train', 'valid', 'test', 'predict'],
>>>                 "predict": ['predict'],
>>>                 "online": ['online']
>>>             },
>>>             "zero_digits_replaced": true,
>>>             "lowercase": true,
>>>             "extend_vocab": "", //when lowercase is true, this upper_case_vocab will collection all tokens the token is not in vocab but it's lowercase is in vocab. this is only for token gather process
>>>             "tokenizer": "whitespace_split",  //the path to vocab(if the token in vocab skip norm it), the file is setted to one token per line
>>>             "data_pair": {
>>>                 "sentence": "norm_sentence"
>>>             },
>>>         },
>>>         "predict": "train",
>>>         "online": "train",
>>>     }
>>> }

tokenize(seq)[source]: tokenize the seq

dlk.data.subprocessors.txt_cls_loader module

dlk.data.subprocessors.txt_reg_loader module

Module contents

processors

class dlk.data.subprocessors.ISubProcessor[source]

Bases: object

docstring for ISubProcessor

abstract process(data: Dict) → Dict[source]

SubProcess entry

Parameters

data –
{ (>>>) –
"data" (>>>) – {“train”: …},
"tokenizer" (>>>) –
} (>>>) –

Returns

processed data

dlk.data.subprocessors.import_subprocessors(processors_dir, namespace)[source]