dlk.data.subprocessors package
Submodules
dlk.data.subprocessors.char_gather module
- class dlk.data.subprocessors.char_gather.CharGather(stage: str, config: dlk.data.subprocessors.char_gather.CharGatherConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
gather all character from the ‘gather_columns’ and deliver a vocab named ‘char_vocab’
- class dlk.data.subprocessors.char_gather.CharGatherConfig(stage: str, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for CharGather
- Config Example:
>>> { >>> "_name": "char_gather", >>> "config": { >>> "train": { >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ["train", "valid", 'test'] >>> }, >>> "gather_columns": "*@*", //List of columns. Every cell must be sigle token or list of tokens or set of tokens >>> "deliver": "char_vocab", // output Vocabulary object (the Vocabulary of labels) name. >>> "ignore": "", // ignore the token, the id of this token will be -1 >>> "update": null, // null or another Vocabulary object to update >>> "unk": "[UNK]", >>> "pad": "[PAD]", >>> "min_freq": 1, >>> "most_common": -1, //-1 for all >>> } >>> } >>> }
dlk.data.subprocessors.fast_tokenizer module
- class dlk.data.subprocessors.fast_tokenizer.FastTokenizer(stage: str, config: dlk.data.subprocessors.fast_tokenizer.FastTokenizerConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
FastTokenizer use hugingface tokenizers
Tokenizer the single $sentence Or tokenizer the pair $sentence_a, $sentence_b Generator $tokens, $input_ids, $type_ids, $special_tokens_mask, $offsets, $word_ids, $overflowing, $sequence_ids
- class dlk.data.subprocessors.fast_tokenizer.FastTokenizerConfig(stage, config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for FastTokenizer
- Config Example:
>>> { >>> "_name": "fast_tokenizer", >>> "config": { >>> "train": { >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ["train", "valid", 'test'], >>> "predict": ["predict"], >>> "online": ["online"] >>> }, >>> "config_path": "*@*", >>> "truncation": { // if this is set to None or empty, will not do trunc >>> "max_length": 512, >>> "strategy": "longest_first", // Can be one of longest_first, only_first or only_second. >>> }, >>> "normalizer": ["nfd", "lowercase", "strip_accents", "some_processor_need_config": {config}], // if don't set this, will use the default normalizer from config >>> "pre_tokenizer": [{"whitespace": {}}], // if don't set this, will use the default normalizer from config >>> "post_processor": "bert", // if don't set this, will use the default normalizer from config, WARNING: not support disable the default setting( so the default tokenizer.post_tokenizer should be null and only setting in this configure) >>> "output_map": { // this is the default value, you can provide other name >>> "tokens": "tokens", >>> "ids": "input_ids", >>> "attention_mask": "attention_mask", >>> "type_ids": "type_ids", >>> "special_tokens_mask": "special_tokens_mask", >>> "offsets": "offsets", >>> "word_ids": "word_ids", >>> "overflowing": "overflowing", >>> "sequence_ids": "sequence_ids", >>> }, // the tokenizer output(the key) map to the value >>> "input_map": { >>> "sentence": "sentence", //for sigle input, tokenizer the "sentence" >>> "sentence_a": "sentence_a", //for pair inputs, tokenize the "sentence_a" && "sentence_b" >>> "sentence_b": "sentence_b", //for pair inputs >>> }, >>> "deliver": "tokenizer", >>> "process_data": { "is_pretokenized": false}, >>> "data_type": "single", // single or pair, if not provide, will calc by len(process_data) >>> }, >>> "predict": ["train", {"deliver": null}], >>> "online": ["train", {"deliver": null}], >>> } >>> }
dlk.data.subprocessors.load module
- class dlk.data.subprocessors.load.Load(stage: str, config: dlk.data.subprocessors.load.LoadConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Loader the $meta, etc. to data
- class dlk.data.subprocessors.load.LoadConfig(stage, config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for Load
- Config Example:
>>> { >>> "_name": "load", >>> "config":{ >>> "base_dir": "" >>> "predict":{ >>> "meta": "./meta.pkl", >>> }, >>> "online": [ >>> "predict", //base predict >>> { // special config, update predict, is this case, the config is null, means use all config from "predict", when this is empty dict, you can only set the value to a str "predict", they will get the same result >>> } >>> ] >>> } >>> },
dlk.data.subprocessors.save module
- class dlk.data.subprocessors.save.Save(stage: str, config: dlk.data.subprocessors.save.SaveConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Save the processed data to $base_dir/$processed Save the meta data(like vocab, embedding, etc.) to $base_dir/$meta
- class dlk.data.subprocessors.save.SaveConfig(stage, config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for Save
- Config Example:
>>> { >>> "_name": "save", >>> "config":{ >>> "base_dir": "" >>> "train":{ >>> "processed": "processed_data.pkl", // all data without meta >>> "meta": { >>> "meta.pkl": ['label_ids', 'embedding'] //only for next time use >>> } >>> }, >>> "predict": { >>> "processed": "processed_data.pkl", >>> } >>> } >>> },
dlk.data.subprocessors.seq_lab_firstpiece_relable module
dlk.data.subprocessors.seq_lab_loader module
dlk.data.subprocessors.seq_lab_relabel module
- class dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabel(stage: str, config: dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabelConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Relabel the json data to bio
- find_position_in_offsets(position: int, offset_list: List, sub_word_ids: List, start: int, end: int, is_start: bool = False)[source]
find the sub_word index which the offset_list[index][0]<=position<offset_list[index][1]
- Parameters
position – position
offset_list – list of all tokens offsets
sub_word_ids – word_ids from tokenizer
start – start search index
end – end search index
is_start – is the position is the start of target token, if the is_start==True and cannot find return -1
- Returns
the index of the offset which include position
- class dlk.data.subprocessors.seq_lab_relabel.SeqLabRelabelConfig(stage, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for SeqLabRelabel
- Config Example:
>>> { >>> "_name": "seq_lab_relabel", >>> "config": { >>> "train":{ >>> "input_map": { // without necessery, don't change this >>> "word_ids": "word_ids", >>> "offsets": "offsets", >>> "entities_info": "entities_info", >>> }, >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ['train', 'valid', 'test'], >>> "predict": ['predict'], >>> "online": ['online'] >>> }, >>> "output_map": { >>> "labels": "labels", >>> }, >>> "drop": "shorter", //'longer'/'shorter'/'none', if entities is overlap, will remove by rule >>> "start_label": "S", >>> "end_label": "E", >>> "clean_droped_entity": true, // after drop entity for training, whether drop the entity for calc metrics, default is true, this only works when the drop != 'none' >>> "entity_priority": [], >>> //"entity_priority": ['Product'], >>> "priority_trigger": 1, // if the overlap entity abs(length_a - length_b)<=priority_trigger, will trigger the entity_priority strategy >>> }, //3 >>> "predict": "train", >>> "online": "train", >>> } >>> }
dlk.data.subprocessors.token2charid module
- class dlk.data.subprocessors.token2charid.Token2CharID(stage: str, config: dlk.data.subprocessors.token2charid.Token2CharIDConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Use ‘Vocabulary’ map the character from tokens to id
- class dlk.data.subprocessors.token2charid.Token2CharIDConfig(stage, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for Token2CharID
- Config Example:
>>> { >>> "_name": "token2charid", >>> "config": { >>> "train":{ >>> "data_pair": { >>> "sentence & offsets": "char_ids" >>> }, >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ['train', 'valid', 'test', 'predict'], >>> "predict": ['predict'], >>> "online": ['online'] >>> }, >>> "vocab": "char_vocab", // usually provided by the "token_gather" module >>> "max_token_len": 20, // the max length of token, then the output will be max_token_len x token_num (put max_token_len in previor is for padding on token_num) >>> }, >>> "predict": "train", >>> "online": "train", >>> } >>> }
dlk.data.subprocessors.token2id module
- class dlk.data.subprocessors.token2id.Token2ID(stage: str, config: dlk.data.subprocessors.token2id.Token2IDConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Use ‘Vocabulary’ map the tokens to id
- class dlk.data.subprocessors.token2id.Token2IDConfig(stage, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for Token2ID
- Config Example:
>>> { >>> "_name": "token2id", >>> "config": { >>> "train":{ >>> "data_pair": { >>> "labels": "label_ids" >>> }, >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ['train', 'valid', 'test', 'predict'], >>> "predict": ['predict'], >>> "online": ['online'] >>> }, >>> "vocab": "label_vocab", // usually provided by the "token_gather" module >>> }, //3 >>> "predict": "train", >>> "online": "train", >>> } >>> }
dlk.data.subprocessors.token_embedding module
- class dlk.data.subprocessors.token_embedding.TokenEmbedding(stage: str, config: dlk.data.subprocessors.token_embedding.TokenEmbeddingConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
Gather tokens embedding from pretrained ‘embedding_file’ or init embedding(xavier_uniform init, and the range clip in ‘bias_clip_range’)
The tokens are from ‘Tokenizer’(get_vocab) or ‘Vocabulary’(word2idx) object(the two must provide only one)
- get_embedding(file_path, embedding_size) Dict[str, List[float]] [source]
load the embeddings from file_path, and only get the last embedding_size dimentions embedding
- Parameters
file_path – embedding file path
embedding_size – the embedding dim
- Returns
>>> embedding_dict >>> { >>> "word": [embedding, ...] >>> }
- class dlk.data.subprocessors.token_embedding.TokenEmbeddingConfig(stage, config)[source]
Bases:
dlk.utils.config.BaseConfig
Config for TokenEmbedding
- Config Example:
>>> { >>> "_name": "token_embedding", >>> "config": { >>> "train": { >>> "embedding_file": "*@*", >>> "tokenizer": null, //List of columns. Every cell must be sigle token or list of tokens or set of tokens >>> "vocab": null, >>> "deliver": "token_embedding", // output Vocabulary object (the Vocabulary of labels) name. >>> "embedding_size": 200, >>> "bias_clip_range": [0.5, 0.1], // the init embedding bias weight range, if you provide two, the larger is the up bound the lower is low bound; if you provide one value, we will use it as the bias >>> } >>> } >>> }
dlk.data.subprocessors.token_gather module
- class dlk.data.subprocessors.token_gather.TokenGather(stage: str, config: dlk.data.subprocessors.token_gather.TokenGatherConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
gather all tokens from the ‘gather_columns’ and deliver a vocab named ‘token_vocab’
- get_elements_from_series_by_trace(data: pandas.core.series.Series, trace: str) List [source]
get the datas from data[trace_path] >>> for example: >>> data[0] = {‘entities_info’: [{‘start’: 0, ‘end’: 1, ‘labels’: [‘Label1’]}]} // data is a series, and every element is as data[0] >>> trace = ‘entities_info.labels’ >>> return_result = [[‘Label1’]]
- Parameters
data – origin data series
trace – get data element trace
- Returns
the data in the tail of traces
- class dlk.data.subprocessors.token_gather.TokenGatherConfig(stage: str, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for TokenGather
- Config Example:
>>> { >>> "_name": "token_gather", >>> "config": { >>> "train": { >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ["train", "valid", 'test'] >>> }, >>> "gather_columns": "*@*", //List of columns, if one element of the list is dict, you can set more. Every cell must be sigle token or list of tokens or set of tokens >>> //"gather_columns": ['tokens'] >>> //"gather_columns": ['tokens', {"column": "entities_info", "trace": 'labels'}] >>> // the trace only trace the dict, if list is in trace path, will add the trace to every elements in the list. for example: {"entities_info": [{'start': 1, 'end': 2, labels: ['Label1']}, ..]}, the trace to labels is 'entities_info.labels' >>> "deliver": "*@*", // output Vocabulary object (the Vocabulary of labels) name. >>> "ignore": "", // ignore the token, the id of this token will be -1 >>> "update": null, // null or another Vocabulary object to update >>> "unk": "[UNK]", >>> "pad": "[PAD]", >>> "min_freq": 1, >>> "most_common": -1, //-1 for all >>> } >>> } >>> }
dlk.data.subprocessors.token_norm module
- class dlk.data.subprocessors.token_norm.TokenNorm(stage: str, config: dlk.data.subprocessors.token_norm.TokenNormConfig)[source]
Bases:
dlk.data.subprocessors.ISubProcessor
This part could merged to fast_tokenizer(it will save some time), but not all process need this part(except some special dataset like conll2003), and will make the fast_tokenizer be heavy.
- Token norm:
Love -> love 3281 -> 0000
- process(data: Dict) Dict [source]
TokenNorm entry
- Parameters
data –
{ – “data”: {“train”: …}, “tokenizer”: ..
} –
- Returns
norm data
- class dlk.data.subprocessors.token_norm.TokenNormConfig(stage, config: Dict)[source]
Bases:
dlk.utils.config.BaseConfig
Config for TokenNorm
- Config Example:
>>> { >>> "_name": "token_norm", >>> "config": { >>> "train":{ >>> "data_set": { // for different stage, this processor will process different part of data >>> "train": ['train', 'valid', 'test', 'predict'], >>> "predict": ['predict'], >>> "online": ['online'] >>> }, >>> "zero_digits_replaced": true, >>> "lowercase": true, >>> "extend_vocab": "", //when lowercase is true, this upper_case_vocab will collection all tokens the token is not in vocab but it's lowercase is in vocab. this is only for token gather process >>> "tokenizer": "whitespace_split", //the path to vocab(if the token in vocab skip norm it), the file is setted to one token per line >>> "data_pair": { >>> "sentence": "norm_sentence" >>> }, >>> }, >>> "predict": "train", >>> "online": "train", >>> } >>> }
dlk.data.subprocessors.txt_cls_loader module
dlk.data.subprocessors.txt_reg_loader module
Module contents
processors