The subprocessor config format

In subprocessors, the config is based on the progress stage(train, predict, online, etc.).

The stage config could be a dict, a str, or a tuple, for different type of config, we will parser the configure the different way.

when the config is a dict, this is the default type, all things go as you think.
when the config is a str, the string must be one of stage name(train, predict, online, etc.) and the stage config is already defined as dict description in “1”
when the config is a tuple(two elements list), the first element must be a str, which defined in “2”, and the second element is a update config, which type is dict(or None) and defined in ‘1’

Some config value set to “@”, this means you must provided this key-value pair in your own config

Processor Config Example

{
    "processor": {
        "_name": "test_text_classification",
        "config": {
            "feed_order": ["load", "tokenizer", "token_gather", "label_to_id", "save"]
        },
        "subprocessor@load": {
            "_name": "load",
            "config":{
                "base_dir": "",
                "predict":{
                    "token_ids": "./token_ids.pkl",
                    "embedding": "./embedding.pkl",
                    "label_ids": "./label_ids.pkl"
                },
                "online": [
                    "predict", //base predict
                    {   // special config, update predict, is this case, the config is null, means use all config from "predict"
                    }
                ]
            }
        },
        "subprocessor@save": {
            "_name": "save",
            "config":{
                "base_dir": "",
                "train":{
                    "data.train": "./train.pkl",
                    "data.dev": "./dev.pkl",
                    "token_ids": "./token_ids.pkl",
                    "embedding": "./embedding.pkl",
                    "label_ids": "./label_ids.pkl"
                },
                "predict": {
                    "data.predict": "./predict.pkl"
                }
            }
        },
        "subprocessor@tokenizer":{
            "_base": "wordpiece_tokenizer",
            "config": {
                "train": { // you can add some whitespace surround the '&' 
                    "data_set": {                   // for different stage, this processor will process different part of data
                        "train": ["train", "dev"],
                        "predict": ["predict"],
                        "online": ["online"]
                    },
                    "config_path": "./token.json",
                    "normalizer": ["nfd", "lowercase", "strip_accents", "some_processor_need_config": {config}], // if don't set this, will use the default normalizer from config
                    "pre_tokenizer": ["whitespace": {}], // if don't set this, will use the default normalizer from config
                    "post_processor": "bert", // if don't set this, will use the default normalizer from config, WARNING: not support disable  the default setting( so the default tokenizer.post_tokenizer should be null and only setting in this configure)
                    "filed_map": { // this is the default value, you can provide other name
                        "tokens": "tokens",
                        "ids": "ids",
                        "attention_mask": "attention_mask",
                        "type_ids": "type_ids",
                        "special_tokens_mask": "special_tokens_mask",
                        "offsets": "offsets",
                    }, // the tokenizer output(the key) map to the value
                    "data_type": "single", // single or pair, if not provide, will calc by len(process_data)
                    "process_data": [
                        ["sentence", { "is_pretokenized": false}], 
                    ],
                    /*"data_type": "pair", // single or pair*/
                    /*"process_data": [*/
                        /*['sentence_a', { "is_pretokenized": false}], */ 
                        /*['sentence_b', {}], the config of the second data must as same as the first*/ 
                    /*],*/
                },
                "predict": "train",
                "online": "train"
            }
        },
        "subprocessor@token_gather":{
            "_name": "token_gather",
            "config": {
                "train": { // only train stage using
                    "data_set": {                   // for different stage, this processor will process different part of data
                        "train": ["train", "dev"]
                    },
                    "gather_columns": ["label"], //List of columns. Every cell must be sigle token or list of tokens or set of tokens
                    "deliver": "label_vocab", // output Vocabulary object (the Vocabulary of labels) name. 
                    "update": null, // null or another Vocabulary object to update
                }
            }
        },
        "subprocessor@label_to_id":{
            "_name": "token2id",
            "config": {
                "train":{ //train、predict、online stage config,  using '&' split all stages
                    "data_pair": {
                        "label": "label_id"
                    },
                    "data_set": {                   // for different stage, this processor will process different part of data
                        "train": ['train', 'dev'],
                        "predict": ['predict'],
                        "online": ['online']
                    },
                    "vocab": "label_vocab", // usually provided by the "token_gather" module
                },
                "predict": "train",
                "online": "train",
            }
        }
    }
}

To Process Data Format Example

You can provide dataframe format by yourself, or use the task_name_loader(if provided or you can write one) to load your dict format data to dataframe

{
    "data": {
        "train": pd.DataFrame, // may include these columns "uuid"、"origin"、"label"
        "dev": pd.DataFrame, // may include these columns "uuid"、"origin"、"label"
    }
}

Processed Data Format Example

{
    "data": {
        "train": pd.DataFrame, // may include these columns "uuid"、"origin"、"labels"、"origin_tokens"、"label_ids"、"origin_token_ids"
        "dev": pd.DataFrame, // may include these columns "uuid"、"origin"、"labels"、"origin_tokens"、"label_ids"、"origin_token_ids"
    },
    "embedding": ..,
    "token_vocab": ..,
    "label_vocab": ..,
    ...
}