Skip to content

How to Calculate KL Reduction ? #13

@GenerallyCovetous

Description

@GenerallyCovetous

Can the DSIR calculate the data metric method mentioned in the paper—KL reduction?
And what are the necessary data preprocessing methods when resampling a custom dataset? My scenario involves importance resampling of data in the Alpaca style, and my current processing code is as follows:

from data_selection import HashedNgramDSIR

raw_datasets = ["/dsir/original_data/train_30k.jsonl"]
target_datasets = ["/dsir/original_data/target.jsonl"]

dsir = HashedNgramDSIR(raw_datasets, target_datasets, cache_dir='/dsir/dsir_cache')
dsir.fit_importance_estimator(num_tokens_to_fit='auto')
dsir.compute_importance_weights()
dsir.resample(out_dir='resampled', num_to_sample=10000, cache_dir='/dsir/resampled_cache')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions