Quickstart
This is very first and simple quickstart example.
1. Set your ETL process as config
# 1. Set your ETL process as config.
from omegaconf import OmegaConf
ETL_config = OmegaConf.create({
# Set up Spark
'spark': {
'appname': 'ETL',
'driver': {'memory': '4g'},
},
'etl': [
{
# Extract; You can use HuggingFace datset from hub directly!
'name': 'data_ingestion___huggingface___hf2raw',
'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
},
{
# Reduce dataset scale
'name': 'utils___sampling___random',
'args': {'sample_n_or_frac': 0.5}
},
{
# Transform; deduplicate data via minhash
'name': 'deduplication___minhash___lsh_jaccard',
'args': {'threshold': 0.1,
'ngram_size': 5,
'subset': 'question'}
},
{
# Load; Save the data
'name': 'data_load___parquet___ufl2parquet',
'args': {'save_path': './guideline/etl/sample/quickstart.parquet'}
}
]
})2. Run ETLpipeline.
3. Check the result
Last updated