Quickstart
This is very first and simple quickstart example.
Various and more detailed tutorials are here.
1. Set your ETL process as config
# 1. Set your ETL process as config.
from omegaconf import OmegaConf
ETL_config = OmegaConf.create({
# Set up Spark
'spark': {
'appname': 'ETL',
'driver': {'memory': '4g'},
},
'etl': [
{
# Extract; You can use HuggingFace datset from hub directly!
'name': 'data_ingestion___huggingface___hf2raw',
'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
},
{
# Reduce dataset scale
'name': 'utils___sampling___random',
'args': {'sample_n_or_frac': 0.5}
},
{
# Transform; deduplicate data via minhash
'name': 'deduplication___minhash___lsh_jaccard',
'args': {'threshold': 0.1,
'ngram_size': 5,
'subset': 'question'}
},
{
# Load; Save the data
'name': 'data_load___parquet___ufl2parquet',
'args': {'save_path': './guideline/etl/sample/quickstart.parquet'}
}
]
})
Above code block is an example of an ETL process in Dataverse. In Dataverse, the available registered ETL functions are referred to as blocks
, and this example is comprised of four blocks. You can freely combine these blocks using config to create the ETL processes for your needs. The list of available functions and args of them can be found in the API Reference. Each functions 'args' should be added in dictionary format.
2. Run ETLpipeline.
from dataverse.etl import ETLPipeline
etl_pipeline = ETLPipeline()
spark, dataset = etl_pipeline.run(config=ETL_config, verbose=True)
ETLPipeline is an object designed to manage the ETL processes. By inserting ETL_config
which is defined in the previous step into ETLpipeline object and calling the run
method, stacked ETL blocks will execute in the order they were stacked.
3. Check the result
As the example gave save_path
argument to the last block of ETL_config
, data passed through the process will be saved on the given path.
Last updated