Quickstart

This is very first and simple quickstart example.

Various and more detailed tutorials are here.

1. Set your ETL process as config


# 1. Set your ETL process as config.

from omegaconf import OmegaConf

ETL_config = OmegaConf.create({
    # Set up Spark
    'spark': { 
        'appname': 'ETL',
        'driver': {'memory': '4g'},
    },
    'etl': [
        { 
          # Extract; You can use HuggingFace datset from hub directly!
          'name': 'data_ingestion___huggingface___hf2raw', 
          'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
        },
        {
          # Reduce dataset scale
          'name': 'utils___sampling___random',
          'args': {'sample_n_or_frac': 0.5}
        },
        {
          # Transform; deduplicate data via minhash
          'name': 'deduplication___minhash___lsh_jaccard', 
          'args': {'threshold': 0.1,
                  'ngram_size': 5,
                  'subset': 'question'}
        },
        {
          # Load; Save the data
          'name': 'data_load___parquet___ufl2parquet',
          'args': {'save_path': './guideline/etl/sample/quickstart.parquet'}
        }
      ]
  })
Detail to the example etl configure.
  • data_ingestion___huggingface___hf2raw

    • Load dataset from Hugging Face, which contains a total of 2.59k rows.

  • utils___sampling___random

    • To decrease the dataset size, randomly subsample 50% of data to reduce the size of dataset, with a default seed value of 42. This will reduce the dataset to 1.29k rows.

  • deduplication___minhash___lsh_jaccard

    • Deduplicate by question column, 5-gram minhash jaccard similarity threshold of 0.1.

  • data_load___parquet___ufl2parquet

    • Save the processed dataset as a Parquet file to ./guideline/etl/sample/quickstart.parquet. The final dataset comprises around 1.14k rows.

Above code block is an example of an ETL process in Dataverse. In Dataverse, the available registered ETL functions are referred to as blocks, and this example is comprised of four blocks. You can freely combine these blocks using config to create the ETL processes for your needs. The list of available functions and args of them can be found in the API Reference. Each functions 'args' should be added in dictionary format.

2. Run ETLpipeline.

from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()
spark, dataset = etl_pipeline.run(config=ETL_config, verbose=True)

ETLPipeline is an object designed to manage the ETL processes. By inserting ETL_config which is defined in the previous step into ETLpipeline object and calling the run method, stacked ETL blocks will execute in the order they were stacked.

3. Check the result

As the example gave save_path argument to the last block of ETL_config, data passed through the process will be saved on the given path.

Last updated