Dataverse Docs
Getting started
  • Dataverse
  • 🏃LET'S START!
    • Installation
    • Quickstart
    • Tutorials
    • AWS Setting guides
  • 📃Documents
    • Modules
    • API Reference
    • FAQs
  • 🌌Portal
    • arXiv
    • Discord
    • GitHub
    • GitHub Issues
    • GitHub Discussions
Powered by GitBook
On this page
  • 1. Set your ETL process as config
  • 2. Run ETLpipeline.
  • 3. Check the result
  1. LET'S START!

Quickstart

This is very first and simple quickstart example.

PreviousInstallationNextTutorials

Last updated 1 year ago

Various and more detailed tutorials are .

1. Set your ETL process as config


# 1. Set your ETL process as config.

from omegaconf import OmegaConf

ETL_config = OmegaConf.create({
    # Set up Spark
    'spark': { 
        'appname': 'ETL',
        'driver': {'memory': '4g'},
    },
    'etl': [
        { 
          # Extract; You can use HuggingFace datset from hub directly!
          'name': 'data_ingestion___huggingface___hf2raw', 
          'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
        },
        {
          # Reduce dataset scale
          'name': 'utils___sampling___random',
          'args': {'sample_n_or_frac': 0.5}
        },
        {
          # Transform; deduplicate data via minhash
          'name': 'deduplication___minhash___lsh_jaccard', 
          'args': {'threshold': 0.1,
                  'ngram_size': 5,
                  'subset': 'question'}
        },
        {
          # Load; Save the data
          'name': 'data_load___parquet___ufl2parquet',
          'args': {'save_path': './guideline/etl/sample/quickstart.parquet'}
        }
      ]
  })
Detail to the example etl configure.
  • data_ingestion___huggingface___hf2raw

  • utils___sampling___random

    • To decrease the dataset size, randomly subsample 50% of data to reduce the size of dataset, with a default seed value of 42. This will reduce the dataset to 1.29k rows.

  • deduplication___minhash___lsh_jaccard

    • Deduplicate by question column, 5-gram minhash jaccard similarity threshold of 0.1.

  • data_load___parquet___ufl2parquet

    • Save the processed dataset as a Parquet file to ./guideline/etl/sample/quickstart.parquet. The final dataset comprises around 1.14k rows.

2. Run ETLpipeline.

from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()
spark, dataset = etl_pipeline.run(config=ETL_config, verbose=True)

ETLPipeline is an object designed to manage the ETL processes. By inserting ETL_config which is defined in the previous step into ETLpipeline object and calling the run method, stacked ETL blocks will execute in the order they were stacked.

3. Check the result

As the example gave save_path argument to the last block of ETL_config, data passed through the process will be saved on the given path.

Load dataset from , which contains a total of 2.59k rows.

Above code block is an example of an ETL process in Dataverse. In Dataverse, the available registered ETL functions are referred to as blocks, and this example is comprised of four blocks. You can freely combine these blocks using config to create the ETL processes for your needs. The list of available functions and args of them can be found in the . Each functions 'args' should be added in dictionary format.

🏃
here
Hugging Face
API Reference