Dataverse Docs
Getting started
  • Dataverse
  • 🏃LET'S START!
    • Installation
    • Quickstart
    • Tutorials
    • AWS Setting guides
  • 📃Documents
    • Modules
    • API Reference
    • FAQs
  • 🌌Portal
    • arXiv
    • Discord
    • GitHub
    • GitHub Issues
    • GitHub Discussions
Powered by GitBook
On this page
  • Dataverse can be installed using pip:
  • Prerequisites
  1. LET'S START!

Installation

A step-by-step guide to setting up Dataverse for you

PreviousDataverseNextQuickstart

Last updated 1 year ago

Dataverse can be installed using pip:

pip install dataverse

In order to use Dataverse, there are prerequisites you need to have: Python, Spark and Java. Right below, you can find guidelines for installing Apache Spark and JDK.

Prerequisites

  • Python (version between 3.10 and 3.11)

  • JDK (version 11) & PySpark

1. Install JDK

1-1. Install JDK

sudo apt-get update
sudo apt-get install openjdk-11-jdk

1-2. Set Java environment variable

echo "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" >> ~/.bashrc 
source ~/.bashrc

2. Install PySpark

2-1. Install PySpark

pip install pyspark

2-2. Set PySpark environment variables

echo "export SPARK_HOME=$(pip show pyspark | grep Location | awk '{print $2 "/pyspark"}')" >> ~/.bashrc
echo "export PYSPARK_PYTHON=python3" >> ~/.bashrc
source ~/.bashrc

1. Install JDK (Java Development Kit)

You can download OpenJDK 11 via . Please download the proper file for your system. After the download, follow the guide to install. Note that OpenJDK 11 requires your Oracle account.

Or you can simply download via HomeBrew by running brew install openjdk@11

2. Install Apache Spark

Method A. Manual Installation

Download pre-built version of Apache Hadoop and Spark 3 from . After that, extract the downloaded compressed file to your home directory using the following command:

tar -zxvf {YOUR-DOWNLOADED-SPARK-FILE}

Method B. Via HomeBrew

You can simply download via HomeBrew by running brew install apache-spark

3. Set Environment Variables

3-1. Set JAVA_HOME

cd {YOUR-JAVA-DIRECTORY}
vi ~/.bash_profile
export JAVA_HOME={YOUR-JAVA-DIRECTORY}
export PATH=$PATH:$JAVA_HOME/bin
source ~/.bash_profile

After you save ~/.bash_profile, you can check the variable is set properly with java-version. Also, the specific path for JDK should be match with yours.

3-2. Set SPARK_HOME and PYSPARK_PYTHON

vi ~/.bash_profile
export SPARK_HOME={YOUR-SPARK-DIRECTORY}
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON={YOUR-PYTHON-DIRECTORY}
source ~/.bash_profile

1. Install JDK

1-1. Download and install a JDK (Java Development Kit)

Also, you must install the JDK into a path with no spaces.

2. Install Apache Spark

2-1. Download Apache Hadoop and Spark

2-2. Extract and move Spark

After the extraction of the Spark archive, copy its contents into proper directory, for example C:\Spark. There must be extracted contents directly under the folder.

2-3. Install winutils.exe in C:\Hadoop

3. Set Environment Variables

You must set new environment variables to use Java and Spark on Windows. Open Environment Variables via Windows menu.

Add below System variables.

  • JAVA_HOME : {YOUR-JAVA-DIRECTORY} ex. C:\JDK

    • and add %JAVA_HOME%\bin into Path variable.

  • SPARK_HOME : {YOUR-SPARK-DIRECTORY} ex. C:\Spark\spark-3.5.0-bin-hadoop3

    • and add %SPARK_HOME%\bin into Path variable.

  • HADOOP_HOME : {YOUR-HADOOP-DIRECTORY} ex. C:\Hadoop

    • and add %HADOOP_HOME%\bininto Path variable.

  • PYSPARK_PYTHON : {YOUR-PYTHON-PATH} ex. C:\anaconda3\python.exe

Note that above directories are example. You have to put the appropriate directory where each content is located.

You can download OpenJDK 11 for Windows from . After the download, unzip the file and put jdk folder to proper directory for example C:\JDK. OpenJDK 11 requires your Oracle account.

Download pre-built version of Apache Hadoop and Spark 3 from . Since this is .tgz file, download and install if necessary.

Download proper winutils.exe from and move the bin folder into C:\Hadoop.

🏃
oracle
Apache Spark
Oracle
Apache Spark
WinRAR
here