Installation

A step-by-step guide to setting up Dataverse for you

Dataverse can be installed using `pip`:

pip install dataverse

In order to use Dataverse, there are prerequisites you need to have: Python, Spark and Java. Right below, you can find guidelines for installing Apache Spark and JDK.

Prerequisites

Python (version between 3.10 and 3.11)
JDK (version 11) & PySpark

1. Install JDK

1-1. Install JDK

sudo apt-get update
sudo apt-get install openjdk-11-jdk

1-2. Set Java environment variable

echo "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" >> ~/.bashrc

source ~/.bashrc

2. Install PySpark

2-1. Install PySpark

pip install pyspark

2-2. Set PySpark environment variables

echo "export SPARK_HOME=$(pip show pyspark | grep Location | awk '{print $2 "/pyspark"}')" >> ~/.bashrc
echo "export PYSPARK_PYTHON=python3" >> ~/.bashrc

source ~/.bashrc

1. Install JDK (Java Development Kit)

You can download OpenJDK 11 via oracle. Please download the proper file for your system. After the download, follow the guide to install. Note that OpenJDK 11 requires your Oracle account.

Or you can simply download via HomeBrew by running brew install openjdk@11

2. Install Apache Spark

Method A. Manual Installation

Download pre-built version of Apache Hadoop and Spark 3 from Apache Spark. After that, extract the downloaded compressed file to your home directory using the following command:

tar -zxvf {YOUR-DOWNLOADED-SPARK-FILE}

Method B. Via HomeBrew

You can simply download via HomeBrew by running brew install apache-spark

3. Set Environment Variables

3-1. Set `JAVA_HOME`

cd {YOUR-JAVA-DIRECTORY}
vi ~/.bash_profile

export JAVA_HOME={YOUR-JAVA-DIRECTORY}
export PATH=$PATH:$JAVA_HOME/bin

source ~/.bash_profile

After you save ~/.bash_profile, you can check the variable is set properly with java-version. Also, the specific path for JDK should be match with yours.

3-2. Set `SPARK_HOME` and `PYSPARK_PYTHON`

vi ~/.bash_profile

export SPARK_HOME={YOUR-SPARK-DIRECTORY}
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON={YOUR-PYTHON-DIRECTORY}

source ~/.bash_profile

PreviousDataverse NextQuickstart

Last updated 1 year ago

Installation

Dataverse can be installed using `pip`:

Prerequisites

1. Install JDK

1-1. Install JDK

1-2. Set Java environment variable

2. Install PySpark

2-1. Install PySpark

2-2. Set PySpark environment variables

1. Install JDK (Java Development Kit)

2. Install Apache Spark

Method A. Manual Installation

Method B. Via HomeBrew

3. Set Environment Variables

3-1. Set `JAVA_HOME`

3-2. Set `SPARK_HOME` and `PYSPARK_PYTHON`

1. Install JDK

1-1. Download and install a JDK (Java Development Kit)

2. Install Apache Spark

2-1. Download Apache Hadoop and Spark

2-2. Extract and move Spark

2-3. Install winutils.exe in `C:\Hadoop`

3. Set Environment Variables

Add below System variables.

Dataverse can be installed using pip:

Prerequisites

1. Install JDK

1-1. Install JDK

1-2. Set Java environment variable

2. Install PySpark

2-1. Install PySpark

2-2. Set PySpark environment variables

1. Install JDK (Java Development Kit)

2. Install Apache Spark

Method A. Manual Installation

Method B. Via HomeBrew

3. Set Environment Variables

3-1. Set JAVA_HOME

3-2. Set SPARK_HOME and PYSPARK_PYTHON

1. Install JDK

1-1. Download and install a JDK (Java Development Kit)

2. Install Apache Spark

2-1. Download Apache Hadoop and Spark

2-2. Extract and move Spark

2-3. Install winutils.exe in C:\Hadoop

3. Set Environment Variables

Add below System variables.

Dataverse can be installed using `pip`:

3-1. Set `JAVA_HOME`

3-2. Set `SPARK_HOME` and `PYSPARK_PYTHON`

2-3. Install winutils.exe in `C:\Hadoop`