# Installation

### Dataverse can be installed using `pip`:

```bash
pip install dataverse
```

In order to use Dataverse, there are prerequisites you need to have: Python, Spark and Java. Right below, you can find guidelines for installing Apache Spark and JDK.

## Prerequisites

* **Python** (version between 3.10 and 3.11)
* **JDK** (version 11) & **PySpark**

{% tabs %}
{% tab title="Linux" %}

### 1. Install JDK

#### 1-1. Install JDK

```bash
sudo apt-get update
sudo apt-get install openjdk-11-jdk
```

#### 1-2. Set Java environment variable

```bash
echo "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" >> ~/.bashrc 
```

```bash
source ~/.bashrc
```

### 2. Install PySpark

#### 2-1. Install PySpark

```bash
pip install pyspark
```

#### 2-2. Set PySpark environment variables

```bash
echo "export SPARK_HOME=$(pip show pyspark | grep Location | awk '{print $2 "/pyspark"}')" >> ~/.bashrc
echo "export PYSPARK_PYTHON=python3" >> ~/.bashrc
```

```bash
source ~/.bashrc
```

{% endtab %}

{% tab title="macOS" %}

### 1. Install JDK (Java Development Kit)

You can download OpenJDK 11 via [oracle](https://www.oracle.com/java/technologies/javase/jdk11-archive-downloads.html). Please download the proper file for your system. After the download, follow the guide to install. Note that OpenJDK 11 requires your Oracle account.

Or you can simply download via HomeBrew by running `brew install openjdk@11`

### 2. Install Apache Spark

#### Method A. Manual Installation

Download pre-built version of Apache Hadoop and Spark 3 from [Apache Spark](https://spark.apache.org/downloads.html). After that, extract the downloaded compressed file to your home directory using the following command:

```bash
tar -zxvf {YOUR-DOWNLOADED-SPARK-FILE}
```

#### Method B. Via HomeBrew

You can simply download via HomeBrew by running `brew install apache-spark`

### 3. Set Environment Variables &#x20;

#### 3-1. Set `JAVA_HOME`

```bash
cd {YOUR-JAVA-DIRECTORY}
vi ~/.bash_profile
```

```bash
export JAVA_HOME={YOUR-JAVA-DIRECTORY}
export PATH=$PATH:$JAVA_HOME/bin
```

```bash
source ~/.bash_profile
```

After you save \~/.bash\_profile, you can check the variable is set properly with `java-version`. Also, the specific path for JDK should be match with yours.

#### 3-2. Set `SPARK_HOME` and `PYSPARK_PYTHON`

```bash
vi ~/.bash_profile
```

```bash
export SPARK_HOME={YOUR-SPARK-DIRECTORY}
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON={YOUR-PYTHON-DIRECTORY}
```

```bash
source ~/.bash_profile
```

{% endtab %}

{% tab title="Windows" %}

### 1. Install JDK

#### 1-1. Download and install a JDK (Java Development Kit)

You can download OpenJDK 11 for Windows from [Oracle](https://www.oracle.com/java/technologies/javase/jdk11-archive-downloads.html). After the download, unzip the file and put `jdk` folder to proper directory for example `C:\JDK`. OpenJDK 11 requires your Oracle account.

Also, you must install the JDK into a path with no spaces.

### 2. Install Apache Spark

#### 2-1. Download Apache Hadoop and Spark

Download pre-built version of Apache Hadoop and Spark 3 from [Apache Spark](https://spark.apache.org/downloads.html). Since this is .tgz file, download and install [WinRAR](https://www.rarlab.com/download.htm) if necessary.

#### 2-2. Extract and move Spark

After the extraction of the Spark archive, copy its contents into proper directory, for example `C:\Spark`. There must be extracted contents directly under the folder.

#### 2-3. Install winutils.exe in `C:\Hadoop`

Download proper winutils.exe from [here](https://github.com/cdarlint/winutils) and move the bin folder into `C:\Hadoop`.

### 3. Set Environment Variables

You must set new environment variables to use Java and Spark on Windows. Open Environment Variables via Windows menu.

#### Add below System variables.

* <mark style="color:red;">`JAVA_HOME`</mark> : `{YOUR-JAVA-DIRECTORY}` *ex. C:\JDK*
  * and add `%JAVA_HOME%\bin` into <mark style="color:red;">`Path`</mark> variable.
* <mark style="color:red;">`SPARK_HOME`</mark> : `{YOUR-SPARK-DIRECTORY}` *ex. C:\Spark\spark-3.5.0-bin-hadoop3*
  * and add `%SPARK_HOME%\bin` into <mark style="color:red;">`Path`</mark> variable.
* <mark style="color:red;">`HADOOP_HOME`</mark> : `{YOUR-HADOOP-DIRECTORY}` *ex. C:\Hadoop*
  * and add `%HADOOP_HOME%\bin`into <mark style="color:red;">`Path`</mark> variable.
* <mark style="color:red;">`PYSPARK_PYTHON`</mark> : `{YOUR-PYTHON-PATH}` *ex. C:\anaconda3\python.exe*

Note that above directories are example. You have to put the appropriate directory where each content is located.
{% endtab %}
{% endtabs %}
