Installation
A step-by-step guide to setting up Dataverse for you
Dataverse can be installed using pip:
pip:pip install dataverseIn order to use Dataverse, there are prerequisites you need to have: Python, Spark and Java. Right below, you can find guidelines for installing Apache Spark and JDK.
Prerequisites
Python (version between 3.10 and 3.11)
JDK (version 11) & PySpark
1. Install JDK
1-1. Install JDK
sudo apt-get update
sudo apt-get install openjdk-11-jdk1-2. Set Java environment variable
echo "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" >> ~/.bashrc source ~/.bashrc2. Install PySpark
2-1. Install PySpark
pip install pyspark2-2. Set PySpark environment variables
echo "export SPARK_HOME=$(pip show pyspark | grep Location | awk '{print $2 "/pyspark"}')" >> ~/.bashrc
echo "export PYSPARK_PYTHON=python3" >> ~/.bashrcsource ~/.bashrc1. Install JDK (Java Development Kit)
You can download OpenJDK 11 via oracle. Please download the proper file for your system. After the download, follow the guide to install. Note that OpenJDK 11 requires your Oracle account.
Or you can simply download via HomeBrew by running brew install openjdk@11
2. Install Apache Spark
Method A. Manual Installation
Download pre-built version of Apache Hadoop and Spark 3 from Apache Spark. After that, extract the downloaded compressed file to your home directory using the following command:
tar -zxvf {YOUR-DOWNLOADED-SPARK-FILE}Method B. Via HomeBrew
You can simply download via HomeBrew by running brew install apache-spark
3. Set Environment Variables
3-1. Set JAVA_HOME
JAVA_HOMEcd {YOUR-JAVA-DIRECTORY}
vi ~/.bash_profileexport JAVA_HOME={YOUR-JAVA-DIRECTORY}
export PATH=$PATH:$JAVA_HOME/binsource ~/.bash_profileAfter you save ~/.bash_profile, you can check the variable is set properly with java-version. Also, the specific path for JDK should be match with yours.
3-2. Set SPARK_HOME and PYSPARK_PYTHON
SPARK_HOME and PYSPARK_PYTHONvi ~/.bash_profileexport SPARK_HOME={YOUR-SPARK-DIRECTORY}
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON={YOUR-PYTHON-DIRECTORY}source ~/.bash_profile1. Install JDK
1-1. Download and install a JDK (Java Development Kit)
You can download OpenJDK 11 for Windows from Oracle. After the download, unzip the file and put jdk folder to proper directory for example C:\JDK. OpenJDK 11 requires your Oracle account.
Also, you must install the JDK into a path with no spaces.
2. Install Apache Spark
2-1. Download Apache Hadoop and Spark
Download pre-built version of Apache Hadoop and Spark 3 from Apache Spark. Since this is .tgz file, download and install WinRAR if necessary.
2-2. Extract and move Spark
After the extraction of the Spark archive, copy its contents into proper directory, for example C:\Spark. There must be extracted contents directly under the folder.
2-3. Install winutils.exe in C:\Hadoop
C:\HadoopDownload proper winutils.exe from here and move the bin folder into C:\Hadoop.
3. Set Environment Variables
You must set new environment variables to use Java and Spark on Windows. Open Environment Variables via Windows menu.
Add below System variables.
JAVA_HOME:{YOUR-JAVA-DIRECTORY}ex. C:\JDKand add
%JAVA_HOME%\binintoPathvariable.
SPARK_HOME:{YOUR-SPARK-DIRECTORY}ex. C:\Spark\spark-3.5.0-bin-hadoop3and add
%SPARK_HOME%\binintoPathvariable.
HADOOP_HOME:{YOUR-HADOOP-DIRECTORY}ex. C:\Hadoopand add
%HADOOP_HOME%\binintoPathvariable.
PYSPARK_PYTHON:{YOUR-PYTHON-PATH}ex. C:\anaconda3\python.exe
Note that above directories are example. You have to put the appropriate directory where each content is located.
Last updated