# AWS Setting guides

### Prerequisites

`SPARK_HOME` is required for the following steps. please make sure you have set `SPARK_HOME` before proceeding. You can find setting up `SPARK_HOME` guideline in this [page](https://data-verse.gitbook.io/docs/lets-start/installation).

## 1. Check `hadoop-aws` & `aws-java-sdk` version

### `hadoop-aws`

The version must match with **hadoop** version. you can check your hadoop version by running below command. while writing this README.md the hadoop version was `3.3.4` so the example will use `3.3.4` version.

```python
>>> from dataverse.utils.setting import SystemSetting
>>> SystemSetting().get('HADOOP_VERSION')
3.3.4
```

### `aws-java-sdk`

The version must be compatible with **hadoop-aws** version. Check at Maven [Apache Hadoop Amazon Web Services Support » 3.3.4](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.4) **Compile Dependencies** section. (e.g. hadoop-aws 3.3.4 is compatible with aws-java-sdk-bundle 1.12.592)

## 2. Download `hadoop-aws` & `aws-java-sdk` version

Download corresponding version of `hadoop-aws` and `aws-java-sdk` jar files to `$SPARK_HOME/jars` directory.

### Option A. Manual Setup

```bash
hadoop_aws_jar_url="https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar"
aws_java_sdk_jar_url="https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.592/aws-java-sdk-bundle-1.12.592.jar"
wget -P $SPARK_HOME/jars $hadoop_aws_jar_url
wget -P $SPARK_HOME/jars/ $aws_java_sdk_jar_url
```

### Option B. Use Makefile of Dataverse

```bash
make aws_s3
```

Makefile can be found on repository of Dataverse \[[Link](https://github.com/UpstageAI/dataverse/blob/main/Makefile)].

## 3. Set AWS Credentials

Currently we do not support ENV variables for AWS credentials but this will be supported in the future. Please use `aws configure` command to set your AWS credentials and this will set `~/.aws/credentials` file which is accessible by `boto3`.

```python
aws configure
```

* `aws_access_key_id`
* `aws_secret_access_key`
* `region`

### If you have session token:

When you have temporary security credentials you have to set `session token` too.

```python
aws configure set aws_session_token <your_session_token>
```

## 🌠 Dataverse is now ready to use AWS S3!

> now you are ready to use `Dataverse` with AWS! Every other details will be handled by `Dataverse`!

```python
s3a_src_url = "s3a://your-awesome-bucket/your-awesome-data-old.parquet"
s3a_dst_url = "s3a://your-awesome-bucket/your-awesome-data-new.parquet"

data = spark.read.parquet(s3a_src_url)
data = data.filter(data['awesome'] == True)
spark.write.parquet(data, s3a_dst_url)
```
