# FAQs

### About the project

<details>

<summary>What is Dataverse?</summary>

Dataverse is a freely-accessible open-source project that supports your **ETL pipeline with Python**. We offer a simple, standardized and user-friendly solution for data processing and management, catering to the needs of data scientists, analysts, and developers in LLM era. Even though you don't know much about Spark, you can use it easily via *dataverse*.

</details>

<details>

<summary>Who would use Dataverse?</summary>

Dataverse is ideal for anyone who works with data, including:

**Data Scientists:**

* Dataverse can help data scientists quickly and easily prepare data for analysis, including tasks such as cleaning, transforming, and loading data.
* Dataverse provides a unified interface for data processing tasks, making it easy to use for users of all skill levels.
* Dataverse leverages the power of Spark to deliver high-performance data processing capabilities.

**Developers:**

* Dataverse can help developers build data-driven applications.
* Dataverse can be used to build scalable and reliable data pipelines.

</details>

<details>

<summary>Why should I use Dataverse?</summary>

* **Enhanced productivity**: Dataverse streamline your workflow by integrating multiple preprocessing libraries into one, eliminating the hassle of settings and searching for the right tools. Furthermore, you can easily take advantages of Spark’s efficiency even if you’re not an Spark expert.
* **Improved data quality**: Elevate your data quality with a variety of preprocessing functions. Dataverse helps you to make high-quality data for analysis, manage, and train LLM, etc.
* **Facilitated collaboration**: Offer uniform preprocessing codes to ensure consistent results whether who runs the code. Dataverse also enable collaboration among users with varying levels of Spark proficiency.

</details>

<details>

<summary>When can Dataverse be used?</summary>

Using Dataverse is always encouraged! We'd love to hear how you're applying it, so please share your use cases with us on [Discord](https://discord.com/invite/DG5GGJ3qJx).

* **I am handling large-scale text data**: Dataverse systematically cleanses and enhances the quality of large-scale datasets for training LLMs, which is vital for optimizing model performance.
* **I am collaborating across expertise:** Dataverse ensures consistent results through uniform processing codes, making it ideal for collaborative environments with team members of varying skill levels.

</details>

<details>

<summary>How to use Dataverse?</summary>

We suggest kicking off your journey by exploring the [Examples](https://github.com/UpstageAI/dataverse/tree/main/examples) section on our GitHub repository, where you'll find valuable resources to get started. If you have any questions on your journey, feel free to share it on [Discord](https://discord.com/invite/DG5GGJ3qJx).

</details>

### Support

<details>

<summary>How to cite the Dataverse project?</summary>

If you are using the project for academic work, please cite as follows:

```jsx
@misc{park2024dataverse,
      title={Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models}, 
      author={Hyunbyung Park and Sukyung Lee and Gyoungjin Gim and Yungi Kim and Dahyun Kim and Chanjun Park},
      year={2024},
      eprint={2403.19340},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

</details>

<details>

<summary>I have a question or something to share with.</summary>

[The Discord channel](https://discord.com/invite/DG5GGJ3qJx) is where you should head for general inquiries or seeking assistance. Regarding bugs, please report them on the [GitHub Issues](https://github.com/UpstageAI/dataverse/issues) directly. Or you have something to discussion with, please use [GitHub Discussion](https://github.com/UpstageAI/dataverse/discussions).

Typically, you can anticipate a response within 1 to 2 business days.&#x20;

</details>

<details>

<summary>I have a topic to discuss.</summary>

Please upload it to [GitHub Discussion](https://github.com/UpstageAI/dataverse/discussions).

</details>

<details>

<summary>I found a bug</summary>

Please report it on the [GitHub Issues](https://github.com/UpstageAI/dataverse/issues).

Typically, you can anticipate a response within 1 to 2 business days.&#x20;

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://data-verse.gitbook.io/docs/documents/faqs.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
