> For the complete documentation index, see [llms.txt](https://data-verse.gitbook.io/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://data-verse.gitbook.io/docs/documents/faqs.md).

# FAQs

### About the project

<details>

<summary>What is Dataverse?</summary>

Dataverse is a freely-accessible open-source project that supports your **ETL pipeline with Python**. We offer a simple, standardized and user-friendly solution for data processing and management, catering to the needs of data scientists, analysts, and developers in LLM era. Even though you don't know much about Spark, you can use it easily via *dataverse*.

</details>

<details>

<summary>Who would use Dataverse?</summary>

Dataverse is ideal for anyone who works with data, including:

**Data Scientists:**

* Dataverse can help data scientists quickly and easily prepare data for analysis, including tasks such as cleaning, transforming, and loading data.
* Dataverse provides a unified interface for data processing tasks, making it easy to use for users of all skill levels.
* Dataverse leverages the power of Spark to deliver high-performance data processing capabilities.

**Developers:**

* Dataverse can help developers build data-driven applications.
* Dataverse can be used to build scalable and reliable data pipelines.

</details>

<details>

<summary>Why should I use Dataverse?</summary>

* **Enhanced productivity**: Dataverse streamline your workflow by integrating multiple preprocessing libraries into one, eliminating the hassle of settings and searching for the right tools. Furthermore, you can easily take advantages of Spark’s efficiency even if you’re not an Spark expert.
* **Improved data quality**: Elevate your data quality with a variety of preprocessing functions. Dataverse helps you to make high-quality data for analysis, manage, and train LLM, etc.
* **Facilitated collaboration**: Offer uniform preprocessing codes to ensure consistent results whether who runs the code. Dataverse also enable collaboration among users with varying levels of Spark proficiency.

</details>

<details>

<summary>When can Dataverse be used?</summary>

Using Dataverse is always encouraged! We'd love to hear how you're applying it, so please share your use cases with us on [Discord](https://discord.com/invite/DG5GGJ3qJx).

* **I am handling large-scale text data**: Dataverse systematically cleanses and enhances the quality of large-scale datasets for training LLMs, which is vital for optimizing model performance.
* **I am collaborating across expertise:** Dataverse ensures consistent results through uniform processing codes, making it ideal for collaborative environments with team members of varying skill levels.

</details>

<details>

<summary>How to use Dataverse?</summary>

We suggest kicking off your journey by exploring the [Examples](https://github.com/UpstageAI/dataverse/tree/main/examples) section on our GitHub repository, where you'll find valuable resources to get started. If you have any questions on your journey, feel free to share it on [Discord](https://discord.com/invite/DG5GGJ3qJx).

</details>

### Support

<details>

<summary>How to cite the Dataverse project?</summary>

If you are using the project for academic work, please cite as follows:

```jsx
@misc{park2024dataverse,
      title={Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models}, 
      author={Hyunbyung Park and Sukyung Lee and Gyoungjin Gim and Yungi Kim and Dahyun Kim and Chanjun Park},
      year={2024},
      eprint={2403.19340},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

</details>

<details>

<summary>I have a question or something to share with.</summary>

[The Discord channel](https://discord.com/invite/DG5GGJ3qJx) is where you should head for general inquiries or seeking assistance. Regarding bugs, please report them on the [GitHub Issues](https://github.com/UpstageAI/dataverse/issues) directly. Or you have something to discussion with, please use [GitHub Discussion](https://github.com/UpstageAI/dataverse/discussions).

Typically, you can anticipate a response within 1 to 2 business days.&#x20;

</details>

<details>

<summary>I have a topic to discuss.</summary>

Please upload it to [GitHub Discussion](https://github.com/UpstageAI/dataverse/discussions).

</details>

<details>

<summary>I found a bug</summary>

Please report it on the [GitHub Issues](https://github.com/UpstageAI/dataverse/issues).

Typically, you can anticipate a response within 1 to 2 business days.&#x20;

</details>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://data-verse.gitbook.io/docs/documents/faqs.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
