You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
112 lines
3.6 KiB
ReStructuredText
112 lines
3.6 KiB
ReStructuredText
.. xtuner documentation master file, created by
|
|
sphinx-quickstart on Tue Jan 9 16:33:06 2024.
|
|
You can adapt this file completely to your liking, but it should at least
|
|
contain the root `toctree` directive.
|
|
|
|
Welcome to the MinerU Documentation
|
|
==============================================
|
|
|
|
.. figure:: ./_static/image/logo.png
|
|
:align: center
|
|
:alt: mineru
|
|
:class: no-scaled-link
|
|
|
|
.. raw:: html
|
|
|
|
<p style="text-align:center">
|
|
<strong>A one-stop, open-source, high-quality data extraction tool
|
|
</strong>
|
|
</p>
|
|
|
|
<p style="text-align:center">
|
|
<script async defer src="https://buttons.github.io/buttons.js"></script>
|
|
<a class="github-button" href="https://github.com/opendatalab/MinerU" data-show-count="true" data-size="large" aria-label="Star">Star</a>
|
|
<a class="github-button" href="https://github.com/opendatalab/MinerU/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
|
|
<a class="github-button" href="https://github.com/opendatalab/MinerU/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
|
|
</p>
|
|
|
|
|
|
Project Introduction
|
|
--------------------
|
|
|
|
MinerU is a tool that converts PDFs into machine-readable formats (e.g.,
|
|
markdown, JSON), allowing for easy extraction into any format. MinerU
|
|
was born during the pre-training process of
|
|
`InternLM <https://github.com/InternLM/InternLM>`__. We focus on solving
|
|
symbol conversion issues in scientific literature and hope to contribute
|
|
to technological development in the era of large models. Compared to
|
|
well-known commercial products, MinerU is still young. If you encounter
|
|
any issues or if the results are not as expected, please submit an issue
|
|
on `issue <https://github.com/opendatalab/MinerU/issues>`__ and **attach
|
|
the relevant PDF**.
|
|
|
|
.. video:: https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
|
|
|
|
|
|
Key Features
|
|
------------
|
|
|
|
- Remove headers, footers, footnotes, page numbers, etc., to ensure
|
|
semantic coherence.
|
|
- Output text in human-readable order, suitable for single-column,
|
|
multi-column, and complex layouts.
|
|
- Preserve the structure of the original document, including headings,
|
|
paragraphs, lists, etc.
|
|
- Extract images, image descriptions, tables, table titles, and
|
|
footnotes.
|
|
- Automatically recognize and convert formulas in the document to LaTeX
|
|
format.
|
|
- Automatically recognize and convert tables in the document to LaTeX
|
|
or HTML format.
|
|
- Automatically detect scanned PDFs and garbled PDFs and enable OCR
|
|
functionality.
|
|
- OCR supports detection and recognition of 84 languages.
|
|
- Supports multiple output formats, such as multimodal and NLP
|
|
Markdown, JSON sorted by reading order, and rich intermediate
|
|
formats.
|
|
- Supports various visualization results, including layout
|
|
visualization and span visualization, for efficient confirmation of
|
|
output quality.
|
|
- Supports both CPU and GPU environments.
|
|
- Compatible with Windows, Linux, and Mac platforms.
|
|
|
|
|
|
.. tip::
|
|
|
|
Get started with MinerU by trying the `online demo <https://www.modelscope.cn/studios/OpenDataLab/MinerU>`_ or :doc:`installing it locally <user_guide/install/install>`.
|
|
|
|
|
|
User Guide
|
|
-------------
|
|
.. toctree::
|
|
:maxdepth: 2
|
|
:caption: User Guide
|
|
|
|
user_guide
|
|
|
|
|
|
API Reference
|
|
-------------
|
|
|
|
If you are looking for information on a specific function, class or
|
|
method, this part of the documentation is for you.
|
|
|
|
.. toctree::
|
|
:maxdepth: 2
|
|
:caption: API
|
|
|
|
api
|
|
|
|
|
|
Additional Notes
|
|
------------------
|
|
.. toctree::
|
|
:maxdepth: 1
|
|
:caption: Additional Notes
|
|
|
|
additional_notes/known_issues
|
|
additional_notes/faq
|
|
additional_notes/glossary
|
|
|
|
|