Self-Hosted React OCR Web App with DeepSeek-OCR Model

Description:

DeepSeek OCR is a modern OCR web app that uses the DeepSeek-OCR model for advanced text extraction. It combines a React frontend with a FastAPI backend to provide a complete, self-hostable solution for processing images and documents.

The web app processes images by converting them into a compressed visual format, which an AI model then decodes into text. This method allows the application to handle complex documents efficiently.

It also provides a user-friendly web UI for uploading files and viewing results, complete with features for visualizing located text within the image.

Features

🎯 Four OCR Modes: Plain text extraction, intelligent image description, term localization with visual boxes, and custom prompt processing.
🖼️ Large File Support: Handles image uploads up to 100MB with configurable size limits.
📊 Structured Output: Returns text, HTML tables, or JSON data depending on the processing mode.
🔍 Multi-box Detection: Identifies and displays multiple instances of search terms with individual bounding boxes.
🎨 Visual Feedback: Shows detection results overlaid directly on uploaded images with proper coordinate scaling.
⚙️ Environment Configuration: Manages settings through .env files for ports, model paths, and processing parameters.
🚀 GPU Acceleration: Leverages NVIDIA CUDA support for faster model inference.

Use Cases

Document Digitization: Convert scanned documents, PDFs, and images into machine-readable text for archival or data entry.
Content Indexing: Extract text from images to make their content searchable within a larger database or application.
Data Extraction from Forms: Pull specific information from structured or semi-structured documents like invoices or forms using custom prompts.
Accessibility Tools: Build services that describe images or read text from pictures for visually impaired users.

How to Use It

1. Confirm your system meets the hardware and software prerequisites.

Hardware:

An NVIDIA GPU with CUDA support is necessary. Recommended models include the RTX 3090, 4090, or newer.
A minimum of 8-12GB of VRAM is required for the model.
Your system should have at least 16GB of RAM and approximately 20GB of free disk space for the model and Docker images.

Software:

Docker and Docker Compose
NVIDIA Container Toolkit to allow Docker containers to access the GPU.
Up-to-date NVIDIA drivers. For newer GPUs like the RTX 50 series on Ubuntu, using the open-source driver (e.g., nvidia-driver-580-open) and enabling Resize Bar in the BIOS/UEFI is recommended for stability.

2. Clone the project repository to your local machine and navigate into the directory.

git clone https://github.com/your-repo/deepseek_ocr_app.git
cd deepseek_ocr_app

3. The application uses a .env file for configuration. Copy the example file to create your own configuration.

cp .env.example .env

4. Open the .env file and customize the settings as needed. You can change ports, set the maximum file upload size, and configure model parameters.

# DeepSeek OCR Application Configuration
# API Configuration
API_HOST=0.0.0.0
API_PORT=8000
# Frontend Configuration
FRONTEND_PORT=3000
# Model Configuration
MODEL_NAME=deepseek-ai/DeepSeek-OCR
HF_HOME=/models
# Upload Configuration
MAX_UPLOAD_SIZE_MB=100
# Processing Configuration
BASE_SIZE=1024
IMAGE_SIZE=640
CROP_MODE=true

5. Use Docker Compose to build the application containers and start the services.

docker compose up --build

The initial run will download the DeepSeek-OCR model, which is several gigabytes in size, so this step may take some time depending on your internet connection.

6. Once the containers are running, you can access the different parts of the application:

Frontend: http://localhost:3000 (or your configured FRONTEND_PORT)
Backend API: http://localhost:8000 (or your configured API_PORT)
API Documentation: http://localhost:8000/docs

API Reference

POST /api/ocr

This endpoint processes an uploaded image based on the specified parameters.

Parameters:

Parameter	Type	Required	Default	Description
`image`	File	Yes		The image file to process.
`mode`	string	No	`plain_ocr`	The OCR mode to use. Options are `plain_ocr`, `describe`, `find_ref`, `freeform`.
`prompt`	string	No		A custom prompt for the `freeform` mode.
`grounding`	boolean	No		Set to true to enable bounding box generation. Automatically enabled for `find_ref`.
`find_term`	string	No		The term to locate when using the `find_ref` mode.
`base_size`	integer	No	`1024`	The base processing resolution for the image.
`image_size`	integer	No	`640`	The tile size used for processing larger images with dynamic cropping.
`crop_mode`	boolean	No	`true`	Enables or disables dynamic cropping for large images.
`include_caption`	boolean	No	`false`	Set to true to include an image description in the output.

Example Response:

The API returns a JSON object containing the extracted text, bounding box coordinates, and image dimensions.

{
  "success": true,
  "text": "Extracted text or HTML output...",
  "boxes": [{"label": "field", "box": [100, 150, 250, 200]}],
  "image_dims": {"w": 1920, "h": 1080},
  "metadata": {
    "mode": "find_ref",
    "grounding": true,
    "base_size": 1024,
    "image_size": 640,
    "crop_mode": true
  }
}

Related Resources

DeepSeek-OCR Model Card

FAQs

Q: How does the application handle large images?
A: The application uses a dynamic cropping feature. Images larger than a certain size are split into smaller tiles, processed individually, and the results are combined. You can configure this behavior with the crop_mode, base_size, and image_size parameters.

Q: Can I use this application without an internet connection?
A: Yes, once the Docker images and the AI model are downloaded, the entire application runs locally on your machine. This makes it a suitable on-premises OCR solution.

Q: How are the bounding box coordinates determined?
A: The DeepSeek-OCR model outputs coordinates normalized to a 0-999 scale. The FastAPI backend scales these normalized coordinates to the actual pixel dimensions of the input image before sending them to the frontend for display.

Q: Can I process multiple images simultaneously?
A: The current implementation processes single images per request. You can send multiple API requests for batch processing.

Q: What image formats does the system support?
A: The application accepts common formats including JPEG, PNG, and WebP through the file upload interface.

Q: How do I adjust processing for different document types?
A: Modify the base_size and image_size parameters. Higher values improve detail recognition but increase memory usage.

Prev Next

ReactScript