Skip to content
Snippets Groups Projects
Commit b93a2a53 authored by Bastien Abadie's avatar Bastien Abadie Committed by Bastien Abadie
Browse files

Less pages

parent ab1a6ba9
No related branches found
No related tags found
No related merge requests found
......@@ -7,6 +7,98 @@ insert_anchor_links = "right"
This documentation is aimed at system administrators and business leaders who want to deploy the Arkindex platform on their own hardware.
If you are interested in using Arkindex on your own documents, but **cannot publish them on our own instances** (due to privacy or regulatory concerns), it's possible to deploy the full Arkindex platform on your own infrastructure.
In the following sections, we'll describe the requirements needed to run an efficient and scalable Arkindex infrastructure using **Docker containers** on your own hardware. This setup is able to handle millions of documents to process with multiple Machine Learning processes.
In this section you'll find out how to:
1. setup [Arkindex](@/deployment/setup.md), in production mode,
## Pricing
Please [contact us](https://teklia.com/company/contact/) if you are interested in this solution for your company or institution.
We can also provide a private instance that we manage on our servers (hosted in Europe or North America).
## Architecture
The main part of the architecture uses a set of open-source software along with our own proprietary software.
{{ figure(image="deployment/architecture.png", height=400, caption="Arkindex platform architecture") }}
The open source components here are:
- Traefik as load balancer,
- Cantaloupe as IIIF server,
- Minio as S3-compatible storage server,
- Redis as cache,
- PostgreSQL as database,
- Solr as search engine.
You'll also need to run a set of workers on dedicated servers: this is where the Machine Learning processes will run.
{{ figure(image="deployment/workers.png", height=400, caption="Arkindex workers for Machine Learning") }}
Each worker in the diagram represents a dedicated server, running our in-house job scheduling agents and dedicated Machine Learning tasks.
## Hardware
### Platform
We recommend to use Docker Swarm to aggregate several web servers along with at least one server for databases.
At least 2 web nodes must run for efficient results in production.
#### Web node spec
These servers can be virtual machines (VPS) or dedicated servers on bare metal, with recommended specifications:
- 4 CPU cores, 2Ghz by core
- 4Gb of RAM
- 80Gb of storage
Should host these services:
- arkindex backend & frontend
- arkindex internal worker
- load balancer
- (optionally) IIIF server
#### Database server spec
This server must be a dedicated server on bare metal, using SSD for database storage, with recommended specifications:
- 8 to 12 cores, 2.6Ghz by core
- 32Gb of RAM
- 500 Gb of storage (heavily depends on the size of your datasets)
Should host these services:
- PostgreSQL database
- Redis server
- (optionally) Solr server
- (optionally) Minio instance
### Machine Learning Workers
Each worker can be an independent server, and is not necessarily connected directly to the platform (it only needs to communicate through the REST API of the platform, no database access is needed).
The requirement of each server depends on the type of your processes and datasets. We recommend to use bare-metal servers with at least 8 cores at 2Ghz and 16Gb of RAM. You may also need some GPUs for specific use cases. Please describe your datasets with samples so we can reply with specific requirements for any inquiry.
## Requirements
- Use Linux servers and Docker. We provide support for the Ubuntu LTS distribution, and only provide Docker images to run our software.
- Your instance must be able to make regular API calls (once a day) on a remote server to validate its licence. The server does **not** need to be exposed to Internet, but simply be able to make requests towards a domain.
## Deliverables
- Docker images:
- backend
- agent to run processes
- relevant Machine Learning workers used in processes (DLA, HTR, NER, ...)
- frontend assets
- Documentation to deploy and manage an instance using [Ansible playbook](https://www.ansible.com/)
+++
title = "Configure Arkindex backend"
title = "Settings"
description = "All the configuration options available to setup your Arkindex backend"
weight = 30
......
+++
title = "Deploy Arkindex with docker-compose"
description = "Deploy Arkindex on your own infrastucture using Linux and docker-compose"
weight = 10
+++
This documentation is written for **system administrators**.
We'll use different terms for the components of our product:
- **Platform server** is the server that will run the **Backend** code responsible for the **Rest API**,
- Arkindex needs to run some specific asynchronous tasks that require direct access to the database: the **local worker** will execute these tasks,
- Some intensive Machine Learning tasks will be executed by **Remote workers**, using a proprietary software called **Ponos**. One instance of Ponos is called an **Agent**.
## Requirements
- A bare metal server running Linux Ubuntu LTS (20.04 or 22.04) for the platform
- If you plan to run Machine Learning processes, you'll need another server
- [Docker installed on that server](https://docs.docker.com/desktop/install/linux-install/)
- [docker-compose](https://docs.docker.com/desktop/install/linux-install/)
- A domain name for the platform server:
- ideally this is a public domain name if your server is reachable on Internet (like `arkindex.company.com`),
- or an internal domain name, provided by your company's system administrator.
- An SSL certificate for that domain name:
- it can be provided by [Let's Encrypt](https://letsencrypt.org/) freely and automatically if your server is reachable on Internet
- otherwise an internal certificate , provided by your company's system administrator.
## Third-party services
You'll need to setup multiple *companion* services that support Arkindex. All these services are **open source and freely available**.
Required services:
- a load balancer, [Traefik](https://doc.traefik.io/traefik/) that will control traffic from your users towards the different services.
- a message broker for asynchronous service: [redis](https://redis.io/)
- a relational database for all data stored in Arkindex: [postgres](https://www.postgresql.org/)
- the [postgis](https://postgis.net/) extension is also required
Optional services:
- a remote storage server S3-compatible, [MinIO](https://min.io/docs/minio/linux/index.html)
- You can use AWS S3 or any other API-compatible provider instead
- a IIIF server for your images, [cantaloupe](https://cantaloupe-project.github.io/)
- a search engine to lookup your transcriptions: [Apache Solr](https://solr.apache.org/)
## Arkindex software
Teklia will provide you with several docker images (to load using [docker load](https://docs.docker.com/engine/reference/commandline/load/)):
- the backend image, tagged `registry.gitlab.teklia.com/arkindex/backend:X.Y.Z`, must be present on your application server,
- the tasks image, `registry.gitlab.teklia.com/arkindex/tasks:X.Y.Z`, will be used to by the remote workers (file imports, thumbnails generation, ...).
- the ponos image, `registry.gitlab.teklia.com/arkindex/ponos-agent:X.Y.Z` will be used to actually run the asynchronous tasks across all your remote workers.
-
{{ figure(image="deployment/stack.png", height=250, caption="Arkindex Platform and a single Worker") }}
The backend image mentioned above will run in two containers on your application server:
1. for the API, this is really the heart of Arkindex,
2. for the local asynchronous tasks that can directly reach the database (sqlite export, element deletion, ...)
We recommend you to use our own CDN for the **frontend files**. Simply use the `assets.teklia.com` as source for static files in the backend configuration (you can look into [our own example](https://gitlab.teklia.com/arkindex/public-architecture/-/blob/master/config.yml)).
### Docker-Compose
We documented a working example of docker-compose setup in a [dedicated public repository](https://gitlab.teklia.com/arkindex/public-architecture/). You can clone this repository and use it as a starting point for your own deployment.
Of course your setup may differ, you could use external services (databases, search engine, file storage, ...). You just need to run our own software through Docker, the other parts can be externalized.
### Configuration
All the configuration options for the backend are detailed [on this page](@/deployment/configuration.md).
A minimal configuration file is also available in the [public repository](https://gitlab.teklia.com/arkindex/public-architecture/-/blob/master/config.yml).
### Ponos
If your setup requires Machine Learning process, you'll need at least one **Ponos Agent** on a dedicated server.
The setup of this kind of server is easier, as it only requires to run the agent (from Docker image `registry.gitlab.teklia.com/arkindex/ponos-agent`) and configure it. The tasks will then be triggered by the agent automatically.
To begin the setup, you'll need 2 private keys: one for the backend, another for the agent. Each agent needs a dedicated key to authenticate itself.
To generate a valid private key:
```sh
openssl ecparam -name secp384r1 -genkey -noout > agent.key
```
A YAML configuration file is also required:
```yaml
---
# Save as agent.yml
url: https://ark.localhost/
farm_id: XXXXX
seed: YYYYY
data_dir: /data
private_key: /etc/ponos/agent.key
```
The `farm_id` and `seed` information can be found in the Arkindex administration interface under the section **Ponos > Farms**.
You can then run the agent as:
```sh
docker run \
--name=ponos \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./agent.yml:/etc/ponos/agent.yml:ro \
-v ./agent.key:/etc/ponos/agent.key:ro \
-v ponos_data:/data
registry.gitlab.teklia.com/arkindex/ponos-agent:X.Y.Z
```
Please note that the agent requires a write access on the local Docker socket in order to create new containers that will run the tasks.
The `ponos_data` docker volume is not required, but will allow to retrieve debug logs outside of the agent container.
+++
title = "Deploy Arkindex on-premise"
description = "Deploy Arkindex on your own infrastucture"
weight = 110
+++
If you are interested in using Arkindex on your own documents, but **cannot publish them on our own instances** (due to privacy or regulatory concerns), it's possible to deploy the full Arkindex platform on your own infrastructure.
In the following sections, we'll describe the requirements needed to run an efficient and scalable Arkindex infrastructure using **Docker containers** on your own hardware. This setup is able to handle millions of documents to process with multiple Machine Learning processes.
## Architecture
The main part of the architecture uses a set of open-source software along with our own proprietary software.
{{ figure(image="deployment/architecture.png", height=400, caption="Arkindex platform architecture") }}
The open source components here are:
- Traefik as load balancer,
- Cantaloupe as IIIF server,
- Minio as S3-compatible storage server,
- Redis as cache,
- PostgreSQL as database,
- Solr as search engine.
You'll also need to run a set of workers on dedicated servers: this is where the Machine Learning processes will run.
{{ figure(image="deployment/workers.png", height=400, caption="Arkindex workers for Machine Learning") }}
Each worker in the diagram represents a dedicated server, running our in-house job scheduling agents and dedicated Machine Learning tasks.
## Hardware
### Platform
We recommend to use Docker Swarm to aggregate several web servers along with at least one server for databases.
At least 2 web nodes must run for efficient results in production.
#### Web node spec
These servers can be virtual machines (VPS) or dedicated servers on bare metal, with recommended specifications:
- 4 CPU cores, 2Ghz by core
- 4Gb of RAM
- 80Gb of storage
Should host these services:
- arkindex backend & frontend
- arkindex internal worker
- load balancer
- (optionally) IIIF server
#### Database server spec
This server must be a dedicated server on bare metal, using SSD for database storage, with recommended specifications:
- 8 to 12 cores, 2.6Ghz by core
- 32Gb of RAM
- 500 Gb of storage (heavily depends on the size of your datasets)
Should host these services:
- PostgreSQL database
- Redis server
- (optionally) Solr server
- (optionally) Minio instance
### Machine Learning Workers
Each worker can be an independent server, and is not necessarily connected directly to the platform (it only needs to communicate through the REST API of the platform, no database access is needed).
The requirement of each server depends on the type of your processes and datasets. We recommend to use bare-metal servers with at least 8 cores at 2Ghz and 16Gb of RAM. You may also need some GPUs for specific use cases. Please describe your datasets with samples so we can reply with specific requirements for any inquiry.
## Requirements
- Use Linux servers and Docker. We provide support for the Ubuntu LTS distribution, and only provide Docker images to run our software.
- Your instance must be able to make regular API calls (once a day) on a remote server to validate its licence. The server does **not** need to be exposed to Internet, but simply be able to make requests towards a domain.
## Deliverables
- Docker images:
- backend
- agent to run processes
- relevant Machine Learning workers used in processes (DLA, HTR, NER, ...)
- frontend assets
- Documentation to deploy and manage an instance using [Ansible playbook](https://www.ansible.com/)
## Pricing
Please [contact us](https://teklia.com/company/contact/) if you are interested in this solution for your company or institution.
We can also provide a private instance that we manage on our servers (hosted in Europe or North America).
## Run with docker
More information on [running Arkindex using docker-compose](@/deployment/docker_compose.md)
+++
title = "Setup"
sort_by = "weight"
title = "Deploy Arkindex with docker-compose"
description = "Deploy Arkindex on your own infrastucture using Linux and docker-compose"
weight = 10
+++
This documentation is written for **system administrators**.
We'll use different terms for the components of our product:
- **Platform server** is the server that will run the **Backend** code responsible for the **Rest API**,
- Arkindex needs to run some specific asynchronous tasks that require direct access to the database: the **local worker** will execute these tasks,
- Some intensive Machine Learning tasks will be executed by **Remote workers**, using a proprietary software called **Ponos**. One instance of Ponos is called an **Agent**.
## Requirements
- A bare metal server running Linux Ubuntu LTS (20.04 or 22.04) for the platform
- If you plan to run Machine Learning processes, you'll need another server
- [Docker installed on that server](https://docs.docker.com/desktop/install/linux-install/)
- [docker-compose](https://docs.docker.com/desktop/install/linux-install/)
- A domain name for the platform server:
- ideally this is a public domain name if your server is reachable on Internet (like `arkindex.company.com`),
- or an internal domain name, provided by your company's system administrator.
- An SSL certificate for that domain name:
- it can be provided by [Let's Encrypt](https://letsencrypt.org/) freely and automatically if your server is reachable on Internet
- otherwise an internal certificate , provided by your company's system administrator.
## Third-party services
You'll need to setup multiple *companion* services that support Arkindex. All these services are **open source and freely available**.
Required services:
- a load balancer, [Traefik](https://doc.traefik.io/traefik/) that will control traffic from your users towards the different services.
- a message broker for asynchronous service: [redis](https://redis.io/)
- a relational database for all data stored in Arkindex: [postgres](https://www.postgresql.org/)
- the [postgis](https://postgis.net/) extension is also required
Optional services:
- a remote storage server S3-compatible, [MinIO](https://min.io/docs/minio/linux/index.html)
- You can use AWS S3 or any other API-compatible provider instead
- a IIIF server for your images, [cantaloupe](https://cantaloupe-project.github.io/)
- a search engine to lookup your transcriptions: [Apache Solr](https://solr.apache.org/)
## Arkindex software
Teklia will provide you with several docker images (to load using [docker load](https://docs.docker.com/engine/reference/commandline/load/)):
- the backend image, tagged `registry.gitlab.teklia.com/arkindex/backend:X.Y.Z`, must be present on your application server,
- the tasks image, `registry.gitlab.teklia.com/arkindex/tasks:X.Y.Z`, will be used to by the remote workers (file imports, thumbnails generation, ...).
- the ponos image, `registry.gitlab.teklia.com/arkindex/ponos-agent:X.Y.Z` will be used to actually run the asynchronous tasks across all your remote workers.
-
{{ figure(image="deployment/stack.png", height=250, caption="Arkindex Platform and a single Worker") }}
The backend image mentioned above will run in two containers on your application server:
1. for the API, this is really the heart of Arkindex,
2. for the local asynchronous tasks that can directly reach the database (sqlite export, element deletion, ...)
We recommend you to use our own CDN for the **frontend files**. Simply use the `assets.teklia.com` as source for static files in the backend configuration (you can look into [our own example](https://gitlab.teklia.com/arkindex/public-architecture/-/blob/master/config.yml)).
### Docker-Compose
We documented a working example of docker-compose setup in a [dedicated public repository](https://gitlab.teklia.com/arkindex/public-architecture/). You can clone this repository and use it as a starting point for your own deployment.
Of course your setup may differ, you could use external services (databases, search engine, file storage, ...). You just need to run our own software through Docker, the other parts can be externalized.
### Configuration
All the configuration options for the backend are detailed [on this page](@/deployment/configuration.md).
A minimal configuration file is also available in the [public repository](https://gitlab.teklia.com/arkindex/public-architecture/-/blob/master/config.yml).
### Ponos
If your setup requires Machine Learning process, you'll need at least one **Ponos Agent** on a dedicated server.
The setup of this kind of server is easier, as it only requires to run the agent (from Docker image `registry.gitlab.teklia.com/arkindex/ponos-agent`) and configure it. The tasks will then be triggered by the agent automatically.
To begin the setup, you'll need 2 private keys: one for the backend, another for the agent. Each agent needs a dedicated key to authenticate itself.
To generate a valid private key:
```sh
openssl ecparam -name secp384r1 -genkey -noout > agent.key
```
A YAML configuration file is also required:
```yaml
---
# Save as agent.yml
url: https://ark.localhost/
farm_id: XXXXX
seed: YYYYY
data_dir: /data
private_key: /etc/ponos/agent.key
```
The `farm_id` and `seed` information can be found in the Arkindex administration interface under the section **Ponos > Farms**.
You can then run the agent as:
```sh
docker run \
--name=ponos \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./agent.yml:/etc/ponos/agent.yml:ro \
-v ./agent.key:/etc/ponos/agent.key:ro \
-v ponos_data:/data
registry.gitlab.teklia.com/arkindex/ponos-agent:X.Y.Z
```
Please note that the agent requires a write access on the local Docker socket in order to create new containers that will run the tasks.
The `ponos_data` docker volume is not required, but will allow to retrieve debug logs outside of the agent container.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment